Search
Search
Close this search box.

Multiplexed bulk and single-cell RNA-seq hybrid enables cost-efficient disease modeling with chimeric organoids – Nature Communications

Ethics statement

Written informed consent was obtained from the patient, and the study was approved by the Human Subjects Committee of Jinling Hospital, Nanjing University (2021DZGZR-YBB-109). The participants of the experiment of PBMC, iPSC, and iMac: samples 1-4 are from 60-70 years old donors, and samples 5–10 are from 10–40 years old donors. Sample 1/3/5/8/9/10 are male, and sample 2/4/6/7 are female. The participant of the experiment of kidney organoid: WT1 mutation sample was from a 20–30 years old male donor.

Vireo-bulk algorithm

In multiplexed bulk RNA-seq samples in which multiple donors are pooled in one experiment, our deconvolution method, Vireo-bulk, aims to estimate the proportions of read counts coming from each of the K donors through modeling of the expressed alleles governed by the distinct genotypes of each donor. Here, we consider N effective biallelic variants (or SNPs) sequenced that have different genotypes in at least two donors in the pool and sufficient read counts sequenced, for example, at 100 reads or UMIs and >5% from minor alleles, which can be obtained by probing with SNP arrays or high-throughput sequencing platforms (see genotyping section below). For a certain variant i in donor k, its genotype can be written as ({g}_{i,k,t}in {0,,1}), where (tin {{{{{mathrm{0,1,2}}}}}}) refers to the category of the genotype for homozygous reference BB, heterozygous AB, and homozygous alternative AA. In theory, the corresponding expressed allele frequency of alternative allele A in genotypes BB, AB, and AA should be θ = {({theta }_{0},{theta }_{1},{theta }_{2})}={0, 0.5, 1}, though deviation may occur due to technical noise, e.g., sequencing errors and genotyping errors, or allele-specific expression (ASE) effects. Nonetheless, these parameters θ = {({theta }_{0},{theta }_{1},{theta }_{2})}, which are globally shared across all variants and donors, can be estimated adaptively in our model (see below). By taking the known genotypes while ignoring this deviance, the allelic expression of variant i for each donor k in the mixture can be obtained as follows:

$${mu }_{i,k}={sum }_{t=0}^{2}, {g}_{i,k,t}{theta }_{t},, tin {0,1,2}$$

(1)

Meanwhile, for variant i in the multiplex bulk sample, for each A and B allele sequenced, the read count supporting the alternative allele is ({a}_{i}) and reference allele is ({b}_{i}) out of the ({d}_{i}={a}_{i}+{b}_{i}) total reads. For a certain read j, the observation of its allele ({r}_{j}in {A,B}) follows a Bernoulli distribution with genotype specific parameter ({theta }_{T}) given the genotype is T (i.e., ({g}_{i,k,T}=1)), as follows:

$$left{begin{array}{c}{{{{{rm{P}}}}}}left({r}_{j}={{{{{rm{B}}}}}} | {theta }_{T}right)={theta }_{T} Pleft({r}_{j}={{{{{rm{A}}}}}} | {theta }_{T}right)=1-{theta }_{T}end{array}right.$$

(2)

Given the donor proportion ({{{{{boldsymbol{phi }}}}}}={{phi }_{1},{phi }_{2}…{phi }_{K}}) with ({sum }_{{k}{=}{1}}^{{K}}{{phi }}_{{k}}{=}{1}). For a specific variant i, the likelihood of observing total ({a}_{i}) reads from the A allele and ({b}_{i}={d}_{i}-{a}_{i}) reads from the B allele can be expressed as follows:

$$begin{array}{c}{{{{{rm{p}}}}}}left({a}_{i},{b}_{i}|{{{{{{rm{mu }}}}}}}_{i},{{{{{rm{phi }}}}}}right)={prod }_{j=1}^{{a}_{i}}{sum }_{k=1}^{K}Pleft({r}_{j}=A | {{{{{{rm{mu }}}}}}}_{i},kright)Pleft({I}_{j}=k | {{{{{rm{phi }}}}}}right) times {prod }_{j=1}^{{b}_{i}}{sum }_{k=1}^{K}Pleft({r}_{j}=B | {{{{{{rm{mu }}}}}}}_{i},kright)Pleft({I}_{j}=k | {{{{{rm{phi }}}}}}right) {=}{left({{{{{{{boldsymbol{mu }}}}}}}}_{{i}}^{{{{{{{rm{top }}}}}}}}{{{{{{boldsymbol{phi }}}}}}}right)}^{{{a}}_{{i}}}{left({left({{{{{{bf{1}}}}}}}{-}{{{{{{{boldsymbol{mu }}}}}}}}_{{i}}right)}^{{{{{{{rm{top }}}}}}}}{{{{{{boldsymbol{phi }}}}}}}right)}^{{{b}}_{{i}}},=,{left({{{{{{{boldsymbol{mu }}}}}}}}_{{i}}^{{{{{{{rm{top }}}}}}}}{{{{{{boldsymbol{phi }}}}}}}right)}^{{{a}}_{{i}}}{left({1}{-}{{{{{{{boldsymbol{mu }}}}}}}}_{{i}}^{{{{{{{rm{top }}}}}}}}{{{{{{boldsymbol{phi }}}}}}}right)}^{{{b}}_{{i}}}end{array}$$

(3)

This likelihood is in the same form as the binomial distribution by taking the averaged allele rate across donors ({{{{{{{boldsymbol{mu }}}}}}}}_{{i}}^{{{{{{{rm{top }}}}}}}}{{{{{{boldsymbol{phi }}}}}}}). With conditional independence, the joint likelihood of (N) variants can be written by taking their product as follows:

$$Lleft({{{{{boldsymbol{theta }}}}}},{{{{{boldsymbol{phi }}}}}}right)={prod }_{i=1}^{N}Pleft({a}_{i},{b}_{i} | {{{{{{boldsymbol{mu }}}}}}}_{i},{{{{{boldsymbol{phi }}}}}}right)={prod }_{i=1}^{N}{left({{{{{{boldsymbol{mu }}}}}}}_{i}^{{{{{{rm{top }}}}}}}{{{{{boldsymbol{phi }}}}}}right)}^{{a}_{i}}{left(1-{{{{{{boldsymbol{mu }}}}}}}_{i}^{{{{{{rm{top }}}}}}}{{{{{boldsymbol{phi }}}}}}right)}^{{b}_{i}}$$

(4)

By introducing an Expectation-Maximization algorithm (Algorithm 1; below”)we obtain a maximum-likelihood estimation of ({{{{{boldsymbol{phi }}}}}}) and ({{{{{boldsymbol{theta }}}}}}). Alternatively, the genotype-specific allelic expression parameter ({{{{{boldsymbol{theta }}}}}}) can be set to a fixed value, e.g., (theta=left{{theta }_{0},{theta }_{1},{theta }_{2}right}=left{0.01,,0.5,,0.99right}) as default.

Detection of differential gene expression between donors with Vireo-bulk

In addition to quantifying the donor abundance by using genome-wide SNPs, Vireo-bulk can also be used to focus on a certain gene g (or a gene set with similar functions) and its SNPs. Therefore, the estimated value from this subset of SNPs reflects the product of the number of cells from each donor and their mean expression. If the expression levels of gene g are highly similar between donors, the estimated ({{{{{{boldsymbol{phi }}}}}}}_{g}) is expected to be close to the global donor abundance. In other words, a ({{{{{{boldsymbol{phi }}}}}}}_{g}) that is substantially different from the global donor abundance ({{{{{boldsymbol{phi }}}}}}) implies that significant differential expression exists in at least one donor.

Therefore, we introduced a likelihood ratio test to detect the differential gene expression between donors. Specifically, we compare the likelihood of observing the read counts for a certain gene (g) when using the global donor abundance ({{{{{boldsymbol{phi }}}}}}) (null hypothesis with no differential expression) and when using the specific estimated gene abundance ({{{{{{boldsymbol{phi }}}}}}}_{g}) (alternative hypothesis with differential expression) as follows:

$${H}_{0}:{L}_{0}left({a}_{g} | {d}_{g},{{{{{boldsymbol{phi }}}}}},{{{{{boldsymbol{theta }}}}}}right)=Pleft({a}_{g} | {d}_{g},{{{{{boldsymbol{phi }}}}}},{{{{{boldsymbol{theta }}}}}}right)={prod }_{i=1}^{{N}_{g}}pleft({a}_{i} | {d}_{i},{{{{{{boldsymbol{mu }}}}}}}_{i},{{{{{boldsymbol{phi }}}}}}right)$$

(5)

$${H}_{1}:{L}_{1}left({a}_{g} | {d}_{g},{{{{{boldsymbol{phi }}}}}},{{{{{boldsymbol{theta }}}}}}right)=Pleft({a}_{g} | {d}_{g},{{{{{boldsymbol{phi }}}}}},{{{{{boldsymbol{theta }}}}}}right)={prod }_{i=1}^{{N}_{g}}pleft({a}_{i} | {d}_{i},{{{{{{boldsymbol{mu }}}}}}}_{i},{{{{{{boldsymbol{phi }}}}}}}_{g}right)$$

(6)

Then, we calculate the likelihood ratio test statistic ({{{{{rm{lambda }}}}}}=-2log left({L}_{0}/{L}_{1}right)) and assume that it follows a Chi-square distribution with K-1 degrees of freedom to obtain a one-tailed p value.

Generally, gene-level abundance and differential expression tests require a sufficient number of SNPs with distinct genotypes between donors. In this study, we focused on 4000 genes covering 20,000 SNPs in total.

Genotyping calling

Different samples in different experiments are genotyped via different pipelines. For epilepsy family samples, raw WGS results were aligned to GRCh38 (Ensembl 93) via Sentieon ® bwa and joint calling was performed by the standard Sentieon® pipeline20. The genotype calling result was validated with the same parameter in the GATK pipeline21. For organoid donors, their genotype reference was called according to the RNA-seq results. For day 0 to day 14 organoid samples, isolated individual bulk RNA-seq samples were aligned to GRCh38 by HISAT 2.0, and bam files were processed by cellSNP-lite bulk mode to call donor-specific genotypes22. For samples collected after day 14, genotypes are called from isolated single-cell transcriptomes. Processed bam files given by the cellranger default pipeline were piled up via cellSNP-lite bulk mode and used as the reference for samples after day 1423.

RNA-seq data processing and analysis

Raw RNA-seq reads generated by Illumina in fastq format were first trimmed with the Trimmomatic tool and then aligned to the human genome (hg38 from GENCODE) by HISAT2 with default parameters24. Multimapped reads and PCR duplicates were masked for subsequent quantification and genotyping with RepeatMask. After the gene expression count was qualified by FeatureCounts, the edgeR package was used for count normalization and differential expression analysis. The FDR cutoff for DEGs was 0.125,26.

For the kidney organoid data, sample-separated bulk RNA-seq datasets are available; hence, the aligned bam files were used for genotyping through FreeBayes with the parameter -min-alternate-fraction 0.227. Then, the generated VCF file was filtered with -minQ 30 by BCFtools.

Single-cell expression matrix generation and stimulating data preprocessing

For reads produced by the 10X-Chromium V3 protocol, including both repeats and coding genes, single-cell RNA-seq was generated by 10X Genomics Chromium (chemical v2). The sequencing reads in fastq format were trimmed and then mapped to the hg38 genome index by the default cell ranger-3.0.2 pipeline. We used the cells called by cell ranger in the default setting. Bam files were genotyped at given known SNPs (FreeBayes called before) by CellSNP-lite 1.2.1 and demultiplexed by Vireo 0.2.3. Only cell barcodes predicted as singlets by Vireo were kept for generating simulated data by subsampling.

Analysis of scRNA-seq data

After inputting the UMI count sparse matrix from the cell ranger 3.0.2., DGE was normalized by log2(TPM)28. The Seurat R package (v4.0) was used for downstream analysis of single-cell transcriptome data from the cell ranger cell-by-gene UMI count matrix29. The top 2000 variable genes in the cleaned DGE were detected via VST methods by the FindVariableFeatures function. Then, the top 20 PCs were selected using the Jackstraw function, and their coordinates were used for uniform manifold approximation (UMAP) to generate low-dimensional cell embedding and SNN clustering (resolution=0.4).

Single-cell organoid atlas mapping and batch effect correction

The kidney organoid atlas was preprocessed by the Seurat R package (v4.0) with annotated cell type metadata. Query data were mapped to the atlas through TransferData (aweight.reduction = “cca”, dims = 1:30) after anchored reference FindTransferAnchors (1:30, reference.assay, normalization.method = “LogNormalize”, reduction = “cca”). The batch effect was calculated by the Jensen–Shannon divergence between proportions of cell types. For batch effect removal methods in the embedding plot, Harmony with default parameters was used to generate the dimensionally reduced matrix.

Reprogramming of donor PBMCs to iPSCs

Fresh whole blood was obtained from consenting donors, and then the PBMC Isolation Kit (Solarbio P8610) was used to isolate PBMCs from the samples. The PBMCs were cultured in H3000 (STEMCELL Technologies, Catalog # 100-0073) with CC100 (STEMCELL Technologies, Catalog # 02690). Subsequently, these PBMCs were electroporated with OriP/EBNA-1-based episomal plasmids expressing the reprogramming factors OCT3/4, SOX2, KLF-4, L-MYC, and LIN2830.

Generation of kidney organoids

iPSCs were induced to differentiate toward the primitive streak by treating cells with 7 μM CHIR99021(STEMCELL Technologies, Catalog # 72052) in TeSR-E6(STEMCELL Technologies, Catalog # 05946) medium for 4 days. Next, 200 ng/ml FGF9(MCE, Catalog # HY-P7177), 1 μg/ml heparin(STEMCELL Technologies, Catalog # 07980) and 1 μM CHIR99021 were added to induce the iPSCs to differentiate toward intermediate mesoderm (IM) for 3 days. The IM cells were digested into single cells, resuspended in 200 ng/ml FGF9, 1 μg/ml heparin, 1 μM CHIR99021, 0.1% PVA, 0.1% MC, and 10 μM ROCK inhibitor(STEMCELL Technologies, Catalog # 72308) medium, and cultured in a horizontal shaker. After 24 hours, the ROCK inhibitor was removed from the medium. On the next five days, all cytokines were removed and maintained in TeSR-E6 medium. Organoids were spontaneously formed in the following 13 days.

Immunofluorescence staining

Kidney organoids were fixed with 2% PFA for 20 min and incubated with primary antibodies overnight at 4 °C. The kidney organoids were then washed five times with PBS and incubated with secondary antibodies with fluorescent labels. After staining, the kidney organoids were dehydrated using a 25%, 50%, 75%, and 100% methanol series for 5 min, followed by clearing using benzyl alcohol and benzyl benzoate (BABB, 1:2 ratio). The clear kidney organoids were mounted on a glass-bottom dish (NEST Corporation). The stained cells and kidney organoids were observed via confocal microscopy (Nikon). The primary antibodies: WT1 (1:100, abcam, cat. no. ab89901), SALL1 (1:100, Thermo, cat. no. PA5-62057), PAX8 (1:100, Proteintech, cat. no. 10336-1-AP), NPHS1 (1:100, R&D System, cat. no. AF4269). The secondary antibodies: Donkey Anti-Rabbit IgG H&L Alexa Fluor® 488(1:200, abcam,ab150073), Donkey Anti-Rabbit IgG H&L Alexa Fluor® 647(1:200, abcam,ab150075), Donkey Anti-Rabbit IgG H&L Alexa Fluor® 568(1:200, abcam, ab175470), Donkey Anti-Sheep IgG H&L Alexa Fluor® 647 (1:200, abcam, ab150179), Donkey Anti-Sheep IgG H&L Alexa Fluor® 488(1:200, abcam, ab150177).

Statistics & Reproducibility

Statistical analyses were conducted utilizing R statistical software (version 3.6.1) and Python (version 3.10). In Vireo-bulk, Differentially Expressed Genes (DEGs) between donors were identified through the execution of a likelihood ratio test. For the significance test in the Venn diagram, Fisher’s exact test was employed. For isolated sample’s DEGs significance testing, p-values were calculated using the Wald test from the DEseq2 package. For significance testing in the violin plot, a one-sided t-test was performed. All correlations were assessed using Spearman’s rank correlation coefficient. A p-value of less than 0.05 was considered statistically significant in all analyses.

No statistical method was used to predetermine the sample size. No data were excluded from the analysis. The experiments were not randomized, and the investigators were not blinded to allocation during experiments and outcome assessment.

To ensure reproducibility, all methods and experimental protocols were meticulously described in the Methods section. The data and code utilized for the analyses are readily available in the Supplementary Information and have also been deposited in a public repository.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.