Mol. Cells

Integration of Single-Cell RNA-Seq Datasets: A Review of Computational Methods

Yeonjae Ryu, Geun Hee Han, Eunsoo Jung

Additional article information


With the increased number of single-cell RNA sequencing (scRNA-seq) datasets in public repositories, integrative analysis of multiple scRNA-seq datasets has become commonplace. Batch effects among different datasets are inevitable because of differences in cell isolation and handling protocols, library preparation technology, and sequencing platforms. To remove these batch effects for effective integration of multiple scRNA-seq datasets, a number of methodologies have been developed based on diverse concepts and approaches. These methods have proven useful for examining whether cellular features, such as cell subpopulations and marker genes, identified from a certain dataset, are consistently present, or whether their condition-dependent variations, such as increases in cell subpopulations in particular disease-related conditions, are consistently observed in different datasets generated under similar or distinct conditions. In this review, we summarize the concepts and approaches of the integration methods and their pros and cons as has been reported in previous literature.

Keywords: batch correction, data integration, single-cell RNA-seq


Single-cell RNA sequencing (scRNA-seq) datasets have been increasingly accumulated in public data repositories, such as the Single Cell Portal (, Gene Expression Omnibus (Barrett et al., 2013), and Human Cell Atlas (HCA) data portal (Regev et al., 2017). With an increase in the number of datasets, many efforts have been made for the integrative analysis of different scRNA-seq datasets. Multiple scRNA-seq datasets are often integrated and compared to check whether cellular features (e.g., cell subpopulations and their marker genes) identified from a certain dataset are shared or distinctive compared with those in other datasets produced under similar or different biological conditions. The integration of multiple scRNA-seq datasets has proven useful for reliably identifying shared or distinctive cellular features across datasets. Shared cellular features may not be clear when individual datasets are analyzed independently, owing to small numbers of cells or sparse expression data in each dataset and unwanted technical or biological variations within and across datasets. However, these cellular features can be corroborated by combining features from multiple datasets after correcting the unwanted variations.

In many cases, individual scRNA-seq datasets are generated from samples with distinctive characteristics (e.g., cell counts, tissue types, and conditions from which cells are isolated, such as healthy or diseased conditions) and using different experimental protocols (e.g., cell isolation and handling protocols and library preparation methods) or sequencing platforms. These differences inevitably lead to unwanted technical and biological variations across different datasets. Even within a single dataset, multiple batches with variations in sample characteristics, experimental protocols, or sequencing platforms can exist. These unwanted variations among batches, called batch effects, within and across datasets can decrease the chances of identifying underlying cellular features by introducing inconsistent cellular expression profile structures. Therefore, batch effects arising from both systematic technical and unwanted biological variations should be corrected before integrative analysis of multiple datasets to prevent misleading conclusions. To effectively correct batch effects, several computational methods have been developed based on different concepts and approaches. In this review, we conceptually categorize existing integration methods, describe the key algorithms employed in these methods, and summarize their advantages and disadvantages as reported in literature.


Definition of sample batches in multiple datasets

Batch effects occur mainly due to differences in the following three factors: (1) sample characteristics (donors, tissues, species, or disease conditions), (2) experimental protocols, and (3) sequencing platforms (Fig. 1A, top). The integration of multiple datasets begins by defining batches as sets of samples that are thought to be similar in terms of these factors (Fig. 1A, boxes in different colors). The factors leading to the most significant batch effects can vary across the integrated datasets. It is common for the major factor causing the largest batch effect to be chosen subjectively, and the batches are defined based on the major factor. For example, batches were defined as sets of samples from individual donors (Reichart et al., 2022; Smillie et al., 2019; Uchimura et al., 2020; Villa et al., 2022) (Fig. 1A, Dataset 1) or as individual datasets that were generated using different protocols (Cheng et al., 2021; Morabito et al., 2021) and/or sequencing platforms (Cheng et al., 2021) (Fig. 1A, Datasets 2-3). Several studies even defined batches as individual samples (Bryois et al., 2022; Yoon et al., 2022) (Fig. 1A, Dataset k), assuming that different samples are subjected to distinct technical variations. Moreover, when two factors (e.g., protocol and sequencing platform) cause significant batch effects, two sets of batches can be defined based on these factors, and the batch effects in the two sets can be sequentially corrected (Cheng et al., 2021; Morabito et al., 2021). After defining the batches, multiple datasets are then merged by concatenating the expression counts of cells for the same genes in the expression count matrix for each dataset. For each cell in the merged expression count matrix, the expression counts are divided by the total expression count, multiplied by a scale factor (e.g., 10,000), and log-transformed (Fig. 1B, Normalization). For each batch, a set of highly variable genes (e.g., 2,000 genes) with large variances across the cells in the batch are then selected using various tools (e.g., ‘FindVariableFeatures’ function in Seurat [v3 and higher hereafter] [Stuart et al., 2019], BASiCS [Vallejos et al., 2015], Brennecke [Brennecke et al., 2013], scLVM [Buettner et al., 2015], scran [Lun et al., 2016], and scVEGs [Chen et al., 2016]). The final set of highly variable genes most frequently selected across the batches are then selected (‘SelectIntegrationFeatures’ function of Seurat; Fig. 1B, Selection of highly variable genes). Although data integration can be performed using only highly variable genes or all genes, the use of highly variable genes only in subsequent analyses (Fig. 1B, Batch correction, Cell clustering, and Cell annotation) has been shown to be generally more effective for identifying underlying biological differences among cell types and/or for removing unwanted variations in the data (Luecken et al., 2022).

Figure F1
(A) Schematic illustration of defining batches by donors (Dataset 1), sample preparation protocols (Dataset 2), sequencing platforms (Dataset 3), and individual samples (donors; Dataset k). (B) Analytical flow of data ...

Categorization of integration methods

Batch correction is next performed for the normalized merged dataset using the selected highly variable genes (Fig. 1B, Batch correction). Batch correction methods can be conceptually classified into the following three categories depending on how they model batch effects: (1) linear decomposition methods, (2) similarity-based batch correction methods in reduced dimension space, and (3) generative models using a variational autoencoder. A similar categorization scheme for the first two categories was previously proposed by Xu et al. (2021).

Linear decomposition methods

Modeling batch effects using a generalized linear decomposition model was first introduced to remove batch effects within or across bulk RNA expression datasets when integrating multiple bulk RNA expression datasets. For example, the ‘removeBatchEffect’ function in limma (Ritchie et al., 2015) and ComBat (Johnson et al., 2007) have employed this linear decomposition model to delineate batch effects in bulk RNA expression datasets.

In the ‘removeBatchEffect’ function in limma, the merged data matrix ( X) including expression levels of N genes in K batches of M samples at H conditions is represented as a linear sum of (i) the overall gene expression matrix ( GMGE, N × M), (ii) condition-dependent gene expression matrix represented by multiplication of the condition regression coefficient matrix ( RC, N × H) and a condition design matrix ( DC, H × M), (iii) a batch term matrix represented by multiplication of the batch regression coefficient matrix ( RB, N × K) and a batch structure matrix ( DB, K × M), and (iv) an error matrix ( E, N × M) (Fig. 2A). In the overall gene expression matrix, each column (sample) contains the same vector (N × 1) including mean expression values of N genes across M samples. In the DC matrix, the column (H × 1) for sample m under condition h includes one in the h-th element and zeros in the other elements. Similarly, in the DB matrix, the column (K × 1) for sample m belonging to batch k includes one in the k-th element and zeros in the other elements. The condition ( RC) and batch regression coefficients ( RB) are estimated by minimizing the sum of the squared errors in E. The effects of the individual batches are linearly combined using regression coefficients (Fig. 2A, Batch term). Finally, the batch corrected matrix is generated by subtracting RBDB from X.

Figure F2
(A) Linear decomposition scheme used in limma and ComBat. Batches and conditions for cells are indicated by colors. Matrix sizes are denoted in left bottom (number of rows) and right ...

In addition to the above batch effects, called additive batch effects, ComBat estimates multiplicative batch effects. To this end, E is modified into multiplication of a regression coefficient matrix ( RMB, N × K), a batch structure matrix ( DB, K × M) and an error matrix ( E’, M × M) (Fig. 2A, Error matrix of ComBat). Unlike the linear regression in limma, the regression coefficients in this modified linear decomposition model are estimated using an empirical Bayesian method. With the multiplicative batch effect terms, batch correction was performed by [ X – ( GMGE + RCDC + RBDB)]/ RMBDB + ( GMGE + RCDC).

These approaches have been also used to remove batch effects when integrating multiple scRNA-seq datasets in early studies (Giustacchini et al., 2017; Kadoki et al., 2017; Smillie et al., 2019; Young et al., 2018). However, another linear decomposition model ZINB-WaVE (Risso et al., 2018) with an alternative model structure was developed specifically for integrating scRNA-seq datasets that have unique features such as zero inflation (dropouts), overdispersion, and different count distributions from bulk RNA-seq datasets. In this method, the expression count in the merged data matrix ( X, N × M) is defined as a random variable (y) following a zero-inflated negative binomial (ZINB) distribution: π δ(y) + (1 – π)fNB(y; μ, θ) where π is the probability of dropout (π), and δ and fNB are Dirac delta function and negative binomial probability mass function with the mean μ and an inverse dispersion parameter θ, respectively (Greene, 1994). The model decomposes the log μ matrix (N × M) into sample- ( S) and gene-level covariate matrices ( G) and unknown sample-level covariate matrix ( U) (Fig. 2B). Although S is defined for K batches, as in the aforementioned RBDB (Fig. 2B, Sample-level covariate), G is represented by multiplication of the design matrix ( DG, N × L) and gene regression coefficients ( RG, L × M). G was introduced to capture variations in L gene sets, each of which had different GC content and gene length distributions, possibly causing differences in the counts and quality of reads (Fig. 2B, Gene-level covariate). U was represented by multiplying the loading matrix ( RF, N × Q) and the factor matrix ( DF, Q × M) to capture the systematic cellular gene expression profiles in a low dimension of Q latent factors (Fig. 2B, Factor matrix). Notably, U may also include the effects of unknown factor-driven batches that may be missed when defined in S. The logit π (i.e., ln[π/(1 – π)]) matrix is also decomposed into Sπ, Gπ, and Uπ (Fig. 2B). However, the same factor matrix ( DF) is shared for both U and Uπ. The regression coefficients in RB, RG, and RF for the log μ and logit π matrices are then collectively estimated using the maximum likelihood method. DF was finally used as a cellular gene expression profile summarized in the Q-dimensional space during subsequent analyses (e.g., clustering and visualization).

Similarity-based batch correction methods in reduced dimension space

The above methods assume that there are no cell-level covariates and that all cells vary equivalently between different samples (or batches) in response to the same sources of variation. However, owing to distinct sensitivity in the response, cells can vary differentially between samples and thus between batches. For example, the three groups of cells shown in Fig. 3A exhibited different distribution changes between batches 1 and 2 in terms of their mean and standard deviation when visualized in a uniform manifold approximation and projection (UMAP) space. To address this cell-level covariate issue in batch correction, a number of methods have been developed, including canonical correlation analysis (CCA) (Butler et al., 2018), mutual nearest neighbor (MNN) (Haghverdi et al., 2018), fastMNN (Haghverdi et al., 2018), ‘IntegrateData’ function in Seurat (Stuart et al., 2019), Scanorama (Hie et al., 2019), BBKNN (Polański et al., 2020), Conos (Barkas et al., 2019), Harmony (Korsunsky et al., 2019), DESC (Li et al., 2020), LIGER (Welch et al., 2019), scMerge (Lin et al., 2019), and SAUCIE (Amodio et al., 2019). These methods start with the projection of cells in the merged dataset (X) onto a reduced space defined by several dimension reduction methods, such as principal component analysis (PCA), CCA, non-negative matrix factorization (NMF), and an autoencoder (e.g., a 2-dimentional space defined by two latent variables [LV1-2] in Fig. 3B). They then identify similar cells sharing expression profiles, which can be identified as pairs of cells between batches at the individual cell level (Fig. 3B, connected cells between batches 1 and 2 in a 2-dimentional LV space) or as cells from different batches in the same cluster at the cluster level. Batch effects are then corrected such that similar cells followed a common distribution in the reduced space (see batch-corrected cluster-level similar cells in Fig. 3C). Each step is described in detail below.

Dimension reduction
Figure F3
(A) Cell-level covariates. Three cell types (clusters) show differential variations between batches 1 and 2. (B) Anchored cell pairs between batches 1 and 2 on two-dimensional LV space. (C) Distributions ...

Projection values in the reduced dimension from the above linear or nonlinear methods are used to identify similar cells at the individual or cluster level (Fig. 3G), followed by batch correction for similar cells to follow a common distribution in the reduced space. However, SAUCIE does not identify similar cells, but directly performs batch correction to minimize discrepancies between batches (Amodio et al., 2019). To this end, SAUCIE randomly sets one batch as the reference batch, and corrects the mean and standard deviation of the embedded values for each non-reference batch with those of the reference batch while minimizing the reconstruction error. The weights in the autoencoder are thus determined to balance the reconstruction error and batch correction.

Identification of similar cells between batches

In practice, after batch correction using the aforementioned methods, the outputs, such as batch corrected merged data matrix (X from linear decomposition methods) or projections ( z) on the reduced space (e.g., Harmony and DESC), or updated factor loadings (e.g., LIGER) from these methods are subjected to another clustering (e.g., Louvain clustering using the kNN graph in Seurat), and identification of differentially expressed genes (e.g., ‘FindMarkers’ function in Seurat) and annotation of cell types (e.g., SingleR; Aran et al., 2019) are then performed for the resulting clusters.

Generative models with variational autoencoder

Linear decomposition and similarity-based methods using linear dimension reduction cannot effectively capture the nonlinear characteristics of batch effects and systematic biological signals. To address these issues, several methods using a generative model with variational autoencoder (e.g., scVI [Lopez et al., 2018], scGen [Lotfollahi et al., 2019], and trVAE [Lotfollahi et al., 2020]) have been developed.

Similarity-based methods often suffer from heavy computational loads during similarity searches and batch corrections, when hundreds of thousands of cells are integrated. scVI (Lopez et al., 2018) was developed to effectively model the nonlinear characteristics of data and resolve the computational load issue. The scVI assumes that the expression count (xm) for each gene in cell m follows the ZINB distribution p(xm|zm, sm, lm) where zm is the nonlinear LVs, sm is the batch information for cell m; and lm is a cell-specific size factor that accounts for variations during library construction and sequencing, which is not explicitly considered in the aforementioned methods. To estimate the ZINB probability, scVI employs a variational autoencoder model composed of ‘variational posterior’ and ‘generative model’ parts (Fig. 6A): the first part performs the inference of a variational posterior distribution q(zm, lm|xm, sm) for two unknown variables zm and lm given xm and sm using neural networks (NNs; Fig. 6A, NN1-4), and the second part estimates p(xm|zm, sm, lm) using other NNs (Fig. 6A, NN5-6). NN1-4 are trained to estimate the parameters (mean and standard deviation) of the Gaussian distributions that zm and lm are assumed to follow based on variational inference. After zm and lm (mean values) are sampled from their estimated distributions, zm, considered as batch corrected projections, are used to train NN5-6 to estimate the expected dropout (π) and frequency in the NB distribution, respectively. These expected values are finally used together with the sampled lm to estimate the expected expression counts (xm) that follow p(xm|zm, sm, lm). The weights of the NNs are updated such that the sampled zm and lm explain the observed xm given sm based on the ZINB distribution. The sampled zm can be used for clustering analysis, and the expected counts can be used to identify differentially expressed genes for cell clusters. scANVI (Xu et al., 2021), an extension of scVI, uses additional cell-type information such that the distribution of cells in the latent space (zm) reflects the cell types, thereby enabling batch corrections with the cell-type information considered.

Figure F6
(A) Architecture of scVI and schematic illustration of analytical steps in scVI. The outputs from NN5-6 are used to estimate the ZINB distribution p(xm|zm,sm,lm). (B). Schematic illustration of analytical steps ...

Similar to scANVI, scGen (Lotfollahi et al., 2019) uses the cell-type information obtained from the cell-type annotation of the clusters after cell clustering (e.g., three cell types in Fig. 6B, left). A variational autoencoder is used to estimate the distribution parameters (means and standard deviations) of the nonlinear LVs (zm) based on variational inference (Fig. 6B, middle top). The sampled zm are used to determine perturbation parameters (K × 1 δ) for K LVs as the difference between the mean vectors of zm for cells in batches (e.g., δ on 2-dimensional LV space between batches 1-2 in Fig. 6B, right). Batch effects are then corrected by applying δ to zm for cells in batch 2, and the batch corrected zm are then rescaled back to produce the batch corrected xm using the decoder NN (Fig. 6B, right). Finally, trVAE (Lotfollahi et al., 2020) employs a variational encoder similar to that in scGen to estimate the distribution parameters of zm; however, the input layer has additional L nodes that take the batch information for L batches. Optionally, nonlinear LVs can first be decoded in two sequential stages: from the LVs to an intermediate ym (g1 decoder to handle the discrepancy between batches) and then from ym to xm (g2 decoder).

Pros and cons

All the aforementioned methods have differences in model structures and parameters, dimension reduction, similarity search, and batch correction. These aspects of the individual methods are summarized in Supplementary Table S1. The unique characteristics of these methods provide advantages and disadvantages in terms of their performance (output type, speed, peak memory use, GPU support, etc.), which are summarized in Supplementary Table S1. ‘High’ performance in the table indicates that the corresponding methods were found to show good performance in terms of speed, memory, and batch correction performance evaluated based on diversity of batches in each cluster (i.e., cell type).

The characteristics of the individual methods are associated with the algorithms or approaches employed. Linear decomposition methods are the simplest, thereby providing an advantage in terms of analysis speed (Tran et al., 2020) (Supplementary Table S1, Speed). ComBat could model diverse batch effects using both additive and multiplicative batch terms compared to other linear decomposition methods, and also incorporated the empirical Bayes shrinkage of parameters to pool the information across genes, which provides robustness in correction of batches with small sample sizes. ZINB-WaVE includes unique gene-level covariates in the model. However, how its addition to the model improves performance has not been systematically tested, and the benefit of the gene-level covariate term is unclear, considering that the discrepancy among gene-level covariates may be handled by normalization strategies in other methods. Linear decomposition methods assume that the cell types between batches are similar (Haghverdi et al., 2018; Lin et al., 2019) and cannot effectively handle heterogeneity among cell-level covariates across samples or batches.

Although similarity-based batch correction methods effectively handle cell-level covariate issues, they commonly involve additional steps to search for similar cells and correct batch effects to match the distributions of similar cells. Because of these additional steps, some of them (e.g., MNN-based methods) often suffer from heavy computational loads (Supplementary Table S1, Speed and Peak memory use), which render limited applicability or scalability for scRNA-seq datasets, including hundreds of thousands of cells or more (Li et al., 2020). Both linear decomposition and similarity-based methods effectively handle unwanted non-systematic variations. For systematic unwanted variations, however, MNN-based methods can handle more effectively the systematic unwanted variations than the linear decomposition methods, as long as they identify pairs of similar cells with reasonable accuracy.

Similarity-based methods perform similarity searches and batch corrections mostly on reduced dimensions. Among the dimension reduction tools, CCA captures the shared sources of variations between batches and thus performs well when cell populations are largely shared between batches, whereas it is less likely to capture variations for cell subpopulations uniquely present in small numbers of batches. By contrast, PCA captures distinctive sources of variation with a sufficient number of LVs, which may make similarity search among these distinctive cell subpopulations possible. iNMF specifically models cells that are uniquely present in a small number of batches to effectively address this issue. Moreover, batch corrections based on batch vectors require large amounts of memory (Hie et al., 2019), thereby causing memory issues, particularly for datasets that include a large number of cells. Scanorama endeavored to resolve this memory issue by reducing peak memory usage, thereby improving the scalability of analyses with limited computing resources (Li et al., 2020; Luecken et al., 2022; Tran et al., 2020) (Supplementary Table S1, Speed, Peak memory use, and Performance). BBKNN and Conos provide no batch-corrected X or z, but several downstream analyses, such as functional gene program identification (Kotliar et al., 2019) or trajectory inference (Trapnell et al., 2014), require corrected X or z, thereby limiting the applicability of BBKNN and Conos (Luecken et al., 2022).

Among the cluster-level similarity search methods, Harmony shows high scalability in terms of both runtime and memory usage (Korsunsky et al., 2019; McKellar et al., 2021; Tran et al., 2020) (Supplementary Table S1, Speed). However, the KL divergence term in its objective function can make Harmony biased toward major cell types (clusters), including large numbers of cells. There may be a chance for Harmony to overcorrect batch effects for small cell types when they are integrated with major cell types. DESC does not require batch information, but corrects batch effects and simultaneously performs soft clustering. The exact number of batch-effect sources (or unwanted variations) is typically unknown. Although DESC may effectively handle these unknown sources of batch effects, systematic analyses are needed to understand the benefits from no use of batch information. LIGER was shown to have a tendency of focusing on batch effect correction rather than biological conservation, favoring its application to integration of cross-species datasets (Luecken et al., 2022) (Supplementary Table S1, Bio-conservation vs Batch correction).

scVI and scANVI are ‘all-inclusive’ tools that perform a range of analyses including normalization, dimension reduction, batch effect correction, imputation, clustering, and differential expression. They can effectively capture the nonlinear characteristics in data based on probabilistic models that statistically bound variations in random variables (xm or zm) using neural networks. Probabilistic models enable the propagation of the statistically bounded xm or zm to clustering, differential analysis, or cell-type annotation, providing effective handling of uncertainties in downstream analyses. These methods have been shown to scale up to very large datasets (Stuart et al., 2019). scVI and scANVI use the ZINB distribution for probabilistic modeling by default, but it is possible to use the NB distribution instead of the ZINB distribution. It is debatable which distribution better fits the read counts derived from either the droplet-based method or the full-length plate-based method. However, in the benchmarking analysis using both distributions, similar results in four datasets tested were shown (Xu et al., 2021).

As no single method performs well in all integration settings, several representative methods in the aforementioned categories (e.g., ComBat, Seurat’s anchoring method, harmony, and scVI) should be compared to select an appropriate method for data integration. Clusters of cells identified by each method on UMAP can be compared across the methods. Among the clusters, several are consistently identified with similar shapes and memberships in all methods. Focusing on these clusters provides the most reliable conclusions. However, these methods often produce inconsistent clusters, particularly for small clusters. For example, a small cluster can be identified as a distinct cluster in the ComBat and anchoring methods but merged into another large cluster in Harmony and scVI. Moreover, the relative position of the small cluster with respect to the other clusters on the UMAP cells can differ across methods. Furthermore, cells within a cluster can be evenly distributed or exhibit a skewed distribution. The patterns of these cluster characteristics across the methods can suggest whether batch effects are corrected too weakly, too much, or appropriately for particular clusters of interest, from linear and similarity-based (cell- or cluster-level) methods to variational autoencoders. Nevertheless, whether a small cluster is a biologically meaningful cell subpopulation or an artifact from cell isolation, and batch correction should be determined by detailed functional experiments for small clusters.


Data integration methods provide unique opportunities to systematically compare and federate cell types present in multiple sets of samples under similar or distinct disease conditions, thereby enabling a more comprehensive interpretation of the functional roles of different cell types under diverse disease conditions. Moreover, the integrative analysis of multiple datasets generated from diverse disease conditions can provide insights into the interplay of particular cell-type pairs by providing shared count variations of the cell-type pairs under disease conditions. Although data integration methods have improved from linear decomposition to nonlinear probabilistic methods, many problems remain to be resolved. A definition of batches is required for most tools. Although batches are defined subjectively based on the major sources of unwanted variation, there may still be unrecognized sources that cause batch effects. Moreover, most methods assume that batch effects are smaller than biological differences. However, there could be systematic unwanted variations that are similar in magnitude to biological differences. These systematic unwanted variations cannot be effectively distinguished from biological differences using the current methods. In addition, there is a significant need for tools that can effectively evaluate how robust or stable correction data integration methods can achieve in the presence of diverse types of non-systematic and systematic noises. Furthermore, data integration tends to be biased toward the major cell types that are commonly abundant across the integrated datasets, thereby not providing stable integration for the small cell types present only in particular small sets of datasets. Finally, owing to recent technical advances, the number of detected cells has substantially increased. Thus, the scalability of these methods needs to be improved to effectively handle large numbers of cells. Therefore, there are still plenty of room for improvements. Nonetheless, data integration methods have been applied to answer diverse single-cell-level biological and medical questions. Along with the improvement of these methods, their continuous application will shed new insights into cellular players and their interactions underlying disease pathogenesis, and provide new cellular targets for the treatment of various diseases.

Supplemental Materials

Note: Supplementary information is available on the Molecules and Cells website (

Article information

Mol. Cells.Feb 28, 2023; 46(2): 106-119.
Published online 2023-02-24. doi:  10.14348/molcells.2023.0009
1School of Biological Sciences, Seoul National University, Seoul 08826, Korea
Received January 10, 2023; Accepted January 19, 2023.
Articles from Mol. Cells are provided here courtesy of Mol. Cells


  • Amodio, M., van Dijk, D., Srinivasan, K., Chen, W.S., Mohsen, H., Moon, K.R., Campbell, A., Zhao, Y., Wang, X., Venkataswamy, M. (2019). Exploring single-cell data with deep multitasking neural networks. Nat. Methods. 16, 1139-1145.
  • Aran, D., Looney, A.P., Liu, L., Wu, E., Fong, V., Hsu, A., Chak, S., Naikawadi, R.P., Wolters, P.J., Abate, A.R. (2019). Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat. Immunol.. 20, 163-172.
  • Argelaguet, R., Cuomo, A.S.E., Stegle, O., Marioni, J.C. (2021). Computational principles and challenges in single-cell data integration. Nat. Biotechnol.. 39, 1202-1215.
  • Barkas, N., Petukhov, V., Nikolaeva, D., Lozinsky, Y., Demharter, S., Khodosevich, K., Kharchenko, P.V. (2019). Joint analysis of heterogeneous single-cell RNA-seq dataset collections. Nat. Methods. 16, 695-698.
  • Barrett, T., Wilhite, S.E., Ledoux, P., Evangelista, C., Kim, I.F., Tomashevsky, M., Marshall, K.A., Phillippy, K.H., Sherman, P.M., Holko, M. (2013). NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res.. 41, D991-D995.
  • Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E. (2008). Fast unfolding of communities in large networks. J. Stat. Mech.. 2008, P10008.
  • Bolstad, B.M., Irizarry, R.A., Åstrand, M., Speed, T.P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 19, 185-193.
  • Brennecke, P., Anders, S., Kim, J.K., Kołodziejczyk, A.A., Zhang, X., Proserpio, V., Baying, B., Benes, V., Teichmann, S.A., Marioni, J.C. (2013). Accounting for technical noise in single-cell RNA-seq experiments. Nat. Methods. 10, 1093-1095.
  • Bryois, J., Calini, D., Macnair, W., Foo, L., Urich, E., Ortmann, W., Iglesias, V.A., Selvaraj, S., Nutma, E., Marzin, M. (2022). Cell-type-specific cis-eQTLs in eight human brain cell types identify novel risk genes for psychiatric and neurological disorders. Nat. Neurosci.. 25, 1104-1112.
  • Buettner, F., Natarajan, K.N., Casale, F.P., Proserpio, V., Scialdone, A., Theis, F.J., Teichmann, S.A., Marioni, J.C., Stegle, O. (2015). Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nat. Biotechnol.. 33, 155-160.
  • Butler, A., Hoffman, P., Smibert, P., Papalexi, E., Satija, R. (2018). Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol.. 36, 411-420.
  • Bzdok, D., Altman, N., Krzywinski, M. (2018). Statistics versus machine learning. Nat. Methods. 15, 233-234.
  • Chen, H.I., Jin, Y., Huang, Y., Chen, Y. (2016). Detection of high variability in gene expression from single-cell RNA-seq profiling. BMC Genomics. 17, 508.
  • Cheng, S., Li, Z., Gao, R., Xing, B., Gao, Y., Yang, Y., Qin, S., Zhang, L., Ouyang, H., Du, P. (2021). A pan-cancer single-cell transcriptional atlas of tumor infiltrating myeloid cells. Cell. 184, 792-809.e23.
  • Csardi, G., Nepusz, T. (2006). The igraph software package for complex network research. InterJournal, Complex Systems. 1695, 1-9.
  • Giorgino, T. (2009). Computing and visualizing dynamic time warping alignments in R: the dtw Package. J. Stat. Softw.. 31, 1-24.
  • Giustacchini, A., Thongjuea, S., Barkas, N., Woll, P.S., Povinelli, B.J., Booth, C.A.G., Sopp, P., Norfo, R., Rodriguez-Meira, A., Ashley, N. (2017). Single-cell transcriptomics uncovers distinct molecular signatures of stem cells in chronic myeloid leukemia. Nat. Med.. 23, 692-702.
  • Greene, W.H. (1994). . Accounting for Excess Zeros and Sample Selection in Poisson and Negative Binomial Regression Models, , ed. (New York:New York University), pp. .
  • Haghverdi, L., Lun, A.T.L., Morgan, M.D., Marioni, J.C. (2018). Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol.. 36, 421-427.
  • Hie, B., Bryson, B., Berger, B. (2019). Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nat. Biotechnol.. 37, 685-691.
  • Johnson, W.E., Li, C., Rabinovic, A. (2007). Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 8, 118-127.
  • Kadoki, M., Patil, A., Thaiss, C.C., Brooks, D.J., Pandey, S., Deep, D., Alvarez, D., von Andrian, U.H., Wagers, A.J., Nakai, K. (2017). Organism-level analysis of vaccination reveals networks of protection across tissues. Cell. 171, 398-413.e21.
  • Kim, Y., Kim, T.K., Kim, Y., Yoo, J., You, S., Lee, I., Carlson, G., Hood, L., Choi, S., Hwang, D. (2011). Principal network analysis: identification of subnetworks representing major dynamics using gene expression data. Bioinformatics. 27, 391-398.
  • Korsunsky, I., Millard, N., Fan, J., Slowikowski, K., Zhang, F., Wei, K., Baglaenko, Y., Brenner, M., Loh, P.R., Raychaudhuri, S. (2019). Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods. 16, 1289-1296.
  • Kotliar, D., Veres, A., Nagy, M.A., Tabrizi, S., Hodis, E., Melton, D.A., Sabeti, P.C. (2019). Identifying gene expression programs of cell-type identity and cellular activity with single-cell RNA-Seq. Elife. 8, e43803.
  • Kriebel, A.R., Welch, J.D. (2022). UINMF performs mosaic integration of single-cell multi-omic datasets using nonnegative matrix factorization. Nat. Commun.. 13, 780.
  • Li, X., Wang, K., Lyu, Y., Pan, H., Zhang, J., Stambolian, D., Susztak, K., Reilly, M.P., Hu, G., Li, M. (2020). Deep learning enables accurate clustering with batch effect removal in single-cell RNA-seq analysis. Nat. Commun.. 11, 2338.
  • Lin, Y., Ghazanfar, S., Wang, K.Y.X., Gagnon-Bartsch, J.A., Lo, K.K., Su, X., Han, Z.G., Ormerod, J.T., Speed, T.P., Yang, P. (2019). scMerge leverages factor analysis, stable expression, and pseudoreplication to merge multiple single-cell RNA-seq datasets. Proc. Natl. Acad. Sci. U. S. A.. 116, 9775-9784.
  • Lopez, R., Regier, J., Cole, M.B., Jordan, M.I., Yosef, N. (2018). Deep generative modeling for single-cell transcriptomics. Nat. Methods. 15, 1053-1058.
  • Lotfollahi, M., Naghipourfar, M., Theis, F.J., Wolf, F.A. (2020). Conditional out-of-distribution generation for unpaired data using transfer VAE. Bioinformatics. 36, i610-i617.
  • Lotfollahi, M., Wolf, F.A., Theis, F.J. (2019). scGen predicts single-cell perturbation responses. Nat. Methods. 16, 715-721.
  • Luecken, M.D., Büttner, M., Chaichoompu, K., Danese, A., Interlandi, M., Mueller, M.F., Strobl, D.C., Zappia, L., Dugas, M., Colomé-Tatché, M. (2022). Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods. 19, 41-50.
  • Lun, A.T., McCarthy, D.J., Marioni, J.C. (2016). A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Res. 5, 2122.
  • McKellar, D.W., Walter, L.D., Song, L.T., Mantri, M., Wang, M.F.Z., De Vlaminck, I., Cosgrove, B.D. (2021). Large-scale integration of single-cell transcriptomic data captures transitional progenitor states in mouse skeletal muscle regeneration. Commun. Biol.. 4, 1280.
  • Molania, R., Gagnon-Bartsch, J.A., Dobrovic, A., Speed, T.P. (2019). A new normalization for Nanostring nCounter gene expression data. Nucleic Acids Res.. 47, 6073-6083.
  • Morabito, S., Miyoshi, E., Michael, N., Shahin, S., Martini, A.C., Head, E., Silva, J., Leavy, K., Perez-Rosendahl, M., Swarup, V. (2021). Single-nucleus chromatin accessibility and transcriptomic characterization of Alzheimer's disease. Nat. Genet.. 53, 1143-1155.
  • Polański, K., Young, M.D., Miao, Z., Meyer, K.B., Teichmann, S.A., Park, J.E. (2020). BBKNN: fast batch alignment of single cell transcriptomes. Bioinformatics. 36, 964-965.
  • Regev, A., Teichmann, S.A., Lander, E.S., Amit, I., Benoist, C., Birney, E., Bodenmiller, B., Campbell, P., Carninci, P., Clatworthy, M. (2017). The human cell atlas. Elife. 6, e27041.
  • Reichart, D., Lindberg, E.L., Maatz, H., Miranda, A.M.A., Viveiros, A., Shvetsov, N., Gartner, A., Nadelmann, E.R., Lee, M., Kanemaru, K. (2022). Pathogenic variants damage cell composition and single cell transcription in cardiomyopathies. Science. 377, eabo1984.
  • Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S., Vert, J.P. (2018). A general and flexible method for signal extraction from single-cell RNA-seq data. Nat. Commun.. 9, 284.
  • Ritchie, M.E., Phipson, B., Wu, D., Hu, Y., Law, C.W., Shi, W., Smyth, G.K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res.. 43, e47.
  • Smillie, C.S., Biton, M., Ordovas-Montanes, J., Sullivan, K.M., Burgin, G., Graham, D.B., Herbst, R.H., Rogel, N., Slyper, M., Waldman, J. (2019). Intra- and inter-cellular rewiring of the human colon during ulcerative colitis. Cell. 178, 714-730.e22.
  • Stuart, T., Butler, A., Hoffman, P., Hafemeister, C., Papalexi, E., Mauck, W.M., Hao, Y., Stoeckius, M., Smibert, P., Satija, R. (2019). Comprehensive integration of single-cell data. Cell. 177, 1888-1902.e21.
  • Tran, H.T.N., Ang, K.S., Chevrier, M., Zhang, X., Lee, N.Y.S., Goh, M., Chen, J. (2020). A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol.. 21, 12.
  • Trapnell, C., Cacchiarelli, D., Grimsby, J., Pokharel, P., Li, S., Morse, M., Lennon, N.J., Livak, K.J., Mikkelsen, T.S., Rinn, J.L. (2014). The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol.. 32, 381-386.
  • Uchimura, K., Wu, H., Yoshimura, Y., Humphreys, B.D. (2020). Human pluripotent stem cell-derived kidney organoids with improved collecting duct maturation and injury modeling. Cell Rep.. 33, 108514.
  • Vallejos, C.A., Marioni, J.C., Richardson, S. (2015). BASiCS: Bayesian analysis of single-cell sequencing data. PLoS Comput. Biol.. 11, e1004333.
  • Villa, C.E., Cheroni, C., Dotter, C.P., Lopez-Tobon, A., Oliveira, B., Sacco, R., Yahya, A.C., Morandell, J., Gabriele, M., Tavakoli, M.R. (2022). CHD8 haploinsufficiency links autism to transient alterations in excitatory and inhibitory trajectories. Cell Rep.. 39, 110615.
  • Welch, J.D., Kozareva, V., Ferreira, A., Vanderburg, C., Martin, C., Macosko, E.Z. (2019). Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell. 177, 1873-1887.e17.
  • Xu, C., Lopez, R., Mehlman, E., Regier, J., Jordan, M.I., Yosef, N. (2021). Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models. Mol. Syst. Biol.. 17, e9620.
  • Yang, Z., Michailidis, G. (2016). A non-negative matrix factorization method for detecting modules in heterogeneous omics multi-modal data. Bioinformatics. 32, 1-8.
  • Yoon, B.K., Oh, T.G., Bu, S., Seo, K.J., Kwon, S.H., Lee, J.Y., Kim, Y., Kim, J.W., Ahn, H.S., Fang, S. (2022). The peripheral immune landscape in a patient with myocarditis after the administration of BNT162b2 mRNA vaccine. Mol. Cells. 45, 738-748.
  • Young, A.L., Marinescu, R.V., Oxtoby, N.P., Bocchetta, M., Yong, K., Firth, N.C., Cash, D.M., Thomas, D.L., Dick, K.M., Cardoso, J. (2018). Uncovering the heterogeneity and temporal complexity of neurodegenerative diseases with Subtype and Stage Inference. Nat. Commun.. 9, 4273.

Figure 1

(A) Schematic illustration of defining batches by donors (Dataset 1), sample preparation protocols (Dataset 2), sequencing platforms (Dataset 3), and individual samples (donors; Dataset k). (B) Analytical flow of data integration. See text for details.

Figure 2

(A) Linear decomposition scheme used in limma and ComBat. Batches and conditions for cells are indicated by colors. Matrix sizes are denoted in left bottom (number of rows) and right top (number of columns) corners: N genes, M cells, H conditions, and K batches. The error matrix used in ComBat is depicted in parentheses. (B) Decomposition scheme used in ZINB-WaVE involving L gene-level covariates and Q unknown sample-level covariates.

Figure 3

(A) Cell-level covariates. Three cell types (clusters) show differential variations between batches 1 and 2. (B) Anchored cell pairs between batches 1 and 2 on two-dimensional LV space. (C) Distributions of cells after batch correction on the UMAP. (D and E) Schematic illustration of PCA (D) and CCA (E). PC1 and PC2 are defined to capture the largest and 2nd largest variance in the distribution of cells while u and v are defined to maximize the correlation between projections of X1 (batch 1) and X2 (batch 2) onto u and v. Decomposition schemes of X1 and X1 are also shown. (F) Architecture of the autoencoder that takes X as an input and tries to reconstruct X itself. During this reconstruction, the essential features of X are extracted in the nodes of the embedding layer. UMAP, uniform manifold approximation and projection; PCA, principal component analysis; CCA, canonical correlation analysis; PC, principal component.

Figure 4

(A) Similar cell pairs identified by cell-level similarity search (left) and similar clusters identified by clustering (right). (B) Schematic illustration of MNN strategy for identifying anchored cell pairs (left) and batch correction strategy (right). (C and D) Dynamic time warping involving selection of metagenes (C, top), determination of metagene expression profiles (C, bottom), generation of cumulative distance matrix (D, top), and dynamic time warping strategy (D, bottom). See text for details (B-D).

Figure 5

Schematic illustration of the analytical steps in Harmony (A), DESC (B), LIGER (C), and scMerge (D). See text for details.

Figure 6

(A) Architecture of scVI and schematic illustration of analytical steps in scVI. The outputs from NN5-6 are used to estimate the ZINB distribution p(xm|zm,sm,lm). (B). Schematic illustration of analytical steps in scGen. See text for details (A and B).