Mol. Cells 2023; 46(2): 106-119
Published online February 24, 2023
https://doi.org/10.14348/molcells.2023.0009
© The Korean Society for Molecular and Cellular Biology
Correspondence to : daehee@snu.ac.kr
This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/.
With the increased number of single-cell RNA sequencing (scRNA-seq) datasets in public repositories, integrative analysis of multiple scRNA-seq datasets has become commonplace. Batch effects among different datasets are inevitable because of differences in cell isolation and handling protocols, library preparation technology, and sequencing platforms. To remove these batch effects for effective integration of multiple scRNA-seq datasets, a number of methodologies have been developed based on diverse concepts and approaches. These methods have proven useful for examining whether cellular features, such as cell subpopulations and marker genes, identified from a certain dataset, are consistently present, or whether their condition-dependent variations, such as increases in cell subpopulations in particular disease-related conditions, are consistently observed in different datasets generated under similar or distinct conditions. In this review, we summarize the concepts and approaches of the integration methods and their pros and cons as has been reported in previous literature.
Keywords batch correction, data integration, single-cell RNA-seq
Single-cell RNA sequencing (scRNA-seq) datasets have been increasingly accumulated in public data repositories, such as the Single Cell Portal (https://singlecell.broadinstitute.org/single_cell), Gene Expression Omnibus (Barrett et al., 2013), and Human Cell Atlas (HCA) data portal (Regev et al., 2017). With an increase in the number of datasets, many efforts have been made for the integrative analysis of different scRNA-seq datasets. Multiple scRNA-seq datasets are often integrated and compared to check whether cellular features (e.g., cell subpopulations and their marker genes) identified from a certain dataset are shared or distinctive compared with those in other datasets produced under similar or different biological conditions. The integration of multiple scRNA-seq datasets has proven useful for reliably identifying shared or distinctive cellular features across datasets. Shared cellular features may not be clear when individual datasets are analyzed independently, owing to small numbers of cells or sparse expression data in each dataset and unwanted technical or biological variations within and across datasets. However, these cellular features can be corroborated by combining features from multiple datasets after correcting the unwanted variations.
In many cases, individual scRNA-seq datasets are generated from samples with distinctive characteristics (e.g., cell counts, tissue types, and conditions from which cells are isolated, such as healthy or diseased conditions) and using different experimental protocols (e.g., cell isolation and handling protocols and library preparation methods) or sequencing platforms. These differences inevitably lead to unwanted technical and biological variations across different datasets. Even within a single dataset, multiple batches with variations in sample characteristics, experimental protocols, or sequencing platforms can exist. These unwanted variations among batches, called batch effects, within and across datasets can decrease the chances of identifying underlying cellular features by introducing inconsistent cellular expression profile structures. Therefore, batch effects arising from both systematic technical and unwanted biological variations should be corrected before integrative analysis of multiple datasets to prevent misleading conclusions. To effectively correct batch effects, several computational methods have been developed based on different concepts and approaches. In this review, we conceptually categorize existing integration methods, describe the key algorithms employed in these methods, and summarize their advantages and disadvantages as reported in literature.
Batch effects occur mainly due to differences in the following three factors: (1) sample characteristics (donors, tissues, species, or disease conditions), (2) experimental protocols, and (3) sequencing platforms (Fig. 1A, top). The integration of multiple datasets begins by defining batches as sets of samples that are thought to be similar in terms of these factors (Fig. 1A, boxes in different colors). The factors leading to the most significant batch effects can vary across the integrated datasets. It is common for the major factor causing the largest batch effect to be chosen subjectively, and the batches are defined based on the major factor. For example, batches were defined as sets of samples from individual donors (Reichart et al., 2022; Smillie et al., 2019; Uchimura et al., 2020; Villa et al., 2022) (Fig. 1A, Dataset 1) or as individual datasets that were generated using different protocols (Cheng et al., 2021; Morabito et al., 2021) and/or sequencing platforms (Cheng et al., 2021) (Fig. 1A, Datasets 2-3). Several studies even defined batches as individual samples (Bryois et al., 2022; Yoon et al., 2022) (Fig. 1A, Dataset
Batch correction is next performed for the normalized merged dataset using the selected highly variable genes (Fig. 1B, Batch correction). Batch correction methods can be conceptually classified into the following three categories depending on how they model batch effects: (1) linear decomposition methods, (2) similarity-based batch correction methods in reduced dimension space, and (3) generative models using a variational autoencoder. A similar categorization scheme for the first two categories was previously proposed by Xu et al. (2021).
Modeling batch effects using a generalized linear decomposition model was first introduced to remove batch effects within or across bulk RNA expression datasets when integrating multiple bulk RNA expression datasets. For example, the ‘removeBatchEffect’ function in limma (Ritchie et al., 2015) and ComBat (Johnson et al., 2007) have employed this linear decomposition model to delineate batch effects in bulk RNA expression datasets.
In the ‘removeBatchEffect’ function in limma, the merged data matrix (X) including expression levels of
In addition to the above batch effects, called additive batch effects, ComBat estimates multiplicative batch effects. To this end, E is modified into multiplication of a regression coefficient matrix (RMB,
These approaches have been also used to remove batch effects when integrating multiple scRNA-seq datasets in early studies (Giustacchini et al., 2017; Kadoki et al., 2017; Smillie et al., 2019; Young et al., 2018). However, another linear decomposition model ZINB-WaVE (Risso et al., 2018) with an alternative model structure was developed specifically for integrating scRNA-seq datasets that have unique features such as zero inflation (dropouts), overdispersion, and different count distributions from bulk RNA-seq datasets. In this method, the expression count in the merged data matrix (X,
The above methods assume that there are no cell-level covariates and that all cells vary equivalently between different samples (or batches) in response to the same sources of variation. However, owing to distinct sensitivity in the response, cells can vary differentially between samples and thus between batches. For example, the three groups of cells shown in Fig. 3A exhibited different distribution changes between batches 1 and 2 in terms of their mean and standard deviation when visualized in a uniform manifold approximation and projection (UMAP) space. To address this cell-level covariate issue in batch correction, a number of methods have been developed, including canonical correlation analysis (CCA) (Butler et al., 2018), mutual nearest neighbor (MNN) (Haghverdi et al., 2018), fastMNN (Haghverdi et al., 2018), ‘IntegrateData’ function in Seurat (Stuart et al., 2019), Scanorama (Hie et al., 2019), BBKNN (Polański et al., 2020), Conos (Barkas et al., 2019), Harmony (Korsunsky et al., 2019), DESC (Li et al., 2020), LIGER (Welch et al., 2019), scMerge (Lin et al., 2019), and SAUCIE (Amodio et al., 2019). These methods start with the projection of cells in the merged dataset (X) onto a reduced space defined by several dimension reduction methods, such as principal component analysis (PCA), CCA, non-negative matrix factorization (NMF), and an autoencoder (e.g., a 2-dimentional space defined by two latent variables [LV1-2] in Fig. 3B). They then identify similar cells sharing expression profiles, which can be identified as pairs of cells between batches at the individual cell level (Fig. 3B, connected cells between batches 1 and 2 in a 2-dimentional LV space) or as cells from different batches in the same cluster at the cluster level. Batch effects are then corrected such that similar cells followed a common distribution in the reduced space (see batch-corrected cluster-level similar cells in Fig. 3C). Each step is described in detail below.
scRNA-seq data are vulnerable to technical and biological noise owing to high dropout rates and low expression counts, leading to reduced power to decipher the underlying intrinsic biological differences between cells. Dimension reduction has been commonly employed to focus on the intrinsic information in the data and remove non-systematic noise during subsequent analyses of scRNA-seq data, such as searching for similar cells between batches, cell clustering and visualization (Bzdok et al., 2018). Furthermore, dimension reduction makes computation more convenient and efficient (Argelaguet et al., 2021). Hence, most methods use dimension reduction strategies for subsequent analyses.
Linear dimension reduction approaches have been the most frequently employed, including PCA/singular value decomposition (SVD) (Haghverdi et al., 2018), CCA (Butler et al., 2018), and NMF (Welch et al., 2019). PCA defines LVs, called principal components (PCs), that are orthogonal to each other to capture the largest variances in the data (Fig. 3D, PC1 and PC2). The number of PCs is determined such that the projection of the scRNA profiles (X) of individual cells onto the PCs can sufficiently capture the variation (covariance) in the data by minimizing the reconstruction error E = X – PTT where P (
NMF defines
Neural-network-based dimensionality reduction methods have been employed to effectively handle nonlinear correlations among variables in datasets. Among these methods, the autoencoder (Amodio et al., 2019) is one of the most frequently used methods. For the merged data matrix (X), the autoencoder is designed to reconstruct X itself and thus has (1) input and output layers comprising
Projection values in the reduced dimension from the above linear or nonlinear methods are used to identify similar cells at the individual or cluster level (Fig. 3G), followed by batch correction for similar cells to follow a common distribution in the reduced space. However, SAUCIE does not identify similar cells, but directly performs batch correction to minimize discrepancies between batches (Amodio et al., 2019). To this end, SAUCIE randomly sets one batch as the reference batch, and corrects the mean and standard deviation of the embedded values for each non-reference batch with those of the reference batch while minimizing the reconstruction error. The weights in the autoencoder are thus determined to balance the reconstruction error and batch correction.
Several methods have been developed to identify similar cells among different batches, which can be categorized into two groups according to whether similar cells between batches are identified as cell pairs at the individual cell level (Fig. 4A, left) or as sets of cells in the same cell cluster (Fig. 4A, right).
BBKNN and Conos explicitly perform no batch correction, but provide weighted graphs. BBKNN connects cell
‘AlignSubspace’ function in Seurat v2 employed alternatively dynamic time warping for cell-cell similarity search. It first selects the batch with the largest number of cells as the reference batch (e.g., batch 1). For a given query batch (e.g., batch 2), it applies CCA to identify the first LVs for batches 1 (u) and 2 (v) as illustrated in Fig. 3E; selects top 30 genes with the highest contribution [i.e., highest min(bicor(X1n,u), bicor(X2n,v)) for gene
DESC also performs cluster-level batch correction; however, its algorithm is different from that of Harmony. The stacked autoencoder, a variation of the autoencoder with pre-training and model fine tuning steps, is first applied to the merged data matrix (X) to obtain nonlinear LV projections (zi for cell
LIGER also performs cluster-level batch correction. For clustering of
Finally, scMerge performs first
In practice, after batch correction using the aforementioned methods, the outputs, such as batch corrected merged data matrix (X from linear decomposition methods) or projections (z) on the reduced space (e.g., Harmony and DESC), or updated factor loadings (e.g., LIGER) from these methods are subjected to another clustering (e.g., Louvain clustering using the kNN graph in Seurat), and identification of differentially expressed genes (e.g., ‘FindMarkers’ function in Seurat) and annotation of cell types (e.g., SingleR; Aran et al., 2019) are then performed for the resulting clusters.
Linear decomposition and similarity-based methods using linear dimension reduction cannot effectively capture the nonlinear characteristics of batch effects and systematic biological signals. To address these issues, several methods using a generative model with variational autoencoder (e.g., scVI [Lopez et al., 2018], scGen [Lotfollahi et al., 2019], and trVAE [Lotfollahi et al., 2020]) have been developed.
Similarity-based methods often suffer from heavy computational loads during similarity searches and batch corrections, when hundreds of thousands of cells are integrated. scVI (Lopez et al., 2018) was developed to effectively model the nonlinear characteristics of data and resolve the computational load issue. The scVI assumes that the expression count (
Similar to scANVI, scGen (Lotfollahi et al., 2019) uses the cell-type information obtained from the cell-type annotation of the clusters after cell clustering (e.g., three cell types in Fig. 6B, left). A variational autoencoder is used to estimate the distribution parameters (means and standard deviations) of the nonlinear LVs (
All the aforementioned methods have differences in model structures and parameters, dimension reduction, similarity search, and batch correction. These aspects of the individual methods are summarized in Supplementary Table S1. The unique characteristics of these methods provide advantages and disadvantages in terms of their performance (output type, speed, peak memory use, GPU support, etc.), which are summarized in Supplementary Table S1. ‘High’ performance in the table indicates that the corresponding methods were found to show good performance in terms of speed, memory, and batch correction performance evaluated based on diversity of batches in each cluster (i.e., cell type).
The characteristics of the individual methods are associated with the algorithms or approaches employed. Linear decomposition methods are the simplest, thereby providing an advantage in terms of analysis speed (Tran et al., 2020) (Supplementary Table S1, Speed). ComBat could model diverse batch effects using both additive and multiplicative batch terms compared to other linear decomposition methods, and also incorporated the empirical Bayes shrinkage of parameters to pool the information across genes, which provides robustness in correction of batches with small sample sizes. ZINB-WaVE includes unique gene-level covariates in the model. However, how its addition to the model improves performance has not been systematically tested, and the benefit of the gene-level covariate term is unclear, considering that the discrepancy among gene-level covariates may be handled by normalization strategies in other methods. Linear decomposition methods assume that the cell types between batches are similar (Haghverdi et al., 2018; Lin et al., 2019) and cannot effectively handle heterogeneity among cell-level covariates across samples or batches.
Although similarity-based batch correction methods effectively handle cell-level covariate issues, they commonly involve additional steps to search for similar cells and correct batch effects to match the distributions of similar cells. Because of these additional steps, some of them (e.g., MNN-based methods) often suffer from heavy computational loads (Supplementary Table S1, Speed and Peak memory use), which render limited applicability or scalability for scRNA-seq datasets, including hundreds of thousands of cells or more (Li et al., 2020). Both linear decomposition and similarity-based methods effectively handle unwanted non-systematic variations. For systematic unwanted variations, however, MNN-based methods can handle more effectively the systematic unwanted variations than the linear decomposition methods, as long as they identify pairs of similar cells with reasonable accuracy.
Similarity-based methods perform similarity searches and batch corrections mostly on reduced dimensions. Among the dimension reduction tools, CCA captures the shared sources of variations between batches and thus performs well when cell populations are largely shared between batches, whereas it is less likely to capture variations for cell subpopulations uniquely present in small numbers of batches. By contrast, PCA captures distinctive sources of variation with a sufficient number of LVs, which may make similarity search among these distinctive cell subpopulations possible. iNMF specifically models cells that are uniquely present in a small number of batches to effectively address this issue. Moreover, batch corrections based on batch vectors require large amounts of memory (Hie et al., 2019), thereby causing memory issues, particularly for datasets that include a large number of cells. Scanorama endeavored to resolve this memory issue by reducing peak memory usage, thereby improving the scalability of analyses with limited computing resources (Li et al., 2020; Luecken et al., 2022; Tran et al., 2020) (Supplementary Table S1, Speed, Peak memory use, and Performance). BBKNN and Conos provide no batch-corrected X or z, but several downstream analyses, such as functional gene program identification (Kotliar et al., 2019) or trajectory inference (Trapnell et al., 2014), require corrected X or z, thereby limiting the applicability of BBKNN and Conos (Luecken et al., 2022).
Among the cluster-level similarity search methods, Harmony shows high scalability in terms of both runtime and memory usage (Korsunsky et al., 2019; McKellar et al., 2021; Tran et al., 2020) (Supplementary Table S1, Speed). However, the KL divergence term in its objective function can make Harmony biased toward major cell types (clusters), including large numbers of cells. There may be a chance for Harmony to overcorrect batch effects for small cell types when they are integrated with major cell types. DESC does not require batch information, but corrects batch effects and simultaneously performs soft clustering. The exact number of batch-effect sources (or unwanted variations) is typically unknown. Although DESC may effectively handle these unknown sources of batch effects, systematic analyses are needed to understand the benefits from no use of batch information. LIGER was shown to have a tendency of focusing on batch effect correction rather than biological conservation, favoring its application to integration of cross-species datasets (Luecken et al., 2022) (Supplementary Table S1, Bio-conservation vs Batch correction).
scVI and scANVI are ‘all-inclusive’ tools that perform a range of analyses including normalization, dimension reduction, batch effect correction, imputation, clustering, and differential expression. They can effectively capture the nonlinear characteristics in data based on probabilistic models that statistically bound variations in random variables (
As no single method performs well in all integration settings, several representative methods in the aforementioned categories (e.g., ComBat, Seurat’s anchoring method, harmony, and scVI) should be compared to select an appropriate method for data integration. Clusters of cells identified by each method on UMAP can be compared across the methods. Among the clusters, several are consistently identified with similar shapes and memberships in all methods. Focusing on these clusters provides the most reliable conclusions. However, these methods often produce inconsistent clusters, particularly for small clusters. For example, a small cluster can be identified as a distinct cluster in the ComBat and anchoring methods but merged into another large cluster in Harmony and scVI. Moreover, the relative position of the small cluster with respect to the other clusters on the UMAP cells can differ across methods. Furthermore, cells within a cluster can be evenly distributed or exhibit a skewed distribution. The patterns of these cluster characteristics across the methods can suggest whether batch effects are corrected too weakly, too much, or appropriately for particular clusters of interest, from linear and similarity-based (cell- or cluster-level) methods to variational autoencoders. Nevertheless, whether a small cluster is a biologically meaningful cell subpopulation or an artifact from cell isolation, and batch correction should be determined by detailed functional experiments for small clusters.
Data integration methods provide unique opportunities to systematically compare and federate cell types present in multiple sets of samples under similar or distinct disease conditions, thereby enabling a more comprehensive interpretation of the functional roles of different cell types under diverse disease conditions. Moreover, the integrative analysis of multiple datasets generated from diverse disease conditions can provide insights into the interplay of particular cell-type pairs by providing shared count variations of the cell-type pairs under disease conditions. Although data integration methods have improved from linear decomposition to nonlinear probabilistic methods, many problems remain to be resolved. A definition of batches is required for most tools. Although batches are defined subjectively based on the major sources of unwanted variation, there may still be unrecognized sources that cause batch effects. Moreover, most methods assume that batch effects are smaller than biological differences. However, there could be systematic unwanted variations that are similar in magnitude to biological differences. These systematic unwanted variations cannot be effectively distinguished from biological differences using the current methods. In addition, there is a significant need for tools that can effectively evaluate how robust or stable correction data integration methods can achieve in the presence of diverse types of non-systematic and systematic noises. Furthermore, data integration tends to be biased toward the major cell types that are commonly abundant across the integrated datasets, thereby not providing stable integration for the small cell types present only in particular small sets of datasets. Finally, owing to recent technical advances, the number of detected cells has substantially increased. Thus, the scalability of these methods needs to be improved to effectively handle large numbers of cells. Therefore, there are still plenty of room for improvements. Nonetheless, data integration methods have been applied to answer diverse single-cell-level biological and medical questions. Along with the improvement of these methods, their continuous application will shed new insights into cellular players and their interactions underlying disease pathogenesis, and provide new cellular targets for the treatment of various diseases.
This study was supported by the Bio & Medical Technology Development Program of the National Research Foundation (NRF), funded by the Korean government (MSIT) (No. 2019M3A9B6066967).
Y.R., G.H.H., and E.J. wrote the original draft. Y.R., G.H.H., E.J., and D.H. reviewed and revised the manuscript.
The authors have no potential conflicts of interest to disclose.
Mol. Cells 2023; 46(2): 106-119
Published online February 28, 2023 https://doi.org/10.14348/molcells.2023.0009
Copyright © The Korean Society for Molecular and Cellular Biology.
Yeonjae Ryu1,2 , Geun Hee Han1,2
, Eunsoo Jung1,2
, and Daehee Hwang1,*
1School of Biological Sciences, Seoul National University, Seoul 08826, Korea, 2These authors contributed equally to this work.
Correspondence to:daehee@snu.ac.kr
This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/.
With the increased number of single-cell RNA sequencing (scRNA-seq) datasets in public repositories, integrative analysis of multiple scRNA-seq datasets has become commonplace. Batch effects among different datasets are inevitable because of differences in cell isolation and handling protocols, library preparation technology, and sequencing platforms. To remove these batch effects for effective integration of multiple scRNA-seq datasets, a number of methodologies have been developed based on diverse concepts and approaches. These methods have proven useful for examining whether cellular features, such as cell subpopulations and marker genes, identified from a certain dataset, are consistently present, or whether their condition-dependent variations, such as increases in cell subpopulations in particular disease-related conditions, are consistently observed in different datasets generated under similar or distinct conditions. In this review, we summarize the concepts and approaches of the integration methods and their pros and cons as has been reported in previous literature.
Keywords: batch correction, data integration, single-cell RNA-seq
Single-cell RNA sequencing (scRNA-seq) datasets have been increasingly accumulated in public data repositories, such as the Single Cell Portal (https://singlecell.broadinstitute.org/single_cell), Gene Expression Omnibus (Barrett et al., 2013), and Human Cell Atlas (HCA) data portal (Regev et al., 2017). With an increase in the number of datasets, many efforts have been made for the integrative analysis of different scRNA-seq datasets. Multiple scRNA-seq datasets are often integrated and compared to check whether cellular features (e.g., cell subpopulations and their marker genes) identified from a certain dataset are shared or distinctive compared with those in other datasets produced under similar or different biological conditions. The integration of multiple scRNA-seq datasets has proven useful for reliably identifying shared or distinctive cellular features across datasets. Shared cellular features may not be clear when individual datasets are analyzed independently, owing to small numbers of cells or sparse expression data in each dataset and unwanted technical or biological variations within and across datasets. However, these cellular features can be corroborated by combining features from multiple datasets after correcting the unwanted variations.
In many cases, individual scRNA-seq datasets are generated from samples with distinctive characteristics (e.g., cell counts, tissue types, and conditions from which cells are isolated, such as healthy or diseased conditions) and using different experimental protocols (e.g., cell isolation and handling protocols and library preparation methods) or sequencing platforms. These differences inevitably lead to unwanted technical and biological variations across different datasets. Even within a single dataset, multiple batches with variations in sample characteristics, experimental protocols, or sequencing platforms can exist. These unwanted variations among batches, called batch effects, within and across datasets can decrease the chances of identifying underlying cellular features by introducing inconsistent cellular expression profile structures. Therefore, batch effects arising from both systematic technical and unwanted biological variations should be corrected before integrative analysis of multiple datasets to prevent misleading conclusions. To effectively correct batch effects, several computational methods have been developed based on different concepts and approaches. In this review, we conceptually categorize existing integration methods, describe the key algorithms employed in these methods, and summarize their advantages and disadvantages as reported in literature.
Batch effects occur mainly due to differences in the following three factors: (1) sample characteristics (donors, tissues, species, or disease conditions), (2) experimental protocols, and (3) sequencing platforms (Fig. 1A, top). The integration of multiple datasets begins by defining batches as sets of samples that are thought to be similar in terms of these factors (Fig. 1A, boxes in different colors). The factors leading to the most significant batch effects can vary across the integrated datasets. It is common for the major factor causing the largest batch effect to be chosen subjectively, and the batches are defined based on the major factor. For example, batches were defined as sets of samples from individual donors (Reichart et al., 2022; Smillie et al., 2019; Uchimura et al., 2020; Villa et al., 2022) (Fig. 1A, Dataset 1) or as individual datasets that were generated using different protocols (Cheng et al., 2021; Morabito et al., 2021) and/or sequencing platforms (Cheng et al., 2021) (Fig. 1A, Datasets 2-3). Several studies even defined batches as individual samples (Bryois et al., 2022; Yoon et al., 2022) (Fig. 1A, Dataset
Batch correction is next performed for the normalized merged dataset using the selected highly variable genes (Fig. 1B, Batch correction). Batch correction methods can be conceptually classified into the following three categories depending on how they model batch effects: (1) linear decomposition methods, (2) similarity-based batch correction methods in reduced dimension space, and (3) generative models using a variational autoencoder. A similar categorization scheme for the first two categories was previously proposed by Xu et al. (2021).
Modeling batch effects using a generalized linear decomposition model was first introduced to remove batch effects within or across bulk RNA expression datasets when integrating multiple bulk RNA expression datasets. For example, the ‘removeBatchEffect’ function in limma (Ritchie et al., 2015) and ComBat (Johnson et al., 2007) have employed this linear decomposition model to delineate batch effects in bulk RNA expression datasets.
In the ‘removeBatchEffect’ function in limma, the merged data matrix (X) including expression levels of
In addition to the above batch effects, called additive batch effects, ComBat estimates multiplicative batch effects. To this end, E is modified into multiplication of a regression coefficient matrix (RMB,
These approaches have been also used to remove batch effects when integrating multiple scRNA-seq datasets in early studies (Giustacchini et al., 2017; Kadoki et al., 2017; Smillie et al., 2019; Young et al., 2018). However, another linear decomposition model ZINB-WaVE (Risso et al., 2018) with an alternative model structure was developed specifically for integrating scRNA-seq datasets that have unique features such as zero inflation (dropouts), overdispersion, and different count distributions from bulk RNA-seq datasets. In this method, the expression count in the merged data matrix (X,
The above methods assume that there are no cell-level covariates and that all cells vary equivalently between different samples (or batches) in response to the same sources of variation. However, owing to distinct sensitivity in the response, cells can vary differentially between samples and thus between batches. For example, the three groups of cells shown in Fig. 3A exhibited different distribution changes between batches 1 and 2 in terms of their mean and standard deviation when visualized in a uniform manifold approximation and projection (UMAP) space. To address this cell-level covariate issue in batch correction, a number of methods have been developed, including canonical correlation analysis (CCA) (Butler et al., 2018), mutual nearest neighbor (MNN) (Haghverdi et al., 2018), fastMNN (Haghverdi et al., 2018), ‘IntegrateData’ function in Seurat (Stuart et al., 2019), Scanorama (Hie et al., 2019), BBKNN (Polański et al., 2020), Conos (Barkas et al., 2019), Harmony (Korsunsky et al., 2019), DESC (Li et al., 2020), LIGER (Welch et al., 2019), scMerge (Lin et al., 2019), and SAUCIE (Amodio et al., 2019). These methods start with the projection of cells in the merged dataset (X) onto a reduced space defined by several dimension reduction methods, such as principal component analysis (PCA), CCA, non-negative matrix factorization (NMF), and an autoencoder (e.g., a 2-dimentional space defined by two latent variables [LV1-2] in Fig. 3B). They then identify similar cells sharing expression profiles, which can be identified as pairs of cells between batches at the individual cell level (Fig. 3B, connected cells between batches 1 and 2 in a 2-dimentional LV space) or as cells from different batches in the same cluster at the cluster level. Batch effects are then corrected such that similar cells followed a common distribution in the reduced space (see batch-corrected cluster-level similar cells in Fig. 3C). Each step is described in detail below.
scRNA-seq data are vulnerable to technical and biological noise owing to high dropout rates and low expression counts, leading to reduced power to decipher the underlying intrinsic biological differences between cells. Dimension reduction has been commonly employed to focus on the intrinsic information in the data and remove non-systematic noise during subsequent analyses of scRNA-seq data, such as searching for similar cells between batches, cell clustering and visualization (Bzdok et al., 2018). Furthermore, dimension reduction makes computation more convenient and efficient (Argelaguet et al., 2021). Hence, most methods use dimension reduction strategies for subsequent analyses.
Linear dimension reduction approaches have been the most frequently employed, including PCA/singular value decomposition (SVD) (Haghverdi et al., 2018), CCA (Butler et al., 2018), and NMF (Welch et al., 2019). PCA defines LVs, called principal components (PCs), that are orthogonal to each other to capture the largest variances in the data (Fig. 3D, PC1 and PC2). The number of PCs is determined such that the projection of the scRNA profiles (X) of individual cells onto the PCs can sufficiently capture the variation (covariance) in the data by minimizing the reconstruction error E = X – PTT where P (
NMF defines
Neural-network-based dimensionality reduction methods have been employed to effectively handle nonlinear correlations among variables in datasets. Among these methods, the autoencoder (Amodio et al., 2019) is one of the most frequently used methods. For the merged data matrix (X), the autoencoder is designed to reconstruct X itself and thus has (1) input and output layers comprising
Projection values in the reduced dimension from the above linear or nonlinear methods are used to identify similar cells at the individual or cluster level (Fig. 3G), followed by batch correction for similar cells to follow a common distribution in the reduced space. However, SAUCIE does not identify similar cells, but directly performs batch correction to minimize discrepancies between batches (Amodio et al., 2019). To this end, SAUCIE randomly sets one batch as the reference batch, and corrects the mean and standard deviation of the embedded values for each non-reference batch with those of the reference batch while minimizing the reconstruction error. The weights in the autoencoder are thus determined to balance the reconstruction error and batch correction.
Several methods have been developed to identify similar cells among different batches, which can be categorized into two groups according to whether similar cells between batches are identified as cell pairs at the individual cell level (Fig. 4A, left) or as sets of cells in the same cell cluster (Fig. 4A, right).
BBKNN and Conos explicitly perform no batch correction, but provide weighted graphs. BBKNN connects cell
‘AlignSubspace’ function in Seurat v2 employed alternatively dynamic time warping for cell-cell similarity search. It first selects the batch with the largest number of cells as the reference batch (e.g., batch 1). For a given query batch (e.g., batch 2), it applies CCA to identify the first LVs for batches 1 (u) and 2 (v) as illustrated in Fig. 3E; selects top 30 genes with the highest contribution [i.e., highest min(bicor(X1n,u), bicor(X2n,v)) for gene
DESC also performs cluster-level batch correction; however, its algorithm is different from that of Harmony. The stacked autoencoder, a variation of the autoencoder with pre-training and model fine tuning steps, is first applied to the merged data matrix (X) to obtain nonlinear LV projections (zi for cell
LIGER also performs cluster-level batch correction. For clustering of
Finally, scMerge performs first
In practice, after batch correction using the aforementioned methods, the outputs, such as batch corrected merged data matrix (X from linear decomposition methods) or projections (z) on the reduced space (e.g., Harmony and DESC), or updated factor loadings (e.g., LIGER) from these methods are subjected to another clustering (e.g., Louvain clustering using the kNN graph in Seurat), and identification of differentially expressed genes (e.g., ‘FindMarkers’ function in Seurat) and annotation of cell types (e.g., SingleR; Aran et al., 2019) are then performed for the resulting clusters.
Linear decomposition and similarity-based methods using linear dimension reduction cannot effectively capture the nonlinear characteristics of batch effects and systematic biological signals. To address these issues, several methods using a generative model with variational autoencoder (e.g., scVI [Lopez et al., 2018], scGen [Lotfollahi et al., 2019], and trVAE [Lotfollahi et al., 2020]) have been developed.
Similarity-based methods often suffer from heavy computational loads during similarity searches and batch corrections, when hundreds of thousands of cells are integrated. scVI (Lopez et al., 2018) was developed to effectively model the nonlinear characteristics of data and resolve the computational load issue. The scVI assumes that the expression count (
Similar to scANVI, scGen (Lotfollahi et al., 2019) uses the cell-type information obtained from the cell-type annotation of the clusters after cell clustering (e.g., three cell types in Fig. 6B, left). A variational autoencoder is used to estimate the distribution parameters (means and standard deviations) of the nonlinear LVs (
All the aforementioned methods have differences in model structures and parameters, dimension reduction, similarity search, and batch correction. These aspects of the individual methods are summarized in Supplementary Table S1. The unique characteristics of these methods provide advantages and disadvantages in terms of their performance (output type, speed, peak memory use, GPU support, etc.), which are summarized in Supplementary Table S1. ‘High’ performance in the table indicates that the corresponding methods were found to show good performance in terms of speed, memory, and batch correction performance evaluated based on diversity of batches in each cluster (i.e., cell type).
The characteristics of the individual methods are associated with the algorithms or approaches employed. Linear decomposition methods are the simplest, thereby providing an advantage in terms of analysis speed (Tran et al., 2020) (Supplementary Table S1, Speed). ComBat could model diverse batch effects using both additive and multiplicative batch terms compared to other linear decomposition methods, and also incorporated the empirical Bayes shrinkage of parameters to pool the information across genes, which provides robustness in correction of batches with small sample sizes. ZINB-WaVE includes unique gene-level covariates in the model. However, how its addition to the model improves performance has not been systematically tested, and the benefit of the gene-level covariate term is unclear, considering that the discrepancy among gene-level covariates may be handled by normalization strategies in other methods. Linear decomposition methods assume that the cell types between batches are similar (Haghverdi et al., 2018; Lin et al., 2019) and cannot effectively handle heterogeneity among cell-level covariates across samples or batches.
Although similarity-based batch correction methods effectively handle cell-level covariate issues, they commonly involve additional steps to search for similar cells and correct batch effects to match the distributions of similar cells. Because of these additional steps, some of them (e.g., MNN-based methods) often suffer from heavy computational loads (Supplementary Table S1, Speed and Peak memory use), which render limited applicability or scalability for scRNA-seq datasets, including hundreds of thousands of cells or more (Li et al., 2020). Both linear decomposition and similarity-based methods effectively handle unwanted non-systematic variations. For systematic unwanted variations, however, MNN-based methods can handle more effectively the systematic unwanted variations than the linear decomposition methods, as long as they identify pairs of similar cells with reasonable accuracy.
Similarity-based methods perform similarity searches and batch corrections mostly on reduced dimensions. Among the dimension reduction tools, CCA captures the shared sources of variations between batches and thus performs well when cell populations are largely shared between batches, whereas it is less likely to capture variations for cell subpopulations uniquely present in small numbers of batches. By contrast, PCA captures distinctive sources of variation with a sufficient number of LVs, which may make similarity search among these distinctive cell subpopulations possible. iNMF specifically models cells that are uniquely present in a small number of batches to effectively address this issue. Moreover, batch corrections based on batch vectors require large amounts of memory (Hie et al., 2019), thereby causing memory issues, particularly for datasets that include a large number of cells. Scanorama endeavored to resolve this memory issue by reducing peak memory usage, thereby improving the scalability of analyses with limited computing resources (Li et al., 2020; Luecken et al., 2022; Tran et al., 2020) (Supplementary Table S1, Speed, Peak memory use, and Performance). BBKNN and Conos provide no batch-corrected X or z, but several downstream analyses, such as functional gene program identification (Kotliar et al., 2019) or trajectory inference (Trapnell et al., 2014), require corrected X or z, thereby limiting the applicability of BBKNN and Conos (Luecken et al., 2022).
Among the cluster-level similarity search methods, Harmony shows high scalability in terms of both runtime and memory usage (Korsunsky et al., 2019; McKellar et al., 2021; Tran et al., 2020) (Supplementary Table S1, Speed). However, the KL divergence term in its objective function can make Harmony biased toward major cell types (clusters), including large numbers of cells. There may be a chance for Harmony to overcorrect batch effects for small cell types when they are integrated with major cell types. DESC does not require batch information, but corrects batch effects and simultaneously performs soft clustering. The exact number of batch-effect sources (or unwanted variations) is typically unknown. Although DESC may effectively handle these unknown sources of batch effects, systematic analyses are needed to understand the benefits from no use of batch information. LIGER was shown to have a tendency of focusing on batch effect correction rather than biological conservation, favoring its application to integration of cross-species datasets (Luecken et al., 2022) (Supplementary Table S1, Bio-conservation vs Batch correction).
scVI and scANVI are ‘all-inclusive’ tools that perform a range of analyses including normalization, dimension reduction, batch effect correction, imputation, clustering, and differential expression. They can effectively capture the nonlinear characteristics in data based on probabilistic models that statistically bound variations in random variables (
As no single method performs well in all integration settings, several representative methods in the aforementioned categories (e.g., ComBat, Seurat’s anchoring method, harmony, and scVI) should be compared to select an appropriate method for data integration. Clusters of cells identified by each method on UMAP can be compared across the methods. Among the clusters, several are consistently identified with similar shapes and memberships in all methods. Focusing on these clusters provides the most reliable conclusions. However, these methods often produce inconsistent clusters, particularly for small clusters. For example, a small cluster can be identified as a distinct cluster in the ComBat and anchoring methods but merged into another large cluster in Harmony and scVI. Moreover, the relative position of the small cluster with respect to the other clusters on the UMAP cells can differ across methods. Furthermore, cells within a cluster can be evenly distributed or exhibit a skewed distribution. The patterns of these cluster characteristics across the methods can suggest whether batch effects are corrected too weakly, too much, or appropriately for particular clusters of interest, from linear and similarity-based (cell- or cluster-level) methods to variational autoencoders. Nevertheless, whether a small cluster is a biologically meaningful cell subpopulation or an artifact from cell isolation, and batch correction should be determined by detailed functional experiments for small clusters.
Data integration methods provide unique opportunities to systematically compare and federate cell types present in multiple sets of samples under similar or distinct disease conditions, thereby enabling a more comprehensive interpretation of the functional roles of different cell types under diverse disease conditions. Moreover, the integrative analysis of multiple datasets generated from diverse disease conditions can provide insights into the interplay of particular cell-type pairs by providing shared count variations of the cell-type pairs under disease conditions. Although data integration methods have improved from linear decomposition to nonlinear probabilistic methods, many problems remain to be resolved. A definition of batches is required for most tools. Although batches are defined subjectively based on the major sources of unwanted variation, there may still be unrecognized sources that cause batch effects. Moreover, most methods assume that batch effects are smaller than biological differences. However, there could be systematic unwanted variations that are similar in magnitude to biological differences. These systematic unwanted variations cannot be effectively distinguished from biological differences using the current methods. In addition, there is a significant need for tools that can effectively evaluate how robust or stable correction data integration methods can achieve in the presence of diverse types of non-systematic and systematic noises. Furthermore, data integration tends to be biased toward the major cell types that are commonly abundant across the integrated datasets, thereby not providing stable integration for the small cell types present only in particular small sets of datasets. Finally, owing to recent technical advances, the number of detected cells has substantially increased. Thus, the scalability of these methods needs to be improved to effectively handle large numbers of cells. Therefore, there are still plenty of room for improvements. Nonetheless, data integration methods have been applied to answer diverse single-cell-level biological and medical questions. Along with the improvement of these methods, their continuous application will shed new insights into cellular players and their interactions underlying disease pathogenesis, and provide new cellular targets for the treatment of various diseases.
This study was supported by the Bio & Medical Technology Development Program of the National Research Foundation (NRF), funded by the Korean government (MSIT) (No. 2019M3A9B6066967).
Y.R., G.H.H., and E.J. wrote the original draft. Y.R., G.H.H., E.J., and D.H. reviewed and revised the manuscript.
The authors have no potential conflicts of interest to disclose.