# Integration of Single-Cell RNA-Seq Datasets: A Review of Computational Methods

Yeonjae Ryu, Geun Hee Han, Eunsoo Jung

## Abstract

With the increased number of single-cell RNA sequencing (scRNA-seq) datasets in public repositories, integrative analysis of multiple scRNA-seq datasets has become commonplace. Batch effects among different datasets are inevitable because of differences in cell isolation and handling protocols, library preparation technology, and sequencing platforms. To remove these batch effects for effective integration of multiple scRNA-seq datasets, a number of methodologies have been developed based on diverse concepts and approaches. These methods have proven useful for examining whether cellular features, such as cell subpopulations and marker genes, identified from a certain dataset, are consistently present, or whether their condition-dependent variations, such as increases in cell subpopulations in particular disease-related conditions, are consistently observed in different datasets generated under similar or distinct conditions. In this review, we summarize the concepts and approaches of the integration methods and their pros and cons as has been reported in previous literature.

**Keywords:**batch correction, data integration, single-cell RNA-seq

## INTRODUCTION

Single-cell RNA sequencing (scRNA-seq) datasets have been increasingly accumulated in public data repositories, such as the Single Cell Portal (http://singlecell.broadinstitute.org/single_cell), Gene Expression Omnibus (Barrett et al., 2013), and Human Cell Atlas (HCA) data portal (Regev et al., 2017). With an increase in the number of datasets, many efforts have been made for the integrative analysis of different scRNA-seq datasets. Multiple scRNA-seq datasets are often integrated and compared to check whether cellular features (e.g., cell subpopulations and their marker genes) identified from a certain dataset are shared or distinctive compared with those in other datasets produced under similar or different biological conditions. The integration of multiple scRNA-seq datasets has proven useful for reliably identifying shared or distinctive cellular features across datasets. Shared cellular features may not be clear when individual datasets are analyzed independently, owing to small numbers of cells or sparse expression data in each dataset and unwanted technical or biological variations within and across datasets. However, these cellular features can be corroborated by combining features from multiple datasets after correcting the unwanted variations.

In many cases, individual scRNA-seq datasets are generated from samples with distinctive characteristics (e.g., cell counts, tissue types, and conditions from which cells are isolated, such as healthy or diseased conditions) and using different experimental protocols (e.g., cell isolation and handling protocols and library preparation methods) or sequencing platforms. These differences inevitably lead to unwanted technical and biological variations across different datasets. Even within a single dataset, multiple batches with variations in sample characteristics, experimental protocols, or sequencing platforms can exist. These unwanted variations among batches, called batch effects, within and across datasets can decrease the chances of identifying underlying cellular features by introducing inconsistent cellular expression profile structures. Therefore, batch effects arising from both systematic technical and unwanted biological variations should be corrected before integrative analysis of multiple datasets to prevent misleading conclusions. To effectively correct batch effects, several computational methods have been developed based on different concepts and approaches. In this review, we conceptually categorize existing integration methods, describe the key algorithms employed in these methods, and summarize their advantages and disadvantages as reported in literature.

## RESULTS

### Definition of sample batches in multiple datasets

Batch effects occur mainly due to differences in the following three factors: (1)
sample characteristics (donors, tissues, species, or disease conditions), (2) experimental
protocols, and (3) sequencing platforms (Fig. 1A, top). The integration of multiple datasets begins by defining batches as sets of
samples that are thought to be similar in terms of these factors (Fig. 1A, boxes in different colors). The factors leading to the most significant batch effects
can vary across the integrated datasets. It is common for the major factor causing
the largest batch effect to be chosen subjectively, and the batches are defined based
on the major factor. For example, batches were defined as sets of samples from individual
donors (Reichart et al., 2022; Smillie et al., 2019; Uchimura et al., 2020; Villa et al., 2022) (Fig. 1A, Dataset 1) or as individual datasets that were generated using different protocols
(Cheng et al., 2021; Morabito et al., 2021) and/or sequencing platforms (Cheng et al., 2021) (Fig. 1A, Datasets 2-3). Several studies even defined batches as individual samples (Bryois et al., 2022; Yoon et al., 2022) (Fig. 1A, Dataset *k*), assuming that different samples are subjected to distinct technical variations.
Moreover, when two factors (e.g., protocol and sequencing platform) cause significant
batch effects, two sets of batches can be defined based on these factors, and the
batch effects in the two sets can be sequentially corrected (Cheng et al., 2021; Morabito et al., 2021). After defining the batches, multiple datasets are then merged by concatenating
the expression counts of cells for the same genes in the expression count matrix for
each dataset. For each cell in the merged expression count matrix, the expression
counts are divided by the total expression count, multiplied by a scale factor (e.g.,
10,000), and log-transformed (Fig. 1B, Normalization). For each batch, a set of highly variable genes (e.g., 2,000 genes)
with large variances across the cells in the batch are then selected using various
tools (e.g., ‘FindVariableFeatures’ function in Seurat [v3 and higher hereafter] [Stuart et al., 2019], BASiCS [Vallejos et al., 2015], Brennecke [Brennecke et al., 2013], scLVM [Buettner et al., 2015], scran [Lun et al., 2016], and scVEGs [Chen et al., 2016]). The final set of highly variable genes most frequently selected across the batches
are then selected (‘SelectIntegrationFeatures’ function of Seurat; Fig. 1B, Selection of highly variable genes). Although data integration can be performed
using only highly variable genes or all genes, the use of highly variable genes only
in subsequent analyses (Fig. 1B, Batch correction, Cell clustering, and Cell annotation) has been shown to be generally
more effective for identifying underlying biological differences among cell types
and/or for removing unwanted variations in the data (Luecken et al., 2022).

### Categorization of integration methods

Batch correction is next performed for the normalized merged dataset using the selected highly variable genes (Fig. 1B, Batch correction). Batch correction methods can be conceptually classified into the following three categories depending on how they model batch effects: (1) linear decomposition methods, (2) similarity-based batch correction methods in reduced dimension space, and (3) generative models using a variational autoencoder. A similar categorization scheme for the first two categories was previously proposed by Xu et al. (2021).

### Linear decomposition methods

Modeling batch effects using a generalized linear decomposition model was first introduced to remove batch effects within or across bulk RNA expression datasets when integrating multiple bulk RNA expression datasets. For example, the ‘removeBatchEffect’ function in limma (Ritchie et al., 2015) and ComBat (Johnson et al., 2007) have employed this linear decomposition model to delineate batch effects in bulk RNA expression datasets.

In the ‘removeBatchEffect’ function in limma, the merged data matrix (
*N* genes in *K* batches of *M* samples at *H* conditions is represented as a linear sum of (i) the overall gene expression matrix
(
_{MGE}, *N* × *M*), (ii) condition-dependent gene expression matrix represented by multiplication of
the condition regression coefficient matrix (
_{C}, *N* × *H*) and a condition design matrix (
_{C}, *H* × *M*), (iii) a batch term matrix represented by multiplication of the batch regression
coefficient matrix (
_{B}, *N* × *K*) and a batch structure matrix (
_{B}, *K* × *M*), and (iv) an error matrix (
*N* × *M*) (Fig. 2A). In the overall gene expression matrix, each column (sample) contains the same vector
(*N* × 1) including mean expression values of *N* genes across *M* samples. In the
_{C} matrix, the column (*H* × 1) for sample *m* under condition *h* includes one in the *h*-th element and zeros in the other elements. Similarly, in the
_{B} matrix, the column (*K* × 1) for sample *m* belonging to batch *k* includes one in the *k*-th element and zeros in the other elements. The condition (
_{C}) and batch regression coefficients (
_{B}) are estimated by minimizing the sum of the squared errors in
_{B}_{B} from

In addition to the above batch effects, called additive batch effects, ComBat estimates
multiplicative batch effects. To this end,
_{MB}, *N* × *K*), a batch structure matrix (
_{B}, *K* × *M*) and an error matrix (
*M* × *M*) (Fig. 2A, Error matrix of ComBat). Unlike the linear regression in limma, the regression coefficients
in this modified linear decomposition model are estimated using an empirical Bayesian
method. With the multiplicative batch effect terms, batch correction was performed
by [
_{MGE} +
_{C}_{C} +
_{B}_{B})]/
_{MB}_{B} + (
_{MGE} +
_{C}_{C}).

These approaches have been also used to remove batch effects when integrating multiple
scRNA-seq datasets in early studies (Giustacchini et al., 2017; Kadoki et al., 2017; Smillie et al., 2019; Young et al., 2018). However, another linear decomposition model ZINB-WaVE (Risso et al., 2018) with an alternative model structure was developed specifically for integrating scRNA-seq
datasets that have unique features such as zero inflation (dropouts), overdispersion,
and different count distributions from bulk RNA-seq datasets. In this method, the
expression count in the merged data matrix (
*N* × *M*) is defined as a random variable (*y*) following a zero-inflated negative binomial (ZINB) distribution: *π* δ(*y*) + (1 – *π*)*f*_{NB}(*y*; *μ*, *θ*) where *π* is the probability of dropout (*π*), and δ and *f*NB are Dirac delta function and negative binomial probability mass function with the
mean *μ* and an inverse dispersion parameter *θ*, respectively (Greene, 1994). The model decomposes the log *μ* matrix (*N* × *M*) into sample- (
*K* batches, as in the aforementioned
_{B}_{B} (Fig. 2B, Sample-level covariate),
_{G}, *N* × *L*) and gene regression coefficients (
_{G}, *L* × *M*).
*L* gene sets, each of which had different GC content and gene length distributions,
possibly causing differences in the counts and quality of reads (Fig. 2B, Gene-level covariate).
_{F}, *N* × *Q*) and the factor matrix (
_{F}, *Q* × *M*) to capture the systematic cellular gene expression profiles in a low dimension of
Q latent factors (Fig. 2B, Factor matrix). Notably, U may also include the effects of unknown factor-driven
batches that may be missed when defined in
*π* (i.e., ln[*π*/(1 – *π*)]) matrix is also decomposed into
_{π},
_{π}, and
_{π} (Fig. 2B). However, the same factor matrix (
_{F}) is shared for both
_{π}. The regression coefficients in
_{B},
_{G}, and
_{F} for the log *μ* and logit *π* matrices are then collectively estimated using the maximum likelihood method.
_{F} was finally used as a cellular gene expression profile summarized in the Q-dimensional
space during subsequent analyses (e.g., clustering and visualization).

### Similarity-based batch correction methods in reduced dimension space

The above methods assume that there are no cell-level covariates and that all cells
vary equivalently between different samples (or batches) in response to the same sources
of variation. However, owing to distinct sensitivity in the response, cells can vary
differentially between samples and thus between batches. For example, the three groups
of cells shown in Fig. 3A exhibited different distribution changes between batches 1 and 2 in terms of their
mean and standard deviation when visualized in a uniform manifold approximation and
projection (UMAP) space. To address this cell-level covariate issue in batch correction,
a number of methods have been developed, including canonical correlation analysis
(CCA) (Butler et al., 2018), mutual nearest neighbor (MNN) (Haghverdi et al., 2018), fastMNN (Haghverdi et al., 2018), ‘IntegrateData’ function in Seurat (Stuart et al., 2019), Scanorama (Hie et al., 2019), BBKNN (Polański et al., 2020), Conos (Barkas et al., 2019), Harmony (Korsunsky et al., 2019), DESC (Li et al., 2020), LIGER (Welch et al., 2019), scMerge (Lin et al., 2019), and SAUCIE (Amodio et al., 2019). These methods start with the projection of cells in the merged dataset (_{X}) onto a reduced space defined by several dimension reduction methods, such as principal
component analysis (PCA), CCA, non-negative matrix factorization (NMF), and an autoencoder
(e.g., a 2-dimentional space defined by two latent variables [LV1-2] in Fig. 3B). They then identify similar cells sharing expression profiles, which can be identified
as pairs of cells between batches at the individual cell level (Fig. 3B, connected cells between batches 1 and 2 in a 2-dimentional LV space) or as cells
from different batches in the same cluster at the cluster level. Batch effects are
then corrected such that similar cells followed a common distribution in the reduced
space (see batch-corrected cluster-level similar cells in Fig. 3C). Each step is described in detail below.

*Dimension reduction*Projection values in the reduced dimension from the above linear or nonlinear methods are used to identify similar cells at the individual or cluster level (Fig. 3G), followed by batch correction for similar cells to follow a common distribution in the reduced space. However, SAUCIE does not identify similar cells, but directly performs batch correction to minimize discrepancies between batches (Amodio et al., 2019). To this end, SAUCIE randomly sets one batch as the reference batch, and corrects the mean and standard deviation of the embedded values for each non-reference batch with those of the reference batch while minimizing the reconstruction error. The weights in the autoencoder are thus determined to balance the reconstruction error and batch correction.

*Identification of similar cells between batches*In practice, after batch correction using the aforementioned methods, the outputs,
such as batch corrected merged data matrix (_{X} from linear decomposition methods) or projections (

### Generative models with variational autoencoder

Linear decomposition and similarity-based methods using linear dimension reduction cannot effectively capture the nonlinear characteristics of batch effects and systematic biological signals. To address these issues, several methods using a generative model with variational autoencoder (e.g., scVI [Lopez et al., 2018], scGen [Lotfollahi et al., 2019], and trVAE [Lotfollahi et al., 2020]) have been developed.

Similarity-based methods often suffer from heavy computational loads during similarity
searches and batch corrections, when hundreds of thousands of cells are integrated.
scVI (Lopez et al., 2018) was developed to effectively model the nonlinear characteristics of data and resolve
the computational load issue. The scVI assumes that the expression count (*x*_{m}) for each gene in cell *m* follows the ZINB distribution *p*(*x*_{m}|*z*_{m}, *s*_{m}, *l*_{m}) where *z*_{m} is the nonlinear LVs, *s*_{m} is the batch information for cell *m*; and *l*_{m} is a cell-specific size factor that accounts for variations during library construction
and sequencing, which is not explicitly considered in the aforementioned methods.
To estimate the ZINB probability, scVI employs a variational autoencoder model composed
of ‘variational posterior’ and ‘generative model’ parts (Fig. 6A): the first part performs the inference of a variational posterior distribution *q*(*z*_{m}, *l*_{m}|*x*_{m}, *s*_{m}) for two unknown variables *z*_{m} and *l*_{m} given *x*_{m} and *s*_{m} using neural networks (NNs; Fig. 6A, NN1-4), and the second part estimates *p*(*x*_{m}|*z*_{m}, *s*_{m}, *l*_{m}) using other NNs (Fig. 6A, NN5-6). NN1-4 are trained to estimate the parameters (mean and standard deviation)
of the Gaussian distributions that *z*_{m} and *l*_{m} are assumed to follow based on variational inference. After *z*_{m} and *l*_{m} (mean values) are sampled from their estimated distributions, *z*_{m}, considered as batch corrected projections, are used to train NN5-6 to estimate the
expected dropout (*π*) and frequency in the NB distribution, respectively. These expected values are finally
used together with the sampled *l*_{m} to estimate the expected expression counts (*x*_{m}) that follow *p*(*x*_{m}|*z*_{m}, *s*_{m}, *l*_{m}). The weights of the NNs are updated such that the sampled *z*_{m} and *l*_{m} explain the observed *x*_{m} given *s*_{m} based on the ZINB distribution. The sampled *z*_{m} can be used for clustering analysis, and the expected counts can be used to identify
differentially expressed genes for cell clusters. scANVI (Xu et al., 2021), an extension of scVI, uses additional cell-type information such that the distribution
of cells in the latent space (*z*_{m}) reflects the cell types, thereby enabling batch corrections with the cell-type information
considered.

Similar to scANVI, scGen (Lotfollahi et al., 2019) uses the cell-type information obtained from the cell-type annotation of the clusters
after cell clustering (e.g., three cell types in Fig. 6B, left). A variational autoencoder is used to estimate the distribution parameters
(means and standard deviations) of the nonlinear LVs (*z*_{m}) based on variational inference (Fig. 6B, middle top). The sampled *z*_{m} are used to determine perturbation parameters (*K* × 1 *δ*) for *K* LVs as the difference between the mean vectors of *z*_{m} for cells in batches (e.g., *δ* on 2-dimensional LV space between batches 1-2 in Fig. 6B, right). Batch effects are then corrected by applying *δ* to *z*_{m} for cells in batch 2, and the batch corrected *z*_{m} are then rescaled back to produce the batch corrected *x*_{m} using the decoder NN (Fig. 6B, right). Finally, trVAE (Lotfollahi et al., 2020) employs a variational encoder similar to that in scGen to estimate the distribution
parameters of *z*_{m}; however, the input layer has additional *L* nodes that take the batch information for *L* batches. Optionally, nonlinear LVs can first be decoded in two sequential stages:
from the LVs to an intermediate *y*_{m} (*g*_{1} decoder to handle the discrepancy between batches) and then from *y*_{m} to *x*_{m} (*g*_{2} decoder).

### Pros and cons

All the aforementioned methods have differences in model structures and parameters, dimension reduction, similarity search, and batch correction. These aspects of the individual methods are summarized in Supplementary Table S1. The unique characteristics of these methods provide advantages and disadvantages in terms of their performance (output type, speed, peak memory use, GPU support, etc.), which are summarized in Supplementary Table S1. ‘High’ performance in the table indicates that the corresponding methods were found to show good performance in terms of speed, memory, and batch correction performance evaluated based on diversity of batches in each cluster (i.e., cell type).

The characteristics of the individual methods are associated with the algorithms or approaches employed. Linear decomposition methods are the simplest, thereby providing an advantage in terms of analysis speed (Tran et al., 2020) (Supplementary Table S1, Speed). ComBat could model diverse batch effects using both additive and multiplicative batch terms compared to other linear decomposition methods, and also incorporated the empirical Bayes shrinkage of parameters to pool the information across genes, which provides robustness in correction of batches with small sample sizes. ZINB-WaVE includes unique gene-level covariates in the model. However, how its addition to the model improves performance has not been systematically tested, and the benefit of the gene-level covariate term is unclear, considering that the discrepancy among gene-level covariates may be handled by normalization strategies in other methods. Linear decomposition methods assume that the cell types between batches are similar (Haghverdi et al., 2018; Lin et al., 2019) and cannot effectively handle heterogeneity among cell-level covariates across samples or batches.

Although similarity-based batch correction methods effectively handle cell-level covariate issues, they commonly involve additional steps to search for similar cells and correct batch effects to match the distributions of similar cells. Because of these additional steps, some of them (e.g., MNN-based methods) often suffer from heavy computational loads (Supplementary Table S1, Speed and Peak memory use), which render limited applicability or scalability for scRNA-seq datasets, including hundreds of thousands of cells or more (Li et al., 2020). Both linear decomposition and similarity-based methods effectively handle unwanted non-systematic variations. For systematic unwanted variations, however, MNN-based methods can handle more effectively the systematic unwanted variations than the linear decomposition methods, as long as they identify pairs of similar cells with reasonable accuracy.

Similarity-based methods perform similarity searches and batch corrections mostly
on reduced dimensions. Among the dimension reduction tools, CCA captures the shared
sources of variations between batches and thus performs well when cell populations
are largely shared between batches, whereas it is less likely to capture variations
for cell subpopulations uniquely present in small numbers of batches. By contrast,
PCA captures distinctive sources of variation with a sufficient number of LVs, which
may make similarity search among these distinctive cell subpopulations possible. iNMF
specifically models cells that are uniquely present in a small number of batches to
effectively address this issue. Moreover, batch corrections based on batch vectors
require large amounts of memory (Hie et al., 2019), thereby causing memory issues, particularly for datasets that include a large number
of cells. Scanorama endeavored to resolve this memory issue by reducing peak memory
usage, thereby improving the scalability of analyses with limited computing resources
(Li et al., 2020; Luecken et al., 2022; Tran et al., 2020) (Supplementary Table S1, Speed, Peak memory use, and Performance). BBKNN and Conos provide no batch-corrected

Among the cluster-level similarity search methods, Harmony shows high scalability in terms of both runtime and memory usage (Korsunsky et al., 2019; McKellar et al., 2021; Tran et al., 2020) (Supplementary Table S1, Speed). However, the KL divergence term in its objective function can make Harmony biased toward major cell types (clusters), including large numbers of cells. There may be a chance for Harmony to overcorrect batch effects for small cell types when they are integrated with major cell types. DESC does not require batch information, but corrects batch effects and simultaneously performs soft clustering. The exact number of batch-effect sources (or unwanted variations) is typically unknown. Although DESC may effectively handle these unknown sources of batch effects, systematic analyses are needed to understand the benefits from no use of batch information. LIGER was shown to have a tendency of focusing on batch effect correction rather than biological conservation, favoring its application to integration of cross-species datasets (Luecken et al., 2022) (Supplementary Table S1, Bio-conservation vs Batch correction).

scVI and scANVI are ‘all-inclusive’ tools that perform a range of analyses including
normalization, dimension reduction, batch effect correction, imputation, clustering,
and differential expression. They can effectively capture the nonlinear characteristics
in data based on probabilistic models that statistically bound variations in random
variables (*x*_{m} or *z*_{m}) using neural networks. Probabilistic models enable the propagation of the statistically
bounded *x*_{m} or *z*_{m} to clustering, differential analysis, or cell-type annotation, providing effective
handling of uncertainties in downstream analyses. These methods have been shown to
scale up to very large datasets (Stuart et al., 2019). scVI and scANVI use the ZINB distribution for probabilistic modeling by default,
but it is possible to use the NB distribution instead of the ZINB distribution. It
is debatable which distribution better fits the read counts derived from either the
droplet-based method or the full-length plate-based method. However, in the benchmarking
analysis using both distributions, similar results in four datasets tested were shown
(Xu et al., 2021).

As no single method performs well in all integration settings, several representative methods in the aforementioned categories (e.g., ComBat, Seurat’s anchoring method, harmony, and scVI) should be compared to select an appropriate method for data integration. Clusters of cells identified by each method on UMAP can be compared across the methods. Among the clusters, several are consistently identified with similar shapes and memberships in all methods. Focusing on these clusters provides the most reliable conclusions. However, these methods often produce inconsistent clusters, particularly for small clusters. For example, a small cluster can be identified as a distinct cluster in the ComBat and anchoring methods but merged into another large cluster in Harmony and scVI. Moreover, the relative position of the small cluster with respect to the other clusters on the UMAP cells can differ across methods. Furthermore, cells within a cluster can be evenly distributed or exhibit a skewed distribution. The patterns of these cluster characteristics across the methods can suggest whether batch effects are corrected too weakly, too much, or appropriately for particular clusters of interest, from linear and similarity-based (cell- or cluster-level) methods to variational autoencoders. Nevertheless, whether a small cluster is a biologically meaningful cell subpopulation or an artifact from cell isolation, and batch correction should be determined by detailed functional experiments for small clusters.

## CONCLUSION

Data integration methods provide unique opportunities to systematically compare and federate cell types present in multiple sets of samples under similar or distinct disease conditions, thereby enabling a more comprehensive interpretation of the functional roles of different cell types under diverse disease conditions. Moreover, the integrative analysis of multiple datasets generated from diverse disease conditions can provide insights into the interplay of particular cell-type pairs by providing shared count variations of the cell-type pairs under disease conditions. Although data integration methods have improved from linear decomposition to nonlinear probabilistic methods, many problems remain to be resolved. A definition of batches is required for most tools. Although batches are defined subjectively based on the major sources of unwanted variation, there may still be unrecognized sources that cause batch effects. Moreover, most methods assume that batch effects are smaller than biological differences. However, there could be systematic unwanted variations that are similar in magnitude to biological differences. These systematic unwanted variations cannot be effectively distinguished from biological differences using the current methods. In addition, there is a significant need for tools that can effectively evaluate how robust or stable correction data integration methods can achieve in the presence of diverse types of non-systematic and systematic noises. Furthermore, data integration tends to be biased toward the major cell types that are commonly abundant across the integrated datasets, thereby not providing stable integration for the small cell types present only in particular small sets of datasets. Finally, owing to recent technical advances, the number of detected cells has substantially increased. Thus, the scalability of these methods needs to be improved to effectively handle large numbers of cells. Therefore, there are still plenty of room for improvements. Nonetheless, data integration methods have been applied to answer diverse single-cell-level biological and medical questions. Along with the improvement of these methods, their continuous application will shed new insights into cellular players and their interactions underlying disease pathogenesis, and provide new cellular targets for the treatment of various diseases.

## Supplemental Materials

*Note: Supplementary information is available on the Molecules and Cells website (www.molcells.org).*

## Article information

###### Articles from Mol. Cells are provided here courtesy of **Mol. Cells**

## References

- Amodio, M., van Dijk, D., Srinivasan, K., Chen, W.S., Mohsen, H., Moon, K.R., Campbell,
A., Zhao, Y., Wang, X., Venkataswamy, M. (2019). Exploring single-cell data with deep
multitasking neural networks. Nat. Methods.
*16*, 1139-1145. - Aran, D., Looney, A.P., Liu, L., Wu, E., Fong, V., Hsu, A., Chak, S., Naikawadi, R.P.,
Wolters, P.J., Abate, A.R. (2019). Reference-based analysis of lung single-cell sequencing
reveals a transitional profibrotic macrophage. Nat. Immunol..
*20*, 163-172. - Argelaguet, R., Cuomo, A.S.E., Stegle, O., Marioni, J.C. (2021). Computational principles
and challenges in single-cell data integration. Nat. Biotechnol..
*39*, 1202-1215. - Barkas, N., Petukhov, V., Nikolaeva, D., Lozinsky, Y., Demharter, S., Khodosevich,
K., Kharchenko, P.V. (2019). Joint analysis of heterogeneous single-cell RNA-seq dataset
collections. Nat. Methods.
*16*, 695-698. - Barrett, T., Wilhite, S.E., Ledoux, P., Evangelista, C., Kim, I.F., Tomashevsky, M.,
Marshall, K.A., Phillippy, K.H., Sherman, P.M., Holko, M. (2013). NCBI GEO: archive
for functional genomics data sets--update. Nucleic Acids Res..
*41*, D991-D995. - Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E. (2008). Fast unfolding
of communities in large networks. J. Stat. Mech..
*2008*, P10008. - Bolstad, B.M., Irizarry, R.A., Åstrand, M., Speed, T.P. (2003). A comparison of normalization
methods for high density oligonucleotide array data based on variance and bias. Bioinformatics.
*19*, 185-193. - Brennecke, P., Anders, S., Kim, J.K., Kołodziejczyk, A.A., Zhang, X., Proserpio, V.,
Baying, B., Benes, V., Teichmann, S.A., Marioni, J.C. (2013). Accounting for technical
noise in single-cell RNA-seq experiments. Nat. Methods.
*10*, 1093-1095. - Bryois, J., Calini, D., Macnair, W., Foo, L., Urich, E., Ortmann, W., Iglesias, V.A.,
Selvaraj, S., Nutma, E., Marzin, M. (2022). Cell-type-specific cis-eQTLs in eight
human brain cell types identify novel risk genes for psychiatric and neurological
disorders. Nat. Neurosci..
*25*, 1104-1112. - Buettner, F., Natarajan, K.N., Casale, F.P., Proserpio, V., Scialdone, A., Theis,
F.J., Teichmann, S.A., Marioni, J.C., Stegle, O. (2015). Computational analysis of
cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations
of cells. Nat. Biotechnol..
*33*, 155-160. - Butler, A., Hoffman, P., Smibert, P., Papalexi, E., Satija, R. (2018). Integrating
single-cell transcriptomic data across different conditions, technologies, and species.
Nat. Biotechnol..
*36*, 411-420. - Bzdok, D., Altman, N., Krzywinski, M. (2018). Statistics versus machine learning.
Nat. Methods.
*15*, 233-234. - Chen, H.I., Jin, Y., Huang, Y., Chen, Y. (2016). Detection of high variability in
gene expression from single-cell RNA-seq profiling. BMC Genomics.
*17*, 508. - Cheng, S., Li, Z., Gao, R., Xing, B., Gao, Y., Yang, Y., Qin, S., Zhang, L., Ouyang,
H., Du, P. (2021). A pan-cancer single-cell transcriptional atlas of tumor infiltrating
myeloid cells. Cell.
*184*, 792-809.e23. - Csardi, G., Nepusz, T. (2006). The igraph software package for complex network research.
InterJournal, Complex Systems.
*1695*, 1-9. - Giorgino, T. (2009). Computing and visualizing dynamic time warping alignments in
R: the dtw Package. J. Stat. Softw..
*31*, 1-24. - Giustacchini, A., Thongjuea, S., Barkas, N., Woll, P.S., Povinelli, B.J., Booth, C.A.G.,
Sopp, P., Norfo, R., Rodriguez-Meira, A., Ashley, N. (2017). Single-cell transcriptomics
uncovers distinct molecular signatures of stem cells in chronic myeloid leukemia.
Nat. Med..
*23*, 692-702. - Greene, W.H. (1994). . Accounting for Excess Zeros and Sample Selection in Poisson and Negative Binomial Regression Models, , ed. (New York:New York University), pp. .
- Haghverdi, L., Lun, A.T.L., Morgan, M.D., Marioni, J.C. (2018). Batch effects in single-cell
RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol..
*36*, 421-427. - Hie, B., Bryson, B., Berger, B. (2019). Efficient integration of heterogeneous single-cell
transcriptomes using Scanorama. Nat. Biotechnol..
*37*, 685-691. - Johnson, W.E., Li, C., Rabinovic, A. (2007). Adjusting batch effects in microarray
expression data using empirical Bayes methods. Biostatistics.
*8*, 118-127. - Kadoki, M., Patil, A., Thaiss, C.C., Brooks, D.J., Pandey, S., Deep, D., Alvarez,
D., von Andrian, U.H., Wagers, A.J., Nakai, K. (2017). Organism-level analysis of
vaccination reveals networks of protection across tissues. Cell.
*171*, 398-413.e21. - Kim, Y., Kim, T.K., Kim, Y., Yoo, J., You, S., Lee, I., Carlson, G., Hood, L., Choi,
S., Hwang, D. (2011). Principal network analysis: identification of subnetworks representing
major dynamics using gene expression data. Bioinformatics.
*27*, 391-398. - Korsunsky, I., Millard, N., Fan, J., Slowikowski, K., Zhang, F., Wei, K., Baglaenko,
Y., Brenner, M., Loh, P.R., Raychaudhuri, S. (2019). Fast, sensitive and accurate
integration of single-cell data with Harmony. Nat. Methods.
*16*, 1289-1296. - Kotliar, D., Veres, A., Nagy, M.A., Tabrizi, S., Hodis, E., Melton, D.A., Sabeti,
P.C. (2019). Identifying gene expression programs of cell-type identity and cellular
activity with single-cell RNA-Seq. Elife.
*8*, e43803. - Kriebel, A.R., Welch, J.D. (2022). UINMF performs mosaic integration of single-cell
multi-omic datasets using nonnegative matrix factorization. Nat. Commun..
*13*, 780. - Li, X., Wang, K., Lyu, Y., Pan, H., Zhang, J., Stambolian, D., Susztak, K., Reilly,
M.P., Hu, G., Li, M. (2020). Deep learning enables accurate clustering with batch
effect removal in single-cell RNA-seq analysis. Nat. Commun..
*11*, 2338. - Lin, Y., Ghazanfar, S., Wang, K.Y.X., Gagnon-Bartsch, J.A., Lo, K.K., Su, X., Han,
Z.G., Ormerod, J.T., Speed, T.P., Yang, P. (2019). scMerge leverages factor analysis,
stable expression, and pseudoreplication to merge multiple single-cell RNA-seq datasets.
Proc. Natl. Acad. Sci. U. S. A..
*116*, 9775-9784. - Lopez, R., Regier, J., Cole, M.B., Jordan, M.I., Yosef, N. (2018). Deep generative
modeling for single-cell transcriptomics. Nat. Methods.
*15*, 1053-1058. - Lotfollahi, M., Naghipourfar, M., Theis, F.J., Wolf, F.A. (2020). Conditional out-of-distribution
generation for unpaired data using transfer VAE. Bioinformatics.
*36*, i610-i617. - Lotfollahi, M., Wolf, F.A., Theis, F.J. (2019). scGen predicts single-cell perturbation
responses. Nat. Methods.
*16*, 715-721. - Luecken, M.D., Büttner, M., Chaichoompu, K., Danese, A., Interlandi, M., Mueller,
M.F., Strobl, D.C., Zappia, L., Dugas, M., Colomé-Tatché, M. (2022). Benchmarking
atlas-level data integration in single-cell genomics. Nat. Methods.
*19*, 41-50. - Lun, A.T., McCarthy, D.J., Marioni, J.C. (2016). A step-by-step workflow for low-level
analysis of single-cell RNA-seq data with Bioconductor. F1000Res.
*5*, 2122. - McKellar, D.W., Walter, L.D., Song, L.T., Mantri, M., Wang, M.F.Z., De Vlaminck, I.,
Cosgrove, B.D. (2021). Large-scale integration of single-cell transcriptomic data
captures transitional progenitor states in mouse skeletal muscle regeneration. Commun.
Biol..
*4*, 1280. - Molania, R., Gagnon-Bartsch, J.A., Dobrovic, A., Speed, T.P. (2019). A new normalization
for Nanostring nCounter gene expression data. Nucleic Acids Res..
*47*, 6073-6083. - Morabito, S., Miyoshi, E., Michael, N., Shahin, S., Martini, A.C., Head, E., Silva,
J., Leavy, K., Perez-Rosendahl, M., Swarup, V. (2021). Single-nucleus chromatin accessibility
and transcriptomic characterization of Alzheimer's disease. Nat. Genet..
*53*, 1143-1155. - Polański, K., Young, M.D., Miao, Z., Meyer, K.B., Teichmann, S.A., Park, J.E. (2020).
BBKNN: fast batch alignment of single cell transcriptomes. Bioinformatics.
*36*, 964-965. - Regev, A., Teichmann, S.A., Lander, E.S., Amit, I., Benoist, C., Birney, E., Bodenmiller,
B., Campbell, P., Carninci, P., Clatworthy, M. (2017). The human cell atlas. Elife.
*6*, e27041. - Reichart, D., Lindberg, E.L., Maatz, H., Miranda, A.M.A., Viveiros, A., Shvetsov,
N., Gartner, A., Nadelmann, E.R., Lee, M., Kanemaru, K. (2022). Pathogenic variants
damage cell composition and single cell transcription in cardiomyopathies. Science.
*377*, eabo1984. - Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S., Vert, J.P. (2018). A general
and flexible method for signal extraction from single-cell RNA-seq data. Nat. Commun..
*9*, 284. - Ritchie, M.E., Phipson, B., Wu, D., Hu, Y., Law, C.W., Shi, W., Smyth, G.K. (2015).
limma powers differential expression analyses for RNA-sequencing and microarray studies.
Nucleic Acids Res..
*43*, e47. - Smillie, C.S., Biton, M., Ordovas-Montanes, J., Sullivan, K.M., Burgin, G., Graham,
D.B., Herbst, R.H., Rogel, N., Slyper, M., Waldman, J. (2019). Intra- and inter-cellular
rewiring of the human colon during ulcerative colitis. Cell.
*178*, 714-730.e22. - Stuart, T., Butler, A., Hoffman, P., Hafemeister, C., Papalexi, E., Mauck, W.M., Hao,
Y., Stoeckius, M., Smibert, P., Satija, R. (2019). Comprehensive integration of single-cell
data. Cell.
*177*, 1888-1902.e21. - Tran, H.T.N., Ang, K.S., Chevrier, M., Zhang, X., Lee, N.Y.S., Goh, M., Chen, J. (2020).
A benchmark of batch-effect correction methods for single-cell RNA sequencing data.
Genome Biol..
*21*, 12. - Trapnell, C., Cacchiarelli, D., Grimsby, J., Pokharel, P., Li, S., Morse, M., Lennon,
N.J., Livak, K.J., Mikkelsen, T.S., Rinn, J.L. (2014). The dynamics and regulators
of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat.
Biotechnol..
*32*, 381-386. - Uchimura, K., Wu, H., Yoshimura, Y., Humphreys, B.D. (2020). Human pluripotent stem
cell-derived kidney organoids with improved collecting duct maturation and injury
modeling. Cell Rep..
*33*, 108514. - Vallejos, C.A., Marioni, J.C., Richardson, S. (2015). BASiCS: Bayesian analysis of
single-cell sequencing data. PLoS Comput. Biol..
*11*, e1004333. - Villa, C.E., Cheroni, C., Dotter, C.P., Lopez-Tobon, A., Oliveira, B., Sacco, R.,
Yahya, A.C., Morandell, J., Gabriele, M., Tavakoli, M.R. (2022). CHD8 haploinsufficiency
links autism to transient alterations in excitatory and inhibitory trajectories. Cell
Rep..
*39*, 110615. - Welch, J.D., Kozareva, V., Ferreira, A., Vanderburg, C., Martin, C., Macosko, E.Z.
(2019). Single-cell multi-omic integration compares and contrasts features of brain
cell identity. Cell.
*177*, 1873-1887.e17. - Xu, C., Lopez, R., Mehlman, E., Regier, J., Jordan, M.I., Yosef, N. (2021). Probabilistic
harmonization and annotation of single-cell transcriptomics data with deep generative
models. Mol. Syst. Biol..
*17*, e9620. - Yang, Z., Michailidis, G. (2016). A non-negative matrix factorization method for detecting
modules in heterogeneous omics multi-modal data. Bioinformatics.
*32*, 1-8. - Yoon, B.K., Oh, T.G., Bu, S., Seo, K.J., Kwon, S.H., Lee, J.Y., Kim, Y., Kim, J.W.,
Ahn, H.S., Fang, S. (2022). The peripheral immune landscape in a patient with myocarditis
after the administration of BNT162b2 mRNA vaccine. Mol. Cells.
*45*, 738-748. - Young, A.L., Marinescu, R.V., Oxtoby, N.P., Bocchetta, M., Yong, K., Firth, N.C.,
Cash, D.M., Thomas, D.L., Dick, K.M., Cardoso, J. (2018). Uncovering the heterogeneity
and temporal complexity of neurodegenerative diseases with Subtype and Stage Inference.
Nat. Commun..
*9*, 4273.