Mol. Cells 2023; 46(2): 99-105
Published online February 28, 2023
https://doi.org/10.14348/molcells.2023.2178
© The Korean Society for Molecular and Cellular Biology
Correspondence to : yo.kim@amsterdamumc.nl
This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/.
Tumors are surrounded by a variety of tumor microenvironmental cells. Profiling individual cells within the tumor tissues is crucial to characterize the tumor microenvironment and its therapeutic implications. Since single-cell technologies are still not cost-effective, scientists have developed many statistical deconvolution methods to delineate cellular characteristics from bulk transcriptome data. Here, we present an overview of 20 deconvolution techniques, including cutting-edge techniques recently established. We categorized deconvolution techniques by three primary criteria: characteristics of methodology, use of prior knowledge of cell types and outcome of the methods. We highlighted the advantage of the recent deconvolution tools that are based on probabilistic models. Moreover, we illustrated two scenarios of the common application of deconvolution methods to study tumor microenvironments. This comprehensive review will serve as a guideline for the researchers to select the appropriate method for their application of deconvolution.
Keywords statistical deconvolution, tumor microenvironment
Conventional bulk transcriptome analysis has made a significant contribution to our understanding of the molecular mechanisms behind complex biological phenomena, yet it has been unable to fully uncover the intrinsic heterogeneity of samples. Previous studies revealed that tumors are surrounded by collections of various microenvironmental cells, including endothelial, stromal and infiltrating immune cells, which mutually interact with malignant cells to regulate tumor progression and its therapeutic resistance (Baghban et al., 2020; Jin and Jin, 2020). Moreover, tumors consist of multiple subpopulations of malignant cells with different genotypic and phenotypic features (Dagogo-Jack and Shaw, 2018; Nguyen et al., 2016). However, bulk RNA-seq data measures accumulated gene expression levels of all cells in each sample, which makes it limited to studying cellular heterogeneity. For these reasons, technologies like LCM (laser capture microdissection) or FACS (fluorescence-activated cell sorting) have been developed to isolate and identify each single cell, further extending to single-cell RNA sequencing (scRNA-seq) (Lee et al., 2020). Although promising, single-cell technologies still face obstacles to retaining enough samples and determining proper markers for cell labeling due to their labor-intensiveness and high cost (Lähnemann et al., 2020). Furthermore, the tissue dissociation step in scRNA-seq enriches cells that are easily detachable, essentially introducing a bias in the composition of cells to be profiled (Denisenko et al., 2020).
Scientists thus developed various computational deconvolution methods to infer the abundance of different cell types from bulk RNA-seq data to make the most of pre-existing large cohort-based studies (e.g., The Cancer Genome Atlas [TCGA] and International Cancer Genome Consortium) (Campbell et al., 2020). Beyond the cell type compositions, advanced techniques can infer cell-type-specific gene expression levels—often referred to as purification. Based on this advancement, recent studies revealed differential cellular states among the same cell type defined by their specific transcriptome profiles (Andrade Barbosa et al., 2021; Chu et al., 2022; Luca et al., 2021). Since so many deconvolution methods have been published, it is challenging for the users to select the most suitable method. Previous review articles have evaluated some deconvolution tools but only focused on benchmarking their performance using well-controlled gold standard data (Avila Cobos et al., 2020; Sturm et al., 2019). However, there have not been many reviews that focus on the theoretical background of the methods that can help users to understand the strengths and the weaknesses of each deconvolution method.
Here, we provide a comprehensive overview of 20 statistical deconvolution tools, including some recently established techniques. We organized the tools by their algorithms and delineated practical limitations due to their technical groundings. We focused on the recent deconvolution techniques that can predict not only the cellular composition but also the cell-type-specific gene expression profiles. The majority of these methods are based on probabilistic models, which benefit from their flexible nature. Furthermore, we will highlight the recent application to show how these techniques can delineate tumor ecosystems in large numbers of samples using commonly available bulk RNA-seq data. Finally, we will share some practical recommendations on how to apply deconvolution tools and perspectives on how deconvolution tools can contribute to tumor microenvironment studies.
We constructed an overview of 20 different deconvolution methods, including recent approaches categorized by the characteristics of methodology, use of prior knowledge for deconvolution and the outcome of the methods (Table 1). Among the 20 methods, nine deconvolution methods are based on linear approaches, which include robust linear regression (ABIS [Moncao et al., 2019]), regularized linear regression (DSA [Zhong et al., 2013], TIMER [Li et al., 2020], csSAM [Shen-Orr et al., 2010], MuSiC [Wang et al., 2019], quanTIseq [Fintello et al., 2019]), and non-negative matrix factorization (DECODER [Peng et al., 2019]). Three other methods, CIBERSORT (Newman et al., 2015), CIBERSORTx (Newman et al., 2019), and Bseq-SC (Baron et al., 2016), are based on support vector regression using a linear kernel, which makes them similar to linear regression methods. Other methods applied gene set enrichment approaches (e.g., MCP-counter [Becht et al., 2016]) or, more recently, probabilistic models (Fig. 1). Eighteen of the methods take prior knowledge of cell types for deconvolution (supervised/semi-supervised approach), among which csSAM (Shen-Orr et al., 2010) assumes cell type fractions are known. On the contrary, few methods do not require such an input, such as DECODER (Peng et al., 2019) and CDSeq (Kang et al., 2019) (unsupervised approach). Though it is known that deconvolution after logarithmic transformation leads to a downwards bias (Zhong and Liu, 2012), some of the old and recent approaches perform deconvolution in log-linear space.
Among the supervised/semi-supervised methods that take prior knowledge of cell types for deconvolution, the vast majority employ a predetermined reference matrix of cell-type-specific gene expression data, often referred to as a signature. To construct a signature of preset cell types for deconvolution, cell-type-specific marker genes can be selected from databases or by performing differentially expressed gene analysis among each of the cell types. Early approaches constructed such a signature from gene expression data of purified cell populations, while recent approaches take that from scRNA-seq data. CIBERSORTx (Newman et al., 2019) offers an internal tool to provide signatures that represent each cell type by selecting marker gene reference profiles from scRNA-seq data. Although the vast majority of methods take only the expected gene expression profiles as the signature, MuSiC (Wang et al., 2019), Demix/DemixT (Ahn et al., 2013; Wang et al., 2018), EPIC (Racle et al., 2017), and BLADE (Andrade Barbosa et al., 2021) also take into account the variability of gene expression in each cell type for an enhanced robustness. On the other hand, CDSeq (Kang et al., 2019) and DECODER (Peng et al., 2019) estimate the number of constituent cell types as well as their populations from the bulk data without any signature (unsupervised approach). However, CDSeq offers quasi-unsupervised learning strategy, which augments the input bulk gene expression data with additional gene expression profiles of pure cell lines to get some guidance on cell type selection.
The 20 methods in Table 1 can also be categorized by the type of outcomes. Practically, a majority of supervised deconvolution methods practically can handle as many cell types as in the signature. However, enrichment-based approaches and probabilistic approaches often come with a limit in the number of cell types that can be used (e.g., both 2 cell types for ESTIMATE [Yoshihara et al., 2013] and ISOpure [Quon et al., 2013]), with exception of xCell (Aran et al., 2017). Furthermore, enrichment analysis offers a less precise estimate of cell type abundance, which is often called an enrichment score. The score cannot be compared between cell types, unlike fractions. Probabilistic methods require a sophisticated and often complex optimization strategy which may limit the number of cell types, as in Demix/DemixT (Ahn et al., 2013; Wang et al., 2018). Although the vast majority of methods predict only the cell type fractions, recent approaches can also estimate the gene expression profiles of each cell type, often referred to as
Probabilistic model-based deconvolution methods stand out for their exceptional flexibility at the expense of complex mathematical formulation and high computational costs. First of all, unlike linear regression methods, including support vector regression in CIBERSORT/CIBERSORTx, that are bound to the normal variability assumptions for gene expression data, probabilistic models can take other variability assumptions like log-normal (e.g., BLADE and Demix/DemixT) and multinomial variability (e.g., BayesPrism) (Chu et al., 2022). In particular, log-normal distribution is more suitable than normal distribution in modeling gene expression data, which cannot be log-transformed for deconvolution due to the known risk of bias (Zhong and Liu, 2012). Furthermore, probabilistic models enable the integration of multiple variables, both observed and hidden, to perform a more sophisticated deconvolution. BLADE and Demix/DemixT account for the gene expression variability observed from other data, such as scRNA-seq data. BayesPrism, CDseq, BLADE and Demix/DemixT can perform
By harnessing the established deconvolution techniques and commonly available RNA-seq data from previous cancer studies, there have been many applications of deconvolution techniques to study the tumor microenvironment (Fig. 2). Thorsson et al. (2018) identified six immune subtypes from 33 cancer types from TCGA using CIBERSORT deconvolution method in combination with several immunogenic scoring methods, such as gene set analysis. In the downstream characterization of these six subtypes, they observed that a subtype with an elevated expression level of T helper cell markers, TH17 and TH1, was associated with the best prognosis. In contrast, subtypes with a mixed signature were associated with poor overall survival. Moreover, they suggested a global regulatory network model for all tumor types and immune subtypes that consists of regulatory relationships between subtype-specific transcription factors and cancer-type-specific somatic mutations. This network model illustrates how cancer-type-specific somatic mutations lead to a specific tumor immune microenvironment and an underlying key transcription factor. Recent studies have utilized the estimated gene expression profiles of each cell type (i.e., the outcome of the
Although deconvolution is an attractive and powerful tool to obtain an extra resolution of information from standard bulk RNA-seq data, it comes with a risk when it is applied arbitrarily. Given the complexity of the problem, it is best to make use of prior knowledge of cell types, as in the supervised deconvolution technique, comprehending that cell quality is crucial. It is ideal to extract prior knowledge from the relevant scRNA-seq data that best reflects the biological context. This process will include the critical selection of cell types for deconvolution while keeping in mind that as the more cell types with a detailed classification increase, the number of parameters to be optimized will increase and make deconvolution more challenging. On the contrary, missing an abundant cell type in bulk gene expression data will violate the common assumption in most deconvolution methods that they can reconstruct bulk gene expression profiles by combining cell-type-specific gene expression profiles. Finally, the accuracy of the predicted results can be different between cell types depending on many factors, including their abundance and unique gene expression pattern. Therefore, we recommend benchmarking the deconvolution performance for the specific prior knowledge extracted before its application. As in many benchmark experiments in deconvolution tools, an
There is still much more room for development in deconvolution techniques. So far, there has been only limited application of deconvolution techniques to non-RNA molecule types, except for methylation data (Chakravarthy et al., 2018). Since the estimation of cell-type-specific molecular profiling is possible with recent deconvolution techniques, enabling deconvolution with non-RNA molecules can help us delineate cellular states with multiomics rather than RNA alone. The challenge of deconvolution with other molecular types is less available cell-type-specific information. Borrowing information from other data types can be an option, for instance, RNA-based deconvolution or tumor fractions estimated by allele fraction of mutational data (Poell et al., 2019). Although it is not straightforward for most deconvolution methods to account for the extra information, probabilistic models bear the possibility of integrating extra variables into the model. Multiomics profiling of individual cell types may enable us to delineate cellular molecular mechanisms that determine specific cellular behavior. For instance, integrating spatially resolved data, such as spatial transcriptome profiling (Moffitt et al., 2022) and multiplex immunofluorescence technique (Gorris et al., 2018), may enable us to study active immune cell infiltration and associated cellular signaling pathways. Using spatial transcriptome techniques, such as Nanostring’s GeoMX and 10× Visium, although providing a high-resolution spatial information, the techniques fall short on single cell resolution, leading to the use of a deconvolution technique to gain a more accurate understanding of the cellular makeup (e.g., cell2location, a Bayesian deconvolution method for spatial transcription data) (Kleshchevnikov et al., 2022). With the abundance of large-scale multiomics studies and spatially resolved data already available, particularly in the field of oncology, advanced deconvolution techniques can be applied to gain an in-depth characterization of the tumor microenvironment.
Y.K. conceived and provided expertise. Y.K. and Y.I. wrote the manuscript.
The authors have no potential conflicts of interest to disclose.
Overview of the 20 deconvolution methods covered in this review
Method | Characteristics of methodology | Use of prior knowledge | Outcome | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Algorithm | Supervised/unsupervised | Linear/log-linear | Use of marker gene expression profile (signature) | Use of single-cell RNA-seq data for signature | Account for gene expression variability | Cell type fractions | No. of cell types can be handled* | ||||
ABIS | Robust linear regression | Supervised | Linear | Yes | No | No | Yes | No | Flexible (29) | ||
DSA | Regularized linear regression | Supervised | Linear | Yes | No | No | Yes | Yes (group-mode) | Flexible (6) | ||
TIMER | Regularized linear regression (multivariate normal) | Supervised | Linear | Yes | No | No | Yes | Yes | Flexible (6) | ||
csSAM | Linear regression | Supervised | Log-linear | No | No | No | No | Yes (group-mode) | Flexible (5) | ||
MuSiC | Weighted non-negative least squares | Supervised | Linear | Yes | Yes | Cross-subject variance | Yes | No | Flexible (13) | ||
DECODER | NMF + Regularized linear regression | Unsupervised | Log-linear | No | No | No | Yes | Yes (group-mode) | Flexible (8) | ||
CIBERSORT | nu-SVR (linear) | Supervised | Linear | Yes | No | No | Yes | No | Flexible (22) | ||
CIBERSORTx | nu-SVR (linear) | Supervised | Linear | Yes | Yes | No | Yes | Yes (high resolution) | Flexible (10) | ||
Bseq-SC | CIBERSORT + csSAM | Supervised | Linear | Yes | Yes | No | Yes | No | Flexible (6) | ||
quantTIseq | constrained least squares | Supervised | Linear | Yes | No | No | Yes | Yes (group-mode) | Flexible (10) | ||
MCP-counter | Relative gene expression levels | Supervised | Log-linear | Yes | No | No | No; score | No | Flexible (10) | ||
ESTIMATE | Gene set enrichment analysis (GSEA) | Supervised | Log-linear | Yes | No | No | No; score | No | 2 | ||
xCell | GSEA | Supervised | Log-linear | Yes | No | No | No; score | No | Flexible (64) | ||
CDSeq | Probabilistic model (multinomial) | Unsupervised | Linear | No | No | No | Yes | Yes (group-mode) | Flexible (22) | ||
Demix | Probabilistic model (log-normal) | Semi-supervised | Log-linear | Yes | No | Yes | Yes | Yes | 2 | ||
DemixT | Probabilistic model (log-normal) | Semi-supervised | Linear | Yes | No | Yes | Yes | Yes | 3 | ||
EPIC | Least-square | Supervised | Linear | Yes | No | Yes | Yes | No | Flexible (8) | ||
BLADE | Probabilistic model (log-normal) | Supervised | Linear | Yes | Yes | Yes | Yes | Yes (high resolution) | Flexible (20) | ||
ISOpure | Probabilistic model (multinomial) | Semi-supervised | Linear | Yes | No | No | Yes | Yes (high resolution) | 2 | ||
BayesPrism | Probabilistic model (multinomial) | Supervised | Linear | Yes | Yes | No | Yes | Yes (high resolution) | Flexible (10) |
The deconvolution methods are categorized by characteristics of methodology, use of prior knowledge and outcome.
NMF, non-negative matrix factorization; nu-SVR, nu-support vector regression.
*The maximum number of cell types used in the original study. The deconvolution technique may be able to handle more cell types when they are classified as flexible.
Mol. Cells 2023; 46(2): 99-105
Published online February 28, 2023 https://doi.org/10.14348/molcells.2023.2178
Copyright © The Korean Society for Molecular and Cellular Biology.
1School of Biological Sciences, Seoul National University, Seoul 08826, Korea, 2Department of Pathology, Cancer Center Amsterdam, Amsterdam UMC, Vrije Universiteit Amsterdam, 1081 HZ Amsterdam, The Netherlands
Correspondence to:yo.kim@amsterdamumc.nl
This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/.
Tumors are surrounded by a variety of tumor microenvironmental cells. Profiling individual cells within the tumor tissues is crucial to characterize the tumor microenvironment and its therapeutic implications. Since single-cell technologies are still not cost-effective, scientists have developed many statistical deconvolution methods to delineate cellular characteristics from bulk transcriptome data. Here, we present an overview of 20 deconvolution techniques, including cutting-edge techniques recently established. We categorized deconvolution techniques by three primary criteria: characteristics of methodology, use of prior knowledge of cell types and outcome of the methods. We highlighted the advantage of the recent deconvolution tools that are based on probabilistic models. Moreover, we illustrated two scenarios of the common application of deconvolution methods to study tumor microenvironments. This comprehensive review will serve as a guideline for the researchers to select the appropriate method for their application of deconvolution.
Keywords: statistical deconvolution, tumor microenvironment
Conventional bulk transcriptome analysis has made a significant contribution to our understanding of the molecular mechanisms behind complex biological phenomena, yet it has been unable to fully uncover the intrinsic heterogeneity of samples. Previous studies revealed that tumors are surrounded by collections of various microenvironmental cells, including endothelial, stromal and infiltrating immune cells, which mutually interact with malignant cells to regulate tumor progression and its therapeutic resistance (Baghban et al., 2020; Jin and Jin, 2020). Moreover, tumors consist of multiple subpopulations of malignant cells with different genotypic and phenotypic features (Dagogo-Jack and Shaw, 2018; Nguyen et al., 2016). However, bulk RNA-seq data measures accumulated gene expression levels of all cells in each sample, which makes it limited to studying cellular heterogeneity. For these reasons, technologies like LCM (laser capture microdissection) or FACS (fluorescence-activated cell sorting) have been developed to isolate and identify each single cell, further extending to single-cell RNA sequencing (scRNA-seq) (Lee et al., 2020). Although promising, single-cell technologies still face obstacles to retaining enough samples and determining proper markers for cell labeling due to their labor-intensiveness and high cost (Lähnemann et al., 2020). Furthermore, the tissue dissociation step in scRNA-seq enriches cells that are easily detachable, essentially introducing a bias in the composition of cells to be profiled (Denisenko et al., 2020).
Scientists thus developed various computational deconvolution methods to infer the abundance of different cell types from bulk RNA-seq data to make the most of pre-existing large cohort-based studies (e.g., The Cancer Genome Atlas [TCGA] and International Cancer Genome Consortium) (Campbell et al., 2020). Beyond the cell type compositions, advanced techniques can infer cell-type-specific gene expression levels—often referred to as purification. Based on this advancement, recent studies revealed differential cellular states among the same cell type defined by their specific transcriptome profiles (Andrade Barbosa et al., 2021; Chu et al., 2022; Luca et al., 2021). Since so many deconvolution methods have been published, it is challenging for the users to select the most suitable method. Previous review articles have evaluated some deconvolution tools but only focused on benchmarking their performance using well-controlled gold standard data (Avila Cobos et al., 2020; Sturm et al., 2019). However, there have not been many reviews that focus on the theoretical background of the methods that can help users to understand the strengths and the weaknesses of each deconvolution method.
Here, we provide a comprehensive overview of 20 statistical deconvolution tools, including some recently established techniques. We organized the tools by their algorithms and delineated practical limitations due to their technical groundings. We focused on the recent deconvolution techniques that can predict not only the cellular composition but also the cell-type-specific gene expression profiles. The majority of these methods are based on probabilistic models, which benefit from their flexible nature. Furthermore, we will highlight the recent application to show how these techniques can delineate tumor ecosystems in large numbers of samples using commonly available bulk RNA-seq data. Finally, we will share some practical recommendations on how to apply deconvolution tools and perspectives on how deconvolution tools can contribute to tumor microenvironment studies.
We constructed an overview of 20 different deconvolution methods, including recent approaches categorized by the characteristics of methodology, use of prior knowledge for deconvolution and the outcome of the methods (Table 1). Among the 20 methods, nine deconvolution methods are based on linear approaches, which include robust linear regression (ABIS [Moncao et al., 2019]), regularized linear regression (DSA [Zhong et al., 2013], TIMER [Li et al., 2020], csSAM [Shen-Orr et al., 2010], MuSiC [Wang et al., 2019], quanTIseq [Fintello et al., 2019]), and non-negative matrix factorization (DECODER [Peng et al., 2019]). Three other methods, CIBERSORT (Newman et al., 2015), CIBERSORTx (Newman et al., 2019), and Bseq-SC (Baron et al., 2016), are based on support vector regression using a linear kernel, which makes them similar to linear regression methods. Other methods applied gene set enrichment approaches (e.g., MCP-counter [Becht et al., 2016]) or, more recently, probabilistic models (Fig. 1). Eighteen of the methods take prior knowledge of cell types for deconvolution (supervised/semi-supervised approach), among which csSAM (Shen-Orr et al., 2010) assumes cell type fractions are known. On the contrary, few methods do not require such an input, such as DECODER (Peng et al., 2019) and CDSeq (Kang et al., 2019) (unsupervised approach). Though it is known that deconvolution after logarithmic transformation leads to a downwards bias (Zhong and Liu, 2012), some of the old and recent approaches perform deconvolution in log-linear space.
Among the supervised/semi-supervised methods that take prior knowledge of cell types for deconvolution, the vast majority employ a predetermined reference matrix of cell-type-specific gene expression data, often referred to as a signature. To construct a signature of preset cell types for deconvolution, cell-type-specific marker genes can be selected from databases or by performing differentially expressed gene analysis among each of the cell types. Early approaches constructed such a signature from gene expression data of purified cell populations, while recent approaches take that from scRNA-seq data. CIBERSORTx (Newman et al., 2019) offers an internal tool to provide signatures that represent each cell type by selecting marker gene reference profiles from scRNA-seq data. Although the vast majority of methods take only the expected gene expression profiles as the signature, MuSiC (Wang et al., 2019), Demix/DemixT (Ahn et al., 2013; Wang et al., 2018), EPIC (Racle et al., 2017), and BLADE (Andrade Barbosa et al., 2021) also take into account the variability of gene expression in each cell type for an enhanced robustness. On the other hand, CDSeq (Kang et al., 2019) and DECODER (Peng et al., 2019) estimate the number of constituent cell types as well as their populations from the bulk data without any signature (unsupervised approach). However, CDSeq offers quasi-unsupervised learning strategy, which augments the input bulk gene expression data with additional gene expression profiles of pure cell lines to get some guidance on cell type selection.
The 20 methods in Table 1 can also be categorized by the type of outcomes. Practically, a majority of supervised deconvolution methods practically can handle as many cell types as in the signature. However, enrichment-based approaches and probabilistic approaches often come with a limit in the number of cell types that can be used (e.g., both 2 cell types for ESTIMATE [Yoshihara et al., 2013] and ISOpure [Quon et al., 2013]), with exception of xCell (Aran et al., 2017). Furthermore, enrichment analysis offers a less precise estimate of cell type abundance, which is often called an enrichment score. The score cannot be compared between cell types, unlike fractions. Probabilistic methods require a sophisticated and often complex optimization strategy which may limit the number of cell types, as in Demix/DemixT (Ahn et al., 2013; Wang et al., 2018). Although the vast majority of methods predict only the cell type fractions, recent approaches can also estimate the gene expression profiles of each cell type, often referred to as
Probabilistic model-based deconvolution methods stand out for their exceptional flexibility at the expense of complex mathematical formulation and high computational costs. First of all, unlike linear regression methods, including support vector regression in CIBERSORT/CIBERSORTx, that are bound to the normal variability assumptions for gene expression data, probabilistic models can take other variability assumptions like log-normal (e.g., BLADE and Demix/DemixT) and multinomial variability (e.g., BayesPrism) (Chu et al., 2022). In particular, log-normal distribution is more suitable than normal distribution in modeling gene expression data, which cannot be log-transformed for deconvolution due to the known risk of bias (Zhong and Liu, 2012). Furthermore, probabilistic models enable the integration of multiple variables, both observed and hidden, to perform a more sophisticated deconvolution. BLADE and Demix/DemixT account for the gene expression variability observed from other data, such as scRNA-seq data. BayesPrism, CDseq, BLADE and Demix/DemixT can perform
By harnessing the established deconvolution techniques and commonly available RNA-seq data from previous cancer studies, there have been many applications of deconvolution techniques to study the tumor microenvironment (Fig. 2). Thorsson et al. (2018) identified six immune subtypes from 33 cancer types from TCGA using CIBERSORT deconvolution method in combination with several immunogenic scoring methods, such as gene set analysis. In the downstream characterization of these six subtypes, they observed that a subtype with an elevated expression level of T helper cell markers, TH17 and TH1, was associated with the best prognosis. In contrast, subtypes with a mixed signature were associated with poor overall survival. Moreover, they suggested a global regulatory network model for all tumor types and immune subtypes that consists of regulatory relationships between subtype-specific transcription factors and cancer-type-specific somatic mutations. This network model illustrates how cancer-type-specific somatic mutations lead to a specific tumor immune microenvironment and an underlying key transcription factor. Recent studies have utilized the estimated gene expression profiles of each cell type (i.e., the outcome of the
Although deconvolution is an attractive and powerful tool to obtain an extra resolution of information from standard bulk RNA-seq data, it comes with a risk when it is applied arbitrarily. Given the complexity of the problem, it is best to make use of prior knowledge of cell types, as in the supervised deconvolution technique, comprehending that cell quality is crucial. It is ideal to extract prior knowledge from the relevant scRNA-seq data that best reflects the biological context. This process will include the critical selection of cell types for deconvolution while keeping in mind that as the more cell types with a detailed classification increase, the number of parameters to be optimized will increase and make deconvolution more challenging. On the contrary, missing an abundant cell type in bulk gene expression data will violate the common assumption in most deconvolution methods that they can reconstruct bulk gene expression profiles by combining cell-type-specific gene expression profiles. Finally, the accuracy of the predicted results can be different between cell types depending on many factors, including their abundance and unique gene expression pattern. Therefore, we recommend benchmarking the deconvolution performance for the specific prior knowledge extracted before its application. As in many benchmark experiments in deconvolution tools, an
There is still much more room for development in deconvolution techniques. So far, there has been only limited application of deconvolution techniques to non-RNA molecule types, except for methylation data (Chakravarthy et al., 2018). Since the estimation of cell-type-specific molecular profiling is possible with recent deconvolution techniques, enabling deconvolution with non-RNA molecules can help us delineate cellular states with multiomics rather than RNA alone. The challenge of deconvolution with other molecular types is less available cell-type-specific information. Borrowing information from other data types can be an option, for instance, RNA-based deconvolution or tumor fractions estimated by allele fraction of mutational data (Poell et al., 2019). Although it is not straightforward for most deconvolution methods to account for the extra information, probabilistic models bear the possibility of integrating extra variables into the model. Multiomics profiling of individual cell types may enable us to delineate cellular molecular mechanisms that determine specific cellular behavior. For instance, integrating spatially resolved data, such as spatial transcriptome profiling (Moffitt et al., 2022) and multiplex immunofluorescence technique (Gorris et al., 2018), may enable us to study active immune cell infiltration and associated cellular signaling pathways. Using spatial transcriptome techniques, such as Nanostring’s GeoMX and 10× Visium, although providing a high-resolution spatial information, the techniques fall short on single cell resolution, leading to the use of a deconvolution technique to gain a more accurate understanding of the cellular makeup (e.g., cell2location, a Bayesian deconvolution method for spatial transcription data) (Kleshchevnikov et al., 2022). With the abundance of large-scale multiomics studies and spatially resolved data already available, particularly in the field of oncology, advanced deconvolution techniques can be applied to gain an in-depth characterization of the tumor microenvironment.
Y.K. conceived and provided expertise. Y.K. and Y.I. wrote the manuscript.
The authors have no potential conflicts of interest to disclose.
Overview of the 20 deconvolution methods covered in this review
Method | Characteristics of methodology | Use of prior knowledge | Outcome | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Algorithm | Supervised/unsupervised | Linear/log-linear | Use of marker gene expression profile (signature) | Use of single-cell RNA-seq data for signature | Account for gene expression variability | Cell type fractions | No. of cell types can be handled* | ||||
ABIS | Robust linear regression | Supervised | Linear | Yes | No | No | Yes | No | Flexible (29) | ||
DSA | Regularized linear regression | Supervised | Linear | Yes | No | No | Yes | Yes (group-mode) | Flexible (6) | ||
TIMER | Regularized linear regression (multivariate normal) | Supervised | Linear | Yes | No | No | Yes | Yes | Flexible (6) | ||
csSAM | Linear regression | Supervised | Log-linear | No | No | No | No | Yes (group-mode) | Flexible (5) | ||
MuSiC | Weighted non-negative least squares | Supervised | Linear | Yes | Yes | Cross-subject variance | Yes | No | Flexible (13) | ||
DECODER | NMF + Regularized linear regression | Unsupervised | Log-linear | No | No | No | Yes | Yes (group-mode) | Flexible (8) | ||
CIBERSORT | nu-SVR (linear) | Supervised | Linear | Yes | No | No | Yes | No | Flexible (22) | ||
CIBERSORTx | nu-SVR (linear) | Supervised | Linear | Yes | Yes | No | Yes | Yes (high resolution) | Flexible (10) | ||
Bseq-SC | CIBERSORT + csSAM | Supervised | Linear | Yes | Yes | No | Yes | No | Flexible (6) | ||
quantTIseq | constrained least squares | Supervised | Linear | Yes | No | No | Yes | Yes (group-mode) | Flexible (10) | ||
MCP-counter | Relative gene expression levels | Supervised | Log-linear | Yes | No | No | No; score | No | Flexible (10) | ||
ESTIMATE | Gene set enrichment analysis (GSEA) | Supervised | Log-linear | Yes | No | No | No; score | No | 2 | ||
xCell | GSEA | Supervised | Log-linear | Yes | No | No | No; score | No | Flexible (64) | ||
CDSeq | Probabilistic model (multinomial) | Unsupervised | Linear | No | No | No | Yes | Yes (group-mode) | Flexible (22) | ||
Demix | Probabilistic model (log-normal) | Semi-supervised | Log-linear | Yes | No | Yes | Yes | Yes | 2 | ||
DemixT | Probabilistic model (log-normal) | Semi-supervised | Linear | Yes | No | Yes | Yes | Yes | 3 | ||
EPIC | Least-square | Supervised | Linear | Yes | No | Yes | Yes | No | Flexible (8) | ||
BLADE | Probabilistic model (log-normal) | Supervised | Linear | Yes | Yes | Yes | Yes | Yes (high resolution) | Flexible (20) | ||
ISOpure | Probabilistic model (multinomial) | Semi-supervised | Linear | Yes | No | No | Yes | Yes (high resolution) | 2 | ||
BayesPrism | Probabilistic model (multinomial) | Supervised | Linear | Yes | Yes | No | Yes | Yes (high resolution) | Flexible (10) |
The deconvolution methods are categorized by characteristics of methodology, use of prior knowledge and outcome.
NMF, non-negative matrix factorization; nu-SVR, nu-support vector regression.
*The maximum number of cell types used in the original study. The deconvolution technique may be able to handle more cell types when they are classified as flexible.
. Overview of the 20 deconvolution methods covered in this review.
Method | Characteristics of methodology | Use of prior knowledge | Outcome | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Algorithm | Supervised/unsupervised | Linear/log-linear | Use of marker gene expression profile (signature) | Use of single-cell RNA-seq data for signature | Account for gene expression variability | Cell type fractions | No. of cell types can be handled* | ||||
ABIS | Robust linear regression | Supervised | Linear | Yes | No | No | Yes | No | Flexible (29) | ||
DSA | Regularized linear regression | Supervised | Linear | Yes | No | No | Yes | Yes (group-mode) | Flexible (6) | ||
TIMER | Regularized linear regression (multivariate normal) | Supervised | Linear | Yes | No | No | Yes | Yes | Flexible (6) | ||
csSAM | Linear regression | Supervised | Log-linear | No | No | No | No | Yes (group-mode) | Flexible (5) | ||
MuSiC | Weighted non-negative least squares | Supervised | Linear | Yes | Yes | Cross-subject variance | Yes | No | Flexible (13) | ||
DECODER | NMF + Regularized linear regression | Unsupervised | Log-linear | No | No | No | Yes | Yes (group-mode) | Flexible (8) | ||
CIBERSORT | nu-SVR (linear) | Supervised | Linear | Yes | No | No | Yes | No | Flexible (22) | ||
CIBERSORTx | nu-SVR (linear) | Supervised | Linear | Yes | Yes | No | Yes | Yes (high resolution) | Flexible (10) | ||
Bseq-SC | CIBERSORT + csSAM | Supervised | Linear | Yes | Yes | No | Yes | No | Flexible (6) | ||
quantTIseq | constrained least squares | Supervised | Linear | Yes | No | No | Yes | Yes (group-mode) | Flexible (10) | ||
MCP-counter | Relative gene expression levels | Supervised | Log-linear | Yes | No | No | No; score | No | Flexible (10) | ||
ESTIMATE | Gene set enrichment analysis (GSEA) | Supervised | Log-linear | Yes | No | No | No; score | No | 2 | ||
xCell | GSEA | Supervised | Log-linear | Yes | No | No | No; score | No | Flexible (64) | ||
CDSeq | Probabilistic model (multinomial) | Unsupervised | Linear | No | No | No | Yes | Yes (group-mode) | Flexible (22) | ||
Demix | Probabilistic model (log-normal) | Semi-supervised | Log-linear | Yes | No | Yes | Yes | Yes | 2 | ||
DemixT | Probabilistic model (log-normal) | Semi-supervised | Linear | Yes | No | Yes | Yes | Yes | 3 | ||
EPIC | Least-square | Supervised | Linear | Yes | No | Yes | Yes | No | Flexible (8) | ||
BLADE | Probabilistic model (log-normal) | Supervised | Linear | Yes | Yes | Yes | Yes | Yes (high resolution) | Flexible (20) | ||
ISOpure | Probabilistic model (multinomial) | Semi-supervised | Linear | Yes | No | No | Yes | Yes (high resolution) | 2 | ||
BayesPrism | Probabilistic model (multinomial) | Supervised | Linear | Yes | Yes | No | Yes | Yes (high resolution) | Flexible (10) |
The deconvolution methods are categorized by characteristics of methodology, use of prior knowledge and outcome..
NMF, non-negative matrix factorization; nu-SVR, nu-support vector regression..
*The maximum number of cell types used in the original study. The deconvolution technique may be able to handle more cell types when they are classified as flexible..
Jialin Feng, Oliver J. Read, and Albena T. Dinkova-Kostova
Mol. Cells 2023; 46(3): 142-152 https://doi.org/10.14348/molcells.2023.2183Yae Chan Song, Seung Eon Lee, Young Jin, Hyun Woo Park, Kyung-Hee Chun, and Han-Woong Lee
Mol. Cells 2020; 43(9): 763-773 https://doi.org/10.14348/molcells.2020.0118Hak Jun Ahn, Soon Young Hwang, Ngoc Hoan Nguyen, Ik Jae Lee, Eun Jeong Lee, Jinsil Seong, and Jong-Soo Lee
Mol. Cells 2019; 42(7): 530-545 https://doi.org/10.14348/molcells.2019.2280