Mol. Cells 2019; 42(8): 579-588
Published online July 16, 2019
https://doi.org/10.14348/molcells.2019.0065
© The Korean Society for Molecular and Cellular Biology
Correspondence to : insuklee@yonsei.ac.kr
This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/.
Gene set enrichment analysis (GSEA) is a popular tool to identify underlying biological processes in clinical samples using their gene expression phenotypes. GSEA measures the enrichment of annotated gene sets that represent biological processes for differentially expressed genes (DEGs) in clinical samples. GSEA may be suboptimal for functional gene sets; however, because DEGs from the expression dataset may not be functional genes
Keywords drug repositioning, gene network, gene set enrichment analysis, network-based analysis, pathway analysis
Molecular phenotypes of clinical samples have proven useful in disease diagnosis, patient stratification, and drug discovery. Gene expression profiling is probably the most accessible strategy for molecular phenotyping of clinical samples. DNA chip technology and RNA sequencing have been widely used for molecular profiling of patient-derived primary cells and cell lines. Numerous gene expression profiles of clinical samples are now freely available from public data repositories such as the Gene Expression Omnibus (GEO) (Barrett et al., 2013) and the National Cancer Institute Genomic Data Commons (NCI GDC) (Jensen et al., 2017). Functional analysis of genome-wide expression phenotypes is generally more interpretable with annotated gene sets rather than individual genes; therefore, many bioinformatics methods for gene set analysis have been developed over the past several years (de Leeuw et al., 2016). For clinical samples, the general purpose of gene set analysis of genome-wide expression profiles is to identify underlying disease-associated molecular processes, which can facilitate disease diagnosis and therapeutic intervention.
Two major approaches for gene set analysis of gene expression phenotypes are available:
The analytical limitations of over-representation approaches can be overcome by aggregate score approaches, which assign scores to each annotated gene set based on all the gene-specific scores of the member genes. Gene set enrichment analysis (GSEA) (Subramanian et al., 2005) is the most popular aggregate score approach available to date. In GSEA, genes for the expression profile are first rank-ordered by the gene-specific scores based on the expression difference, and then the enrichment score of each annotated gene set is computed based on a modified Kolmogorov–Smirnov (K–S) test. Despite its popularity, however, GSEA also has some shortcomings. For example, GSEA was designed to identify sets of genes that are differentially regulated in one direction, i.e., either up-regulated or down-regulated. If a gene set has matched genes for DEGs in which up-regulation and down-regulation are equally distributed, then its association with the expression phenotype may not be detected by GSEA. To overcome this limitation, a modified GSEA called absolute enrichment (AE) was developed that computes the absolute values of gene scores for both up- and down-regulated genes (Saxena et al., 2006).
Another shortcoming of GSEA is that DEGs do not necessarily represent the functional genes that are responsible for the molecular processes represented by the gene sets. Instead, observed DEGs may be dysregulated genes perturbed by genuine functional genes in the molecular process of interest. Given that GSEA assigns a score to each gene set based on the scores of significant DEGs, a gene set comprising
Network-based analysis of differential gene expression has been used to prioritize disease-causing genes (Nitsch et al., 2009) and essential genes of cancer cell lines (Jiang et al., 2015). These methods are based on the idea that functional genes for disease processes, such as tumorigenesis, tend to be surrounded by DEGs for that disease condition in the functional network. We, therefore, hypothesized that ordering genes by the differential expression of their local subnetworks (i.e., networks connecting each gene and its neighbors) will improve the ability to capture functional gene sets associated with the relevant biological processes. In this study, we present a network-based GSEA (NGSEA) that measures the enrichment scores of functional gene sets by utilizing the expression difference of not only individual genes but also their neighbors in the functional network. Although several network-based gene set analysis methods already have been proposed, these methods are modified from the over-representation approach, which identifies associations between two pre-selected gene sets, annotated gene sets from databases, or query gene sets from experiments based on relative closeness within the molecular network (Alexeyenko et al., 2012; Glaab et al., 2012; McCormack et al., 2013; Wang et al., 2012). To the best of our knowledge, NGSEA is the first network-based gene set analysis method that applies the aggregate score approach.
We found that NGSEA outperformed GSEA in retrieving KEGG pathway gene sets (Kanehisa et al., 2017) for matched gene expression data sets. We also applied NGSEA to drug prioritization for several diseases and found that NGSEA performed substantially better than Connectivity Map (CMap) (Lamb et al., 2006) in the ability to retrieve known drugs for matched cancer-associated gene expression data sets. We analyzed FDA-approved drugs to determine whether they had anti-cancer effects on colorectal cancer using NGSEA and experimentally validated the anti-cancer effect of budesonide, a chemical that is currently used as an anti-inflammatory drug. NGSEA is freely available for use as web-based software (
To evaluate the gene set analysis performance for gene expression phenotypes, we used a gold-standard expression dataset composed of expression profiles in which their matched KEGG pathway terms are already annotated. We used KEGG disease datasets from GEO (KEGGdzPathwaysGEO) obtained from Bioconductor (
We also compiled in-house gold-standard expression dataset composed of RNA-seq data from Expression Atlas (
We obtained pathway gene sets from human KEGG pathways (
To benchmark the ability to retrieve drugs for diseases, we compiled 17,063 links between 2,109 diseases and 1,481 chemicals based on a direct evidence of association as determined from the ‘therapeutic’ category of the Comparative Toxicogenomics Database (CTD) (
For network-based analysis of the differential expression of genes, we employed a genome-scale functional gene network, HumanNet-EN (Hwang et al., 2019), which is available from
We obtained the freely available software program javaGSEA version 3.0 from the Broad Institute (
In NGSEA, the original gene-based score, log2(Ratio), was modified via network-based integration of the gene-based scores for network neighbors. We assigned a network-based score (
where n
We performed GSEA, AE, and NGSEA with a gene list ordered by the log2(Ratio) value, the absolute value of the log2(Ratio), and the
We prioritized FDA-approved drugs for the 24 KEGG disease gene expression datasets using the CMap web server (
We conducted MTS (3-(4,5-dimethylthiazol-2-yl)-5-(3-carboxymethoxyphenyl)-2-(4-sulfophenyl)-2H-tetrazolium) assays to measure cell viability following drug treatment. We used two colorectal cancer cell lines, HCT116 and HT-29, which were obtained from the Korean Cell Line Bank, for the assay. We purchased two candidate drugs, dobutamine, and budesonide, from Sigma (USA). We dissolved chemicals in dimethyl sulfoxide (DMSO) prior to treatment. Cells were treated with the candidate drugs at concentrations ranging from 50 to 250 μM for 24, 48, and 72 h. MTS reagents then were added to the cells. The number of viable cells was counted based on the absorbance at 490 nm on an ELISA microplate reader (Molecular Devices, USA), and the cell viability percentage was calculated. All experiments were repeated six times.
As summarized in Figure 1, NGSEA differs from GSEA in the method of scoring genes, resulting in different gene orders between the two methods. The list of genes generated by GSEA is ordered by the log2(Ratio) gene score. In contrast, the list of genes generated by NGSEA is ordered using a network-based score. This score was based on two assumptions. The first assumption is that the annotated gene set for biological processes that are truly associated may contain both up- and down-regulated genes. Whereas GSEA was designed to find gene sets that are regulated in one direction, groups of genes or systems are often regulated in both directions. To address this problem, NGSEA uses the absolute value of the log2(Ratio) for analyses; a similar approach was employed previously in an AE analysis (Saxena et al., 2006). The second assumption is that the expression perturbation of gene regulators may cause severe dysregulation of their downstream genes such that the functional importance of a regulator for a given biological context would, in fact, be much greater than estimated by its own expression change. Thus, we expected that the expression difference in the local subnetwork would assign higher scores than the original gene-based score to truly functional genes. To address this problem, NGSEA integrated the mean of the absolute value of the log2(Ratio) for the network neighbors of each gene to account for the regulatory influence on its local subsystem.
We evaluated the ability of GSEA, AE, and NGSEA to retrieve annotated gene sets for matched gene expression data sets. For this analysis, we used gold-standard gene expression datasets from the Bioconductor’s KEGGdzPathwaysGEO package. We used a total of 24 expression data sets associated with 12 different diseases (
We observed a significantly higher rank distribution using NGSEA compared with GSEA and AE (
Next, we tested the robustness of the three enrichment analysis methods by comparing the assigned scores for the KEGG pathway terms between the different expression profiles for the same disease. The 24 expression data sets were derived from 12 diseases, and nine of the diseases have multiple expression data sets. We hypothesized that if an enrichment analysis retrieved pathways based on disease-specific signals rather than technical variation, then scores for the pathways between two different expression datasets for the same disease should have a higher correlation than those for different diseases. We, therefore, computed Pearson’s correlation coefficients (
As expected, the improved ranks for the matched KEGG pathways were due to improved ranks of their member genes in the gene list used for the enrichment analysis. For example, the network-based scoring method improved the rank of the KEGG term ‘Alzheimer’s disease’ from 17th to 5th and reduced the rank of an irrelevant KEGG term ‘
GSEA is the algorithmic foundation of the most popular drug repositioning system, the CMap (Lamb et al., 2006). The previous CMap database, which was based on the AffyMetrix HG-U133a chip, contained more than 7,000 expression profiles representing 1,309 compounds. A recent release of CMap included a database of reference expression profiles with more than 1,000-fold scale-up based on L1000 platform technology, which is a low-cost, high-throughput reduced representation expression profile method (Subramanian et al., 2017). The CMap system prioritizes drugs for diseases based on an inverse relationship between disease expression profiles and drug treatment expression profiles. To conduct a web-based CMap analysis, users submit signature genes for a given disease (e.g., the 50 most up-regulated and 50 most down-regulated genes from the disease-associated gene expression dataset). GSEA is applied to assign scores to each drug based on the anti-correlation of disease signature genes with the genes ordered by expression changes in drug condition, which represents the strength of the drug response. In contrast to CMap, which uses expression data from drug-treated samples, the NGSEA-based drug prioritization method uses functional genes for the drug’s mode of action; target genes. We used target genes for each FDA-approved drug as functional gene sets to test the association to diseases based on a list of genes ordered by network-based scores computed from disease-associated expression data (Fig. 3A). We compiled target gene sets for drugs from drug-target links based on active bioassays from the Drug Signature Database (DSigDB) (Yoo et al., 2015). We performed drug prioritization using NGSEA on 24 gene expression datasets for 12 diseases from KEGGdzPathwaysGEO and target gene sets for 165 FDA-approved drugs (i.e., those with more than 15 targets) from DSigDB.
We compared the ability of CMap and NGSEA to retrieve known drugs for each of the 24 disease-associated gene expression data sets. For benchmarking, we compiled 17,063 associations between 2,109 diseases and 1,481 chemicals based on direct evidence of an association from the ‘therapeutic’ category of the CTD (Davis et al., 2017). The performance of both CMap and NGSEA were determined using the area under the receiver operating characteristic curve (AUROC). To avoid a biased evaluation due to differences in the number of drugs tested, we included only drugs that were considered in both CMap and NGSEA for the AUROC analysis. We found significantly improved AUROC for drug recovery using NGSEA compared with CMap (
The effective retrieval of known drugs for various types of cancers suggested that we would be able to identify novel anti-cancer drugs using NGSEA. NGSEA yielded the highest improvement in the recovery of known anti-cancer drugs for colorectal cancer (GSE9348: AUROC = 0.488 and 0.775 by CMap and NGSEA, respectively) (Fig. 4A). We, therefore considered it highly likely that NGSEA would be able to identify novel drugs to treat colorectal cancer among the top repurposed FDA-approved chemicals. Among the top 30 chemicals predicted as candidates for colorectal cancer by NGSEA, six chemicals were currently used for the treatment of colorectal cancer and three additional chemicals had been tested in clinical trials for colorectal cancer (
To increase the usability of NGSEA, we have developed a web-based GSEA server (
In this report, we presented a network-based GSEA, NGSEA, which modified existing gene scoring methods by incorporating the differential expression information from neighbors in the functional gene network (Kim et al., 2019). We then demonstrated that NGSEA outperformed GSEA in retrieving both KEGG pathway terms and drugs for matched disease-associated gene expression data sets. Based on the benchmarking results, we have concluded that NGSEA will provide reliable functional information for interpretation of gene expression phenotypes of clinical samples. Most importantly, NGSEA performed well with functional gene sets. Because the original GSEA was designed to detect relationships between biological processes and chemicals based on expression signature genes, which are not necessarily functional genes, GSEA exhibited suboptimal performance with functional gene sets. MSigDB was designed explicitly to provide gene sets, many of which were derived from expression signatures, for GSEA. There are many other public databases; however, that contain annotated gene sets for diseases and pathways, and the majority of these gene sets include functional genes. We expect that NGSEA will be a useful tool for utilizing these available resources.
There are several existing methods that combine networks and gene set analysis. These methods measure the network distance between two gene sets (Alexeyenko et al., 2012; Glaab et al., 2012; McCormack et al., 2013; Wang et al., 2012). Although the sensitivity to detect the relationship between two gene sets was successfully improved using these methods compared with over-representation approaches, these methods still require the user to pre-select the query gene set by applying what is often an arbitrary differential expression score cutoff. These over-representation approaches to gene set analysis with network-based modification are therefore still limited by a lack of information regarding the relative orders between DEGs. We hypothesized that applying the network-based modification to an aggregate score approach to gene set analysis, which uses the differential expression values for the gene set analysis, would further improve the sensitivity to detect the relationship between two gene sets. To the best of our knowledge, NGSEA is the first method that combines network-based gene set analysis with an aggregate score approach.
A previously proposed method, calculation of impact factor (IF) for each pathway (Draghici et al., 2007), shares an idea with NGSEA in incorporating fold change values of DEGs over pathways. However, NGSEA significantly differ from IF analysis. NGSEA incorporates gene expression scores through genome-wide functional networks such as HumanNet, whereas IF analysis does so through topology of pathway knowledge such as KEGG. Therefore, NGSEA are allowed for all genes in the HumanNet which includes ~96% of human genes, whereas IF calculation can be applied to only ~35% of the human genes that are currently annotated in KEGG pathways.
We demonstrated that NGSEA could effectively retrieve known anti-cancer drugs for matched gene expression data sets. It is not clear why drug recovery using NGSEA was highly effective for different cancer types but not for neurodegenerative diseases such as Alzheimer’s disease, Parkinson’s disease, or Huntington’s disease (Fig. 3B). One possibility is that the current human gene network model is short of predictive power for brain cells because we also observed low performance of NGSEA for brain cancer. We observed similar results using another high-quality genome-scale human gene network, STRING (Szklarczyk et al., 2017) (
Mol. Cells 2019; 42(8): 579-588
Published online August 31, 2019 https://doi.org/10.14348/molcells.2019.0065
Copyright © The Korean Society for Molecular and Cellular Biology.
Heonjong Han1 , Sangyoung Lee1
, and Insuk Lee1,2,*
1Department of Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul 03722, Korea, 2Department of Biomedical Systems Informatics, Yonsei University College of Medicine, Seoul 03722, Korea
Correspondence to:insuklee@yonsei.ac.kr
This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/.
Gene set enrichment analysis (GSEA) is a popular tool to identify underlying biological processes in clinical samples using their gene expression phenotypes. GSEA measures the enrichment of annotated gene sets that represent biological processes for differentially expressed genes (DEGs) in clinical samples. GSEA may be suboptimal for functional gene sets; however, because DEGs from the expression dataset may not be functional genes
Keywords: drug repositioning, gene network, gene set enrichment analysis, network-based analysis, pathway analysis
Molecular phenotypes of clinical samples have proven useful in disease diagnosis, patient stratification, and drug discovery. Gene expression profiling is probably the most accessible strategy for molecular phenotyping of clinical samples. DNA chip technology and RNA sequencing have been widely used for molecular profiling of patient-derived primary cells and cell lines. Numerous gene expression profiles of clinical samples are now freely available from public data repositories such as the Gene Expression Omnibus (GEO) (Barrett et al., 2013) and the National Cancer Institute Genomic Data Commons (NCI GDC) (Jensen et al., 2017). Functional analysis of genome-wide expression phenotypes is generally more interpretable with annotated gene sets rather than individual genes; therefore, many bioinformatics methods for gene set analysis have been developed over the past several years (de Leeuw et al., 2016). For clinical samples, the general purpose of gene set analysis of genome-wide expression profiles is to identify underlying disease-associated molecular processes, which can facilitate disease diagnosis and therapeutic intervention.
Two major approaches for gene set analysis of gene expression phenotypes are available:
The analytical limitations of over-representation approaches can be overcome by aggregate score approaches, which assign scores to each annotated gene set based on all the gene-specific scores of the member genes. Gene set enrichment analysis (GSEA) (Subramanian et al., 2005) is the most popular aggregate score approach available to date. In GSEA, genes for the expression profile are first rank-ordered by the gene-specific scores based on the expression difference, and then the enrichment score of each annotated gene set is computed based on a modified Kolmogorov–Smirnov (K–S) test. Despite its popularity, however, GSEA also has some shortcomings. For example, GSEA was designed to identify sets of genes that are differentially regulated in one direction, i.e., either up-regulated or down-regulated. If a gene set has matched genes for DEGs in which up-regulation and down-regulation are equally distributed, then its association with the expression phenotype may not be detected by GSEA. To overcome this limitation, a modified GSEA called absolute enrichment (AE) was developed that computes the absolute values of gene scores for both up- and down-regulated genes (Saxena et al., 2006).
Another shortcoming of GSEA is that DEGs do not necessarily represent the functional genes that are responsible for the molecular processes represented by the gene sets. Instead, observed DEGs may be dysregulated genes perturbed by genuine functional genes in the molecular process of interest. Given that GSEA assigns a score to each gene set based on the scores of significant DEGs, a gene set comprising
Network-based analysis of differential gene expression has been used to prioritize disease-causing genes (Nitsch et al., 2009) and essential genes of cancer cell lines (Jiang et al., 2015). These methods are based on the idea that functional genes for disease processes, such as tumorigenesis, tend to be surrounded by DEGs for that disease condition in the functional network. We, therefore, hypothesized that ordering genes by the differential expression of their local subnetworks (i.e., networks connecting each gene and its neighbors) will improve the ability to capture functional gene sets associated with the relevant biological processes. In this study, we present a network-based GSEA (NGSEA) that measures the enrichment scores of functional gene sets by utilizing the expression difference of not only individual genes but also their neighbors in the functional network. Although several network-based gene set analysis methods already have been proposed, these methods are modified from the over-representation approach, which identifies associations between two pre-selected gene sets, annotated gene sets from databases, or query gene sets from experiments based on relative closeness within the molecular network (Alexeyenko et al., 2012; Glaab et al., 2012; McCormack et al., 2013; Wang et al., 2012). To the best of our knowledge, NGSEA is the first network-based gene set analysis method that applies the aggregate score approach.
We found that NGSEA outperformed GSEA in retrieving KEGG pathway gene sets (Kanehisa et al., 2017) for matched gene expression data sets. We also applied NGSEA to drug prioritization for several diseases and found that NGSEA performed substantially better than Connectivity Map (CMap) (Lamb et al., 2006) in the ability to retrieve known drugs for matched cancer-associated gene expression data sets. We analyzed FDA-approved drugs to determine whether they had anti-cancer effects on colorectal cancer using NGSEA and experimentally validated the anti-cancer effect of budesonide, a chemical that is currently used as an anti-inflammatory drug. NGSEA is freely available for use as web-based software (
To evaluate the gene set analysis performance for gene expression phenotypes, we used a gold-standard expression dataset composed of expression profiles in which their matched KEGG pathway terms are already annotated. We used KEGG disease datasets from GEO (KEGGdzPathwaysGEO) obtained from Bioconductor (
We also compiled in-house gold-standard expression dataset composed of RNA-seq data from Expression Atlas (
We obtained pathway gene sets from human KEGG pathways (
To benchmark the ability to retrieve drugs for diseases, we compiled 17,063 links between 2,109 diseases and 1,481 chemicals based on a direct evidence of association as determined from the ‘therapeutic’ category of the Comparative Toxicogenomics Database (CTD) (
For network-based analysis of the differential expression of genes, we employed a genome-scale functional gene network, HumanNet-EN (Hwang et al., 2019), which is available from
We obtained the freely available software program javaGSEA version 3.0 from the Broad Institute (
In NGSEA, the original gene-based score, log2(Ratio), was modified via network-based integration of the gene-based scores for network neighbors. We assigned a network-based score (
where n
We performed GSEA, AE, and NGSEA with a gene list ordered by the log2(Ratio) value, the absolute value of the log2(Ratio), and the
We prioritized FDA-approved drugs for the 24 KEGG disease gene expression datasets using the CMap web server (
We conducted MTS (3-(4,5-dimethylthiazol-2-yl)-5-(3-carboxymethoxyphenyl)-2-(4-sulfophenyl)-2H-tetrazolium) assays to measure cell viability following drug treatment. We used two colorectal cancer cell lines, HCT116 and HT-29, which were obtained from the Korean Cell Line Bank, for the assay. We purchased two candidate drugs, dobutamine, and budesonide, from Sigma (USA). We dissolved chemicals in dimethyl sulfoxide (DMSO) prior to treatment. Cells were treated with the candidate drugs at concentrations ranging from 50 to 250 μM for 24, 48, and 72 h. MTS reagents then were added to the cells. The number of viable cells was counted based on the absorbance at 490 nm on an ELISA microplate reader (Molecular Devices, USA), and the cell viability percentage was calculated. All experiments were repeated six times.
As summarized in Figure 1, NGSEA differs from GSEA in the method of scoring genes, resulting in different gene orders between the two methods. The list of genes generated by GSEA is ordered by the log2(Ratio) gene score. In contrast, the list of genes generated by NGSEA is ordered using a network-based score. This score was based on two assumptions. The first assumption is that the annotated gene set for biological processes that are truly associated may contain both up- and down-regulated genes. Whereas GSEA was designed to find gene sets that are regulated in one direction, groups of genes or systems are often regulated in both directions. To address this problem, NGSEA uses the absolute value of the log2(Ratio) for analyses; a similar approach was employed previously in an AE analysis (Saxena et al., 2006). The second assumption is that the expression perturbation of gene regulators may cause severe dysregulation of their downstream genes such that the functional importance of a regulator for a given biological context would, in fact, be much greater than estimated by its own expression change. Thus, we expected that the expression difference in the local subnetwork would assign higher scores than the original gene-based score to truly functional genes. To address this problem, NGSEA integrated the mean of the absolute value of the log2(Ratio) for the network neighbors of each gene to account for the regulatory influence on its local subsystem.
We evaluated the ability of GSEA, AE, and NGSEA to retrieve annotated gene sets for matched gene expression data sets. For this analysis, we used gold-standard gene expression datasets from the Bioconductor’s KEGGdzPathwaysGEO package. We used a total of 24 expression data sets associated with 12 different diseases (
We observed a significantly higher rank distribution using NGSEA compared with GSEA and AE (
Next, we tested the robustness of the three enrichment analysis methods by comparing the assigned scores for the KEGG pathway terms between the different expression profiles for the same disease. The 24 expression data sets were derived from 12 diseases, and nine of the diseases have multiple expression data sets. We hypothesized that if an enrichment analysis retrieved pathways based on disease-specific signals rather than technical variation, then scores for the pathways between two different expression datasets for the same disease should have a higher correlation than those for different diseases. We, therefore, computed Pearson’s correlation coefficients (
As expected, the improved ranks for the matched KEGG pathways were due to improved ranks of their member genes in the gene list used for the enrichment analysis. For example, the network-based scoring method improved the rank of the KEGG term ‘Alzheimer’s disease’ from 17th to 5th and reduced the rank of an irrelevant KEGG term ‘
GSEA is the algorithmic foundation of the most popular drug repositioning system, the CMap (Lamb et al., 2006). The previous CMap database, which was based on the AffyMetrix HG-U133a chip, contained more than 7,000 expression profiles representing 1,309 compounds. A recent release of CMap included a database of reference expression profiles with more than 1,000-fold scale-up based on L1000 platform technology, which is a low-cost, high-throughput reduced representation expression profile method (Subramanian et al., 2017). The CMap system prioritizes drugs for diseases based on an inverse relationship between disease expression profiles and drug treatment expression profiles. To conduct a web-based CMap analysis, users submit signature genes for a given disease (e.g., the 50 most up-regulated and 50 most down-regulated genes from the disease-associated gene expression dataset). GSEA is applied to assign scores to each drug based on the anti-correlation of disease signature genes with the genes ordered by expression changes in drug condition, which represents the strength of the drug response. In contrast to CMap, which uses expression data from drug-treated samples, the NGSEA-based drug prioritization method uses functional genes for the drug’s mode of action; target genes. We used target genes for each FDA-approved drug as functional gene sets to test the association to diseases based on a list of genes ordered by network-based scores computed from disease-associated expression data (Fig. 3A). We compiled target gene sets for drugs from drug-target links based on active bioassays from the Drug Signature Database (DSigDB) (Yoo et al., 2015). We performed drug prioritization using NGSEA on 24 gene expression datasets for 12 diseases from KEGGdzPathwaysGEO and target gene sets for 165 FDA-approved drugs (i.e., those with more than 15 targets) from DSigDB.
We compared the ability of CMap and NGSEA to retrieve known drugs for each of the 24 disease-associated gene expression data sets. For benchmarking, we compiled 17,063 associations between 2,109 diseases and 1,481 chemicals based on direct evidence of an association from the ‘therapeutic’ category of the CTD (Davis et al., 2017). The performance of both CMap and NGSEA were determined using the area under the receiver operating characteristic curve (AUROC). To avoid a biased evaluation due to differences in the number of drugs tested, we included only drugs that were considered in both CMap and NGSEA for the AUROC analysis. We found significantly improved AUROC for drug recovery using NGSEA compared with CMap (
The effective retrieval of known drugs for various types of cancers suggested that we would be able to identify novel anti-cancer drugs using NGSEA. NGSEA yielded the highest improvement in the recovery of known anti-cancer drugs for colorectal cancer (GSE9348: AUROC = 0.488 and 0.775 by CMap and NGSEA, respectively) (Fig. 4A). We, therefore considered it highly likely that NGSEA would be able to identify novel drugs to treat colorectal cancer among the top repurposed FDA-approved chemicals. Among the top 30 chemicals predicted as candidates for colorectal cancer by NGSEA, six chemicals were currently used for the treatment of colorectal cancer and three additional chemicals had been tested in clinical trials for colorectal cancer (
To increase the usability of NGSEA, we have developed a web-based GSEA server (
In this report, we presented a network-based GSEA, NGSEA, which modified existing gene scoring methods by incorporating the differential expression information from neighbors in the functional gene network (Kim et al., 2019). We then demonstrated that NGSEA outperformed GSEA in retrieving both KEGG pathway terms and drugs for matched disease-associated gene expression data sets. Based on the benchmarking results, we have concluded that NGSEA will provide reliable functional information for interpretation of gene expression phenotypes of clinical samples. Most importantly, NGSEA performed well with functional gene sets. Because the original GSEA was designed to detect relationships between biological processes and chemicals based on expression signature genes, which are not necessarily functional genes, GSEA exhibited suboptimal performance with functional gene sets. MSigDB was designed explicitly to provide gene sets, many of which were derived from expression signatures, for GSEA. There are many other public databases; however, that contain annotated gene sets for diseases and pathways, and the majority of these gene sets include functional genes. We expect that NGSEA will be a useful tool for utilizing these available resources.
There are several existing methods that combine networks and gene set analysis. These methods measure the network distance between two gene sets (Alexeyenko et al., 2012; Glaab et al., 2012; McCormack et al., 2013; Wang et al., 2012). Although the sensitivity to detect the relationship between two gene sets was successfully improved using these methods compared with over-representation approaches, these methods still require the user to pre-select the query gene set by applying what is often an arbitrary differential expression score cutoff. These over-representation approaches to gene set analysis with network-based modification are therefore still limited by a lack of information regarding the relative orders between DEGs. We hypothesized that applying the network-based modification to an aggregate score approach to gene set analysis, which uses the differential expression values for the gene set analysis, would further improve the sensitivity to detect the relationship between two gene sets. To the best of our knowledge, NGSEA is the first method that combines network-based gene set analysis with an aggregate score approach.
A previously proposed method, calculation of impact factor (IF) for each pathway (Draghici et al., 2007), shares an idea with NGSEA in incorporating fold change values of DEGs over pathways. However, NGSEA significantly differ from IF analysis. NGSEA incorporates gene expression scores through genome-wide functional networks such as HumanNet, whereas IF analysis does so through topology of pathway knowledge such as KEGG. Therefore, NGSEA are allowed for all genes in the HumanNet which includes ~96% of human genes, whereas IF calculation can be applied to only ~35% of the human genes that are currently annotated in KEGG pathways.
We demonstrated that NGSEA could effectively retrieve known anti-cancer drugs for matched gene expression data sets. It is not clear why drug recovery using NGSEA was highly effective for different cancer types but not for neurodegenerative diseases such as Alzheimer’s disease, Parkinson’s disease, or Huntington’s disease (Fig. 3B). One possibility is that the current human gene network model is short of predictive power for brain cells because we also observed low performance of NGSEA for brain cancer. We observed similar results using another high-quality genome-scale human gene network, STRING (Szklarczyk et al., 2017) (
Boah Lee, Seung Ju Park, Seulgi Lee, Seung Eun Park, Eunhye Lee, Ji-Joon Song, Youngjoo Byun, and Seyun Kim
Mol. Cells 2020; 43(3): 222-227 https://doi.org/10.14348/molcells.2020.0051Moon Hee Yang, Min-Suk Jung, Min Joo Lee, Kyung Hyun Yoo, Yeon Joo Yook, Eun Young Park, Seo Hee Choi, Young Ju Suh, Kee-Won Kim and Jong Hoon Park
Mol. Cells 2008; 26(2): 121-130 https://doi.org/10.14348/.2008.26.2.121