Mol. Cells 2021; 44(11): 843-850
Published online November 17, 2021
https://doi.org/10.14348/molcells.2021.0169
© The Korean Society for Molecular and Cellular Biology
Correspondence to : yoonsj@sookmyung.ac.kr
This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/.
The rapid increase in collateral omics and phenotypic data has enabled data-driven studies for the fast discovery of cancer targets and biomarkers. Thus, it is necessary to develop convenient tools for general oncologists and cancer scientists to carry out customized data mining without computational expertise. For this purpose, we developed innovative software that enables user-driven analyses assisted by knowledge-based smart systems. Publicly available data on mutations, gene expression, patient survival, immune score, drug screening and RNAi screening were integrated from the TCGA, GDSC, CCLE, NCI, and DepMap databases. The optimal selection of samples and other filtering options were guided by the smart function of the software for data mining and visualization on Kaplan-Meier plots, box plots and scatter plots of publication quality. We implemented unique algorithms for both data mining and visualization, thus simplifying and accelerating user-driven discovery activities on large multiomics datasets. The present Q-omics software program (v0.95) is available at http://qomics.sookmyung.ac.kr.
Keywords biomarker, cancer bioinformatics, immune infiltrate, Kaplan-Meier plot, omics data mining, smart software
Large collateral datasets, including those on mutations, gene expression, drug/RNAi screening and patient survival, are publicly available from diverse resources (Barretina et al., 2012; Cancer Genome Atlas Research Network et al., 2013; Ghandi et al., 2019; Guan et al., 2019; Iorio et al., 2016; Monks et al., 2018; Shi et al., 2021). Integrated analysis of the cross-association of these datasets provides useful clues for finding novel targets, predictive biomarkers and related mechanisms (Jeong et al., 2020; Shen et al., 2019). For example, many genes and mutations have been found to be associated with the patient survival rate via analyses of datasets from the TCGA database (Cao et al., 2020; Eckstein et al., 2020; Hong et al., 2017; Kitsou et al., 2020; Yang et al., 2011; Zhong et al., 2020). Cell line databases provide clues for the identification of predictive biomarkers against drug resistance and/or sensitivity (Garnett et al., 2012; He et al., 2014; Kim et al., 2016; Li et al., 2021; Yang et al., 2013). Novel targets against subtype-specific cancer mutations have also been suggested (Biswas et al., 2019; Li et al., 2019; Park et al., 2019).
An explosive increase in these collateral datasets will provide important resources for diverse data-driven cancer research projects. However, systematic and integrated analyses of these datasets are still challenging to most oncologists and cancer researchers with no computational background. Many web-based tools have been developed to improve the utility of public cancer datasets, such as Oncomine (Rhodes et al., 2004), cBioPortal (Cerami et al., 2012), and TIMER2.0 (Li et al., 2020). Although these web-based applications provide useful tools for a quick data search with significant information, user-oriented customized calculation and data filtration are generally limited from these server-provided functions. Thus, flexible and comprehensive software is required for cancer scientists to carry out customized data processing and computation on their local computers.
Here, we attempted to develop innovative smart software for oncologists to easily start their own data mining projects without computational skills. We established two aims for this software. First, the process of data analysis and visualization should be simple and comprehensive by providing a user-friendly graphical interface and an intuitive organization of menus. Second, we tried to implement smart functions that guide users to find optimal outputs, i.e., associated data pairs and graphs, via real-time communication with a server-side knowledge base harboring billions of pre-calculated data pairs. For these purposes, we simultaneously developed stand-alone software with data processing and computation abilities and a server-side knowledge base that can be connected to local software. This report briefly presents the functions and utilities of this software, Q-omics v0.95. The smart system of the implemented knowledge base will be continuously updated with improved visualization options in the user interface. We expect that the present computer-aided, smart data mining system will have general utilities in all fields of oncology and cancer research without the requirements of bioinformatics skills.
Cell line-based large-scale data consisting of RNA sequencing data (Expression, ver. 20Q1), sgRNA sequencing data (CRISPR, ver. 20Q1), shRNA screening data (Achilles + DRIVE + Marcotte, DEMETER2), mutation data (Mutation Public, ver. 20Q4), and drug response data (Sanger GDSC1 and GDSC2) were obtained from the DepMap portal (https://depmap.org/portal/). RNA sequencing data represent log2-transformed transcripts per million (TMP) + 1 values using RSEM normalization. sgRNA and shRNA data are batch-corrected CERES gene knockout effects (Meyers et al., 2017) and DEMETER2 estimated gene knockdown effects (McFarland et al., 2018), respectively. Mutation data are MAF of gene mutations. Drug response data are published as IC50 (nM) values and we transformed to logarithmic scale pIC50 (M). To analyze associations between datasets, 20 lineages with a sufficient number of common cell lines between RNA sequencing data and other data (sgRNA and drug response) were used in this study. Furthermore, the gene expression data of NCI60 cell lines treated with 15 drugs were obtained from the GEO database (GSE116436) (Monks et al., 2018). Details on the cell line, number of lineages, number of cell lines, and number of genes/drugs are shown in Table 1.
Patient RNA sequencing data, clinical data, and mutation data were obtained from the Genomic Data Commons Data Portal (https://portal.gdc.cancer.gov/). In total, 33 cancer types were investigated in this study. For comparisons between normal and tumor data, paired normal and tumor tissue samples from 18 cancer types whose number of matched tissue samples was larger than 2 were collected. RNA sequencing data in FPKM (fragments per kilobase of transcript per million fragments mapped) values are transformed to log2 (TPM + 1) values after downloading. In addition, immune cell enrichment score for TCGA data was obtained from the xCell portal (https://xcell.ucsf.edu/) (Aran et al., 2017). Details on the tissue type, number of lineages, number of samples, and number of genes are shown in Table 1.
To analyze associations between two datasets, we performed a cross-association analysis between phenotypic efficacy and gene expression in our previous work (Jeong et al., 2020). In this study, we extended the concept of cross-association to analyze more diverse datasets, including those on gene expression, sgRNAs, shRNAs, the drug response, and mutations (Fig. 1). Cross-associations between each data type, such as drug versus RNA-seq and shRNA versus mutation, and within the same data type, such as sgRNA versus sgRNA and mutation versus mutation, can be analyzed.
Two association measures are predictivity and descriptivity. Given any two datasets (X and Y), we assume that x and y are entries of X and Y, respectively. The predictivity of x measures the difference in x values between two groups divided by the median of y. In contrast, the descriptivity of x measures the difference in y values between two groups divided by the median x. Significance was tested using Fisher’s exact test for categorical data (mutation) and Student’s
Survival data were analyzed using the Kaplan–Meier (KM) method, and the log-rank test was used to compare the survival outcomes of two groups as a test of statistical significance. Furthermore, the area under the curve (AUC) was calculated to provide an estimate of the size of the difference between two groups. In this study, overall survival (OS) and disease-free survival (DFS) were analyzed. The two analyses differed according to the definition of the primary endpoint: all causes of death during the study period were used to analyze OS, and a tumor event or death was used to analyze DFS.
For the single-gene survival analysis, patients were divided into two groups based on high or low expression of the given gene or mutation status of the given gene. The association of two genes can be determined in advance to generate several subgroups for combined-gene survival analysis. Furthermore, for a more sophisticated survival analysis, a subset of patients was selected using clinical information such as sex, stage, or any combination of sex and stage.
Q-omics was designed to run locally on user computers. While running Q-omics, time-consuming or data/memory-intensive analyses are performed on the server computer. For example, the cross-association analysis on the user side investigates only in a given lineage, while the smart search retrieves the most highly associated pairs in all 20 lineages from the server side. Similarly, for the survival analysis, the user side calculates the survival rate based on a single gene in a given lineage, while the smart search provides the significance of the survival rate based on a given gene in all 33 lineages.
Box plots in Q-omics can be used to visualize differences in the distribution of numerical data between different groups. The differences between two groups were analyzed by calculating the fold change and
Q-omics also provides a platform for comparisons between drug-induced changes in gene expression. Gene expression data from NCI60 cell lines treated with 15 anticancer agents contained the measured expression values of nine genes at three time points (2, 4, and 24 h) and at three doses (0 nM, low dose, and high dose) (Monks et al., 2018). The low and high doses used varied depending on each drug. For these data, box plots were generated to compare time- and dose-dependent gene expression. Groups were divided based on time points or doses, and box plots were used to display fold changes between time points (4 h vs 2 h and 24 h vs 2 h) or between doses (low dose vs 0 nM and high dose vs 0 nM), respectively, not raw gene expression.
Scatter plots were used to display relationships between two numeric variables, and the strength and direction of the linear relationships were assessed by Pearson’s correlation coefficient in Q-omics.
Q-omics was implemented in Python 3, and MySQL was used for the smart search.
Q-omics software runs on the user’s computer, providing a graphical interface and computational/visualization modules together with its own local database (Fig. 2A). To assist in user data mining, Q-omics interacts with a server-side knowledge base and retrieves relevant information for analysis. The knowledge base harbors billions of precalculated, significantly associated data pairs with related information such as sample filters and calculation options. Smart algorithms in the knowledge base promptly select data pairs and information that is relevant to the user’s query and then returns it to Q-omics.
As described in Fig. 1, users can start data mining with one query (i.e., gene expression, mutations, drugs, or sh/sgRNAs). The front page of Q-omics provides a graphical interface for selecting the analysis type, query and sample type (Fig. 2B). Basically, all analyses are separated into those with patient samples and those with cell lines. Available analyses with patient samples are as follows: (1) survival analyses (Kaplan–Meier plots) according to gene expression and mutations, (2) differential gene expression analyses between normal and cancer cells, and (3) scatter/box plots analyses of gene expression and/or mutation pairs. Available analyses with cell lines are as follows: (1) cross-association analyses between any pair of datasets according to gene expression, mutations, shRNA screening data, sgRNA screening data and drug screening data, (2) change (induction) analyses of gene expression before/after drug treatments, and (3) scatter/box plot analyses of pairs according to gene expression, mutations, shRNAs, sgRNAs and drugs. The menu “Quick start examples” is used to demonstrate graphical outputs and smart functions of the software using the preselected analysis type and user-selected queries. In all analyses, the resulting graphs and data can be saved for further usage.
Fig. 3A demonstrates the survival analysis module of the software. A Kaplan–Meier plot of BRCA patient data was generated by using user-selected options: CD24 gene expression with TP53 mutations. The graphical panel provides detailed information on selected samples and further filtering options such as sex and stage. Together with the panel of Kaplan–Meier plots, Q-omics software provides a panel of smart search results (Fig. 3B). This smart panel provides a list of genes that exhibit significant (
Fig. 4A shows the Q-omics output panel of a cross-association analysis between the user-selected drug, cisplatin, and 17,795 sgRNAs in lung cancer cell lines. The present example shows that the responses of 136 sgRNAs exhibit a positive association (
The predictivity and descriptivity measures from the cross-association calculation were reported to be useful for the systematic evaluation of targets and biomarkers from multiomics data (Jeong et al., 2020). Q-omics software provides a simple and easy interface for calculating and analyzing the cross-association between any data pair, such as gene expression, mutations, sh/sgRNA screening data and drug screening data, from diverse resources. Q-omics also provides smart search results related to the user’s query in the cross-association analysis. The software retrieves diverse association patterns with statistical significance to the user’s query from the knowledge base and assists users in the optimal selection of data pairs and visualization.
In summary, Q-omics is an innovative software program that enables users to carry out data mining and customized visualization without computational skills. The smart system of the software assists in the identification of new data pairs related to/associated with the user’s interests in real time. This software takes advantage of stand-alone software and web-based applications. Several discovery projects using this software are ongoing, and the results will be published in the near future.
This work was financially supported by grants from the National Research Foundation of Korea (KRF), including the Science Research Center Program (NRF-2016R1A5A1011974), and the Mid-career Researcher Program (NRF-2017R1A2B 2007745 and NRF-2018R1A2B6009313), funded by the Korean government (MEST).
S.Y. contributed to the overall study design. J.L., S.J.(Seonghee Jin), H.Y., S.J.(Sumin Jeong), E.J., and S.Y. conceived and implemented the software. J.L., Y.K., E.J., and S.Y. designed and implemented the database. J.L., Y.K., E.J., and S.Y. wrote manuscript.
The authors have no potential conflicts of interest to disclose.
Numbers of data points integrated into Q-omics software
No. of lineages | No. of cell lines/No. of samples | No. of genes/No. of drugs | Data type | |
---|---|---|---|---|
Cell line data | ||||
Gene expression | 20 | 1,061 | 19,137 | RNA sequencing |
sgRNA | 20 | 741 | 18,110 | CRISPR |
shRNA | 20 | 587 | 16,800 | RNAi shRNA |
Drug response | 20 | 1,001 | 397 | Drug response |
Mutation | 20 | 1,281 | 18,731 | Exome sequencing |
Drug-induced gene expression | 13 | 60 | 12,305/15 | DNA microarray |
Tissue data | ||||
Tumor gene expression | 33 | 9,951 | 38,311 | RNA sequencing |
Paired normal vs. cancer: gene expression | 18 | 679 | 38,311 | RNA sequencing |
Mutation | 33 | 9,100 | 20,850 | Exome sequencing |
Immune | 33 | 8,954 | 64 (cell types) | Cell type enrichment score |
Mol. Cells 2021; 44(11): 843-850
Published online November 30, 2021 https://doi.org/10.14348/molcells.2021.0169
Copyright © The Korean Society for Molecular and Cellular Biology.
Jieun Lee1,3 , Youngju Kim1,3
, Seonghee Jin1
, Heeseung Yoo1
, Sumin Jeong1
, Euna Jeong2
, and Sukjoon Yoon1,2,*
1Department of Biological Sciences, Sookmyung Women’s University, Seoul 04310, Korea, 2Research Institute of Women’s Health, Sookmyung Women’s University, Seoul 04310, Korea, 3These authors contributed equally to this work.
Correspondence to:yoonsj@sookmyung.ac.kr
This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/.
The rapid increase in collateral omics and phenotypic data has enabled data-driven studies for the fast discovery of cancer targets and biomarkers. Thus, it is necessary to develop convenient tools for general oncologists and cancer scientists to carry out customized data mining without computational expertise. For this purpose, we developed innovative software that enables user-driven analyses assisted by knowledge-based smart systems. Publicly available data on mutations, gene expression, patient survival, immune score, drug screening and RNAi screening were integrated from the TCGA, GDSC, CCLE, NCI, and DepMap databases. The optimal selection of samples and other filtering options were guided by the smart function of the software for data mining and visualization on Kaplan-Meier plots, box plots and scatter plots of publication quality. We implemented unique algorithms for both data mining and visualization, thus simplifying and accelerating user-driven discovery activities on large multiomics datasets. The present Q-omics software program (v0.95) is available at http://qomics.sookmyung.ac.kr.
Keywords: biomarker, cancer bioinformatics, immune infiltrate, Kaplan-Meier plot, omics data mining, smart software
Large collateral datasets, including those on mutations, gene expression, drug/RNAi screening and patient survival, are publicly available from diverse resources (Barretina et al., 2012; Cancer Genome Atlas Research Network et al., 2013; Ghandi et al., 2019; Guan et al., 2019; Iorio et al., 2016; Monks et al., 2018; Shi et al., 2021). Integrated analysis of the cross-association of these datasets provides useful clues for finding novel targets, predictive biomarkers and related mechanisms (Jeong et al., 2020; Shen et al., 2019). For example, many genes and mutations have been found to be associated with the patient survival rate via analyses of datasets from the TCGA database (Cao et al., 2020; Eckstein et al., 2020; Hong et al., 2017; Kitsou et al., 2020; Yang et al., 2011; Zhong et al., 2020). Cell line databases provide clues for the identification of predictive biomarkers against drug resistance and/or sensitivity (Garnett et al., 2012; He et al., 2014; Kim et al., 2016; Li et al., 2021; Yang et al., 2013). Novel targets against subtype-specific cancer mutations have also been suggested (Biswas et al., 2019; Li et al., 2019; Park et al., 2019).
An explosive increase in these collateral datasets will provide important resources for diverse data-driven cancer research projects. However, systematic and integrated analyses of these datasets are still challenging to most oncologists and cancer researchers with no computational background. Many web-based tools have been developed to improve the utility of public cancer datasets, such as Oncomine (Rhodes et al., 2004), cBioPortal (Cerami et al., 2012), and TIMER2.0 (Li et al., 2020). Although these web-based applications provide useful tools for a quick data search with significant information, user-oriented customized calculation and data filtration are generally limited from these server-provided functions. Thus, flexible and comprehensive software is required for cancer scientists to carry out customized data processing and computation on their local computers.
Here, we attempted to develop innovative smart software for oncologists to easily start their own data mining projects without computational skills. We established two aims for this software. First, the process of data analysis and visualization should be simple and comprehensive by providing a user-friendly graphical interface and an intuitive organization of menus. Second, we tried to implement smart functions that guide users to find optimal outputs, i.e., associated data pairs and graphs, via real-time communication with a server-side knowledge base harboring billions of pre-calculated data pairs. For these purposes, we simultaneously developed stand-alone software with data processing and computation abilities and a server-side knowledge base that can be connected to local software. This report briefly presents the functions and utilities of this software, Q-omics v0.95. The smart system of the implemented knowledge base will be continuously updated with improved visualization options in the user interface. We expect that the present computer-aided, smart data mining system will have general utilities in all fields of oncology and cancer research without the requirements of bioinformatics skills.
Cell line-based large-scale data consisting of RNA sequencing data (Expression, ver. 20Q1), sgRNA sequencing data (CRISPR, ver. 20Q1), shRNA screening data (Achilles + DRIVE + Marcotte, DEMETER2), mutation data (Mutation Public, ver. 20Q4), and drug response data (Sanger GDSC1 and GDSC2) were obtained from the DepMap portal (https://depmap.org/portal/). RNA sequencing data represent log2-transformed transcripts per million (TMP) + 1 values using RSEM normalization. sgRNA and shRNA data are batch-corrected CERES gene knockout effects (Meyers et al., 2017) and DEMETER2 estimated gene knockdown effects (McFarland et al., 2018), respectively. Mutation data are MAF of gene mutations. Drug response data are published as IC50 (nM) values and we transformed to logarithmic scale pIC50 (M). To analyze associations between datasets, 20 lineages with a sufficient number of common cell lines between RNA sequencing data and other data (sgRNA and drug response) were used in this study. Furthermore, the gene expression data of NCI60 cell lines treated with 15 drugs were obtained from the GEO database (GSE116436) (Monks et al., 2018). Details on the cell line, number of lineages, number of cell lines, and number of genes/drugs are shown in Table 1.
Patient RNA sequencing data, clinical data, and mutation data were obtained from the Genomic Data Commons Data Portal (https://portal.gdc.cancer.gov/). In total, 33 cancer types were investigated in this study. For comparisons between normal and tumor data, paired normal and tumor tissue samples from 18 cancer types whose number of matched tissue samples was larger than 2 were collected. RNA sequencing data in FPKM (fragments per kilobase of transcript per million fragments mapped) values are transformed to log2 (TPM + 1) values after downloading. In addition, immune cell enrichment score for TCGA data was obtained from the xCell portal (https://xcell.ucsf.edu/) (Aran et al., 2017). Details on the tissue type, number of lineages, number of samples, and number of genes are shown in Table 1.
To analyze associations between two datasets, we performed a cross-association analysis between phenotypic efficacy and gene expression in our previous work (Jeong et al., 2020). In this study, we extended the concept of cross-association to analyze more diverse datasets, including those on gene expression, sgRNAs, shRNAs, the drug response, and mutations (Fig. 1). Cross-associations between each data type, such as drug versus RNA-seq and shRNA versus mutation, and within the same data type, such as sgRNA versus sgRNA and mutation versus mutation, can be analyzed.
Two association measures are predictivity and descriptivity. Given any two datasets (X and Y), we assume that x and y are entries of X and Y, respectively. The predictivity of x measures the difference in x values between two groups divided by the median of y. In contrast, the descriptivity of x measures the difference in y values between two groups divided by the median x. Significance was tested using Fisher’s exact test for categorical data (mutation) and Student’s
Survival data were analyzed using the Kaplan–Meier (KM) method, and the log-rank test was used to compare the survival outcomes of two groups as a test of statistical significance. Furthermore, the area under the curve (AUC) was calculated to provide an estimate of the size of the difference between two groups. In this study, overall survival (OS) and disease-free survival (DFS) were analyzed. The two analyses differed according to the definition of the primary endpoint: all causes of death during the study period were used to analyze OS, and a tumor event or death was used to analyze DFS.
For the single-gene survival analysis, patients were divided into two groups based on high or low expression of the given gene or mutation status of the given gene. The association of two genes can be determined in advance to generate several subgroups for combined-gene survival analysis. Furthermore, for a more sophisticated survival analysis, a subset of patients was selected using clinical information such as sex, stage, or any combination of sex and stage.
Q-omics was designed to run locally on user computers. While running Q-omics, time-consuming or data/memory-intensive analyses are performed on the server computer. For example, the cross-association analysis on the user side investigates only in a given lineage, while the smart search retrieves the most highly associated pairs in all 20 lineages from the server side. Similarly, for the survival analysis, the user side calculates the survival rate based on a single gene in a given lineage, while the smart search provides the significance of the survival rate based on a given gene in all 33 lineages.
Box plots in Q-omics can be used to visualize differences in the distribution of numerical data between different groups. The differences between two groups were analyzed by calculating the fold change and
Q-omics also provides a platform for comparisons between drug-induced changes in gene expression. Gene expression data from NCI60 cell lines treated with 15 anticancer agents contained the measured expression values of nine genes at three time points (2, 4, and 24 h) and at three doses (0 nM, low dose, and high dose) (Monks et al., 2018). The low and high doses used varied depending on each drug. For these data, box plots were generated to compare time- and dose-dependent gene expression. Groups were divided based on time points or doses, and box plots were used to display fold changes between time points (4 h vs 2 h and 24 h vs 2 h) or between doses (low dose vs 0 nM and high dose vs 0 nM), respectively, not raw gene expression.
Scatter plots were used to display relationships between two numeric variables, and the strength and direction of the linear relationships were assessed by Pearson’s correlation coefficient in Q-omics.
Q-omics was implemented in Python 3, and MySQL was used for the smart search.
Q-omics software runs on the user’s computer, providing a graphical interface and computational/visualization modules together with its own local database (Fig. 2A). To assist in user data mining, Q-omics interacts with a server-side knowledge base and retrieves relevant information for analysis. The knowledge base harbors billions of precalculated, significantly associated data pairs with related information such as sample filters and calculation options. Smart algorithms in the knowledge base promptly select data pairs and information that is relevant to the user’s query and then returns it to Q-omics.
As described in Fig. 1, users can start data mining with one query (i.e., gene expression, mutations, drugs, or sh/sgRNAs). The front page of Q-omics provides a graphical interface for selecting the analysis type, query and sample type (Fig. 2B). Basically, all analyses are separated into those with patient samples and those with cell lines. Available analyses with patient samples are as follows: (1) survival analyses (Kaplan–Meier plots) according to gene expression and mutations, (2) differential gene expression analyses between normal and cancer cells, and (3) scatter/box plots analyses of gene expression and/or mutation pairs. Available analyses with cell lines are as follows: (1) cross-association analyses between any pair of datasets according to gene expression, mutations, shRNA screening data, sgRNA screening data and drug screening data, (2) change (induction) analyses of gene expression before/after drug treatments, and (3) scatter/box plot analyses of pairs according to gene expression, mutations, shRNAs, sgRNAs and drugs. The menu “Quick start examples” is used to demonstrate graphical outputs and smart functions of the software using the preselected analysis type and user-selected queries. In all analyses, the resulting graphs and data can be saved for further usage.
Fig. 3A demonstrates the survival analysis module of the software. A Kaplan–Meier plot of BRCA patient data was generated by using user-selected options: CD24 gene expression with TP53 mutations. The graphical panel provides detailed information on selected samples and further filtering options such as sex and stage. Together with the panel of Kaplan–Meier plots, Q-omics software provides a panel of smart search results (Fig. 3B). This smart panel provides a list of genes that exhibit significant (
Fig. 4A shows the Q-omics output panel of a cross-association analysis between the user-selected drug, cisplatin, and 17,795 sgRNAs in lung cancer cell lines. The present example shows that the responses of 136 sgRNAs exhibit a positive association (
The predictivity and descriptivity measures from the cross-association calculation were reported to be useful for the systematic evaluation of targets and biomarkers from multiomics data (Jeong et al., 2020). Q-omics software provides a simple and easy interface for calculating and analyzing the cross-association between any data pair, such as gene expression, mutations, sh/sgRNA screening data and drug screening data, from diverse resources. Q-omics also provides smart search results related to the user’s query in the cross-association analysis. The software retrieves diverse association patterns with statistical significance to the user’s query from the knowledge base and assists users in the optimal selection of data pairs and visualization.
In summary, Q-omics is an innovative software program that enables users to carry out data mining and customized visualization without computational skills. The smart system of the software assists in the identification of new data pairs related to/associated with the user’s interests in real time. This software takes advantage of stand-alone software and web-based applications. Several discovery projects using this software are ongoing, and the results will be published in the near future.
This work was financially supported by grants from the National Research Foundation of Korea (KRF), including the Science Research Center Program (NRF-2016R1A5A1011974), and the Mid-career Researcher Program (NRF-2017R1A2B 2007745 and NRF-2018R1A2B6009313), funded by the Korean government (MEST).
S.Y. contributed to the overall study design. J.L., S.J.(Seonghee Jin), H.Y., S.J.(Sumin Jeong), E.J., and S.Y. conceived and implemented the software. J.L., Y.K., E.J., and S.Y. designed and implemented the database. J.L., Y.K., E.J., and S.Y. wrote manuscript.
The authors have no potential conflicts of interest to disclose.
Numbers of data points integrated into Q-omics software
No. of lineages | No. of cell lines/No. of samples | No. of genes/No. of drugs | Data type | |
---|---|---|---|---|
Cell line data | ||||
Gene expression | 20 | 1,061 | 19,137 | RNA sequencing |
sgRNA | 20 | 741 | 18,110 | CRISPR |
shRNA | 20 | 587 | 16,800 | RNAi shRNA |
Drug response | 20 | 1,001 | 397 | Drug response |
Mutation | 20 | 1,281 | 18,731 | Exome sequencing |
Drug-induced gene expression | 13 | 60 | 12,305/15 | DNA microarray |
Tissue data | ||||
Tumor gene expression | 33 | 9,951 | 38,311 | RNA sequencing |
Paired normal vs. cancer: gene expression | 18 | 679 | 38,311 | RNA sequencing |
Mutation | 33 | 9,100 | 20,850 | Exome sequencing |
Immune | 33 | 8,954 | 64 (cell types) | Cell type enrichment score |
. Numbers of data points integrated into Q-omics software.
No. of lineages | No. of cell lines/No. of samples | No. of genes/No. of drugs | Data type | |
---|---|---|---|---|
Cell line data | ||||
Gene expression | 20 | 1,061 | 19,137 | RNA sequencing |
sgRNA | 20 | 741 | 18,110 | CRISPR |
shRNA | 20 | 587 | 16,800 | RNAi shRNA |
Drug response | 20 | 1,001 | 397 | Drug response |
Mutation | 20 | 1,281 | 18,731 | Exome sequencing |
Drug-induced gene expression | 13 | 60 | 12,305/15 | DNA microarray |
Tissue data | ||||
Tumor gene expression | 33 | 9,951 | 38,311 | RNA sequencing |
Paired normal vs. cancer: gene expression | 18 | 679 | 38,311 | RNA sequencing |
Mutation | 33 | 9,100 | 20,850 | Exome sequencing |
Immune | 33 | 8,954 | 64 (cell types) | Cell type enrichment score |
Byeonggeun Kang, Byunghee Kang, Tae-Young Roh, Rho Hyun Seong, and Won Kim
Mol. Cells 2022; 45(5): 343-352 https://doi.org/10.14348/molcells.2022.0001Nayoung Kim, Young-In Yoon, Hyun Ju Yoo, Eunyoung Tak, Chul-Soo Ahn, Gi-Won Song, Sung-Gyu Lee, and Shin Hwang
Mol. Cells 2016; 39(8): 639-644 https://doi.org/10.14348/molcells.2016.0130Chan Hyun Na, Ji Hye Hong, Wan Sup Kim, Selina Rahman Shanta, Joo Yong Bang, Dongmin Park, Hark Kyun Kim, and Kwang Pyo Kim
Mol. Cells 2015; 38(7): 624-629 https://doi.org/10.14348/molcells.2015.0013