Mol. Cells 2021; 44(9): 680-687
Published online September 30, 2021
https://doi.org/10.14348/molcells.2021.2249
© The Korean Society for Molecular and Cellular Biology
Correspondence to : jongbhak@genomics.org
This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/.
Coronavirus disease, COVID-19 (coronavirus disease 2019), caused by SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2), has a higher case fatality rate in European countries than in others, especially East Asian ones. One potential explanation for this regional difference is the diversity of the viral infection efficiency. Here, we analyzed the allele frequencies of a nonsynonymous variant rs12329760 (V197M) in the TMPRSS2 gene, a key enzyme essential for viral infection and found a significant association between the COVID-19 case fatality rate and the V197M allele frequencies, using over 200,000 present-day and ancient genomic samples. East Asian countries have higher V197M allele frequencies than other regions, including European countries which correlates to their lower case fatality rates. Structural and energy calculation analysis of the V197M amino acid change showed that it destabilizes the TMPRSS2 protein, possibly negatively affecting its ACE2 and viral spike protein processing.
Keywords allele frequency, case fatality rate, COVID-19, SARS-CoV-2, TMPRSS2
COVID-19 (coronavirus disease 2019) is an infectious disease caused by SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2). Appearing first during late 2019 in Wuhan, China, COVID-19 has spread rapidly worldwide (Lai et al., 2020). As of May 23, 2020, SARS-CoV-2 has infected >5 million people in over 200 countries, killing more than 330,000 people (European Centre for Disease Prevention and Control, 2020). Many European countries had been particularly affected, with Spain and Italy each having reached over 200,000 cases of infection and more than 27,000 deaths, reaching a maximum case fatality rate (CFR) of >10% (European Centre for Disease Prevention and Control, 2020). In contrast, many East Asian countries did not experience such dire effects, with South Korea, for instance, reporting a peak CFR of 2.4% (European Centre for Disease Prevention and Control, 2020). Multiple contributing factors could explain this difference, including timing and severity of lockdown measures (Sonn et al., 2020), population age ratio (Dowd et al., 2020), healthcare resource availability (Ji et al., 2020), smoking rate (Cai, 2020a; 2020b), and early tuberculosis (Bacillus Calmette–Guérin) vaccination (Hussein, 2020; Miller et al., 2020; Redelman-Sidi, 2020). In principle, genetic factors may also underpin differential susceptibility to SARS-CoV-2 (Cao et al., 2020; Williams et al., 2020; Yuan et al., 2014).
Genes encoding cellular serine protease (TMPRSS2), angiotensin-converting enzyme 2 (ACE2), cysteine proteases cathepsin B and cathepsin L (CatB, CatL), phosphatidylinositol 3-phosphate 5-kinase (PIKfyve), and two pore channel subtype 2 (TPC2) are notable for their critical roles in SARS-CoV-2 infection (Hoffmann et al., 2020; Ou et al., 2020). Particularly, the virus utilizes TMPRSS2 and CatB/L proteolytic activity for priming the viral spike protein, whereas ACE2 is the entry receptor for breaking into host cells (Hoffmann et al., 2020; Ou et al., 2020). A study has suggested TMPRSS2 inhibition as a clinical target because the priming step is a key factor determining successful entry into target cells (Hoffmann et al., 2020). Most of the recent publications on the SARS-CoV-2 susceptibility so far focused on ACE2 and TMPRSS2 as possible genetic determinants by analyzing their associations with sex hormones, their gene expression in various tissues and cell lines, and interactions with spike protein or inhibitors at a gene level (Hoffmann et al., 2020; Matsuyama et al., 2020; Mjaess et al., 2020; Song et al., 2020; Zhou et al., 2020).
To understand the genetic background of complex phenotypes in human populations, researchers commonly assess correlations with allele frequency (AF) (Asselta et al., 2020; Das and Ghate, 2020). This approach has identified a correlation between ancestral genetic composition and the CFR of COVID-19 (Das and Ghate, 2020). However, few have examined specific variants, their frequencies and individual contributions to SARS-CoV-2 susceptibility. Some reports are also based only on low-resolution intercontinental comparisons between Europeans and East Asians (Asselta et al., 2020; Das and Ghate, 2020; Kenyon, 2020). Based on these studies, not only do
In this study, we investigated intercountry AF differences of
Autosomal nonsynonymous variants located in TMPRSS2 were extracted from Korea2K variome set (n = 2,262) from the Korean Genome Project (Jeon et al., 2020), which turned out to contain 15 single nucleotide variants (SNVs). Alternative AFs of other populations were obtained from the PGG.SNV database (GRCh38) (n = 220,147) (Zhang et al., 2019), Italian Genome Reference Panel (IGRP1.0) (n = 926) (Cocca et al., 2020), and Lithuanian high density SNV data (n = 425) (Urnikyte et al., 2019). IGRP1.0 and Lithuanian genomes were lifted over to hg38 coordinates in Picard version 2.22.3 (), using LiftoverVcf with default options. The combined dataset included 223,760 samples from 4 variome databases with whole-genome sequencing, exome sequencing, or genotyping chip data (Supplementary Data S1). Allele counts were merged based on country of sample origin.
Populations were excluded if they could not be assigned to any specific country, if had fewer than 2,500 reported COVID-19 cases, or when AF or CFR information was unavailable. Nonsynonymous variants were included only if they were present in >10 countries and had a global AF of >1%. The final dataset used to calculate AF and CFR correlations contained 72,907 samples (from 29 countries) for TMPRSS2 V197M.
We downloaded COVID-19 data set on March 22, 2021 from Our World in Data (https://github.com/owid/covid-19-data/tree/master/public/data). We employed the equation from Daneshkhah et al. (2020), to calculate time-adjusted CFR (T-CFR) (Equation 1), which throughout this manuscript is averaged and referred to as AT-CFR.
where
Spearman’s correlation test was conducted between AF and AT-CFR in R version 3.5.1.
Variants were annotated in VEP version 99.2 (McLaren et al., 2016) with dbNSFP version 3.0 (Liu et al., 2016) to evaluate deleteriousness and conservation. Additionally, phastCons scores were obtained for primates, mammals, and vertebrates to determine interspecies conservation of significant variant sites.
We built a TMPRSS2 model using hepsin (1Z8G) as the template structure. The model was selected using PSI-BLAST sequence search (Altschul et al., 1997), along with alignment from NCBI. Two sets of TMPRSS2 models were generated using the Robetta web server (Kim et al., 2004) and I-TASSER (Yang et al., 2015): a wild-type TMPRSS2 model based on 1Z8G and a V197M mutant model based on the wild-type one. Valine of residue 65 of 1Z8G was also substituted with methionine to generate mutant type. Protein energies of wild-type and variant models were compared in dDFIRE (Yang and Zhou, 2008) and nDOPE (Shen and Sali, 2006) to determine structural stability (details in the Supplementary Methods). dDFIRE (Yang and Zhou, 2008) scores have been extracted from the protein structure based on the distance between two atoms and the three angles involved in the dipole-dipole interaction. nDOPE (Shen and Sali, 2006) was used to measure protein energy as a statistical potential dependent on the calculated atomic distance in the protein structure.
Ramachandran favorable regions were measured through MolProbity (Williams et al., 2018). The following tools were used to predict variation in TMPRSS2 protein stability for both wild-type and mutant-type models: PoPMuSiC (Dehouck et al., 2011), CUPSAT (Parthiban et al., 2006), I-Mutant3 (Capriotti et al., 2008), DUET (Pires et al., 2014a), mCSM (Pires et al., 2014b), SDM (Pandurangan et al., 2017), MuPro (Cheng et al., 2006). Visualizations were created in UCSF Chimera (Pettersen et al., 2004).
Ancient genomes were downloaded from the David Reich Lab (https://reich.hms.harvard.edu/datasets; Supplementary Data S2 and S3). Additional ancient European data for V197M (rs12329760) were obtained from the PGG.SNV database. Data format conversion was handled using PLINK version 1.9 (Chang et al., 2015). Presence of the two variants was verified and their frequencies calculated in different ancient populations (Supplementary Data S2-S4). Temporal variation in AF was visualized using the ggplot2 package in R.
We found two nonsynonymous exonic
The AF of V197M was negatively correlated with COVID-19 AT-CFR and AT-CFR-100 (Figs. 1B and 1C). The AF distribution pattern was consistent with previous reports, with V197M AF being significantly lower in most Europeans than in East Asians (Asselta et al., 2020) (Fig. 1D, Supplementary Data S5 and S7). In Chinese, Japanese, and Koreans, AF was 34.5%, 38.8%, and 36.8%, respectively (Supplementary Data S5). Among Europeans, the Finnish were a surprising outlier, with 37.3% AF (vs 19.9% in Italians, 17.8% in Spanish, and 22.6% in British) that corresponded to a low AT-CFR (Supplementary Data S5). Finnish AF significantly differed only from the Chinese population among East Asians (
We also found that V197M occurred in an extremely well-conserved position (phastCons17way_primate: 0.958, Supplementary Data S8) of the SRCR domain, suggesting that it is under purifying selection. Moreover, functional prediction tools SIFT (Ng and Henikoff, 2003) and PolyPhen2 (Adzhubei et al., 2010) regarded the variant as “deleterious” and “probably damaging”, respectively (Supplementary Data S8)
Interestingly, the V197M variant is absent in the great apes (Han et al., 2019; Prado-Martinez et al., 2013) and in all sequenced archaic hominin genomes (Denisovan, Neanderthal) (Supplementary Data S9). We further investigated presence of V197M variant in ancient human genomes. Tianyuan man’s genotype showed that the variant was already present in humans 40,000 years ago in East Asia (Supplementary Data S3). We also found V197M in ancient genomes I7021 and I13180 from Mongolia, dated 5,211-5,000 BCE and 3,013-2,876 BCE, respectively (Supplementary Data S2). Starting from the pre-Ice Age (34,000-26,000 years ago), the variant was present in European inhabitants (37,250 BCE sample GoyetQ116-1 from Belgium [Fu et al., 2016]) and remained ever since (Supplementary Data S2 and S4). Although small sample sizes precluded statistical analysis, V197M AF appeared to be higher in ancient East Asian populations (33.3%) than in ancient Europeans (16.3%) (Supplementary Data S3 and S4, Supplementary Fig. S6).
We used 3D protein models to investigate the effect of V197M on TMPRSS2. V197M increased energy score more than wild type (Table 1), suggesting reduced stability. Two programs (dDFIRE [Yang and Zhou, 2008], nDOPE [Shen and Sali, 2006]) were used to measure the effect of V197M on the protein.
We used two homology modeling tools (Robetta [Kim et al., 2004], I-TASSER [Yang et al., 2015]) (Supplementary Methods) and transmembrane serine protease hepsin (PDB ID 1Z8G chain A) (Herter et al., 2005) as the template (Supplementary Fig. S7). The resultant model contains both SRCR and nearby peptidase S1 domains of TMPRSS2 (Fig. 2) because the former was too small for modeling. Despite only minor structural changes to the SRCR domain (Fig. 2), V197M had a consistently destabilizing effect in TMPRSS2 (Table 1). A further indication of reduced stability in mutants was a decrease in the favored region of the Ramachandran plot. Seven computational protein-stability prediction tools confirmed the V197M variant as destabilizing (Supplementary Data S10).
This study has limitations. First, we only used public genome databases and variant frequency data that are not directly linked to COVID-19 patients and the CFR. However, a recent study conducted on COVID-19 patients in Italy confirmed V197M allele frequencies appear to be correlated to patients’ clinical outcomes (Monticelli et al., 2021). Furthermore, we could not completely normalize AT-CFR with relevant covariates, such as lockdown measures, mask availability, medical care standards, within-population or within-fatal-case age ratios, and SARS-CoV-2 test availability. However, we tested the Spearman’s correlation between AT-CFR and thirteen socio-economic variables such as population density and Gross Domestic Product (GDP) per capita in a pairwise manner and found that only the population density had significant positive correlations (Supplementary Figs. S8 and S9). Another limitation is the lack of variant frequency data on chromosome X, absent from many public databases such as PGG.SNV, even though the X chromosome contains a key player,
A previous report has noted that Europeans have significantly lower V197M AF than East Asians, a pattern speculated to be associated with COVID-19 CFR (Asselta et al., 2020). To confirm that this association in our study had not occurred by chance, we performed a V197M AF correlation test based on two different approaches (the correlation with AT-CFR and correlation with AT-CFR-100) applying multiple input filtering criteria (Supplementary Figs. S10-S12) as well as a series of linear regression analyses considering publicly available regional socio-economic and epidemiological variables (Supplementary Data S11-S16). Although we observed a very significant correlation between the AFs of the
Our evaluation of protein structural stability predicted that V197M destabilizes TMPRSS2 (Table 1, Supplementary Data S10). We suspect V197M variant to be related to the overall
In line with previous reports, we suggest that V197M acts to indirectly compromise the binding affinity of TMPRSS2 to SARS-CoV-2 spike protein and ACE2 (Bhattacharyya et al., 2020; Petersen et al., 2010; Sharma et al., 2020). This implies a protective role of the V197M variant against SARS-CoV-2 infections, but neither we nor previous researchers (Bhattacharyya et al., 2020; Paniri et al., 2021; Sharma et al., 2020) have uncovered any clear evidence or explanation for causation. Interestingly, the change from valine to methionine has a Grantham distance matrix value of only 22, the shortest distance from valine to any amino acid. Thus, V197M may lie on a thin boundary of extreme conservation versus functional benefit that may have arisen through the viral invasion and polymorphisms in different ethnic groups that caused 3D structural deviation. We speculate that East Asians have already experienced similar viral infections in the past (Souilmi et al., 2020), leading to natural selection on V197M in
Notably, the majority of the data in our study were obtained from PGG.SNV database (Zhang et al., 2019) (Supplementary Data S1), where we cannot check the duplicity as the genome data for each sample is not available, only allele counts. This also complicates checking duplicity between the PGG.SNV (Zhang et al., 2019) and the external data used (such as Korean [Jeon et al., 2020], Lithuanian [Urnikyte et al., 2019], and Italian data [Cocca et al., 2020]) (Supplementary Data S1). Duplicity could have possibly occurred as the same individual could have been sequenced or genotyped for multiple projects either within PGG.SNV database, or for e.g., PGG.SNV and Korea1K. Still, the chances of this happening must be very low and even if it is the case, it would not affect the overall statistics significantly.
This research was a part of Korean Genome Project (KGP) and was approved by the Institutional Review Board (IRB) of the Ulsan National Institute of Science and Technology (UNISTIRB-15-19-A, UNISTIRB-16-13-C). This work was supported by the Promotion of Innovative Businesses for Regulation-Free Special Zones funded by the Ministry of SMEs and Startups (MSS, Korea)(P0016193)(2.210511.01). This work was also supported by the Establishment of Demonstration Infrastructure for Regulation-Free Special Zones funded by the Ministry of SMEs and Startups (MSS, Korea)(P0016191)(2.210514.01). This work was also supported by the Research Project Funded by Ulsan City Research Fund (2.201052.01) of UNIST (Ulsan National Institute of Science & Technology). We thank Dr. Seung Gu Park for advising the data visualization and Jasmin Junseo Lee for editing grammatical errors.
S.J. and J.B. conceived the study design. S.J., C.Y., H.R., and A.B. performed analyses. A.B., C.Y., S.J., H.R., and H.C. wrote and edited the manuscript. J.B. secured funding. Y.J., Y.B., D.B., A.M., E.S.S., Y.S.C., N.R., and B.C.K. provided expertise and feedback. H.C. independently validated the results.
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Y.S.C. is an employee and B.C.K. and J.B. are the CEOs of Clinomics, Inc. Y.S.C., B.C.K., and J.B. have an equity interest in the company. D.B. is an employee and J.B. is the CEO of Geromics, Ltd. The rest of the authors have no potential conflicts of interest to disclose.
Effect of V197M variant on structural features
Modeled structure | Type of structures | dDFIRE | nDOPE | Ramachandran plot (favored) (%) | |
---|---|---|---|---|---|
Hepsin (1Z8G) | V197d | –822.28 | –1.586 | 97.53 | |
M197e | –812.80 | –1.439 | 96.76 | ||
Robettaa | TMPRSS2 (SRCR)b | V197 | –183.43 | –1.125 | 94.68 |
M197 | –176.48 | –0.907 | 93.62 | ||
TMPRSS2 (SRCR + peptidase S1)c | V197 | –730.79 | –1.135 | 94.77 | |
M197 | –725.40 | –1.062 | 93.90 | ||
I-TASSERa | TMPRSS2 (SRCR) | V197 | –184.17 | –0.909 | 91.67 |
M197 | –151.16 | –0.156 | 86.17 | ||
TMPRSS2 (SRCR + peptidase S1) | V197 | –700.23 | –0.704 | 92.44 | |
M197 | –615.98 | –0.129 | 86.05 |
aHomology modeling tools, bSRCR domain separated from modeled TMPRSS2 structure, cModeled TMPRSS2 structure, dWild type (V197) structure, eMutant type structure with M197 variant.
Mol. Cells 2021; 44(9): 680-687
Published online September 30, 2021 https://doi.org/10.14348/molcells.2021.2249
Copyright © The Korean Society for Molecular and Cellular Biology.
Sungwon Jeon1,2,9 , Asta Blazyte1,2,9
, Changhan Yoon1,2,9
, Hyojung Ryu1,2
, Yeonsu Jeon1,2
, Youngjune Bhak1,2
, Dan Bolser3
, Andrea Manica4
, Eun-Seok Shin5,6
, Yun Sung Cho7
, Byung Chul Kim7
, Namhee Ryoo8
, Hansol Choi1,2
, and Jong Bhak1,2,3,6,7,*
1Korean Genomics Center (KOGIC), Ulsan National Institute of Science and Technology (UNIST), Ulsan 44919, Korea, 2Department of Biomedical Engineering, College of Information and Biotechnology, UNIST, Ulsan 44919, Korea, 3Geromics, Ltd., Cambridge CB1 3NF, UK, 4Department of Zoology, University of Cambridge, Cambridge CB2 3EJ, UK, 5Division of Cardiology, Department of Internal Medicine, Ulsan Medical Center, Ulsan 44686, Korea, 6Personal Genomics Institute (PGI), Genome Research Foundation (GRF), Cheongju 28160, Korea, 7Clinomics, Inc., Ulsan 44919, Korea, 8Department of Laboratory Medicine, Keimyung University School of Medicine, Daegu 42601, Korea, 9These authors contributed equally to this work.
Correspondence to:jongbhak@genomics.org
This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/.
Coronavirus disease, COVID-19 (coronavirus disease 2019), caused by SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2), has a higher case fatality rate in European countries than in others, especially East Asian ones. One potential explanation for this regional difference is the diversity of the viral infection efficiency. Here, we analyzed the allele frequencies of a nonsynonymous variant rs12329760 (V197M) in the TMPRSS2 gene, a key enzyme essential for viral infection and found a significant association between the COVID-19 case fatality rate and the V197M allele frequencies, using over 200,000 present-day and ancient genomic samples. East Asian countries have higher V197M allele frequencies than other regions, including European countries which correlates to their lower case fatality rates. Structural and energy calculation analysis of the V197M amino acid change showed that it destabilizes the TMPRSS2 protein, possibly negatively affecting its ACE2 and viral spike protein processing.
Keywords: allele frequency, case fatality rate, COVID-19, SARS-CoV-2, TMPRSS2
COVID-19 (coronavirus disease 2019) is an infectious disease caused by SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2). Appearing first during late 2019 in Wuhan, China, COVID-19 has spread rapidly worldwide (Lai et al., 2020). As of May 23, 2020, SARS-CoV-2 has infected >5 million people in over 200 countries, killing more than 330,000 people (European Centre for Disease Prevention and Control, 2020). Many European countries had been particularly affected, with Spain and Italy each having reached over 200,000 cases of infection and more than 27,000 deaths, reaching a maximum case fatality rate (CFR) of >10% (European Centre for Disease Prevention and Control, 2020). In contrast, many East Asian countries did not experience such dire effects, with South Korea, for instance, reporting a peak CFR of 2.4% (European Centre for Disease Prevention and Control, 2020). Multiple contributing factors could explain this difference, including timing and severity of lockdown measures (Sonn et al., 2020), population age ratio (Dowd et al., 2020), healthcare resource availability (Ji et al., 2020), smoking rate (Cai, 2020a; 2020b), and early tuberculosis (Bacillus Calmette–Guérin) vaccination (Hussein, 2020; Miller et al., 2020; Redelman-Sidi, 2020). In principle, genetic factors may also underpin differential susceptibility to SARS-CoV-2 (Cao et al., 2020; Williams et al., 2020; Yuan et al., 2014).
Genes encoding cellular serine protease (TMPRSS2), angiotensin-converting enzyme 2 (ACE2), cysteine proteases cathepsin B and cathepsin L (CatB, CatL), phosphatidylinositol 3-phosphate 5-kinase (PIKfyve), and two pore channel subtype 2 (TPC2) are notable for their critical roles in SARS-CoV-2 infection (Hoffmann et al., 2020; Ou et al., 2020). Particularly, the virus utilizes TMPRSS2 and CatB/L proteolytic activity for priming the viral spike protein, whereas ACE2 is the entry receptor for breaking into host cells (Hoffmann et al., 2020; Ou et al., 2020). A study has suggested TMPRSS2 inhibition as a clinical target because the priming step is a key factor determining successful entry into target cells (Hoffmann et al., 2020). Most of the recent publications on the SARS-CoV-2 susceptibility so far focused on ACE2 and TMPRSS2 as possible genetic determinants by analyzing their associations with sex hormones, their gene expression in various tissues and cell lines, and interactions with spike protein or inhibitors at a gene level (Hoffmann et al., 2020; Matsuyama et al., 2020; Mjaess et al., 2020; Song et al., 2020; Zhou et al., 2020).
To understand the genetic background of complex phenotypes in human populations, researchers commonly assess correlations with allele frequency (AF) (Asselta et al., 2020; Das and Ghate, 2020). This approach has identified a correlation between ancestral genetic composition and the CFR of COVID-19 (Das and Ghate, 2020). However, few have examined specific variants, their frequencies and individual contributions to SARS-CoV-2 susceptibility. Some reports are also based only on low-resolution intercontinental comparisons between Europeans and East Asians (Asselta et al., 2020; Das and Ghate, 2020; Kenyon, 2020). Based on these studies, not only do
In this study, we investigated intercountry AF differences of
Autosomal nonsynonymous variants located in TMPRSS2 were extracted from Korea2K variome set (n = 2,262) from the Korean Genome Project (Jeon et al., 2020), which turned out to contain 15 single nucleotide variants (SNVs). Alternative AFs of other populations were obtained from the PGG.SNV database (GRCh38) (n = 220,147) (Zhang et al., 2019), Italian Genome Reference Panel (IGRP1.0) (n = 926) (Cocca et al., 2020), and Lithuanian high density SNV data (n = 425) (Urnikyte et al., 2019). IGRP1.0 and Lithuanian genomes were lifted over to hg38 coordinates in Picard version 2.22.3 (), using LiftoverVcf with default options. The combined dataset included 223,760 samples from 4 variome databases with whole-genome sequencing, exome sequencing, or genotyping chip data (Supplementary Data S1). Allele counts were merged based on country of sample origin.
Populations were excluded if they could not be assigned to any specific country, if had fewer than 2,500 reported COVID-19 cases, or when AF or CFR information was unavailable. Nonsynonymous variants were included only if they were present in >10 countries and had a global AF of >1%. The final dataset used to calculate AF and CFR correlations contained 72,907 samples (from 29 countries) for TMPRSS2 V197M.
We downloaded COVID-19 data set on March 22, 2021 from Our World in Data (https://github.com/owid/covid-19-data/tree/master/public/data). We employed the equation from Daneshkhah et al. (2020), to calculate time-adjusted CFR (T-CFR) (Equation 1), which throughout this manuscript is averaged and referred to as AT-CFR.
where
Spearman’s correlation test was conducted between AF and AT-CFR in R version 3.5.1.
Variants were annotated in VEP version 99.2 (McLaren et al., 2016) with dbNSFP version 3.0 (Liu et al., 2016) to evaluate deleteriousness and conservation. Additionally, phastCons scores were obtained for primates, mammals, and vertebrates to determine interspecies conservation of significant variant sites.
We built a TMPRSS2 model using hepsin (1Z8G) as the template structure. The model was selected using PSI-BLAST sequence search (Altschul et al., 1997), along with alignment from NCBI. Two sets of TMPRSS2 models were generated using the Robetta web server (Kim et al., 2004) and I-TASSER (Yang et al., 2015): a wild-type TMPRSS2 model based on 1Z8G and a V197M mutant model based on the wild-type one. Valine of residue 65 of 1Z8G was also substituted with methionine to generate mutant type. Protein energies of wild-type and variant models were compared in dDFIRE (Yang and Zhou, 2008) and nDOPE (Shen and Sali, 2006) to determine structural stability (details in the Supplementary Methods). dDFIRE (Yang and Zhou, 2008) scores have been extracted from the protein structure based on the distance between two atoms and the three angles involved in the dipole-dipole interaction. nDOPE (Shen and Sali, 2006) was used to measure protein energy as a statistical potential dependent on the calculated atomic distance in the protein structure.
Ramachandran favorable regions were measured through MolProbity (Williams et al., 2018). The following tools were used to predict variation in TMPRSS2 protein stability for both wild-type and mutant-type models: PoPMuSiC (Dehouck et al., 2011), CUPSAT (Parthiban et al., 2006), I-Mutant3 (Capriotti et al., 2008), DUET (Pires et al., 2014a), mCSM (Pires et al., 2014b), SDM (Pandurangan et al., 2017), MuPro (Cheng et al., 2006). Visualizations were created in UCSF Chimera (Pettersen et al., 2004).
Ancient genomes were downloaded from the David Reich Lab (https://reich.hms.harvard.edu/datasets; Supplementary Data S2 and S3). Additional ancient European data for V197M (rs12329760) were obtained from the PGG.SNV database. Data format conversion was handled using PLINK version 1.9 (Chang et al., 2015). Presence of the two variants was verified and their frequencies calculated in different ancient populations (Supplementary Data S2-S4). Temporal variation in AF was visualized using the ggplot2 package in R.
We found two nonsynonymous exonic
The AF of V197M was negatively correlated with COVID-19 AT-CFR and AT-CFR-100 (Figs. 1B and 1C). The AF distribution pattern was consistent with previous reports, with V197M AF being significantly lower in most Europeans than in East Asians (Asselta et al., 2020) (Fig. 1D, Supplementary Data S5 and S7). In Chinese, Japanese, and Koreans, AF was 34.5%, 38.8%, and 36.8%, respectively (Supplementary Data S5). Among Europeans, the Finnish were a surprising outlier, with 37.3% AF (vs 19.9% in Italians, 17.8% in Spanish, and 22.6% in British) that corresponded to a low AT-CFR (Supplementary Data S5). Finnish AF significantly differed only from the Chinese population among East Asians (
We also found that V197M occurred in an extremely well-conserved position (phastCons17way_primate: 0.958, Supplementary Data S8) of the SRCR domain, suggesting that it is under purifying selection. Moreover, functional prediction tools SIFT (Ng and Henikoff, 2003) and PolyPhen2 (Adzhubei et al., 2010) regarded the variant as “deleterious” and “probably damaging”, respectively (Supplementary Data S8)
Interestingly, the V197M variant is absent in the great apes (Han et al., 2019; Prado-Martinez et al., 2013) and in all sequenced archaic hominin genomes (Denisovan, Neanderthal) (Supplementary Data S9). We further investigated presence of V197M variant in ancient human genomes. Tianyuan man’s genotype showed that the variant was already present in humans 40,000 years ago in East Asia (Supplementary Data S3). We also found V197M in ancient genomes I7021 and I13180 from Mongolia, dated 5,211-5,000 BCE and 3,013-2,876 BCE, respectively (Supplementary Data S2). Starting from the pre-Ice Age (34,000-26,000 years ago), the variant was present in European inhabitants (37,250 BCE sample GoyetQ116-1 from Belgium [Fu et al., 2016]) and remained ever since (Supplementary Data S2 and S4). Although small sample sizes precluded statistical analysis, V197M AF appeared to be higher in ancient East Asian populations (33.3%) than in ancient Europeans (16.3%) (Supplementary Data S3 and S4, Supplementary Fig. S6).
We used 3D protein models to investigate the effect of V197M on TMPRSS2. V197M increased energy score more than wild type (Table 1), suggesting reduced stability. Two programs (dDFIRE [Yang and Zhou, 2008], nDOPE [Shen and Sali, 2006]) were used to measure the effect of V197M on the protein.
We used two homology modeling tools (Robetta [Kim et al., 2004], I-TASSER [Yang et al., 2015]) (Supplementary Methods) and transmembrane serine protease hepsin (PDB ID 1Z8G chain A) (Herter et al., 2005) as the template (Supplementary Fig. S7). The resultant model contains both SRCR and nearby peptidase S1 domains of TMPRSS2 (Fig. 2) because the former was too small for modeling. Despite only minor structural changes to the SRCR domain (Fig. 2), V197M had a consistently destabilizing effect in TMPRSS2 (Table 1). A further indication of reduced stability in mutants was a decrease in the favored region of the Ramachandran plot. Seven computational protein-stability prediction tools confirmed the V197M variant as destabilizing (Supplementary Data S10).
This study has limitations. First, we only used public genome databases and variant frequency data that are not directly linked to COVID-19 patients and the CFR. However, a recent study conducted on COVID-19 patients in Italy confirmed V197M allele frequencies appear to be correlated to patients’ clinical outcomes (Monticelli et al., 2021). Furthermore, we could not completely normalize AT-CFR with relevant covariates, such as lockdown measures, mask availability, medical care standards, within-population or within-fatal-case age ratios, and SARS-CoV-2 test availability. However, we tested the Spearman’s correlation between AT-CFR and thirteen socio-economic variables such as population density and Gross Domestic Product (GDP) per capita in a pairwise manner and found that only the population density had significant positive correlations (Supplementary Figs. S8 and S9). Another limitation is the lack of variant frequency data on chromosome X, absent from many public databases such as PGG.SNV, even though the X chromosome contains a key player,
A previous report has noted that Europeans have significantly lower V197M AF than East Asians, a pattern speculated to be associated with COVID-19 CFR (Asselta et al., 2020). To confirm that this association in our study had not occurred by chance, we performed a V197M AF correlation test based on two different approaches (the correlation with AT-CFR and correlation with AT-CFR-100) applying multiple input filtering criteria (Supplementary Figs. S10-S12) as well as a series of linear regression analyses considering publicly available regional socio-economic and epidemiological variables (Supplementary Data S11-S16). Although we observed a very significant correlation between the AFs of the
Our evaluation of protein structural stability predicted that V197M destabilizes TMPRSS2 (Table 1, Supplementary Data S10). We suspect V197M variant to be related to the overall
In line with previous reports, we suggest that V197M acts to indirectly compromise the binding affinity of TMPRSS2 to SARS-CoV-2 spike protein and ACE2 (Bhattacharyya et al., 2020; Petersen et al., 2010; Sharma et al., 2020). This implies a protective role of the V197M variant against SARS-CoV-2 infections, but neither we nor previous researchers (Bhattacharyya et al., 2020; Paniri et al., 2021; Sharma et al., 2020) have uncovered any clear evidence or explanation for causation. Interestingly, the change from valine to methionine has a Grantham distance matrix value of only 22, the shortest distance from valine to any amino acid. Thus, V197M may lie on a thin boundary of extreme conservation versus functional benefit that may have arisen through the viral invasion and polymorphisms in different ethnic groups that caused 3D structural deviation. We speculate that East Asians have already experienced similar viral infections in the past (Souilmi et al., 2020), leading to natural selection on V197M in
Notably, the majority of the data in our study were obtained from PGG.SNV database (Zhang et al., 2019) (Supplementary Data S1), where we cannot check the duplicity as the genome data for each sample is not available, only allele counts. This also complicates checking duplicity between the PGG.SNV (Zhang et al., 2019) and the external data used (such as Korean [Jeon et al., 2020], Lithuanian [Urnikyte et al., 2019], and Italian data [Cocca et al., 2020]) (Supplementary Data S1). Duplicity could have possibly occurred as the same individual could have been sequenced or genotyped for multiple projects either within PGG.SNV database, or for e.g., PGG.SNV and Korea1K. Still, the chances of this happening must be very low and even if it is the case, it would not affect the overall statistics significantly.
This research was a part of Korean Genome Project (KGP) and was approved by the Institutional Review Board (IRB) of the Ulsan National Institute of Science and Technology (UNISTIRB-15-19-A, UNISTIRB-16-13-C). This work was supported by the Promotion of Innovative Businesses for Regulation-Free Special Zones funded by the Ministry of SMEs and Startups (MSS, Korea)(P0016193)(2.210511.01). This work was also supported by the Establishment of Demonstration Infrastructure for Regulation-Free Special Zones funded by the Ministry of SMEs and Startups (MSS, Korea)(P0016191)(2.210514.01). This work was also supported by the Research Project Funded by Ulsan City Research Fund (2.201052.01) of UNIST (Ulsan National Institute of Science & Technology). We thank Dr. Seung Gu Park for advising the data visualization and Jasmin Junseo Lee for editing grammatical errors.
S.J. and J.B. conceived the study design. S.J., C.Y., H.R., and A.B. performed analyses. A.B., C.Y., S.J., H.R., and H.C. wrote and edited the manuscript. J.B. secured funding. Y.J., Y.B., D.B., A.M., E.S.S., Y.S.C., N.R., and B.C.K. provided expertise and feedback. H.C. independently validated the results.
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Y.S.C. is an employee and B.C.K. and J.B. are the CEOs of Clinomics, Inc. Y.S.C., B.C.K., and J.B. have an equity interest in the company. D.B. is an employee and J.B. is the CEO of Geromics, Ltd. The rest of the authors have no potential conflicts of interest to disclose.
. Effect of V197M variant on structural features .
Modeled structure | Type of structures | dDFIRE | nDOPE | Ramachandran plot (favored) (%) | |
---|---|---|---|---|---|
Hepsin (1Z8G) | V197d | –822.28 | –1.586 | 97.53 | |
M197e | –812.80 | –1.439 | 96.76 | ||
Robettaa | TMPRSS2 (SRCR)b | V197 | –183.43 | –1.125 | 94.68 |
M197 | –176.48 | –0.907 | 93.62 | ||
TMPRSS2 (SRCR + peptidase S1)c | V197 | –730.79 | –1.135 | 94.77 | |
M197 | –725.40 | –1.062 | 93.90 | ||
I-TASSERa | TMPRSS2 (SRCR) | V197 | –184.17 | –0.909 | 91.67 |
M197 | –151.16 | –0.156 | 86.17 | ||
TMPRSS2 (SRCR + peptidase S1) | V197 | –700.23 | –0.704 | 92.44 | |
M197 | –615.98 | –0.129 | 86.05 |
aHomology modeling tools, bSRCR domain separated from modeled TMPRSS2 structure, cModeled TMPRSS2 structure, dWild type (V197) structure, eMutant type structure with M197 variant..
Yun-Sook Lim, Lap P. Nguyen, Gun-Hee Lee, Sung-Geun Lee, Kwang-Soo Lyoo, Bumseok Kim, and Soon B. Hwang
Mol. Cells 2021; 44(9): 688-695 https://doi.org/10.14348/molcells.2021.0076Taewoo Kim, Jeong Seok Lee, and Young Seok Ju
Mol. Cells 2021; 44(6): 377-383 https://doi.org/10.14348/molcells.2021.0094Min Kyung Jung and Eui-Cheol Shin
Mol. Cells 2021; 44(6): 401-407 https://doi.org/10.14348/molcells.2021.0079