publications
* denotes equal contribution.
An up-to-date list is available on Google Scholar.
2024
- PreMode predicts mode of action of missense variants by deep graph representation learning of protein sequence and structural contextGuojie Zhong, Yige Zhao, Demi Zhuang, Wendy K Chung, and Yufeng ShenbioRxiv 2024
Accurate prediction of the functional impact of missense variants is important for disease gene discovery, clinical genetic diagnostics, therapeutic strategies, and protein engineering. Previous efforts have focused on predicting a binary pathogenicity classification, but the functional impact of missense variants is multi-dimensional. Pathogenic missense variants in the same gene may act through different modes of action (i.e., gain/loss-of-function) by affecting different aspects of protein function. They may result in distinct clinical conditions that require different treatments. We developed a new method, PreMode, to perform gene-specific mode-of-action predictions. PreMode models effects of coding sequence variants using SE(3)-equivariant graph neural networks on protein sequences and structures. Using the largest-to-date set of missense variants with known modes of action, we showed that PreMode reached state-of-the-art performance in multiple types of mode-of-action predictions by efficient transfer-learning. Additionally, PreMode’s prediction of G/LoF variants in a kinase is validated with inactive-active conformation transition energy changes. Finally, we show that PreMode enables efficient study design of deep mutational scans and optimization in protein engineering.Competing Interest StatementThe authors have declared no competing interest.
2023
- A probabilistic graphical model for estimating selection coefficient of missense variants from human population sequence dataYige Zhao, Guojie Zhong, Jake Hagen, Hongbing Pan, Wendy K. Chung, and 1 more authormedRxiv 2023
Accurately predicting the effect of missense variants is a central problem in interpretation of genomic variation. Commonly used computational methods does not capture the quantitative impact on fitness in populations. We developed MisFit to estimate missense fitness effect using biobank-scale human population genome data. MisFit jointly models the effect at molecular level (d) and population level (selection coefficient, s), assuming that in the same gene, missense variants with similar d have similar s. MisFit is a probabilistic graphical model that integrates deep neural network components and population genetics models efficiently with inductive bias based on biological causality of variant effect. We trained it by maximizing probability of observed allele counts in 236,017 European individuals. We show that s is informative in predicting frequency across ancestries and consistent with the fraction of de novo mutations given s. Finally, MisFit outperforms previous methods in prioritizing missense variants in individuals with neurodevelopmental disorders.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis work is supported by NIH grants (R35GM149527, R01GM120609, and P50HD109879), Simons Foundation (SFARI #1019623), and Columbia Precision Medicine Pilot grants program.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:IRB of Columbia University gave ethical approval for this workI confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.YesAll data produced in the present work are contained in the manuscript https://github.com/ShenLab/MisFit
- VBASS enables integration of single cell gene expression data in Bayesian association analysis of rare variantsG. Zhong, Y. A. Choi, and Y. Shen2023
Rare or de novo variants have substantial contribution to human diseases, but the statistical power to identify risk genes by rare variants is generally low due to rarity of genotype data. Previous studies have shown that risk genes usually have high expression in relevant cell types, although for many conditions the identity of these cell types are largely unknown. Recent efforts in single cell atlas in human and model organisms produced large amount of gene expression data. Here we present VBASS, a Bayesian method that integrates single-cell expression and de novo variant (DNV) data to improve power of disease risk gene discovery. VBASS models disease risk prior as a function of expression profiles, approximated by deep neural networks. It learns the weights of neural networks and parameters of Gamma-Poisson likelihood models of DNV counts jointly from expression and genetics data. On simulated data, VBASS shows proper error rate control and better power than state-of-the-art methods. We applied VBASS to published datasets and identified more candidate risk genes with supports from literature or data from independent cohorts. VBASS can be generalized to integrate other types of functional genomics data in statistical genetics analysis.
2022
- MLSB 2022Representation of missense variants for predicting modes of actionG. Zhong, and Y. ShenMachine Learning in Structural Biology, Workshop at the 36th Conference on Neural Information Processing Systems (NeurIPS), 2022
Accurate prediction of functional impact for missense variants is fundamental for genetic analysis and clinical applications. Current methods focused on generating an overall pathogenicity prediction score while overlooking the fact that variant effect should be multi-dimensional via different modes of action, such as gain or loss of function, and loss of folding stability or enzymatic activity. Recent breakthrough of high-capacity language models enabled ab initio prediction of protein structures as well as self-supervised representation learning of protein sequence and functions. Here we present RESCVE, a method to learn universal representation of sequence variation from protein context. We demonstrated the utility of the method predicting a range of modes of action for missense variants through transfer learning.
- Statistical models of the genetic etiology of congenital heart diseaseG. Zhong, and Y. ShenCurr Opin Genet Dev, 2022
Congenital heart disease (CHD) is a collection of anatomically and clinically heterogeneous structure anomalies of heart at birth. Finding genetic causes of CHD can not only shed light on developmental biology of heart, but also provide basis for improving clinical care and interventions. The optimal study design and analytical approaches to identify genetic causes depend on the underlying genetic architecture. A few well-known syndromes with CHD as core conditions, such as Noonan and CHARGE, have known monogenic causes. The genetic causes of most of CHD patients, however, are unknown and likely to be complex. In this review, we highlight recent studies that assume a complex genetic architecture of CHD with two main approaches. One is genomic sequencing studies aiming for identifying rare or de novo risk variants with large genetic effect. The other is genome-wide association studies optimized for common variants with moderate genetic effect.
- Identification and validation of candidate risk genes in endocytic vesicular trafficking associated with esophageal atresia and tracheoesophageal fistulasG. Zhong*, P. Ahimaz*, N. A. Edwards*, J. J. Hagen, C. Faure, and 13 more authorsHGG Adv, 2022
Esophageal atresias/tracheoesophageal fistulas (EA/TEF) are rare congenital anomalies caused by aberrant development of the foregut. Previous studies indicate that rare or de novo genetic variants significantly contribute to EA/TEF risk, and most individuals with EA/TEF do not have pathogenic genetic variants in established risk genes. To identify novel genetic contributions to EA/TEF, we performed whole genome sequencing of 185 trios (probands and parents) with EA/TEF, including 59 isolated and 126 complex cases with additional congenital anomalies and/or neurodevelopmental disorders. There was a significant burden of protein altering de novo coding variants in complex cases (p=3.3e-4), especially in genes that are intolerant of loss of function variants in the population. We performed simulation analysis of pathway enrichment based on background mutation rate and identified a number of pathways related to endocytosis and intracellular trafficking that as a group have a significant burden of protein altering de novo variants. We assessed 18 variants for disease causality using CRISPR-Cas9 mutagenesis in Xenopus and confirmed 13 with tracheoesophageal phenotypes. Our results implicate disruption of endosome-mediated epithelial remodeling as a potential mechanism of foregut developmental defects. This research may have implications for the mechanisms of other rare congenital anomalies.
- Discovering the Developmental Basis of Trachea-Esophageal Birth Defects: Evidence for Endosome-opathiesN. Edwards, G. Zhong, P. Ahimaz, A. Kenny, P. Kingma, and 4 more authorsThe FASEB Journal, 2022
The trachea and esophagus (TE) arise from a common foregut tube during embryonic development. Disruptions in TE morphogenesis cause congenital trachea-esophageal defects (TEDs) such as esophageal atresia, tracheoesophageal fistula and tracheoesophageal clefts. TEDs occur in approximately 1 in 3500 births, but their etiology is poorly understood. We have established the www.CLEARconsortium.org; a multidisciplinary team of clinicians, geneticists, bioinformaticians, stem cell and developmental biologists using patient genome sequencing, animal models and iPSC-derived human organoids to discover the genetic and developmental basis of trachea-esophageal birth defects. Using the complementary advantages of Xenopus and mouse models we have defined the conserved molecular and cellular mechanisms that regulate normal TE morphogenesis. We show that downstream of Hedgehog/Gli signaling endosome-mediated epithelial remodeling regulates TE morphogenesis which when disrupted results in tracheoesophageal clefts similar to human Pallister Hall syndrome patients. Proband-parent trio genome sequencing identified an enrichment of potential damaging de novo variants in genes encoding membrane/vesicular-trafficking proteins, suggesting a common “endosome-opathy” pathway. Ongoing CRISPR mutagenesis screens in Xenopus tropicalis assessing candidate causative variants from patients confirms that the endosome protein Itsn1 is essential for TE morphogenesis, suggesting that the ITSN1 variant is likely pathogenic in the patient. Finally, leveraging results from animal models we have generated multi-lineage human esophageal organoids from iPSCs with patient mutations to identify how mutations impact human esophageal differentiation. Together these results significantly advance our understanding of TEDs with the goal of revealing phenotype-genotype associations that will inform prognosis and clinical treatment.
2021
- Towards better understanding of developmental disorders from integration of spatial single-cell transcriptomics and epigenomicsG. Zhong*, J. Wang*, S. He*, and X. Fu*The 2021 ICML Workshop on Computational Biology, 2021
The recent emerging techniques of single cell spatial RNA seq makes it possible to profile the transcriptomics data at single cell resolution without loss of the spatial information. However, it is still a challenge to measure epigenomics profiles at spatial levels. In this project, we developed an autoencoder based multi-omics integration method and applied it on spatial mouse fetal brain data to reconstruct the spa- tial epigenomics profiles. We compared our method with LIGER and showed its better performance on a public dataset measured by latent mixing metrics. We further developed a CNN model to predict autism risk genes based on the spatial RNA seq data. Our model is able to prioritize autism risk genes from whole genome level. Code of our project can be found at https://github.com/explorerwjy/ML_genomics
- mRNA Delivery of a Bispecific Single-Domain Antibody to Polarize Tumor-Associated Macrophages and Synergize Immunotherapy against Liver MalignanciesY. Wang, K. Tiruthani, S. Li, M. Hu, G. Zhong, and 6 more authorsAdv Mater, 2021
Liver malignancies are among the tumor types that are resistant to immune checkpoint inhibition therapy. Tumor-associated macrophages (TAMs) are highly enriched and play a major role in inducing immunosuppression in liver malignancies. Herein, CCL2 and CCL5 are screened as two major chemokines responsible for attracting TAM infiltration and inducing their polarization toward cancer-promoting M2-phenotype. To reverse this immunosuppressive process, an innovative single-domain antibody that bispecifically binds and neutralizes CCL2 and CCL5 (BisCCL2/5i) with high potency and specificity is directly evolved. mRNA encoding BisCCL2/5i is encapsulated in a clinically approved lipid nanoparticle platform, resulting in a liver-homing biomaterial that allows transient yet efficient expression of BisCCL2/5i in the diseased organ in a multiple dosage manner. This BisCCL2/5i mRNA nanoplatform significantly induces the polarization of TAMs toward the antitumoral M1 phenotype and reduces immunosuppression in the tumor microenvironment. The combination of BisCCL2/5i with PD-1 ligand inhibitor (PD-Li) achieves long-term survival in mouse models of primary liver cancer and liver metastasis of colorectal and pancreatic cancers. The work provides an effective bispecific targeting strategy that could broaden the PD-Li therapy to multiple types of malignancies in the human liver.
- Author Corrections: Reconstruction of cell spatial organization from single-cell RNA sequencing data based on ligand-receptor mediated self-assemblyX. Ren*, G. Zhong*, Q. Zhang, L. Zhang, Y. Sun, and 1 more authorCell Res, 2021
2020
- Reconstruction of cell spatial organization from single-cell RNA sequencing data based on ligand-receptor mediated self-assemblyX. Ren*, G. Zhong*, Q. Zhang, L. Zhang, Y. Sun, and 1 more authorCell Res, 2020
Single-cell RNA sequencing (scRNA-seq) has revolutionized transcriptomic studies by providing unprecedented cellular and molecular throughputs, but spatial information of individual cells is lost during tissue dissociation. While imaging-based technologies such as in situ sequencing show great promise, technical difficulties currently limit their wide usage. Here we hypothesize that cellular spatial organization is inherently encoded by cell identity and can be reconstructed, at least in part, by ligand-receptor interactions, and we present CSOmap, a computational tool to infer cellular interaction de novo from scRNA-seq. We show that CSOmap can successfully recapitulate the spatial organization of multiple organs of human and mouse including tumor microenvironments for multiple cancers in pseudo-space, and reveal molecular determinants of cellular interactions. Further, CSOmap readily simulates perturbation of genes or cell types to gain novel biological insights, especially into how immune cells interact in the tumor microenvironment. CSOmap can be a widely applicable tool to interrogate cellular organizations based on scRNA-seq data for various tissues in diverse systems.
2019
- Landscape and Dynamics of Single Immune Cells in Hepatocellular CarcinomaQ. Zhang, Y. He, N. Luo, S. J. Patel, Y. Han, and 20 more authorsCell, 2019
The immune microenvironment of hepatocellular carcinoma (HCC) is poorly characterized. Combining two single-cell RNA sequencing technologies, we produced transcriptomes of CD45+ immune cells for HCC patients from five immune-relevant sites: tumor, adjacent liver, hepatic lymph node (LN), blood, and ascites. A cluster of LAMP3+ dendritic cells (DCs) appeared to be the mature form of conventional DCs and possessed the potential to migrate from tumors to LNs. LAMP3+ DCs also expressed diverse immune-relevant ligands and exhibited potential to regulate multiple subtypes of lymphocytes. Of the macrophages in tumors that exhibited distinct transcriptional states, tumor-associated macrophages (TAMs) were associated with poor prognosis, and we established the inflammatory role of SLC40A1 and GPNMB in these cells. Further, myeloid and lymphoid cells in ascites were predominantly linked to tumor and blood origins, respectively. The dynamic properties of diverse CD45+ cell types revealed by this study add new dimensions to the immune landscape of HCC.