Dry Lab

Resources

Computer

The computational genetics team contributes to the greater mission of the lab through the development of innovative software models to improve human genetic studies.

AGAIN

Thanks to our previously reported BPHunter, we precisely delineated the intronic segment from branchpoints (BP) to splice acceptor sites (ACC), where AG-gain variants could interfere with splicing and become deleteriousness. We therefore developed AGAIN as a genome-wide method to systematically and efficiently pinpoint intronic AG-gain variants located in BP-ACC region, based on intron-specific distribution of BP. AGAIN flags variants in the high-risk [BP+8, Acc-4] region, and predicts the protein-level outcomes resulting from two major mis-spliced consequences (new acceptor site and exon skipping).

BPHunter

RNA spliceosome recognizes intronic branchpoint (BP) motifs at the beginning of splicing and operates mostly within introns to define the exon–intron boundaries. BP variants may potentially result in aberrant splicing (exon skipping, intron retention), which could be deleterious to gene products. We established a comprehensive genome-wide database of BP and developed BPHunter for systematic and informative genome-wide detection of intronic variants that may disrupt BP and splicing, together with an effective strategy for prioritizing BP variant candidates. BPHunter not only constitutes an important resource for understanding BP, but should also drive discovery of BP variants in human genetic diseases and traits.

GMUSCLE

The genotyping of CRISPR-Cas9-edited cells is challenging and the traditional genotyping methods are laborious. We therefore developed GMUSCLE for “Genotyping MUltiplexed-Sequencing of CRISPR-Localized Editing”. In our approach, we sequence the CRISPR/Cas9-edited products in great depth and then use GMUSCLE for quantitative and qualitative identification of genotypes. GMUSCLE is user-friendly, time/cost-efficient, and accurate. Beyond the multiplexed-sequencing ability, GMUSCLE can analyze the sequencing data from: bulk cell populations, cells that were edited at multiple target sites, different gene-editing protocols, and other organisms.

NHC

Network-based heterogeneity clustering (NHC) is a computational approach to detect physiological homogeneity amid genetic heterogeneity. It systematically converges genes of biological proximity on a background biological interaction network and captures the gene clusters that harbor presumably deleterious variants from a cohort of patients with the same disease, in an unbiased manner. It is suitable for diseases that have a homogeneous clinical phenotype and are likely caused by rare/uncommon variants with strong individual effects located in physiologically related genes. Moreover, we experimentally validated our prediction in a pilot study of herpes simplex encephalitis (HSE).

iMUBAC

Integration of multi-batch cytometry (iMUBAC) is a flexible, scalable, and robust computational framework for unsupervised cell-type identification across multiple batches of high-dimensional cytometry datasets, even without technical replicates. It overlays cells from multiple healthy controls across batches, learns batch-specific cell-type classification boundaries, and identifies aberrant immunophenotypes in patient samples from multiple batches in a unified manner.

PopViz

PopViz is an integrative and interactive webserver for the rapid visualization of population genetics (gnomAD) and mutational damage prediction scores (CADD) of human genes. It provides multiple options for users to customize the search. PopViz is particularly useful in the hypothesis generation for new disease-causing candidate genes and variants. It could help to reinforce or reject the plausibility of the candidate genes, and to prioritize the candidate variants for experimental testing.

SeqTailor

SeqTailor is for an efficient extraction of DNA and protein sequences for genetic variants. It extracts wild-type / mutated (wt/mt) DNA sequences for human variants, with user-defined window sizes, from the human reference genome. It also annotates variants that have direct impacts on coding sequences (missense, frameshift, stop-gain, in-frame indel), and generates their full-length wt/mt protein sequences. SeqTailor bridges the genetic variants data with DNA/protein sequence-based analyses and predictions.

MSC

MSC (mutation significance cutoff) is a quantitative approach that provides gene-level and gene-specific phenotypic impact cutoff values to improve the use of existing variant-level deleteriousness scores. It enables the filtering of benign variants from NGS data with little risk of removing disease-causing variants.

GDI

GDI (gene damage index) describes a genome-wide and gene-level metric for the nonsynonymous mutational load in each protein-coding gene in the general population. It is an approach for predicting whether a given human protein-coding gene is likely to harbor disease-causing variants. GDI is particularly useful in filtering out false-positive variants in searching for disease-causing candidate variants from NGS data.

HGC

HGC (human gene connectome) is a human gene-centric approach for searching disease-causing candidate genes by the biological proximity to genes already known to be responsible for the phenotype of interest. This is based on the hypothesis that causal genes of a specific phenotype are usually expected to be functionally close to each other.