In a recently published study, Dr PNASThe researchers introduced the Genomic Pre-trained Network (GPN), a multivariate model designed to learn genome-wide variant effects by self-supervised pre-training on genomic deoxyribonucleic acid (DNA) sequences.
Study: DNA language models are powerful predictors of genome-wide variant effects. Image credit: angellodeco/Shutterstock.com
Genetic variations in the genome contribute to complex diseases and agricultural traits, yet understanding them remains challenging. Although genome-wide association studies (GWAS) provide biological insights, identifying causal variants remains difficult.
Test validation is time-consuming and expensive, emphasizing the need for precise, scalable computer methods to estimate the effect of genetic variation across the entire genome.
Unsupervised-type pre-training using large protein sequence databases has shown effectiveness in extracting complex information about proteins and learning the effects of variation in coding regions.
About the study
In the present study, the researchers proposed a genome-wide variant effect prediction technique based on unsupervised DNA language models, which achieved state-of-the-art performance in Arabidopsis thaliana, a model organism in plant biology and a source of insight into human disorders.
To pre-train a language model based on a convolutional neural network, the researchers used concatenated genomes. Arabidopsis thaliana and seven related Brassicales Species using the AraGWAS catalog for reference. The method was used to predict masked nucleotides based on their genetic context.
The scientists averaged the relative embeddings (512 dimensions) of GPNs from the reference genome over a 100 base pair (bp) window of nucleotides. They performed them using Uniform Manifold Approximation and Projection (UMAP) to measure how well the model understood genomic organization.
A logistic regression classifier was developed using average embedding as the feature to measure the ability of GPNs to discriminate genomic regions. In context, each genome location was individually masked, as was the model output distribution over nucleotides.
To make it easier to use these expected distributions, sequence logos were created that can be viewed in the University of California Santa Cruz (UCSC) Genome Browser.
GPN scores were calculated for in silico mutagenesis of SNPs across 1.0-Mb regions, and results were averaged across variant types. Subsequently, the researchers examined more than 10 million single-nucleotide polymorphisms (SNPs) from the Natural 1001 Genomes Project to predict the ability of GPNs to predict the functional impact of genetic variants. A. Thaliana.
Code was provided to train the GPN model for each species based solely on its deoxyribonucleic acid sequence, allowing unsupervised estimation of the effect of mutations across the entire genome. The researchers analyzed enrichment versus abnormality. To assess the abilities of common genetic variants to find potential functional variants within the tails of genome-level score distributions.
The GPN model, which was trained unsupervised, effectively learned gene structures and DNA patterns Arabidopsis thalianaA plant biology model organism closely related to various agriculturally relevant species that can be used to provide insight into human disorders.
The method outperforms established conservation methods such as Fastcons and Philopy based on 18 correlations Brassicales Species aligned by whole-genome sequencing (WGS). The internal representation of DNA sequences used by GPN can discriminate genomic regions such as untranslated regions (UTR), introns and coding sequences, and its confidence can aid in the discovery of regulatory grammars, such as motifs that bind transcription factors.
GPNs had the best accuracy for coding sequences (CDS, 96%) and non-coding ribonucleic acids (ncRNA, 51%), the least common category. The model can detect intergenic, intron, CDS, UTR and ncRNA genomic regions.
Model prediction confidence was associated with the expected function of sites, and start and stop codon motifs were generally correctly predicted.
Using the log-likelihood ratio between alternative and reference alleles, GPN can assign a pathogenicity or function score for each SNP in the genome. Classification of variant types based on the lowest percentile of GPN scores was generally consistent with previously accepted concepts of malignancy.
Models with 0.0 and 0.1 down-weighting accounted for eight percent and nine percent of the repeat variance before the first decile of missense variants, respectively. Putative functional SNPs, defined as the lowest 0.1% of GPN scores, are enriched within anomalous 5.5-fold variation.
GPN has the advantage of assigning significantly different scores to genetic variants in strong linkage disequilibrium (LD) with each other when the surrounding context is different.
The GPN-LD technique effectively distinguished genome-wide association study hits from non-hits, with single-nucleotide polymorphisms in the lowest one percent of GPN-linkage disequilibrium scores being 10-fold more enriched than GWAS hits. % of GPN-linkage imbalance values.
Surprisingly, the model trained with intermediate weights on iterations performed best. When evaluating the entire variation set, including locations that do not correspond to others BrassicalesThe GPN-LD strategy produced significantly higher odds ratio values.
Based on research findings, the genome-wide variant prediction (GPN) technique reliably predicts genome-wide variant effects based on genomic sequences alone. It is applicable to all species and can be used to refine GWAS fine-mapping and polygenic risk scores.
Since GPN is trained on DNA sequences, it can be used for unstudied non-model species that lack comprehensive functional genomics data. The model learns from joint nucleotide distributions in similar contexts across genomes rather than whole-genome alignments, which can result in poor noncoding quality.
GPN predictions around splice junctions can help identify splicing factor binding sites. Future studies could assess the effect of low weight recurrence based on family or age.