Advancing regulatory variant effect prediction with AlphaGenome
Advancing regulatory variant effect prediction with AlphaGenome
Deep learning models that predict functional genomic measurements from DNA sequences are powerful tools for deciphering the genetic regulatory code. Existing methods involve a trade-off between input sequence length and prediction resolution, thereby limiting their modality scope and performance. We present AlphaGenome, a unified DNA sequence model, which takes as input one megabase of DNA sequence and predicts thousands of functional genomic tracks up to single-base-pair resolution across diverse modalities. The modalities include gene expression, transcription initiation, chromatin accessibility, histone modifications, transcription factor binding, chromatin contact maps, splice site usage and splice junction coordinates and strength. Trained on human and mouse genomes, AlphaGenome matches or exceeds the strongest available external models in twenty-five of twenty-six evaluations of variant effect prediction. The ability of AlphaGenome to simultaneously score variant effects across all modalities accurately recapitulates the mechanisms of clinically relevant variants near the TAL1 oncogene. To facilitate broader use, we provide tools for making genome track and variant effect predictions from sequence.
Interpreting the impact of genome sequence variation remains a central biological challenge. Non-coding variants, which reside outside of protein-coding regions, are particularly challenging to interpret because of the diverse molecular consequences they can elicit. For example, non-coding variants can modulate genome properties such as chromatin accessibility, epigenetic modifications and three-dimensional chromatin conformation. Variants can further influence messenger RNA availability by altering expression levels or modifying sequence composition through splicing changes. Additionally, variants can exhibit cell-type-specific or tissue-specific effects. Given that more than ninety-eight percent of observed genetic variation in humans is non-coding, global characterization of the complex effects of this vast majority of variants remains intractable without computational predictions.
Computational methods can learn patterns from experimental data to predict and explain variant effects. One class of methods, sequence-to-function models, takes a DNA sequence as input and predicts genome tracks, a data format associating each DNA base pair with a value (representing read coverage, count or signal) derived from experimental assays performed in cell lines or tissues. Genome tracks span various data modalities measuring gene expression (with output types comprising RNA sequencing, cap analysis of gene expression sequencing, and precision nuclear run-on analysis of capped RNA), splicing (splice sites, splice site usage and splice junctions), DNA accessibility (DNase I hypersensitive site sequencing and assay for transposase-accessible chromatin sequencing), histone modification (chromatin immunoprecipitation sequencing), transcription factor binding or chromatin conformation (high-throughput chromosome or micrococcal nuclease-based conformation capture). Successfully trained sequence-to-function models accurately predict experimental measurements from input sequences. Furthermore, by comparing genome track predictions from an alternative sequence versus a reference sequence, these models can predict the molecular effects of variants.
Currently, deep learning-based sequence-to-function models face two fundamental trade-offs constraining their ability to predict how variants affect diverse modes of biological regulation. First, often owing to computational limitations, models must trade off between capturing long-range genomic interactions and achieving nucleotide-level predictive resolution. Although models such as SpliceAI, BPNet and ProCapNet provide base-resolution predictions, they are restricted to short input sequences (for example, ten kilobases or less), and thus may miss the influence of distal regulatory elements. Models such as Enformer and Borzoi can process longer sequences (approximately two hundred to five hundred kilobases) to capture broader context but at the cost of reducing output resolution (one hundred twenty-eight base pairs or thirty-two base pairs bins), which can blur fine-scale regulatory features such as splice sites, transcription factor footprints or polyadenylation sites.
A second trade-off exists between capturing diverse modalities versus specializing in one or a few. Several state-of-the-art models are highly specialized for single modalities, such as SpliceAI for splice site prediction, ChromBPNet for local chromatin accessibility and Orca for three-dimensional genome architecture. However, specialized models alone are insufficient for capturing the diverse molecular consequences of variants across modalities. Even within a single modality like splicing, specialized models such as SpliceAI or Pangolin predict certain aspects (such as splice site prediction) while omitting others (such as splice junction prediction or competition between splice sites). Models like DeepSEA, Basenji, Enformer, Sei and Borzoi have demonstrated the utility and practicality of multimodal models. They allow users to use a single model for several modalities, instead of requiring several specialized models. Furthermore, their learned general sequence representation enables them to be readily fine-tuned for new tasks. However, these more generalist models can lag behind their specialized counterparts on certain tasks, such as splicing, or may lack particular modalities, such as contact maps.
Here we present AlphaGenome, a model that unifies multimodal prediction, long-sequence context and base-pair resolution into a single framework. The model takes one megabase of DNA sequence as input and predicts a diverse range of genome tracks across numerous cell types. The splicing predictions of AlphaGenome include a new splice junction prediction approach alongside splice site usage prediction. We evaluated the performance of AlphaGenome using a comprehensive set of benchmarks, covering both its ability to accurately predict genome tracks on previously unseen DNA sequences and its effectiveness in variant effect prediction tasks. AlphaGenome achieved state-of-the-art performance on twenty-two of twenty-four genome track prediction tasks and twenty-five of twenty-six variant effect prediction tasks. We performed extensive ablations of target resolution, sequence length, distillation and modality combinations to explain the performance of AlphaGenome and inform design choices for future sequence-to-function models. We envisage that AlphaGenome will provide a powerful and extensible foundation for analysing the regulatory code within the genome.
We first present key technical details of the AlphaGenome data and training procedure, alongside a high-level summary of our evaluations. We then demonstrate high-fidelity genome track prediction performance, a prerequisite for variant effect prediction. Next, we focus on variant effect prediction with modality-specific deep dives into splicing, gene expression and chromatin accessibility. Finally, we highlight the model's utility in cross-modality variant interpretation and dissect the impact of modelling choices on the performance of AlphaGenome.
Unifying DNA sequence-to-function model
Unifying DNA sequence-to-function model
AlphaGenome is a deep learning model designed to learn the sequence basis of diverse molecular phenotypes from human and mouse DNA. It simultaneously predicts five thousand nine hundred thirty human or one thousand one hundred twenty-eight mouse genome tracks across eleven modalities covering gene expression, detailed splicing patterns, chromatin state and chromatin contact maps. These span a variety of biological contexts, such as different tissue types, cell types and cell lines. These predictions are made on the basis of one megabase of DNA sequence, a context length designed to encompass a substantial portion of the relevant distal regulatory landscape. For instance, ninety-nine percent of validated enhancer-gene pairs fall within one megabase.
AlphaGenome uses a U-Net-inspired backbone architecture to efficiently process input sequences into two types of sequence representations: one-dimensional embeddings (at one-base pair and one hundred twenty-eight-base pair resolutions), which correspond to representations of the linear genome, and two-dimensional embeddings
(two thousand forty-eight-base pair resolution), which correspond to representations of spatial interactions between genomic segments. The one-dimensional embeddings serve as the basis for genomic track predictions, whereas the two-dimensional embeddings are the basis for predicting pairwise interactions (contact maps). Within the architecture, convolutional layers model local sequence patterns necessary for fine-grained predictions, whereas transformer blocks model coarser but longer-range dependencies in the sequence, such as enhancer-promoter interactions. Base-pair-resolution training on the full one-megabase sequence is enabled through sequence parallelism across eight interconnected tensor processing unit devices. Genomic track predictions are linear transformations of these sequence embeddings, aside from splice junction count prediction, which uses a separate mechanism that captures interactions between one-dimensional embeddings of donor-acceptor pairs.
We trained the model using a two-stage process: pretraining and distillation. The pretraining phase used the observed experimental data to produce two types of models. Fold-specific models were trained using a four-fold cross-validation scheme, with three fourths of the reference genome used for training and the remaining one fourth held out for validation and testing. These models were then used to evaluate the generalization of AlphaGenome by predicting genomic tracks on unseen test reference genome intervals. Additionally, all-fold models were trained on all available intervals of the reference genome and served as teachers in the second stage (distillation). In the distillation phase, a single student model, sharing the pretrained architecture, was trained to predict the output of an ensemble of all-fold teachers using randomly augmented input sequences. This distilled student model, as shown previously, achieved improved robustness and variant effect prediction accuracy in a single model instance, making predictions across all modeled modalities and cell types with a single device call per variant. Taking less than one second on an NVIDIA H one hundred GPU, the student model is highly efficient for large-scale variant effect prediction relative to the alternative approach of ensembling several independently trained models.