The Words of Proteins: Motif-Level Language Modeling for Interpretable Protein Generation

100%

The Words of Proteins: Motif-Level Language Modeling for Interpretable Protein Generation

Abstract

What are the fundamental units of protein sequences? Most protein language models treat amino acids as tokens, yet biological functions are not encoded at the single-residue level. Instead, they emerge from combinations of residues that form functional units, corresponding to conserved sequence motifs. Just like how modern language models work at the level of learned sub-word units instead of characters, we argue that explicitly modeling at the functional motif level provides both mechanistic insight into sequence-function relationships and interpretable control over protein generation. We demonstrate this framework on copper- and nickel-binding metalloproteins, where similar coordination chemistry is shared at the structure level yet how sequence controls binding specificity remains underexplored. We develop a three-step workflow: one, construction of a dictionary of conserved subsequence motifs; two, prediction of the next motif and inter-motif gaps; and three, sequence infilling given the predicted anchor motifs. Amino acid enrichment analysis confirms that extracted motifs capture known metal-coordination chemistry (His, Cys, Met). Compared to random masking, motif-guided generation improves Conserved Domain Database annotation rates with greater target-family enrichment. Compared to finetuned ProtGPT2 that use BPE tokenization, our proposed functional motif-level approach achieves more specific hits on metalloprotein families and substantially reduces off-target annotations. Generation output directly mirrors dictionary composition, demonstrating that motif vocabularies provide explicit control over generation scope. AlphaFold 3 structure prediction with explicit metal ions confirms plausible coordination geometry, validating functional binding sites in generated sequences. Our results demonstrate that conserved sequence motifs carry functional information. Functional motif-level modeling enables interpretable, controllable protein generation, an important step toward compositional design of novel protein functions.

One Introduction

Protein language models have revolutionized computational biology by learning rich representations of protein sequences through self-supervised learning on large protein databases. Inspired by autoregressive and masked language models, PLMs like ESM-two and ESM-three treat proteins as sequences of amino acid tokens and learn to predict masked or the next residues (the amino acids incorporated into proteins). These models have achieved remarkable success in protein sequence generation.

However, a fundamental disparity exists between how PLMs represent proteins and how Nature encodes biological functions. Biochemical research has established that protein function arises from specific arrangements of residues that form functional sites, binding pockets, and catalytic centers. These functional units correspond to evolutionarily conserved sequence motifs that encode essential functions

This mismatch between representation granularity and functional encoding creates challenges for controllable generation of protein sequences with specific functions. When using masked PLMs like ESM-three for protein design, users must specify which positions to mask. But without understanding of functional motifs, random or naive masking may disrupt critical functional regions.

Some PLMs have moved beyond single-residue tokens. This parallels the evolution of natural language processing, where modern language models use learned subword units such as Byte Pair Encoding or WordPiece rather than raw characters. For example, ProtGPT2 employs BPE tokenization, requiring the model to implicitly learn functional motifs and evolutionary constraints from raw sequence data. However, BPE constructs its vocabulary through unsupervised merging of frequently co-occurring character pairs, yielding tokens that reflect corpus statistics rather than biological function.

Here we propose a different principle for vocabulary construction: grounding tokens in evolutionary signal rather than corpus statistics. We extract conserved subsequences from a target protein family and apply contrastive filtering against a background proteome, retaining motifs that are enriched within the family while discarding those that are ubiquitous. This procedure can be viewed as a form of implicit supervision from evolutionary selection, where motifs preserved within a family serve as positive examples of functional units and broadly distributed motifs act as negatives.

Beyond tokenization, prior generative protein design methods that leverage motifs typically use them as conditioning signals or structural anchors, often relying on structural information or learned embeddings without redefining the sequence representation itself. Our approach instead redefines the modeling unit by treating conserved sequence motifs as discrete tokens and trains an autoregressive transformer to predict the next motif token without any structural input, directly embedding functional and evolutionary priors into the generative process.

This functional motif-level modeling provides several advantages for functional protein generation: one, Interpretability: Since a vocabulary of motifs is inherently more interpretable than a vocabulary of amino acids, this enables rational design of proteins with specific functions instead of working in a black box for researchers. two, Controllability: By specifying motifs from which protein family to include, users can explicitly control the functional properties of generated proteins. The motif dictionary defines the space of possible functions. three, Efficiency: Learning dependencies between functional units improves sample efficiency and achieves generation with functional precision.

We demonstrate our paradigm on metalloproteins, focusing on copper-binding (Cu(One) and Cu(Two)) and nickel-binding (Ni(Two)) protein families. Metalloproteins present an ideal test case for several reasons. First, metal coordination is a well-characterized function with known sequence determinants: residues such as histidine, cysteine, and methionine are established metal-coordinating amino acids. This provides ground truth for validating whether our extracted motifs capture biologically relevant features. Second, despite sharing similar coordination chemistry and binding pocket geometry at the structural level, how different metalloprotein subfamilies employ distinct sequence patterns to determine metal specificity remains elusive.

We develop a three-component workflow for motif-level protein generation: a, family-specific motif dictionary construction, where we extract conserved subsequences from target protein families and filter against a background proteome to retain only function-specific motifs; b, a motif-level language model, where we train a transformer to predict the next motif and inter-motif gap length, learning the grammar of how functional units combine within proteins; and c, scaffold generation and infilling, where the transformer generates scaffolds with motifs as anchors and ESM-three fills the intervening gaps while preserving functional constraints.

Analysis demonstrates that our motif dictionary on metalloproteins is enriched in known metal-coordinating residues (His, Cys, Met), confirming that unsupervised extraction recovers genuine functional features. In generation experiments, motif-guided masking achieves higher functional recovery than random masking, and our functional motif-level generation outperforms BPE tokenization model (ProtGPT2) with more target-family hits and fewer off-target annotations. Notably, the transformer's output distribution mirrors dictionary composition, demonstrating the predictable and interpretable control of generation. AlphaFold 3 predicted structures confirm plausible metal coordination geometry in generated sequences.

With this paper, we make the following contributions:

One. Conceptual reframing: we propose modeling proteins at the functional motif ("meaningful word") level. Like BPE in NLP, our work modifies the tokenization scheme, not architecture or training objective. The distinction lies in the vocabulary: BPE optimizes for corpus compression; we optimize for functional specificity via evolutionary conservation.

Two. Practical workflow: We develop an end-to-end pipeline for motif-level protein generation, integrating dictionary construction, transformer-based scaffold generation, and PLM-based infilling.

Three. Interpretability demonstration: We demonstrate on metalloproteins that this approach provides a form of interpretability absent in end-to-end PLMs: dictionary composition directly predicts generation output, giving users explicit, inspectable control over functional scope.

Two Related Work

Three Method and Experimental Setup

Background Proteome. A non-metal background proteome was assembled from the RCSB Protein Data Bank using the same RCSB

Three point two Dictionary creation

Three point three. Validating utility of dictionary

Three point four Model Training

Three point five Protein Sequence Generation

Three point six Generated Protein Sequence Evaluation

Four Results and Discussion

Four point two Generation Results

Four point three Structural Validation

Five Conclusion

Six. Limitations and Ethical Considerations

Seven. GenAI Disclosure

B Model training

D Generated sequences used for structural validation

Overview

The study presents a framework for motif-level language modeling to improve protein generation, emphasizing the importance of conserved sequence motifs in encoding biological functions. By applying this model to metalloproteins, the authors demonstrate improved specificity and interpretability in generated sequences compared to traditional methods.

Key Points

1Modeling proteins at the functional motif level enables better interpretability and controllability in protein generation
2The proposed workflow includes motif dictionary construction, next motif prediction, and sequence infilling
3Results show enhanced performance in generating metalloproteins with accurate functional properties
4Conserved sequence motifs significantly improve the generation outcomes compared to traditional amino acid-level models

Details

Authors: Anonymous Author(s)
Category: Biology and Natural Sciences

PDF
OPEN Circulating tumor cell-derived lines exhibit an amoeboid mode of migration
This study investigates the migration modes of circulating tumor cell (CTC) lines derived from colorectal cancer patients, focusing on their ability to exhibit an amoeboid movement in various environments. Using in vitro and in vivo assays, the authors highlight the significance of both epithelial and amoeboid migration strategies in cancer metastasis.
PDF
The value of slow-burning science: an interview with Peter Friedl and Bettina Weigelin
This document features an interview with Peter Friedl and Bettina Weigelin, who discuss their research on T cell interactions with tumor cells and the implications of slow-burning scientific projects in cancer therapy. They share insights on the challenges and breakthroughs in understanding tumor progression and immunotherapy mechanisms.
PDF
The Structure and Function of Cells
This document provides a detailed examination of the various components and functions of both prokaryotic and eukaryotic cells, exploring cellular structures and their roles in life processes.
PDF
The Historical Development of the Microscope and Cell Theory
This document provides a comprehensive overview of the historical progression of microscopy and the development of cell theory, detailing key events, figures, and inventions that have shaped our understanding of cellular biology.
PDF
nf-core/crisprseq: a versatile pipeline for comprehensive analysis of CRISPR gene editing and screening assays
This document presents the nf-core/crisprseq pipeline, designed for analyzing CRISPR gene editing and screening assays, showcasing its modular approach and usability with public datasets.