The Words of Proteins: Motif-Level Language Modeling for Interpretable Protein Generation
The Words of Proteins: Motif-Level Language Modeling for Interpretable Protein Generation
Abstract
What are the fundamental units of protein sequences? Most protein language models treat amino acids as tokens, yet biological functions are not encoded at the single-residue level. Instead, they emerge from combinations of residues that form functional units, corresponding to conserved sequence motifs. Just like how modern language models work at the level of learned sub-word units instead of characters, we argue that explicitly modeling at the functional motif level provides both mechanistic insight into sequence-function relationships and interpretable control over protein generation. We demonstrate this framework on copper- and nickel-binding metalloproteins, where similar coordination chemistry is shared at the structure level yet how sequence controls binding specificity remains underexplored. We develop a three-step workflow: one, construction of a dictionary of conserved subsequence motifs; two, prediction of the next motif and inter-motif gaps; and three, sequence infilling given the predicted anchor motifs. Amino acid enrichment analysis confirms that extracted motifs capture known metal-coordination chemistry (His, Cys, Met). Compared to random masking, motif-guided generation improves Conserved Domain Database annotation rates with greater target-family enrichment. Compared to finetuned ProtGPT2 that use BPE tokenization, our proposed functional motif-level approach achieves more specific hits on metalloprotein families and substantially reduces off-target annotations. Generation output directly mirrors dictionary composition, demonstrating that motif vocabularies provide explicit control over generation scope. AlphaFold 3 structure prediction with explicit metal ions confirms plausible coordination geometry, validating functional binding sites in generated sequences. Our results demonstrate that conserved sequence motifs carry functional information. Functional motif-level modeling enables interpretable, controllable protein generation, an important step toward compositional design of novel protein functions.
One Introduction
One Introduction
Protein language models have revolutionized computational biology by learning rich representations of protein sequences through self-supervised learning on large protein databases. Inspired by autoregressive and masked language models, PLMs like ESM-two and ESM-three treat proteins as sequences of amino acid tokens and learn to predict masked or the next residues (the amino acids incorporated into proteins). These models have achieved remarkable success in protein sequence generation.
However, a fundamental disparity exists between how PLMs represent proteins and how Nature encodes biological functions. Biochemical research has established that protein function arises from specific arrangements of residues that form functional sites, binding pockets, and catalytic centers. These functional units correspond to evolutionarily conserved sequence motifs that encode essential functions
This mismatch between representation granularity and functional encoding creates challenges for controllable generation of protein sequences with specific functions. When using masked PLMs like ESM-three for protein design, users must specify which positions to mask. But without understanding of functional motifs, random or naive masking may disrupt critical functional regions.
Some PLMs have moved beyond single-residue tokens. This parallels the evolution of natural language processing, where modern language models use learned subword units such as Byte Pair Encoding or WordPiece rather than raw characters. For example, ProtGPT2 employs BPE tokenization, requiring the model to implicitly learn functional motifs and evolutionary constraints from raw sequence data. However, BPE constructs its vocabulary through unsupervised merging of frequently co-occurring character pairs, yielding tokens that reflect corpus statistics rather than biological function.
Here we propose a different principle for vocabulary construction: grounding tokens in evolutionary signal rather than corpus statistics. We extract conserved subsequences from a target protein family and apply contrastive filtering against a background proteome, retaining motifs that are enriched within the family while discarding those that are ubiquitous. This procedure can be viewed as a form of implicit supervision from evolutionary selection, where motifs preserved within a family serve as positive examples of functional units and broadly distributed motifs act as negatives.
Beyond tokenization, prior generative protein design methods that leverage motifs typically use them as conditioning signals or structural anchors, often relying on structural information or learned embeddings without redefining the sequence representation itself. Our approach instead redefines the modeling unit by treating conserved sequence motifs as discrete tokens and trains an autoregressive transformer to predict the next motif token without any structural input, directly embedding functional and evolutionary priors into the generative process.
This functional motif-level modeling provides several advantages for functional protein generation: one, Interpretability: Since a vocabulary of motifs is inherently more interpretable than a vocabulary of amino acids, this enables rational design of proteins with specific functions instead of working in a black box for researchers. two, Controllability: By specifying motifs from which protein family to include, users can explicitly control the functional properties of generated proteins. The motif dictionary defines the space of possible functions. three, Efficiency: Learning dependencies between functional units improves sample efficiency and achieves generation with functional precision.
We demonstrate our paradigm on metalloproteins, focusing on copper-binding (Cu(One) and Cu(Two)) and nickel-binding (Ni(Two)) protein families. Metalloproteins present an ideal test case for several reasons. First, metal coordination is a well-characterized function with known sequence determinants: residues such as histidine, cysteine, and methionine are established metal-coordinating amino acids. This provides ground truth for validating whether our extracted motifs capture biologically relevant features. Second, despite sharing similar coordination chemistry and binding pocket geometry at the structural level, how different metalloprotein subfamilies employ distinct sequence patterns to determine metal specificity remains elusive.
We develop a three-component workflow for motif-level protein generation: a, family-specific motif dictionary construction, where we extract conserved subsequences from target protein families and filter against a background proteome to retain only function-specific motifs; b, a motif-level language model, where we train a transformer to predict the next motif and inter-motif gap length, learning the grammar of how functional units combine within proteins; and c, scaffold generation and infilling, where the transformer generates scaffolds with motifs as anchors and ESM-three fills the intervening gaps while preserving functional constraints.
Analysis demonstrates that our motif dictionary on metalloproteins is enriched in known metal-coordinating residues (His, Cys, Met), confirming that unsupervised extraction recovers genuine functional features. In generation experiments, motif-guided masking achieves higher functional recovery than random masking, and our functional motif-level generation outperforms BPE tokenization model (ProtGPT2) with more target-family hits and fewer off-target annotations. Notably, the transformer's output distribution mirrors dictionary composition, demonstrating the predictable and interpretable control of generation. AlphaFold 3 predicted structures confirm plausible metal coordination geometry in generated sequences.
With this paper, we make the following contributions:
One. Conceptual reframing: we propose modeling proteins at the functional motif ("meaningful word") level. Like BPE in NLP, our work modifies the tokenization scheme, not architecture or training objective. The distinction lies in the vocabulary: BPE optimizes for corpus compression; we optimize for functional specificity via evolutionary conservation.
Two. Practical workflow: We develop an end-to-end pipeline for motif-level protein generation, integrating dictionary construction, transformer-based scaffold generation, and PLM-based infilling.
Three. Interpretability demonstration: We demonstrate on metalloproteins that this approach provides a form of interpretability absent in end-to-end PLMs: dictionary composition directly predicts generation output, giving users explicit, inspectable control over functional scope.