nature
nature
ARTICLES
Evolutionary information for specifying a protein fold
Evolutionary information for specifying a protein fold
Classical studies show that for many proteins, the information required for specifying the tertiary structure is contained in the amino acid sequence. Here, we attempt to define the sequence rules for specifying a protein fold by computationally creating artificial protein sequences using only statistical information encoded in a multiple sequence alignment and no tertiary structure information. Experimental testing of libraries of artificial WW domain sequences shows that a simple statistical energy function capturing coevolution between amino acid residues is necessary and sufficient to specify sequences that fold into native structures. The artificial proteins show thermodynamic stabilities similar to natural WW domains, and structure determination of one artificial protein shows excellent agreement with the WW fold at atomic resolution. The relative simplicity of the information used for creating sequences suggests a marked reduction to the potential complexity of the protein-folding problem.
A fundamental tenet of biochemistry is that the amino acid sequence of a protein specifies its atomic structure and biochemical function. But exactly what information in the sequence of a protein is necessary and sufficient for producing the fold and its biological activity? Despite considerable progress in understanding the mechanisms of protein folding, the answer to this fundamental question remains unknown. The main problem is the vast potential complexity of cooperative interactions between amino acids-processes by which the free energy contribution of one residue depends on those of other residues. These amino acid couplings could be pairwise and local in the three-dimensional structure, but could also involve more complex cooperativities in which collections of residues interact through three-way or higher-order couplings. Given that protein structures are typically compact and well packed, proteins could be dense and complex networks of inter-atomic interactions, requiring specification of a great number of mutual constraints between amino acid positions to define the fold.
An approach to defining the architecture of amino acid interactions in proteins is suggested by an evolution-based method known as the statistical coupling analysis. This method postulates that regardless of spatial location or underlying mechanism, the conserved functional coupling of sites in a protein should drive their mutual coevolution. Given a sufficiently large and diverse multiple sequence alignment of a protein family, the mutual dependencies should be evident in the conserved statistical correlations between amino acid distributions at sites. Application of the SCA in several different protein families reveals two general conclusions: one, the global pattern of coevolutionary interactions is sparse, so that a small set of positions mutually coevolves among a majority that are largely decoupled, and two, the strongly coevolving residues are spatially organized into physically connected networks linking distant functional sites in the structure through packing interactions. Studies involving directed mutagenesis, structure determination, NMR dynamics, computational modelling, and literature study implicate these networks of coevolving residues in contributing to core aspects of protein function.
More surprising, however, is the finding of sparseness. The SCA implies an unexpected degree of simplicity in amino acid interactions, with far fewer important constraints between residue pairs than would be expected from inspection of the atomic structure. Indeed, as has been pointed out, the evolution-based mapping of amino acid interactions does not look like the contact graph of the protein structure; many direct packing interactions show coevolution scores close to zero, and some distant sites linked through networks of coevolving residues are predicted to be coupled. Thus, the SCA mapping provides a picture of proteins as sparsely coupled architectures with redundant strong constraints linking a few sites, and a great deal of near-independent variation at most sites.
To test the overall hypothesis, we reasoned that if and only if the information contained in the SCA is a good estimate of the total sequence information for specifying a protein, it should be possible to computationally build artificial members of the protein family using no information except the SCA-based parameters of sequence conservation and coupling. In principle, these artificial sequences should fold into a structure representative of the family, and should function in a manner indistinguishable from their natural counterparts. In this and the accompanying paper, we test this hypothesis in a computationally and experimentally facile model system, the WW domain.