nature microbiology
nature microbiology
A new view of the tree of life
The tree of life is one of the most important organizing principles in biology. Gene surveys suggest the existence of an enormous number of branches, but even an approximation of the full scale of the tree has remained elusive. Recent depictions of the tree of life have focused either on the nature of deep evolutionary relationships or on the known, well-classified diversity of life with an emphasis on eukaryotes. These approaches overlook the dramatic change in our understanding of life's diversity resulting from genomic sampling of previously unexamined environments. New methods to generate genome sequences illuminate the identity of organisms and their metabolic capacities, placing them in community and ecosystem contexts. Here, we use new genomic data from over one thousand uncultivated and little-known organisms, together with published sequences, to infer a dramatically expanded version of the tree of life, with Bacteria, Archaea and Eukarya included. The depiction is both a global overview and a snapshot of the diversity within each major lineage. The results reveal the dominance of bacterial diversification and underline the importance of organisms lacking isolated representatives, with substantial evolution concentrated in a major radiation of such organisms. This tree highlights major lineages currently underrepresented in biogeochemical models and identifies radiations that are probably important for future evolutionary analyses.
Early approaches to describe the tree of life distinguished organisms based on their physical characteristics and metabolic features. Molecular methods dramatically broadened the diversity that could be included in the tree because they circumvented the need for direct observation and experimentation by relying on sequenced genes as markers for lineages. Gene surveys, typically using the small subunit ribosomal RNA gene, provided a remarkable and novel view of the biological world, but questions about the structure and extent of diversity remain. Organisms from novel lineages have eluded surveys, because many are invisible to these methods due to sequence divergence relative to the primers commonly used for gene amplification. Furthermore, unusual sequences, including those with unexpected insertions, may be discarded as artefacts.
Whole genome reconstruction was first accomplished in nineteen ninety-five, with a near-exponential increase in the number of draft genomes reported each subsequent year. There are thirty thousand four hundred thirty-seven genomes from all three domains of life-Bacteria, Archaea and Eukarya- which are currently available in the Joint Genome Institute's Integrated Microbial Genomes database.
Contributing to this expansion in genome numbers are single cell genomics and metagenomics studies. Metagenomics is a shotgun sequencing-based method in which DNA isolated directly from the environment is sequenced, and the reconstructed genome fragments are assigned to draft genomes. New bioinformatics methods yield complete and near-complete genome sequences, without a reliance on cultivation or reference genomes. These genome- (rather than gene) based approaches provide information about metabolic potential and a variety of phylogenetically informative sequences that can be used to classify organisms. Here, we have constructed a tree of life by making use of genomes from public databases and one thousand eleven newly reconstructed genomes that we recovered from a variety of environments.
To render this tree of life, we aligned and concatenated a set of sixteen ribosomal protein sequences from each organism. This approach yields a higher-resolution tree than is obtained from a single gene, such as the widely used sixteen S rRNA gene. The use of ribosomal proteins avoids artefacts that would arise from phylogenies constructed using genes with unrelated functions and subject to different evolutionary processes. Another important advantage of the chosen ribosomal proteins is that they tend to be syntenic and co-located in a small genomic region in Bacteria and Archaea, reducing binning errors that could substantially perturb the geometry of the tree. Included in this tree is one representative per genus for all genera for which high-quality draft and complete genomes exist (three thousand eighty-three organisms in total).
Despite the methodological challenges, we have included representatives of all three domains of life. Our primary focus relates to the status of Bacteria and Archaea, as these organisms have been most difficult to profile using macroscopic approaches, and substantial progress has been made recently with acquisition of new genome sequences. The placement of Eukarya relative to Bacteria and Archaea is controversial. Eukaryotes are believed to be evolutionary chimaeras that arose via endosymbiotic fusion, probably involving bacterial and archaeal cells. Here, we do not attempt to confidently resolve the placement of the Eukarya. We position them using sequences of a subset of their nuclear-encoded ribosomal proteins, an approach that classifies them based on the inheritance of their information systems as opposed to lipid or other cellular structures.
Figure one presents a new view of the tree of life. This is one of a relatively small number of three-domain trees constructed from molecular information so far, and the first comprehensive tree to be published since the development of genome-resolved metagenomics. We highlight all major lineages with genomic representation, most of which are phylum-level branches. However, we separately identify the Classes of the Proteobacteria, because the phylum is not monophyletic.
The tree in Figure one recapitulates expected organism groupings at most taxonomic levels and is largely congruent with the tree calculated using traditional small subunit ribosomal RNA gene sequence information. The support values for taxonomic groups are strong at the Species through Class levels (greater than eighty-five percent), with moderate- to-strong support for Phyla (greater than seventy-five percent in most cases), but the branching order of the deepest branches cannot be confidently resolved. The lower support for deep branch placements is a consequence of our prioritization of taxon sampling over number of genes used for tree construction. As proposed recently, the Eukarya, a group that includes protists, fungi, plants and animals, branches within the Archaea, specifically within the TACK superphylum and sibling to the Lokiarchaeota. Interestingly, this placement is not evident in the small subunit ribosomal RNA tree, which has the three-domain topology proposed by Woese and co-workers in nineteen ninety. The two-domain Eocyte tree and the three-domain tree are competing hypotheses for the origin of Eukarya; further analyses to resolve these and other deep relationships will be strengthened with the availability of genomes for a greater diversity of organisms. Important advantages of the ribosomal protein tree compared with the small subunit ribosomal RNA gene tree are that it includes organisms with incomplete or unavailable small subunit ribosomal RNA gene sequences and more strongly resolves the deeper radiations. Ribosomal proteins have been shown to contain compositional biases across the three domains, driven by thermophilic, mesophilic and halophilic lifestyles as well as by a primitive genetic code. Continued expansion of the number of genome sequences for non-extremophile Archaea, such as the DPANN lineages, may allow clarification of these compositional biases.
A striking feature of this tree is the large number of major lineages without isolated representatives (red dots in Figure one). Many of these lineages are clustered together into discrete regions of the tree. Of particular note is the Candidate Phyla Radiation, highlighted in purple in Figure one. Based on information available from hundreds of genomes from genome-resolved metagenomics and single-cell genomics methods to date, all members have relatively small genomes and most have somewhat (if not highly) restricted metabolic capacities. Many are inferred (and some have been shown) to be symbionts. Thus far, all cells lack complete citric acid cycles and respiratory chains and most have limited or no ability to synthesize nucleotides and amino acids. It remains unclear whether these reduced metabolisms are a consequence of superphylum-wide loss of capacities or if these are inherited characteristics that hint at an early metabolic platform for life. If inherited, then adoption of symbiotic lifestyles may have been a later innovation by these organisms once more complex organisms appeared.
Figure two presents another perspective, where the major lineages of the tree are defined using evolutionary distance, so that the main groups become apparent without bias arising from historical naming conventions. This depiction uses the same inferred tree as in Figure one, but with groups defined on the basis of average branch length to the leaf taxa. We chose an average branch length that best recapitulated the current taxonomy (smaller values fragmented many currently accepted phyla and larger values collapsed accepted phyla into very few lineages, see Methods). Evident in Figure two is the enormous extent of evolution that has occurred within the Candidate Phyla Radiation. The diversity within the Candidate Phyla Radiation could be a result of the early emergence of this group and/or a consequence of rapid evolution related to symbiotic lifestyles. The Candidate Phyla Radiation is early-emerging on the ribosomal protein tree (Figure one), but not in the small subunit rRNA tree (Supplementary Figure two). Regardless of branching order, the Candidate Phyla Radiation, in combination with other lineages that lack isolated representatives (red dots in Figure two), clearly comprises the majority of life's current diversity.
Domain Bacteria includes more major lineages of organisms than the other Domains. We do not attribute the smaller scope of the Archaea relative to Bacteria to sampling bias because metagenomics and single-cell genomics methods detect members of both domains equally well. Consistent with this view, Archaea are less prominent and less diverse in many ecosystems (for example, seawater, hydrothermal vents, the terrestrial subsurface and human-associated microbiomes). The lower apparent phylogenetic diversity of Eukarya is fully expected, based on their comparatively recent evolution.
The tree of life as we know it has dramatically expanded due to new genomic sampling of previously enigmatic or unknown microbial lineages. This depiction of the tree captures the current genomic sampling of life, illustrating the progress that has been made in the last two decades following the first published genome. What emerges from analysis of this tree is the depth of evolutionary history that is contained within the Bacteria, in part due to the Candidate Phyla Radiation, which appears to subdivide the domain. Most importantly, the analysis highlights the large fraction of diversity that is currently only accessible via cultivation-independent genome-resolved approaches.
Methods
Methods
A data set comprehensively covering the three domains of life was generated using publicly available genomes from the Joint Genome Institute's IMG-M database, a previously developed data set of eukaryotic genome information, previously published genomes derived from metagenomic data sets and newly reconstructed genomes from current metagenome projects (see Supplementary Table one for NCBI accession numbers). From IMG-M, genomes were sampled such that a single representative for each defined genus was selected. For phyla and candidate phyla lacking full taxonomic definition, every member of the phylum was initially included. Subsequently, these radiations were sampled to an approximate genus level of divergence based on comparison with taxonomically described phyla, thus removing strain- and species-level overlaps. Finally, initial tree reconstructions identified aberrant long-branch attraction effects placing the Microsporidia, a group of parasitic fungi, with the Korarchaeota. The Microsporidia are known to contribute long branch attraction artefacts confounding placement of the Eukarya, and were subsequently removed from the analysis.
This study includes one thousand eleven organisms from lineages for which genomes were not previously available. The organisms were present in samples collected from a shallow aquifer system, a deep subsurface research site in Japan, a salt crust in the Atacama Desert, grassland meadow soil in northern California, a carbon dioxide-rich geyser system, and two dolphin mouths. Genomes were reconstructed from metagenomes as described previously. Genomes were only included if they were estimated to be greater than seventy percent complete based on presence/absence of a suite of fifty-one single copy genes for Bacteria and thirty-eight single copy genes for Archaea. Genomes were additionally required to have consistent nucleotide composition and coverage across scaffolds, as determined using the ggkbase binning software, and to show consistent placement across both small subunit rRNA and concatenated ribosomal protein phylogenies. This contributed marker gene information for one thousand eleven newly sampled organisms, whose genomes were reconstructed for metabolic analyses to be published separately.
The concatenated ribosomal protein alignment was constructed as described previously. In brief, the sixteen ribosomal protein data sets (ribosomal proteins L two, L three, L four, L five, L six, L fourteen, L sixteen, L eighteen, L twenty-two, L twenty-four, S three, S eight, S ten, S seventeen and S nineteen) were aligned independently using MUSCLE version three point eight point three one. Alignments were trimmed to remove ambiguously aligned C and N termini as well as columns composed of more than ninety-five percent gaps. Taxa were removed if their available sequence data represented less than fifty percent of the expected alignment columns (ninety percent of taxa had more than eighty percent of the expected alignment columns). The sixteen alignments were concatenated, forming a final alignment comprising three thousand eighty-three genomes and two thousand five hundred ninety-six amino-acid positions. A maximum likelihood tree was constructed using RAxML version eight point one point two four, as implemented on the CIPRES web server, under the LG plus gamma model of evolution, PROTGAMMALG in the RAxML model section, and with the number of bootstraps automatically determined, MRE-based bootstopping criterion. A total of one hundred fifty-six bootstrap replicates were conducted under the rapid bootstrapping algorithm, with one hundred sampled to generate proportional support values. The full tree inference required three thousand eight hundred forty computational hours on the CIPRES supercomputer.
To construct Figure two, we collapsed branches based on an average branch length criterion. Average branch length calculations were implemented in the Interactive Tree of Life online interface using the formula:
Average branch length equals mean (root distance to tip) minus (root distance to node) for all tips connecting to a node.
We tested values between zero point two five and zero point seven five at zero point zero five intervals, and selected a final threshold of less than zero point six five based on generation of a similar number of major lineages as compared to the taxonomy-guided clustering view in Figure one. The taxonomy view identified twenty-six archaeal and seventy-four bacterial phylum-level lineages (counting the Microgenomates and Parcubacteria as single phyla each), whereas an average branch length of less than zero point six five resulted in twenty-eight archaeal and seventy-six bacterial clades.
For a companion SSU rRNA tree, an alignment was generated from all SSU rRNA genes available from the genomes of the organisms included in the ribosomal protein data set. For organisms with multiple SSU rRNA genes, one representative gene was kept for the analysis, selected randomly. As genome sampling was confined to the genus level, we do not anticipate this selection process will have any impact on the resultant tree. All SSU rRNA genes longer than six hundred base pairs were aligned using the SINA alignment algorithm through the SILVA web interface. The full alignment was stripped of columns containing ninety-five percent or more gaps, generating a final alignment containing one thousand eight hundred seventy-one taxa and one thousand nine hundred forty-seven alignment positions. A maximum likelihood tree was inferred as described for the concatenated ribosomal protein trees, with RAxML run using the GTRCAT model of evolution. The RAXML inference included the calculation of three hundred bootstrap iterations (extended majority rules-based bootstopping criterion), with one hundred randomly sampled to determine support values.
To test the effect of site selection stringency on the inferred phylogenies, we stripped the alignments of columns containing up to fifty percent gaps (compared with the original trimming of ninety-five percent gaps). For the ribosomal protein alignment, this resulted in a fourteen percent reduction in alignment length (to two thousand two hundred thirty-two positions) and a forty-four point six percent reduction in computational time (approximately two thousand one hundred hours). For the SSU rRNA gene alignment, stripping columns with fifty percent or greater gaps reduced the alignment by twenty-four percent (to one thousand four hundred eighty-nine positions) and the computation time by twenty-eight percent. In both cases, the topology of the tree with the best likelihood was not changed significantly. The ribosomal protein resolved a two-domain tree with the Eukarya sibling to the Lokiarcheaota, while the SSU rRNA tree depicts a three-domain tree. The position of the CPR as deep-branching on the ribosomal protein tree and within the Bacteria on the SSU rRNA tree was also consistent. The alignments and inferred trees under the more stringent gap stripping are available upon request.
Nomenclature. We have included names for two lineages for which we have previously published complete genomes. At the time of submission of the paper describing these genomes, the reviewer community was not uniformly open to naming lineages of uncultivated organisms based on such information. Given that this practice is now widely used, we re-propose the names for these phyla. Specifically, for WWE three we suggest the name Katanobacteria from the Hebrew 'katan', which means 'small', and for SR one we suggest the name Absconditabacteria from the Latin 'Abscondo' meaning 'hidden', as in 'shrouded'.
Accession codes. NCBI and/or JGI IMG accession numbers for all genomes used in this study are listed in Supplementary Table one. Additional ribosomal protein gene and sixteen S rRNA gene sequences used in this study have been deposited in Genbank under accession numbers KU eight six eight zero eight one through KU eight six nine five two one. The concatenated ribosomal protein and SSU rRNA alignments used for tree reconstruction are included as separate files in the Supplementary Information.