Bacterial diversification through geological time
Bacterial diversification through geological time
Numerous studies have estimated plant and animal diversification dynamics; however, no comparable rigorous estimates exist for bacteria-the most ancient and widespread form of life on Earth. Here, we analyse phylogenies comprising up to four hundred forty-eight thousand one hundred twelve bacterial lineages to reconstruct global bacterial diversification dynamics. To handle such large phylogenies, we developed methods based on the statistical properties of infinitely large trees. We further analysed sequencing data from sixty environmental studies to determine the fraction of extant bacterial diversity missing from the phylogenies-a crucial parameter for estimating speciation and extinction rates. We estimate that there are about one point four to one point nine million extant bacterial lineages when lineages are defined by ninety-nine percent similarity in the sixteen S ribosomal RNA gene, and that bacterial diversity has been continuously increasing over the past one billion years. Recent bacterial extinction rates are estimated at zero point zero three to zero point zero five per lineage per million years, and are only slightly below estimated recent bacterial speciation rates. Most bacterial lineages ever to have inhabited this planet are estimated to be extinct. Our findings disprove the notion that bacteria are unlikely to go extinct, and provide a valuable perspective on the evolutionary history of a domain of life with a sparse and cryptic fossil record.
or over three point five billion years, the geochemical composition of our planet has been shaped by the evolution and diversification of bacteria. Most prominently, the Great Oxygenation Event was caused by cyanobacteria roughly two point three five billion years ago and dramatically altered Earth's surface environments and the subsequent evolution of life. Despite the prominent role of bacteria in ancient and modern biospheres, little is known about the dynamics by which their diversity evolved over Earth's history. For many eukaryotes, the fossil record provides estimates of past diversity, revealing that extant global eukaryotic diversity only represents a small fraction of the total diversity that existed in the past. Analogous estimates for bacterial diversity are lacking, largely because their fossil record is extremely poor, and thus the clades that are known are those with extant representatives. Fortunately, past diversification dynamics also leave a footprint in molecular phylogenies of extant organisms. Many approaches have been developed to infer past diversification dynamics from these patterns. Despite these methodological advances, global bacterial diversification dynamics remain largely unresolved and much less studied than eukaryotic diversification. Previous studies only examined diversification within a single bacterial genus or a single archaeal phylum, or phylogenies covering only a small and biased portion of diversity. Many of these studies do not report absolute speciation or extinction rates. Importantly, no previous study properly accounted for the incomplete sampling of bacterial diversity represented in the phylogenies. Knowledge of the 'sampling fraction', in addition to any phylogenetic information, is critical for estimating speciation and extinction rates from phylogenies, even to an order of magnitude. As the extant global bacterial diversity was so far largely unknown, previous studies either assumed that the number of catalogued species was exhaustive (an inaccurate assumption), used local (rather than global) diversity estimates, such as for a small quantity of soil, or estimated the unknown sampling fraction directly from the phylogeny without additional information (an impossible task). Consequently, there exists no rigorous estimate of global bacterial speciation rates, extinction rates or total diversity over time, and this uncertainty has clouded our interpretation of bacterial evolution over Earth's history. It is commonly hypothesized that bacterial extinction may not even occur at significant rates, partly due to their large population sizes and wide dispersal ranges, while others hypothesized that animal extinctions could cause substantial host-associated bacterial extinctions.
To address these questions, we examined bacterial phylogenies comprising up to hundreds of thousands of clades, using mathematical tools that we developed specifically for large phylogenies. To properly account for the fraction of undiscovered diversity in our methods, thus resolving a long-standing problem in bacterial phylogenetics, we independently estimated global bacterial diversity using massive DNA sequencing data from sixty studies in diverse environments across the world. To evaluate the robustness of our results, we used numerical simulations and examined several phylogenies constructed using alternative methods. Importantly, some of our phylogenies were constructed from environmental sequences retrieved using culture-independent methods, providing a less biased (and thus more suitable) representation of bacterial diversity compared with previous studies. We used our methods, as well as the independently estimated global bacterial diversity, to reconstruct global bacterial speciation, extinction and diversification rates over the past one billion years.
We used two time-calibrated bacterial phylogenies, called timetrees, based on the sixteen S ribosomal RNA gene-a popular marker gene in microbial ecology and evolution (four hundred forty-eight thousand one hundred twelve and one hundred sixty-two thousand three hundred seventy-one tips, respectively; see Supplementary Table one for an overview and the Methods for details). We also analysed cyanobacteria alone due to their great importance to Earth's evolution, using four sixteen S rRNA-based timetrees constructed with various methods (five hundred eighty-six, six thousand three hundred eight,
six thousand three hundred two and one thousand five hundred seventy-nine tips, respectively). In all cases, tips in the trees represent operational taxonomic units; that is, clusters in the sixteen S rRNA gene delineated at ninety-nine percent similarity-a common microbial 'species' measure. We stress that bacterial operational taxonomic units only provide an approximate 'species' analogue to sexually reproducing organisms, and hence 'speciation' rates reported here should a priori only be interpreted as branching frequencies in sixteen S rRNA sequence space.
Estimating diversification dynamics from large timetrees
Estimating diversification dynamics from large timetrees
Our methods were derived from standard stochastic models for cladogenesis, in which extant lineages can split or go extinct randomly and independent of each other as time proceeds. These models predict the total number of extant lineages, total diversity, at each time point, as well as the number of lineages represented in the final timetree comprising only extant and sampled taxa, lineages through time. Our methods can account for the effect of incomplete taxon sampling, as well as for speciation and extinction rates that vary over time. In contrast to most existing methods, our methods consider timetrees in the continuum limit of infinitely many lineages, which yields novel ways to extract information from timetrees. Notably, given some lineages through time curve, one can estimate a quantity that is related to the diversification rate at each time point, and which we refer to as the pulled diversification rate:
d L & dt r p equals two minus u minus where t is time, two is the instantaneous speciation rate and u is the instantaneous extinction rate. The PDR partly resembles the diversification rate (r equals two minus u), but is modified (pulled) by the term two minus one times d l by d t, which represents the relative rate of change of two over time and is small when two varies slowly. In contrast to the diversification rate, the PDR can be estimated 'non-parametrically' from the curvature and slope of the LTT curve at any point in time. This approach does not require fitting a specific parameterized model, nor a priori assumptions on how two and/or u vary over time, nor assumptions about whether the PDR or diversification rate was positive or negative. More precisely, in the continuum limit, the PDR can be calculated using the LTT for any time t using the formula:
T p of t equals integral from zero to t of negative d omega by t minus one times d omega by t d t where d omega by t equals one divided by N of t times d N by d t is the relative slope of the LTT and N of t is the value of the LTT at time t. For finite trees, equation two is only an estimate.
Similar to the PDR, one can also estimate 'pulled' versions of other important variables, including the pulled extinction rate (PER),
H p equals p plus one minus two plus integral from zero to t of negative d alpha by t times d alpha by t and the pulled total diversity (PTD),
N o equals N
(estimation formulas provided in Supplementary Information section one point three). Here, two refers to the most recent speciation rate (that is, as observed near the tips of the tree) and N is the total diversity at any given point in time. The PER and PTD are equal to the extinction rate u and the total diversity N, respectively, when two is constant (r equals two). If two varies slowly (two minus one times d alpha by d t is less than p), the recent H p still resembles the recent extinction rate, although the difference increases for older ages. Rapid variations in two and/or u will usually lead to substantial variations in up and rp.
In contrast to conventional maximum-likelihood or Bayesian methods for estimating two, p, r and N, the pulled variables H p, rp and N p can be estimated from the LTT for each past time point without any assumptions about how two and u varied over time, and without fitting a specific parameterized model. Model fitting is the current de facto standard in phylogenetics-based reconstruction of diversification, and is, in fact, included in the present study. However, it requires that a parameterized form be specified beforehand for two and u; for example, accounting for rate shifts at discrete time points, leading to well-known trade-offs between model realism and temporal resolution on the one hand versus model simplicity and confidence in parameter estimation on the other hand. The caveat is that up, rp and N p are composite variables, and in general, solely knowing up, rp and N p does not unambiguously determine the constituents u, two, r and N. This limitation can be traced back to the fact that extinction partly erases a clade's history (further discussion in Supplementary Information section five).
As we demonstrate here, pulled variables are a powerful tool for obtaining insight into past diversification dynamics and for testing model assumptions. Using timetrees simulated under realistic scenarios, we found that pulled variables can reveal past changes in diversification rates, such as those due to mass extinction events, oscillating speciation rates and short temporary spikes in the speciation rate, as well as diversity-dependent speciation and extinction rates. In particular, our simulations revealed that changes in the speciation and/or extinction rate usually lead to similarly strong changes in up and that, reciprocally, a constant H p over time is a strong indication that both two and u were constant or varied only slowly over time (details in Supplementary Information section four). Our simulations also revealed that the magnitude of the PDR is usually comparable to the magnitude of the diversification rate, and in fact, in all of our simulations, the two closely resembled each other. Furthermore, we found that N p provides a quick way to roughly estimate past total diversities to order-of-magnitude accuracy, provided two does not change drastically over time (that is, by orders of magnitude).