The Open Tree of Life: toward a global synthesis of phylogenetic knowledge

2013 REU Intern Joshua Stevens-Stein


Sophomore Biology and/or Computer Science major at The University of Chicago

REU Mentor: Dr. Richard Ree (Curator, Botany)

Symposium Presentation Title: Rooting PhyLoTA: A case study in taxonomically rooting phylogenetic trees

Symposium Presentation Abstract: The Open Tree of Life project seeks to produce an online, open, and comprehensive phylogenetic tree of the 1.8 million known species, an invaluable resource to the scientific community and to lay education. Two of the difficulties facing the project, however, are the compilation of all published phylogenetic data and the incorporation of those results in a unified format. One of the larger and previously unmined and unrefined databases of such phylogenetic data is the PhyLoTA Database, containing 22,165 eukaryotic trees produced by all-against-all BLAST searches and sequence clustering algorithms. Unfortunately, the nature of the analyses (generating trees by clustering with no outgroup) provided only unrooted trees. The focus of PhyLoTA analyses to the NCBI taxonomy and a more general focus on phylogeny in biological classification in recent decades suggested that the NCBI taxonomy could prove informative in rooting these trees. Using the Python packages Ivy and Graph-Tool, we examined each PhyLoTa tree in graph format and determined whether the taxa concerned, from most to least inclusive, formed clades within the larger tree, extracting and saving any such subtrees, the root being the branch connecting this subtree to the rest of the tree. Of the original 22,165 trees with 1,420,989 leaves, 14,338 of these trees yielded 24,371 subtrees with 713,260 leaves. These subtrees are due for inclusion in the Open tree of Life. The code used will shortly be public access, for use in similar endeavors. Though analysis on the subtrees produced has thus far been cursory, it will be telling about the nature of the evolving relationship between taxonomy and phylogeny, and particularly the degree of correspondence between the two over the entire tree of life.

Original Project Description: In the field of systematic biology, scientists study species of all kinds to determine how one is related to another by evolutionary descent.  In other words, they are trying to reconstruct the great Tree of Life -- the branching genealogy of all species, traced all the way back to a single common ancestor. (The scientific term for 'Tree of Life' is 'phylogeny'.)

Individual scientists typically have expertise in only one or a few branches on the tree -- for instance, one might study dung beetles, while another studies venus flytraps. Every year, experts like these publish thousands of scientific papers describing new phylogenetic trees for different group of organisms: clams, birds, mushrooms, and so on. However, these newly discovered trees are generally recorded simply as figures embedded in the pages of scientific journals.

The Open Tree of Life project seeks to extract all these trees from the literature and graft them together by entering them into a common database. This will enable computational analyses that will produce, for the first time, an estimate of the Tree of Life that includes all species ever studied.

Research methods and techniques: interns on this project will learn how to download data sets of DNA sequences, perform phylogenetic analyses, and interpret the results. They will also have the opportunity to learn basic computer programming and Linux shell computing, or advance their current knowledge of these topics. Their contributions will be recorded in a public database for posterity. It is perfect fit for anyone interested in both biology and computers.