--- title: "Base Classes and Functions for Phylogenetic Tree Input and Output" author: "Guangchuang Yu\\ School of Public Health, The University of Hong Kong" date: "`r Sys.Date()`" bibliography: treeio.bib csl: nature.csl output: prettydoc::html_pretty: toc: true theme: cayman highlight: github pdf_document: toc: true vignette: > %\VignetteIndexEntry{Tree Data Input and Output} %\VignetteEngine{knitr::rmarkdown} %\VignetteDepends{Biostrings) %\usepackage[utf8]{inputenc} --- ```{r style, echo=FALSE, results="asis", message=FALSE} knitr::opts_chunk$set(tidy = FALSE, message = FALSE) ``` ```{r echo=FALSE, results="hide", message=FALSE} library("ape") library("Biostrings") library("treeio") ``` The `treeio` package is an infrastructure that enables evolutionary evidences that inferred by commonly used software packages in the field to be used in `R`. For instance, *dN/dS* values or ancestral sequences inferred by [CODEML](http://abacus.gene.ucl.ac.uk/software/paml.html)[@yang_paml_2007], clade support values (posterior) inferred by [BEAST](http://beast2.org/)[@bouckaert_beast_2014] and short read placement by [EPA](http://sco.h-its.org/exelixis/web/software/epa/index.html)[@berger_EPA_2011] and [pplacer](http://matsen.fhcrc.org/pplacer/)[@matsen_pplacer_2010]. These evolutionary evidences can be further analyzed in `R` and used to annotate phylogenetic tree using `ggtree`. # Supported File Formats Most of the software in this field (including `R` packages) focus on `Newick` and `Nexus` file formats, while there are file formats from different evolution analysis software that contain supporting evidences within the file that are useful for further summary, analysis and annotation of the tree. The `treeio` package define several parser functions and `S4` classes to store statistical evidences inferred by commonly used software packages. It supports several file formats, including: + Newick (via `ape`) + Nexus (via `ape`) + New Hampshire eXtended format (NHX) + Jplace + Phylip and software output from: + [BEAST](http://beast2.org/)[@bouckaert_beast_2014] + [EPA](http://sco.h-its.org/exelixis/web/software/epa/index.html)[@berger_EPA_2011] + [HYPHY](http://hyphy.org/w/index.php/Main_Page)[@pond_hyphy_2005] + [PAML](http://abacus.gene.ucl.ac.uk/software/paml.html)[@yang_paml_2007] + [PHYLDOG](http://pbil.univ-lyon1.fr/software/phyldog/)[@boussau_genome-scale_2013] + [pplacer](http://matsen.fhcrc.org/pplacer/)[@matsen_pplacer_2010] + [r8s](http://loco.biosci.arizona.edu/r8s/)[@marazzi_locating_2012] + [RAxML](http://sco.h-its.org/exelixis/web/software/raxml/)[@stamatakis_raxml_2014] + [RevBayes](http://revbayes.github.io/intro.html)[@hohna_probabilistic_2014] # Parser functions The `treeio` package implement several parser functions, including: + `read.beast` for parsing output of [BEASE](http://beast2.org/) + `read.codeml` for parsing output of [CODEML](http://abacus.gene.ucl.ac.uk/software/paml.html) (`rst` and `mlc` files) + `read.codeml_mlc` for parsing `mlc` file (output of `CODEML`) + `read.hyphy` for parsing output of [HYPHY](http://hyphy.org/w/index.php/Main_Page) + `read.jplace` for parsing `jplace` file including output from [EPA](http://sco.h-its.org/exelixis/web/software/epa/index.html) and [pplacer](http://matsen.fhcrc.org/pplacer/) + `read.nhx` for parsing `NHX` file including output from [PHYLODOG](http://pbil.univ-lyon1.fr/software/phyldog/) and [RevBayes](http://revbayes.github.io/intro.html) + `read.paml_rst` for parsing `rst` file (output of `BASEML` and `CODEML`) + `read.r8s` for parsing output of [r8s](loco.biosci.arizona.edu/r8s/) + `read.raxml` for parsing output of [RAxML](http://sco.h-its.org/exelixis/web/software/raxml/) # S4 classes Correspondingly, `ggtree` defines several `S4` classes to store evolutionary evidences inferred by these software packages, including: + _`apeBootstrap`_ for bootstrap analysis of `ape::boot.phylo()`[@paradis_ape_2004], output of `apeBoot()` defined in `ggtree` + _`beast`_ for storing output of `read.beast()` + _`codeml`_ for storing output of `read.codeml()` + _`codeml_mlc`_ for storing output of `read.codeml_mlc()` + _`hyphy`_ for storing output of `read.hyphy()` + _`jplace`_ for storing output of `read.jplace()` + _`nhx`_ for storing output of `read.nhx()` + _`paml_rst`_ for _`rst`_ file obtained by [PAML](http://abacus.gene.ucl.ac.uk/software/paml.html), including _`BASEML`_ and _`CODEML`_. + _`phangorn`_ for storing ancestral sequences inferred by `R` package `phangorn`[@schliep_phangorn_2011], output of `phyPML()` defined in `ggtree` + _`r8s`_ for storing output of `read.r8s()` + _`raxml`_ for storing output of `read.raxml()` The _`jplace`_ class is also designed to store user specified annotation data. In `treeio`, tree objects can be merged and evidences inferred from different phylogenetic analyses can be combined or compared and visualized. For each class, we defined _`get.fields`_ method to get the annotation features that available in the object that can be used to annotate a phylogenetic tree directly in `ggtree`. A _`get.tree`_ method can be used to convert tree object to `phylo` (or `multiPhylo` for `r8s`) object that are widely supported by other `R` packages. Detail descriptions of `slots` defined in each class are documented in class man pages. Users can use `class?className` (e.g. `class?beast`) to access man page of a class. # Getting Tree Data into R ## Parsing BEAST output ```{r} file <- system.file("extdata/BEAST", "beast_mcc.tree", package="treeio") beast <- read.beast(file) beast ``` Since _`%`_ is not a valid character in _`names`_, all the feature names that contain _`x%`_ will convert to _`0.x`_. For example, _`length_95%_HPD`_ will be changed to _`length_0.95_HPD`_. The _`get.fields`_ method return all available features that can be used for annotation. ```{r} get.fields(beast) ``` ## Parsing PAML output ### _`rst`_ file The _`read.paml_rst`_ function can parse `rst` file from `BASEML` and `CODEML`. The only difference is the space in the sequences. For `BASEML`, each ten bases are separated by one space, while for `CODEML`, each three bases (triplet) are separated by one space. ```{r fig.width=12, fig.height=10, warning=FALSE, fig.align="center"} brstfile <- system.file("extdata/PAML_Baseml", "rst", package="treeio") brst <- read.paml_rst(brstfile) brst ``` Similarly, we can parse the `rst` file from `CODEML` and visualize the tree with amino acid substitution. ```{r} crstfile <- system.file("extdata/PAML_Codeml", "rst", package="treeio") crst <- read.paml_rst(crstfile) crst ``` We can find that these two figures have different evolution distances and subsitutions inferred by `BASEML` and `CODEML` are slightly different. ### CODEML #### mlc file The `mlc` file output by `CODEML` contains _`dN/dS`_ estimation. ```{r} mlcfile <- system.file("extdata/PAML_Codeml", "mlc", package="treeio") mlc <- read.codeml_mlc(mlcfile) mlc ``` Please aware that _`/`_ and _`*`_ are not valid characters in _`names`_, they were changed to _`_vs_`_ and _`_x_`_ respectively. So _`dN_vs_dS`_ is _`dN/dS`_, _`N_x_dN`_ is _`N*dN`_, and _`S_x_dS`_ is _`S*dS`_. #### _`CODEML`_ output: rst and mlc files\ We annotate the tree with information presented in _`rst`_ and _`mlc`_ file separately as demonstrated in previous sessions. We can also use both of them which is highly recommended. ```{r} ml <- read.codeml(crstfile, mlcfile) ml ``` All the features in both files are available for annotation. For example, we can annotate _`dN/dS`_ with the tree in _`rstfile`_ and amino acid substitutions with the tree in _`mlcfile`_. ## Parsing HYPHY output ```{r warning=FALSE} nwk <- system.file("extdata/HYPHY", "labelledtree.tree", package="treeio") ancseq <- system.file("extdata/HYPHY", "ancseq.nex", package="treeio") tipfas <- system.file("extdata", "pa.fas", package="treeio") hy <- read.hyphy(nwk, ancseq, tipfas) hy ``` ## Parsing r8s output [r8s](http://loco.biosci.arizona.edu/r8s/) output contains 3 trees, namely `TREE`, `RATO` and `PHYLO` for time tree, rate tree and absolute substitution tree respectively. ```{r fig.width=4, fig.height=6, width=60, warning=FALSE, fig.align="center"} r8s <- read.r8s(system.file("extdata/r8s", "H3_r8s_output.log", package="treeio")) r8s ``` ## Parsing output of RAxML bootstraping analysis ```{r fig.width=12, fig.height=10, width=60, warning=FALSE, fig.align="center"} raxml_file <- system.file("extdata/RAxML", "RAxML_bipartitionsBranchLabels.H3", package="treeio") raxml <- read.raxml(raxml_file) raxml ``` ## Parsing NHX tree NHX (New Hampshire eXtended) format is commonly used in phylogenetics software (e.g. [PHYLDOG](http://pbil.univ-lyon1.fr/software/phyldog/)[@boussau_genome-scale_2013], [RevBayes](http://revbayes.github.io/intro.html)[@hohna_probabilistic_2014], *etc*.) ```{r} nhxfile <- system.file("extdata/NHX", "ADH.nhx", package="treeio") nhx <- read.nhx(nhxfile) nhx ## ggtree(nhx) + geom_tiplab() + geom_point(aes(color=S), size=5, alpha=.5) + ## theme(legend.position="right") + ## geom_text(aes(label=branch.length, x=branch), vjust=-.5) + ## xlim(NA, 0.3) ``` ## Parsing Phylip tree Phylip format contains taxa sequences with Newick tree text. ```{r} phyfile <- system.file("extdata", "sample.phy", package="treeio") phylip <- read.phylip(phyfile) phylip ``` ## Parsing EPA and pplacer output [EPA](http://sco.h-its.org/exelixis/web/software/epa/index.html)[@berger_EPA_2011] and [PPLACER](http://matsen.fhcrc.org/pplacer/)[@matsen_pplacer_2010] have common output file format, `jplace`, which can be parsed by `read.jplace()` function. ```{r} jpf <- system.file("extdata/sample.jplace", package="treeio") jp <- read.jplace(jpf) print(jp) ``` In `ggtree`, we provide _`get.placements`_ method to access the placement. ```{r} ## get only best hit get.placements(jp, by="best") ## get all placement get.placements(jp, by="all") ``` This is only a tiny sample file. In reality, [EPA](http://sco.h-its.org/exelixis/web/software/epa/index.html) and [PPLACER](http://matsen.fhcrc.org/pplacer/) may place thousands of short reads on a reference tree. We may, for example, count the number of placement and annotate this information in the tree. # References