--- title: "Introduction to the microbiome R package" author: "Leo Lahti, Sudarshan Shetty, et al." bibliography: - bibliography.bib date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true fig_caption: yes rmarkdown::md_document: toc: true rmarkdown::pdf_document: toc: true vignette: > %\VignetteIndexEntry{microbiome R package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo=FALSE, message=FALSE} library(knitr) require(knitcitations) # cleanbib() # options("citation_format" = "pandoc") bib <- read.bibtex("bibliography.bib") # Can also write math expressions, e.g. # $Y = X\beta + \epsilon$, footnotes^[A footnote here.] # > "He who gives up [code] safety for [code] speed deserves neither." # ([via](https://twitter.com/hadleywickham/status/504368538874703872)) knitr::opts_chunk$set(fig.width=10, fig.height=10, message=FALSE, warning=FALSE) ``` # Introduction The [microbiome R package](http://microbiome.github.io/microbiome) facilitates exploration and analysis of microbiome profiling data, in particular 16S taxonomic profiling. This vignette provides a brief overview with example data sets from published microbiome profiling studies [@lahti14natcomm, @Lahti13provasI, @OKeefe15]. A more comprehensive tutorial is available [on-line](http://microbiome.github.io/microbiome). Tools are provided for the manipulation, statistical analysis, and visualization of taxonomic profiling data. In addition to targeted case-control studies, the package facilitates scalable exploration of large population cohorts [@lahti14natcomm]. Whereas sample collections are rapidly accumulating for the human body and other environments, few general-purpose tools for targeted microbiome analysis are available in R. This package supports the independent [phyloseq](http://joey711.github.io/phyloseq) data format and expands the available toolkit in order to facilitate the standardization of the analyses and the development of best practices. See also the related [PathoStat pipeline](http://bioconductor.org/packages/release/bioc/html/PathoStat.html) [mare pipeline](https://github.com/katrikorpela/mare), [phylofactor](), and [structSSI]() for additional 16S rRNA amplicon analysis tools in R. The aim is to complement the other available packages, but in some cases alternative solutions have been necessary in order to streamline the tools and to improve complementarity. We welcome feedback, bug reports, and suggestions for new features from the user community via the [issue tracker](https://github.com/microbiome/microbiome/issues) and [pull requests](http://microbiome.github.io/microbiome/Contributing.html). See the [Github site](https://github.com/microbiome/microbiome) for source code and other details. These R tools have been utilized in recent publications and in introductory courses [@salonen14, @Faust15, @Shetty2017], and they are released under the [Two-clause FreeBSD license](http://en.wikipedia.org/wiki/BSD\_licenses). Kindly cite the work as follows: "Leo Lahti [et al.](https://github.com/microbiome/microbiome/graphs/contributors) (Bioconductor, 2017). Tools for microbiome analysis in R. Microbiome package version `r sessionInfo()$otherPkgs$microbiome$Version`. URL: (http://microbiome.github.io/microbiome) # Installation To install microbiome package in R (Bioconductor development version), use ```{r install, eval=FALSE} library(BiocInstaller) source("http://www.bioconductor.org/biocLite.R") useDevel() biocLite("microbiome") ``` Then load the package in R ```{r loading} library(microbiome) ``` # Data The microbiome package relies on the independent [phyloseq](http://joey711.github.io/phyloseq) data format. This contains an OTU table (taxa abundances), sample metadata (age, BMI, sex, ...), taxonomy table (mapping between OTUs and higher-level taxonomic classifications), and a phylogenetic tree (relations between the taxa). ## Example data sets Example data sets are provided to facilitate reproducible examples and further methods development. The HITChip Atlas data set [Lahti et al. Nat. Comm. 5:4344, 2014](http://www.nature.com/ncomms/2014/140708/ncomms5344/full/ncomms5344.html) contains 130 genus-level taxonomic groups across 1006 western adults. Load the example data in R with ```{r atlasdata} # Data from # http://www.nature.com/ncomms/2014/140708/ncomms5344/full/ncomms5344.html data(atlas1006) atlas1006 ``` The two-week diet swap study between western (USA) and traditional (rural Africa) diets, reported in [O'Keefe et al. Nat. Comm. 6:6342, 2015](http://dx.doi.org/10.1038/ncomms7342) ```{r dietswapdata} data(dietswap) # Data from http://dx.doi.org/10.1038/ncomms7342 dietswap ``` A parallel profiling of gut microbiota versus blood metabolites from [Lahti et al. PeerJ 1:e32, 2013](https://peerj.com/articles/32/) to characterize associations between human intestinal microbiota and blood serum lipids ```{r peerjdata} data(peerj32) # Data from https://peerj.com/articles/32/ ``` ## Data import You can import 16S profiling data from standard formats (Mother, BIOM, CSV, etc.). See the [tutorial](http://microbiome.github.io/microbiome/Data.html) for details. ## Data manipulation A phyloseq object can be subsetted, filtered, aggregated, transformed, and otherwise manipulated. For a comprehensive list of tools, see the [online tutorial](http://microbiome.github.io/microbiome/Preprocessing.html). To convert absolute counts to compositional (relative) abundances, for instance, use ```{r compositional} # dietswap is a phyloseq object; see above dietswap.compositional <- transform(dietswap, "compositional") ``` # Ecosystem indices ## Alpha diversity, richness, evenness, dominance, and rarity Commonly used ecosystem state variables include various indices to quantify alpha diversities, richness, evenness, dominance, and rarity (see functions with similar names). We provide a comprehensive set of such indices via a standardized interface. The function `global` calls these indicators with default parameters. For further options, see [tutorial](http://microbiome.github.io/microbiome/Diversity.html). ```{r global, results="asis"} g <- global(atlas1006, index = "gini") ``` Visually-Weighted Regression curve with smoothed error bars is based on the can be used to visualize sample variables [(1)](http://www.fight-entropy.com/2012/07/visually-weighted-regression.html), here the relation between age and diversity. This function operates on standard data frames. ```{r reg, fig.width=8, fig.height=4, dev="CairoPNG"} # Estimate Shannon diversity and add it to the phyloseq object sample_data(atlas1006)$diversity <- global(atlas1006, index = "shannon")[,1] # Compare age and microbiome diversity plot_regression(diversity ~ age, meta(atlas1006)) ``` # Core microbiota analysis Population frequencies, or **prevalence**, of the taxonomic groups exceeding a given detection threshold can be calculated with ```{r core-prevalence2} p <- prevalence(dietswap, detection = 0, sort = TRUE) ``` The **core microbiota** refers to the set of taxa that are detected in a remarkable fraction of the population above a given abundance threshold [see e.g. @Jalanka-Tuovinen11, @Salonen12cmi]. The core subset can be sliced from a phyloseq object as follows. ```{r core-data} # Taxa with over 50% prevance at .2% relative abundance dietswap.core <- core(dietswap.compositional, detection = .2/100, prevalence = 50/100) dietswap.core ``` Similar functions are available also for rare and variable taxa, see the microbiome [tutorial](http://microbiome.github.io/microbiome/Core.html) for a full description. To visualize the core as in [@Shetty2017], use ```{r corevisu, fig.width=20, fig.heigth=10, out.width = '350px'} library(ggplot2, quiet = TRUE) p <- plot_core(transform(dietswap.core, "compositional"), plot.type = "heatmap", colours = gray(seq(0,1,length=5)), prevalences = seq(.05, 1, .05), detections = 10^seq(log10(1e-3), log10(.2), length = 10), horizontal = TRUE) + xlab("Detection Threshold (Relative Abundance (%))") print(p) ``` # Microbiome composition Composition heatmap: Z-transformed taxon abundances ```{r compheat, fig.height=8, fig.width=12, out.width="300px"} tmp <- plot_composition(dietswap.core, plot.type = "heatmap", transform = "Z", mar = c(6, 13, 1, 1), sample.sort = "nationality") ``` Composition barplot ```{r compbar, fig.height=6, fig.width=12} plot_composition(transform(dietswap.core, "compositional"), plot.type = "barplot", sample.sort = "neatmap") ``` Visualize sample similarities, or the microbiome landscape [@Shetty2017] on an ordination map based on various [ordination methods](http://microbiome.github.io/microbiome/Ordination.html) (PCoA, NMDS, RDA etc) that are available through various other R packages. ```{r landscape4, fig.width=8, fig.height=6, out.width="400px"} # Project the samples with the given method and dissimilarity measure. # Ordinate the data; note that some ordinations are sensitive to random seed # "quiet" is used to suppress intermediate outputs set.seed(423542) # Get ordination x <- dietswap.core quiet(x.ord <- ordinate(x, method = "NMDS", distance = "bray")) # Pick the projected data (first two columns + metadata) quiet(proj <- phyloseq::plot_ordination(x, x.ord, justDF=TRUE)) # Rename the projection axes names(proj)[1:2] <- paste("Comp", 1:2, sep=".") p <- plot_landscape(proj[, 1:2], col = proj$nationality, legend = TRUE) print(p) ``` # Association heatmaps Let us cross-correlate example data containing microbiome profiling and blood serum lipids study [@Lahti13provasI]. ```{r heatmap-crosscorrelate4, fig.keep='none'} # Define data sets to cross-correlate data(peerj32) x <- log10(peerj32$microbes) # OTU Log10 (44 samples x 130 genera) y <- as.matrix(peerj32$lipids) # Lipids (44 samples x 389 lipids) # Cross correlate data sets and return a table correlation.table <- associate(x, y, method = "spearman", mode = "table", p.adj.threshold = 0.05, n.signif = 1) kable(head(correlation.table)) ``` Let us rearrange the data and plot the heatmap and highlight significant correlations with stars as in [@Lahti13provasI]. ```{r hm3, fig.width=12, fig.height=10, out.width="400px"} heat(correlation.table, "X1", "X2", fill = "Correlation", star = "p.adj", p.adj.threshold = 0.05) ``` # Bistability and tipping elements The microbiome package provides tools to [quantify stability and bimodality](http://microbiome.github.io/microbiome/Stability.html) in abundance data, following [@lahti14natcomm]. Let use Dialister as an example. The tipping point is here set manually but it can also be [estimated from the data](http://microbiome.github.io/microbiome/Stability.html). ```{r varpl2, fig.show='hold', out.width="250px"} # Use relative abundances and plot subject variation # during the follow-up period pseq <- transform(atlas1006, "compositional") plot_tipping(pseq, "Dialister", tipping.point = 0.004) # Show the bimodal population distribution hotplot(pseq, "Dialister", tipping.point = 0.004) ``` # Further reading The [on-line tutorial](http://microbiome.github.io/microbiome) provides many additional tools and examples, with more thorough descriptions. This package and tutorials are work in progress. We welcome feedback, for instance via [issue Tracker](https://github.com/microbiome/microbiome/issues), [pull requests](https://github.com/microbiome/microbiome/), or via [Gitter](https://gitter.im/microbiome). # Acknowledgements Thanks to all [contributors](https://github.com/microbiome/microbiome/graphs/contributors). Financial support has been provided by Academy of Finland (grants 256950 and 295741), [University of Turku](http://www.utu.fi/en/Pages/home.aspx), Department of Mathematics and Statistics. In addition, the work has been supported by [VIB lab for Bioinformatics and (eco-)systems biology](http://www.vib.be/en/research/scientists/Pages/Jeroen-Raes-Lab.aspx), VIB/KULeuven, Belgium, [Molecular Ecology group](http://www.mib.wur.nl/UK/), Laboratory of Microbiology, Wageningen University, Netherlands, and [Department of Veterinary Bioscience](http://www.vetmed.helsinki.fi/apalva/index.htm), University of Helsinki, Finland. This work relies on the independent [phyloseq](https://github.com/joey711/phyloseq) package and data structures for R-based microbiome analysis developed by Paul McMurdie and Susan Holmes. This work also utilizes a number of independent R extensions, including dplyr [@dplyr], ggplot2 [@Wickham_2009], phyloseq [@McMurdie2013], and vegan [@Oksanen_2015]. # References ```{r, echo=FALSE,results="asis"} bibliography() ```