---
title: "Introduction to the microbiome R package"
author: "Leo Lahti, Sudarshan Shetty, et al."
bibliography: 
- bibliography.bib
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    toc: true
    fig_caption: yes    
  rmarkdown::md_document:
    toc: true
  rmarkdown::pdf_document:
    toc: true    
vignette: >
  %\VignetteIndexEntry{microbiome R package}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


```{r, echo=FALSE, message=FALSE}
library(knitr)
require(knitcitations)
# cleanbib()
# options("citation_format" = "pandoc")
bib <- read.bibtex("bibliography.bib")

# Can also write math expressions, e.g. 
# $Y = X\beta + \epsilon$, footnotes^[A footnote here.]
# > "He who gives up [code] safety for [code] speed deserves neither."
# ([via](https://twitter.com/hadleywickham/status/504368538874703872))
knitr::opts_chunk$set(fig.width=10, fig.height=10, message=FALSE, warning=FALSE)
```

# Introduction

The [microbiome R package](http://microbiome.github.io/microbiome)
facilitates exploration and analysis of microbiome profiling data, in
particular 16S taxonomic profiling.

This vignette provides a brief overview
with example data sets from published microbiome profiling studies
[@lahti14natcomm, @Lahti13provasI, @OKeefe15].
A more comprehensive tutorial is available
[on-line](http://microbiome.github.io/microbiome).

Tools are provided for the manipulation, statistical analysis, and
visualization of taxonomic profiling data. In addition to targeted
case-control studies, the package facilitates scalable exploration of
large population cohorts [@lahti14natcomm]. Whereas sample collections
are rapidly accumulating for the human body and other environments,
few general-purpose tools for targeted microbiome analysis are
available in R.  This package supports the independent
[phyloseq](http://joey711.github.io/phyloseq) data format and expands
the available toolkit in order to facilitate the standardization of
the analyses and the development of best practices. See also the
related [PathoStat
pipeline](http://bioconductor.org/packages/release/bioc/html/PathoStat.html)
[mare pipeline](https://github.com/katrikorpela/mare),
[phylofactor](), and [structSSI]() for additional 16S rRNA amplicon
analysis tools in R. The aim is to complement the other available
packages, but in some cases alternative solutions have been necessary
in order to streamline the tools and to improve complementarity.

We welcome feedback, bug reports, and suggestions for new features
from the user community via the [issue
tracker](https://github.com/microbiome/microbiome/issues) and [pull
requests](http://microbiome.github.io/microbiome/Contributing.html). See
the [Github site](https://github.com/microbiome/microbiome) for source
code and other details. These R tools have been utilized in recent
publications and in introductory courses [@salonen14, @Faust15,
@Shetty2017], and they are released under the [Two-clause FreeBSD
license](http://en.wikipedia.org/wiki/BSD\_licenses).

Kindly cite the work as follows: "Leo Lahti [et
al.](https://github.com/microbiome/microbiome/graphs/contributors)
(Bioconductor, 2017). Tools for microbiome analysis in R. Microbiome
package version `r sessionInfo()$otherPkgs$microbiome$Version`. URL:
(http://microbiome.github.io/microbiome)


# Installation

To install microbiome package in R (Bioconductor development version),
use

```{r install, eval=FALSE}
library(BiocInstaller)
source("http://www.bioconductor.org/biocLite.R")
useDevel()
biocLite("microbiome")
```

Then load the package in R

```{r loading}
library(microbiome)  
```


# Data

The microbiome package relies on the independent
[phyloseq](http://joey711.github.io/phyloseq) data format. This
contains an OTU table (taxa abundances), sample metadata (age, BMI,
sex, ...), taxonomy table (mapping between OTUs and higher-level
taxonomic classifications), and a phylogenetic tree (relations between
the taxa).


## Example data sets

Example data sets are provided to facilitate reproducible examples and
further methods development.

The HITChip Atlas data set [Lahti et al. Nat. Comm. 5:4344,
2014](http://www.nature.com/ncomms/2014/140708/ncomms5344/full/ncomms5344.html)
contains 130 genus-level taxonomic groups across 1006 western
adults. Load the example data in R with

```{r atlasdata}
# Data from 
# http://www.nature.com/ncomms/2014/140708/ncomms5344/full/ncomms5344.html
data(atlas1006) 
atlas1006
```

The two-week diet swap study between western (USA) and traditional
(rural Africa) diets, reported in [O'Keefe et al. Nat. Comm. 6:6342,
2015](http://dx.doi.org/10.1038/ncomms7342)

```{r dietswapdata}
data(dietswap) # Data from http://dx.doi.org/10.1038/ncomms7342
dietswap
```

A parallel profiling of gut microbiota versus blood metabolites from
[Lahti et al. PeerJ 1:e32, 2013](https://peerj.com/articles/32/) to
characterize associations between human intestinal microbiota and
blood serum lipids

```{r peerjdata}
data(peerj32) # Data from https://peerj.com/articles/32/
```


## Data import

You can import 16S profiling data from standard formats (Mother, BIOM,
CSV, etc.). See the
[tutorial](http://microbiome.github.io/microbiome/Data.html) for
details.


## Data manipulation

A phyloseq object can be subsetted, filtered, aggregated, transformed,
and otherwise manipulated. For a comprehensive list of tools, see the
[online
tutorial](http://microbiome.github.io/microbiome/Preprocessing.html).

To convert absolute counts to compositional (relative)
abundances, for instance, use

```{r compositional}
# dietswap is a phyloseq object; see above
dietswap.compositional <- transform(dietswap, "compositional")
```


# Ecosystem indices

## Alpha diversity, richness, evenness, dominance, and rarity

Commonly used ecosystem state variables include various indices to quantify
alpha diversities, richness, evenness, dominance, and rarity (see
functions with similar names). We provide a
comprehensive set of such indices via a standardized interface.

The function `global` calls these indicators with default parameters. 
For further options, see 
[tutorial](http://microbiome.github.io/microbiome/Diversity.html). 

```{r global, results="asis"}
g <- global(atlas1006, index = "gini")
```

Visually-Weighted Regression curve with smoothed error bars is based on the
can be used to visualize sample variables
[(1)](http://www.fight-entropy.com/2012/07/visually-weighted-regression.html), 
here the relation between age and diversity. 
This function operates on standard data frames.

```{r reg, fig.width=8, fig.height=4, dev="CairoPNG"}
# Estimate Shannon diversity and add it to the phyloseq object
sample_data(atlas1006)$diversity <- global(atlas1006, index = "shannon")[,1]

# Compare age and microbiome diversity
plot_regression(diversity ~ age, meta(atlas1006))
```


# Core microbiota analysis

Population frequencies, or **prevalence**, of the taxonomic groups
exceeding a given detection threshold can be calculated with

```{r core-prevalence2}
p <- prevalence(dietswap, detection = 0, sort = TRUE)
```

The **core microbiota** refers to the set of taxa that are detected in
a remarkable fraction of the population above a given abundance
threshold [see e.g. @Jalanka-Tuovinen11, @Salonen12cmi]. The core
subset can be sliced from a phyloseq object as follows.

```{r core-data}
# Taxa with over 50% prevance at .2% relative abundance
dietswap.core <- core(dietswap.compositional, 
                    detection = .2/100, prevalence = 50/100)
dietswap.core
```


Similar functions are available also for rare and variable taxa, see
the microbiome
[tutorial](http://microbiome.github.io/microbiome/Core.html) for a
full description.

To visualize the core as in [@Shetty2017], use

```{r corevisu, fig.width=20, fig.heigth=10, out.width = '350px'}
library(ggplot2, quiet = TRUE)
p <- plot_core(transform(dietswap.core, "compositional"), 
    plot.type = "heatmap", 
    colours = gray(seq(0,1,length=5)),
    prevalences = seq(.05, 1, .05), 
    detections = 10^seq(log10(1e-3), log10(.2), length = 10), 
    horizontal = TRUE) +
    xlab("Detection Threshold (Relative Abundance (%))") 
print(p)    
```


# Microbiome composition

Composition heatmap: Z-transformed taxon abundances

```{r compheat, fig.height=8, fig.width=12, out.width="300px"}
tmp <- plot_composition(dietswap.core, plot.type = "heatmap", transform = "Z", 
            mar = c(6, 13, 1, 1), sample.sort = "nationality")
```   


Composition barplot
        
```{r compbar, fig.height=6, fig.width=12}
plot_composition(transform(dietswap.core, "compositional"), 
    plot.type = "barplot", sample.sort = "neatmap")
```

Visualize sample similarities, or the microbiome landscape
[@Shetty2017] on an ordination map based on various [ordination
methods](http://microbiome.github.io/microbiome/Ordination.html)
(PCoA, NMDS, RDA etc) that are available through various other R
packages.

```{r landscape4, fig.width=8, fig.height=6, out.width="400px"}
# Project the samples with the given method and dissimilarity measure. 
# Ordinate the data; note that some ordinations are sensitive to random seed
# "quiet" is used to suppress intermediate outputs
set.seed(423542)

# Get ordination
x <- dietswap.core
quiet(x.ord <- ordinate(x, method = "NMDS", distance = "bray"))
# Pick the projected data (first two columns + metadata)
quiet(proj <- phyloseq::plot_ordination(x, x.ord, justDF=TRUE))
# Rename the projection axes
names(proj)[1:2] <- paste("Comp", 1:2, sep=".")

p <- plot_landscape(proj[, 1:2], col = proj$nationality, legend = TRUE)
print(p)
```


# Association heatmaps

Let us cross-correlate example data containing microbiome profiling
and blood serum lipids study [@Lahti13provasI].

```{r heatmap-crosscorrelate4, fig.keep='none'}
# Define data sets to cross-correlate
data(peerj32)
x <- log10(peerj32$microbes)   # OTU Log10 (44 samples x 130 genera)
y <- as.matrix(peerj32$lipids) # Lipids (44 samples x 389 lipids)

# Cross correlate data sets and return a table
correlation.table <- associate(x, y, 
    method = "spearman", mode = "table", p.adj.threshold = 0.05, n.signif = 1)

kable(head(correlation.table))
```

Let us rearrange the data and plot the heatmap and highlight
significant correlations with stars as in [@Lahti13provasI].

```{r hm3, fig.width=12, fig.height=10, out.width="400px"}
heat(correlation.table, "X1", "X2", fill = "Correlation", 
    star = "p.adj", p.adj.threshold = 0.05) 
```


# Bistability and tipping elements

The microbiome package provides tools to [quantify stability and
bimodality](http://microbiome.github.io/microbiome/Stability.html) in
abundance data, following [@lahti14natcomm]. Let use Dialister as an
example. The tipping point is here set manually but it can also be
[estimated from the
data](http://microbiome.github.io/microbiome/Stability.html).

```{r varpl2, fig.show='hold', out.width="250px"}
# Use relative abundances and plot subject variation 
# during the follow-up period
pseq <- transform(atlas1006, "compositional")
plot_tipping(pseq, "Dialister", tipping.point = 0.004)

# Show the bimodal population distribution
hotplot(pseq, "Dialister", tipping.point = 0.004)
```


# Further reading

The [on-line tutorial](http://microbiome.github.io/microbiome)
provides many additional tools and examples, with more thorough
descriptions. This package and tutorials are work in progress. We
welcome feedback, for instance via [issue
Tracker](https://github.com/microbiome/microbiome/issues), [pull
requests](https://github.com/microbiome/microbiome/), or via
[Gitter](https://gitter.im/microbiome).


# Acknowledgements

Thanks to all
[contributors](https://github.com/microbiome/microbiome/graphs/contributors).
Financial support has been provided by Academy of Finland (grants
256950 and 295741), [University of
Turku](http://www.utu.fi/en/Pages/home.aspx), Department of
Mathematics and Statistics. In addition, the work has been supported
by [VIB lab for Bioinformatics and (eco-)systems
biology](http://www.vib.be/en/research/scientists/Pages/Jeroen-Raes-Lab.aspx),
VIB/KULeuven, Belgium, [Molecular Ecology
group](http://www.mib.wur.nl/UK/), Laboratory of Microbiology,
Wageningen University, Netherlands, and [Department of Veterinary
Bioscience](http://www.vetmed.helsinki.fi/apalva/index.htm),
University of Helsinki, Finland. This work relies on the independent
[phyloseq](https://github.com/joey711/phyloseq) package and data
structures for R-based microbiome analysis developed by Paul McMurdie
and Susan Holmes. This work also utilizes a number of independent R
extensions, including dplyr [@dplyr], ggplot2 [@Wickham_2009],
phyloseq [@McMurdie2013], and vegan [@Oksanen_2015].


# References

```{r, echo=FALSE,results="asis"}
bibliography()
```