---
title: "curatedMetagenomicData"
abstract: >
    The curatedMetagenomicData package provides taxonomic, functional, and gene
    marker abundance for samples collected from different bodysites. It provides
    data from aproximately 3000 human microbiome samples that have been highly
    processed, refined, and curated such that analysis that might otherwise
    require a computing cluster can be done on an ordianary laptop.
output:
    BiocStyle::html_document
vignette: >
    %\VignetteIndexEntry{curatedMetagenomicData}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(cache = TRUE)
```

# What `curatedMetagenomicData` provides

`curatedMetagenomicData` provides 6 types of data for each dataset:

1. species-level taxonomic profiles,
2. presence of unique, clade-specific markers,
3. abundance of unique, clade-specific markers,
4. abundance of gene families,
5. metabolic pathway coverage, and 
6. metabolic pathway abundance. 

Types 1-3 are generated by 
[MetaPhlAn2](http://huttenhower.sph.harvard.edu/metaphlan2); 4-6 are generated 
by [HUMAnN2](http://huttenhower.sph.harvard.edu/humann2).

Currently, `curatedMetagenomicData` provides:

* 2,875 samples from 13 datasets, primarily of the human gut but including body 
sites profiled in the Human Microbiome Project
* processed data from whole-metagenome shotgun metagenomics, with 
manually-curated metadata, as integrated and documented Bioconductor 
ExpressionSet objects
* ~200 fields of specimen metadata from original papers, supplementary files, 
and websites, with manual curation to standardize annotations
* processing of data through the 
[MetaPhlAn2](http://huttenhower.sph.harvard.edu/metaphlan2) pipeline for 
taxonomic abundance, and [HUMAnN2](http://huttenhower.sph.harvard.edu/humann2) 
pipeline for metabolic analysis
* This effort required analyzing ~23T of raw sequencing data, generating ~52T of
intermediate files

These datasets are documented in the 
[reference manual](https://tinyurl.com/curatedMetagenomicData-man).

# Using `curatedMetagenomicData` Resources

Use of the resources in `curatedMetagenomicData` is simplified with the use of
Bioconductor's ExperimentHub platform, which allows for the accessing of data
through an intuitive interface. First, `curatedMetagenomicData` is installed
using `BiocInstaller` and then called as a library - the process allows for the
user to simply call datasets as functions because the package is aware of the
resources present in ExperimentHub S3 buckets. 

```{r, message=FALSE}
# BiocInstaller::biocLite("curatedMetagenomicData")  #Bioconductor version
# BiocInstaller::biocLite("vobencha/curatedMetagenomicData")  #bleeding edge version
library(curatedMetagenomicData)
```

# Browsing `r Biocpkg("ExperimentHub")`

In the next section we will demonstrate convenience functions that make `r Biocpkg("ExperimentHub")`  transparent, but it can be useful and powerful to interact directly with `r Biocpkg("ExperimentHub")`. A "hub" connects you to the `r Biocpkg("ExperimentHub")` server and its metadata, without downloading any data:

```{r}
suppressPackageStartupMessages(library(ExperimentHub))
eh = ExperimentHub()
```

The following queries ExperimentHub for all records matching "curatedMetagenomicData":

```{r}
myquery = query(eh, "curatedMetagenomicData")
```

This is an abbreviated list of what was found:

```{r}
myquery
```

Note that the first column, "EH178" etc, are unique IDs for each dataset.  For a full list, `mcols(myquery)` produces a `data.frame`. The following is unevaluated in this script, but can be used to display an interactive spreadsheet of the results:

```{r, eval=FALSE}
View(mcols(myquery))
```

Or to write the search results to a csv file:

```{r, eval=FALSE}
write.csv(mcols(myquery), file="curatedMetagenomicData_allrecords.csv")
```

# Advanced searching of `r Biocpkg("ExperimentHub")`

Tags can (eventually) help identify useful datasets:
```{r}
head(myquery$tags)
```

Titles can be used to select body site and data type. Here are the MetaPhlan2 taxonomy tables for all the available Human Microbiome Project (HMP) body sites, found using a Perl regular expression to find "HMP" at the beginning of a string, followed by any number of characters ".*", followed by the word "metaphlan":

```{r}
grep("(?=HMP)(?=.*metaphlan)", myquery$title, perl=TRUE, val=TRUE)
```

Here are all the metabolic pathway abundance for the stool body site:
```{r}
grep("(?=.*stool)(?=.*metaphlan)", myquery$title, perl=TRUE, val=TRUE)
```

Or, here are all the data products available for the HiSeq component of the "Loman" study:
```{r}
(lomannames = grep("LomanNJ_2013_Hi", myquery$title, perl=TRUE, val=TRUE))
```

We could create a list of `ExpressionSet` objects containing all of the Loman HiSeq products as follows (not evaluated):

```{r, eval=FALSE}
(idx = grep("LomanNJ_2013_Hi", myquery$title, perl=TRUE))
loman.list = lapply(idx, function(i){
  return(myquery[[i]])
})
names(loman.list) = lomannames
loman.list
```

# Convenience functions

Individual data projects can be fetched without having to think about `r Biocpkg("ExperimentHub")`. A function call to a dataset name returns a Bioconductor ExpressionSet object:

```{r}
suppressPackageStartupMessages(library(curatedMetagenomicData))
loman.eset = LomanNJ_2013_Hi.metaphlan_bugs_list.stool()
```

# Features of `ExpressionSet` Objects

All datasets are represented as `ExpressionSet` objects because of the
integrative nature of the class and its ability to bind data and metadata. There
are three main functions, from the `Biobase` package, that provide access to
experiment-level metadata, subject-level metadata, and the data itself.

To access the experiment-level metadata the `experimentData()` function is used
to return a `MIAME` (Minimum Information About a Microarray Experiment) object.

```{r}
experimentData( loman.eset )
```

To access the subject-level metadata the `pData()` function is used to return a
`data.frame` containing subject-level variables of study.

```{r}
head( pData( loman.eset ) )
```

To access the data itself (in this case relative abundance), the `exprs()` function returns a variables
by samples (rows by columns) numeric matrix. Note the presence of "synthetic" clades at all levels of the taxonomy,
starting with kingdom, e.g. k__bacteria here:

```{r}
exprs( loman.eset )[1:6, 1:5]  #first 6 rows and 5 columns
```

Bioconductor provides further documentation of the ExpressionSet class and has 
published an excellent [introduction](https://tinyurl.com/ExpressionSetIntro).

## E. coli prevalence

Here's a direct, exploratory analysis of *E. coli* prevalence in the Loman 
dataset using the `ExpressionSet` object. More elegant solutions will be provided later using subsetting methods provided by the `r Biocpkg("phyloseq")` package,
but for users familiar with `grep()` and the `ExpressionSet` object, such
manual methods may suffice.

First, which *E. coli*-related taxa are available?

```{r}
grep("coli", rownames(loman.eset), value=TRUE)
```

Create a vector of *E. coli* relative abundances. This `grep` call with
a "$" at the end selects the only row that ends with "s__Escherichia_coli":
```{r}
x = exprs( loman.eset )[grep("s__Escherichia_coli$", rownames( loman.eset)), ]
summary( x )
```

This could be plotted as a histogram:
```{r}
hist( x, xlab = "Relative Abundance", main="Prevalence of E. Coli",
      breaks="FD")
```

# Taxonomy-Aware Analysis using `phyloseq`

For the MetaPhlAn2 bugs datasets (but not other data types), you gain a lot of 
taxonomy-aware, ecological analysis and plotting by conversion to a 
`r Biocpkg("phyloseq")` class object. `r Biocpkg("curatedMetagenomicData")`
provides the `ExpressionSet2phyloseq()` function to make this easy:

```{r, warning=FALSE}
loman.pseq = ExpressionSet2phyloseq( loman.eset )
```

## Estimating Absolute Raw Count Data

Absolute raw count data can be estimated from the relative count data by 
multiplying the columns of the `ExpressionSet` data by the number of reads for
each sample, as found in the `pData` column "number_reads". You could do this
manually using the `sweep()` function:

```{r}
loman.counts = sweep(exprs( loman.eset ), 2, loman.eset$number_reads, "*")
loman.counts[1:6, 1:5]
```

or just set the `relab` argument in `Expressionset2phyloseq()` to `FALSE`:

```{r}
loman.pseq2 = ExpressionSet2phyloseq( loman.eset, relab=FALSE )
otu_table( loman.pseq2 )[1:6, 1:5]
```

Note the simplified row names of the OTU table, resulting from the default 
argument `simplify=TRUE`. The simplified names are convenient, and lossless
because taxonomic information is now attainable by `tax_table(loman.pseq2)`

## Components of a `r Biocpkg("phyloseq")` object

The  `r Biocpkg("phyloseq")` objects currently contain 3 components, with
extractor functions hinted at by its show method:
```{r}
loman.pseq
```


`otu_table()` returns the same thing as `exprs( loman.eset )` did, the
Operational Taxanomic Unit (OTU) table. Here are the first 6 rows and 5 columns:

```{r, warning=FALSE}
otu_table( loman.pseq )[1:6, 1:5]
```

The same patient or participant data that was available from 
`pData( loman.eset )` is now availble using `sample_data()` on 
the `r Biocpkg("phyloseq")` object:

```{r, warning=FALSE}
sample_data( loman.pseq )[1:6, 1:5]
```

But this object also is aware of the taxonomic structure, which 
will enable the powerful subsetting methods of the `r Biocpkg("phyloseq")` 
package.

```{r, warning=FALSE}
head( tax_table( loman.pseq ) )
```

## Subseting / Pruning

The process of subsetting begins with the names of phylogenetic ranks:

```{r, warning=FALSE}
rank_names( loman.pseq )
```

Taxa can be filtered by these rank names. For example, to return an object
with only species and strains:

```{r, warning=FALSE}
subset_taxa( loman.pseq, !is.na(Species))
```

To keep only phylum-level data (not class or finer, and not kingdom-level):

```{r}
subset_taxa( loman.pseq, is.na(Class) & !is.na(Phylum))
```

Or to keep only Bacteroidetes phylum. Note that taxa names have been shortened 
from the rownames of the `ExpressionSet` object, for nicer plotting.
```{r}
loman.bd = subset_taxa( loman.pseq, Phylum == "Bacteroidetes")
head( taxa_names( loman.bd ) )
```

## Advanced Pruning

The `r Biocpkg("phyloseq")` package provides advanced pruning of taxa, such as the following which keeps only taxa that are among the most abundant 5% in at least five samples:

```{r, warning=FALSE}
keepotu = genefilter_sample(loman.pseq, filterfun_sample(topp(0.05)), A=5)
summary(keepotu)
subset_taxa(loman.pseq, keepotu)
```

Note that `r Biocpkg("phyloseq")` also provides `topk()` for selecting the 
most abundant `k` taxa, and other functions for advanced pruning of taxa.

## Taxonomy Heatmap

The `r Biocpkg("phyloseq")` package provides the `plot_heatmap()` function
to create heatmaps using a variety of built-in dissimilarity metrics for 
clustering. Here, we apply the same abundance filter as above, keep only 
strain-level OTUs. This function supports a large number of distance and ordination methods, here we use Bray-Curtis dissimilarity for distance and 
PCoA as the ordination method for organizing the heatmap.

```{r, warning=FALSE}
loman.filt = subset_taxa(loman.pseq, keepotu & !is.na(Strain))
plot_heatmap(loman.filt, method="PCoA", distance="bray")
```

## Taxonomy Histogram

Here we plot the top 20 most abundant species (not strains), defined by the sum of abundance across all samples in the dataset:

```{r, warning=FALSE}
loman.sp = subset_taxa(loman.pseq, !is.na(Species) & is.na(Strain))
par(mar = c(20, 4, 0, 0) + 0.15) #increase margin size on the bottom
barplot(sort(taxa_sums(loman.sp), TRUE)[1:20] / nsamples(loman.sp),
        ylab = "Total counts", las = 2)
```

## Alpha Diversity Estimation

The `r Biocpkg("phyloseq")` package calculates numerous alpha diversity
measures.  Here we compare three diversity in the species-level data, stratifying by stool texture:

```{r, warning=FALSE}
alphas = c("Shannon", "Simpson", "InvSimpson")
plot_richness(loman.sp, "stooltexture", measures = alphas)
```
Let's compare these three alpha diversity measures:

```{r, warning=FALSE}
pairs( estimate_richness(loman.sp, measures = alphas) )
```

## Beta Diversity / Dissimilarity Clustering

Numerous beta diversity / dissimilarity are provided by the `distance()` function when provided a `r Biocpkg("phyloseq")` object, and these can be used for any kind of clustering or classification scheme. For example, here is a hierarchical clustering dendrogram produced by the `hclust()` from the base R `stats` package with "Ward" linkage:

```{r, warning=FALSE}
mydist = distance(loman.sp, method="bray")
myhclust = hclust( mydist )
plot(myhclust, main="Bray-Curtis Dissimilarity", 
     method="ward.D", xlab="Samples", sub = "")
```

## Ordination Analysis

The `r Biocpkg("phyloseq")` package provides a variety of ordination methods, with convenient options for labelling points. Here is a 
Principal Coordinates Analysis plot of species-level taxa from the Loman
dataset, using Bray-Curtis distance:

```{r, warning=FALSE}
ordinated_taxa = ordinate(loman.sp, method="PCoA", distance="bray")
plot_ordination(loman.sp, ordinated_taxa, color="stooltexture", 
                title = "Bray-Curtis Principal Coordinates Analysis")
```


```{r}
plot_scree(ordinated_taxa, title="Screeplot")
```


# Adding Datasets

Authors welcome the addition of new datasets provided they can be or already 
have been run through the MetaPhlAn2 and HUMAnN2 pipelines. Please read the 
[developer documentation](https://tinyurl.com/cMDReadme) and contact the 
maintainer if you have a shotgun metagenomic dataset that would be of interest 
to the Bioconductor community.

# Reporting Bugs

Development of the `curatedMetagenomicData` package occurs on GitHub and bugs 
reported via GitHub issues will recieve the quickest response. Please visit the 
[project repository](https://github.com/vobencha/curatedMetagenomicData) and 
file an issue should you find one.

# Other Issues

Visit the Bioconductor support site at https://support.bioconductor.org/, 
breifly describe your issue, and add the #curatedmetagenomicdata tag.