---
title: "curatedMetagenomicData vignette: microbial taxonomy and functional data for the human microbiome"
abstract: >
    The `curatedMetagenomicData` package provides taxonomic, functional, and gene
    marker abundance for samples collected from different human bodysites. It provides
    processed data from whole-metagenome sequencing for thousand of human 
    microbiome samples, including the Human Microbiome Project. Microbiome data and 
    associated subject, specimen, and sequencing information are integrated as 
    Bioconductor ExpressionSet objects, allowing straightfoward analyses. A conversion 
    function also links taxonomic datasets to the `r BiocStyle::Biocpkg("phyloseq")` package for 
    ecological analyses.
output:
    BiocStyle::html_document
vignette: >
    %\VignetteIndexEntry{curatedMetagenomicData}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(cache = TRUE)
suppressPackageStartupMessages(library(curatedMetagenomicData))
```

# What `curatedMetagenomicData` provides

`curatedMetagenomicData` provides 6 types of data for each dataset:

1. Species-level taxonomic profiles, expressed as relative abundance from kingdom to strain level
2. Presence of unique, clade-specific markers
3. Abundance of unique, clade-specific markers
4. Abundance of gene families
5. Metabolic pathway coverage
6. Metabolic pathway abundance

Types 1-3 are generated by 
[MetaPhlAn2](http://huttenhower.sph.harvard.edu/metaphlan2); 4-6 are generated by [HUMAnN2](http://huttenhower.sph.harvard.edu/humann2).

Currently, `curatedMetagenomicData` provides:

* `r nrow(combined_metadata)` samples from `r length(unique(combined_metadata$dataset_name))` datasets, primarily of the human gut but including body sites profiled in the Human Microbiome Project
* Processed data from whole-metagenome shotgun metagenomics, with manually-curated metadata, as integrated and documented Bioconductor 
ExpressionSet objects
* ~80 fields of specimen metadata from original papers, supplementary files, and websites, with manual curation to standardize annotations
* Processing of data through the [MetaPhlAn2](http://huttenhower.sph.harvard.edu/metaphlan2) pipeline for taxonomic abundance, and [HUMAnN2](http://huttenhower.sph.harvard.edu/humann2) pipeline for metabolic analysis
* This effort required analyzing ~100T of raw sequencing data

These datasets are documented in the 
[reference manual](https://bioconductor.org/packages/devel/data/experiment/manuals/curatedMetagenomicData/man/curatedMetagenomicData.pdf).

# Using `curatedMetagenomicData` Resources

Because `curatedMetagenomicData` utilizes Bioconductor's `r BiocStyle::Biocpkg("ExperimentHub")` platform, data are stored as Amazon S3 buckets and only downloaded individually when requested. Local caching occurs automatically; see `r BiocStyle::Biocpkg("ExperimentHub")` documentation to customize options such as where the cache is stored (by default it is `~/.ExperimentHub`). In case you haven't installed it yet, do:

```{r, message=FALSE}
# BiocManager::install("curatedMetagenomicData")
library(curatedMetagenomicData)
```

Note that newly datasets require installing the [development version](http://bioconductor.org/developers/how-to/useDevel/) of Bioconductor. 

## Available samples and metadata

The manually curated metadata for all available samples are provided in a single table `combined_metadata`:

```{r eval=FALSE}
?combined_metadata
View(combined_metadata)
```

```{r}
table(combined_metadata$antibiotics_current_use)
table(combined_metadata$disease)
```

### Read depth of all samples across all studies
`combined_metadata` also provides technical information for each sample like sequencing platform, read length, and read depth. The following uses `combined_metadata` to create a boxplot of read depth for each sample in each study, with boxes colored by body site. First, create a ranking of datasets by median read depth:

```{r}
dsranking <- combined_metadata %>%
  as.data.frame() %>% 
  group_by(dataset_name) %>%
  summarize(mediandepth = median(number_reads) / 1e6) %>%
  mutate(dsorder = rank(mediandepth)) %>%
  arrange(dsorder)
dsranking
```

Create a factor `ds` that is ordered according to the ranking by median read depth, to show datasets in order from lowest to highest median read depth, then create the box plot.
```{r}
suppressPackageStartupMessages(library(ggplot2))
combined_metadata %>%
  as.data.frame() %>%
  mutate(ds = factor(combined_metadata$dataset_name, levels=dsranking$dataset_name)) %>%
  ggplot(aes(ds, number_reads / 1e6, fill=body_site)) + 
  geom_boxplot() +
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  labs(x="Dataset", y="Read Depth (millions)")
```

For the datasets that provide both stool and oral cavity profiles, notice how much lower the read depth of oral cavity profiles is due to the higher proportions of removed human DNA reads.

# Accessing datasets

Individual data projects can be fetched via per-dataset functions or the `curatedMetagenomicData()` function. A function call to a dataset name returns a Bioconductor ExpressionSet object:

```{r, message=FALSE}
suppressPackageStartupMessages(library(curatedMetagenomicData))
loman.eset = LomanNJ_2013.metaphlan_bugs_list.stool()
```

However, the above approach lacks additional options and versioning provided by the curatedMetagenomicData function, which returns a list of datasets. In this case
the list has length 1:

```{r}
loman <- curatedMetagenomicData("LomanNJ_2013.metaphlan_bugs_list.stool", dryrun = FALSE)
loman
loman.eset <- loman[[1]]
loman.eset
```

The following creates a list of two `ExpressionSet` objects providing the BritoIL_2016 and Castro-NallarE_2015 oral cavity taxonomic profiles:

```{r, message=FALSE, results='hide'}
oral <- c("BritoIL_2016.metaphlan_bugs_list.oralcavity",
          "Castro-NallarE_2015.metaphlan_bugs_list.oralcavity")
esl <- curatedMetagenomicData(oral, dryrun = FALSE)
esl
esl[[1]]
esl[[2]]
```

And the following would provide all stool metaphlan datasets if `dryrun = FALSE` were set:

```{r, eval=TRUE}
curatedMetagenomicData("*metaphlan_bugs_list.stool*", dryrun = TRUE)
```

## Merging multiple datasets

The following merges the two oral cavity datasets downloaded above into a single ExpressionSet. 

```{r}
eset <- mergeData(esl)
eset
```

This works for any number of datasets. The function will not stop you from merging different data types (e.g. metaphlan bugs lists with gene families), but you probably don't want to do that.

# Using `ExpressionSet` Objects

All datasets are represented as `ExpressionSet` objects because of the integrative nature of the class and its ability to bind data and metadata. There are three main functions, from the `Biobase` package, that provide access to experiment-level metadata, subject-level metadata, and the data itself.

To access the experiment-level metadata the `experimentData()` function is used to return a `MIAME` (Minimum Information About a Microarray Experiment) object.

```{r}
experimentData( loman.eset )
```

To access the subject-level metadata the `pData()` function is used to return a `data.frame` containing subject-level variables of study.

```{r}
head( pData( loman.eset ) )
```

To access the data itself (in this case relative abundance), the `exprs()` function returns a variables by samples (rows by columns) numeric matrix. Note the presence of "synthetic" clades at all levels of the taxonomy, starting with kingdom, e.g. k__bacteria here:

```{r}
exprs( loman.eset )[1:6, 1:5]  #first 6 rows and 5 columns
```

Bioconductor provides further documentation of the ExpressionSet class and has published an excellent [introduction](https://tinyurl.com/ExpressionSetIntro).

## Estimating Absolute Raw Count Data

Absolute raw count data can be estimated from the relative count data by multiplying the columns of the `ExpressionSet` data by the number of reads for each sample, as found in the `pData` column "number_reads". For demo purposes you could (but don't have to!) do this manually by dividing by 100 and multiplying by the number of reads:

```{r}
loman.counts = sweep(exprs( loman.eset ), 2, loman.eset$number_reads / 100, "*")
loman.counts = round(loman.counts)
loman.counts[1:6, 1:5]
```

or just set the `counts` argument in `curatedMetagenomicData()` to `TRUE`:

```{r}
loman.eset2 = curatedMetagenomicData("LomanNJ_2013.metaphlan_bugs_list.stool",
                                     counts = TRUE, dryrun = FALSE)[[1]]
all.equal(exprs(loman.eset2), loman.counts)
```


# E. coli prevalence

Here's a direct, exploratory analysis of *E. coli* prevalence in the Loman dataset using the `ExpressionSet` object. More elegant solutions will be provided later using subsetting methods provided by the `r BiocStyle::Biocpkg("phyloseq")` package, but for users familiar with `grep()` and the `ExpressionSet` object, such manual methods may suffice.

First, which *E. coli*-related taxa are available?

```{r}
grep("coli", rownames(loman.eset), value=TRUE)
```

Create a vector of *E. coli* relative abundances. This `grep` call with a "$" at the end selects the only row that ends with "s__Escherichia_coli":
```{r}
x = exprs( loman.eset )[grep("s__Escherichia_coli$", rownames( loman.eset)), ]
summary( x )
```

This could be plotted as a histogram:
```{r}
hist( x, xlab = "Relative Abundance", main="Prevalence of E. Coli",
      breaks="FD")
```

# Taxonomy-Aware Analysis using `phyloseq`

For the MetaPhlAn2 bugs datasets (but not other data types), you gain a lot of taxonomy-aware, ecological analysis and plotting by conversion to a `r BiocStyle::Biocpkg("phyloseq")` class object. `r BiocStyle::Biocpkg("curatedMetagenomicData")` provides the `ExpressionSet2phyloseq()` function to make this easy:

```{r, warning=FALSE}
suppressPackageStartupMessages(library(phyloseq))
loman.pseq = ExpressionSet2phyloseq( loman.eset )
```

Note the simplified row names of the OTU table, showing only the most detailed level of the taxonomy. This results from the default argument `simplify=TRUE`, which is convenient and lossless because taxonomic information is now attainable by `tax_table(loman.pseq2)`

## phylogenetic trees and UniFrac distances

Set **phylogenetictree = TRUE** to include a phylogenetic tree in the `phyloseq` object:
```{r}
loman.tree <- ExpressionSet2phyloseq( loman.eset, phylogenetictree = TRUE)
```

```{r}
wt = UniFrac(loman.tree, weighted=TRUE, normalized=FALSE, 
             parallel=FALSE, fast=TRUE)
plot(hclust(wt), main="Weighted UniFrac distances")
```


## Components of a phyloseq object

This  `r BiocStyle::Biocpkg("phyloseq")` objects contain 3 components, with extractor functions hinted at by its show method:

```{r}
loman.pseq
```

`otu_table()` returns the same thing as `exprs( loman.eset )` did, the Operational Taxanomic Unit (OTU) table. Here are the first 6 rows and 5 columns:

```{r, warning=FALSE}
otu_table( loman.pseq )[1:6, 1:5]
```

The same patient or participant data that was available from  `pData( loman.eset )` is now availble using `sample_data()` on 
the `r BiocStyle::Biocpkg("phyloseq")` object:

```{r, warning=FALSE}
sample_data( loman.pseq )[1:6, 1:5]
```

But this object also is aware of the taxonomic structure, which will enable the powerful subsetting methods of the `r BiocStyle::Biocpkg("phyloseq")` package.

```{r, warning=FALSE}
head( tax_table( loman.pseq ) )
```

## Subsetting / Pruning

The process of subsetting begins with the names of taxonomic ranks:

```{r, warning=FALSE}
rank_names( loman.pseq )
```

Taxa can be filtered by these rank names. For example, to return an object with only species and strains:

```{r, warning=FALSE}
subset_taxa( loman.pseq, !is.na(Species))
```

To keep only phylum-level data (not class or finer, and not kingdom-level):

```{r}
subset_taxa( loman.pseq, is.na(Class) & !is.na(Phylum))
```

Or to keep only Bacteroidetes phylum. Note that taxa names have been shortened from the rownames of the `ExpressionSet` object, for nicer plotting.

```{r}
loman.bd = subset_taxa( loman.pseq, Phylum == "Bacteroidetes")
head( taxa_names( loman.bd ) )
```

## Advanced Pruning

The `r BiocStyle::Biocpkg("phyloseq")` package provides advanced pruning of taxa, such as the following which keeps only taxa that are among the most abundant 5% in at least five samples:

```{r, warning=FALSE}
keepotu = genefilter_sample(loman.pseq, filterfun_sample(topp(0.05)), A=5)
summary(keepotu)
subset_taxa(loman.pseq, keepotu)
```

Note that `r BiocStyle::Biocpkg("phyloseq")` also provides `topk()` for selecting the most abundant `k` taxa, and other functions for advanced pruning of taxa.

## Taxonomy Heatmap

The `r BiocStyle::Biocpkg("phyloseq")` package provides the `plot_heatmap()` function to create heatmaps using a variety of built-in dissimilarity metrics for clustering. Here, we apply the same abundance filter as above, keep only strain-level OTUs. This function supports a large number of distance and ordination methods, here we use Bray-Curtis dissimilarity for distance and PCoA as the ordination method for organizing the heatmap.

```{r, warning=FALSE}
loman.filt = subset_taxa(loman.pseq, keepotu & !is.na(Strain))
plot_heatmap(loman.filt, method="PCoA", distance="bray")
```

## Taxonomy Histogram

Here we plot the top 20 most abundant species (not strains), defined by the sum of abundance across all samples in the dataset:

```{r, warning=FALSE}
loman.sp = subset_taxa(loman.pseq, !is.na(Species) & is.na(Strain))
par(mar = c(20, 4, 0, 0) + 0.15) #increase margin size on the bottom
barplot(sort(taxa_sums(loman.sp), TRUE)[1:20] / nsamples(loman.sp),
        ylab = "Total counts", las = 2)
```

## Alpha Diversity Estimation

The `r BiocStyle::Biocpkg("phyloseq")` package calculates numerous alpha diversity measures. Here we compare three diversity in the species-level data, stratifying by stool texture:

```{r, warning=FALSE}
alphas = c("Shannon", "Simpson", "InvSimpson")
plot_richness(loman.sp, "stool_texture", measures = alphas)
```
Let's compare these three alpha diversity measures:

```{r, warning=FALSE}
pairs( estimate_richness(loman.sp, measures = alphas) )
```

## Beta Diversity / Dissimilarity Clustering

Numerous beta diversity / dissimilarity are provided by the `distance()` function when provided a `r BiocStyle::Biocpkg("phyloseq")` object, and these can be used for any kind of clustering or classification scheme. For example, here is a hierarchical clustering dendrogram produced by the `hclust()` from the base R `stats` package with "Ward" linkage:

```{r, warning=FALSE}
mydist = distance(loman.sp, method="bray")
myhclust = hclust( mydist )
plot(myhclust, main="Bray-Curtis Dissimilarity", 
     method="ward.D", xlab="Samples", sub = "")
```

## Ordination Analysis

The `r BiocStyle::Biocpkg("phyloseq")` package provides a variety of ordination methods, with convenient options for labelling points. Here is a 
Principal Coordinates Analysis plot of species-level taxa from the Loman dataset, using Bray-Curtis distance:

```{r, warning=FALSE}
ordinated_taxa = ordinate(loman.sp, method="PCoA", distance="bray")
plot_ordination(loman.sp, ordinated_taxa, color="stool_texture", 
                title = "Bray-Curtis Principal Coordinates Analysis")
```


```{r}
plot_scree(ordinated_taxa, title="Screeplot")
```

# Using `r BiocStyle::Biocpkg("ExperimentHub")` directly

We recommend that most users use the convenience functions shown above to find `curatedMetagenomicData` datasets and navigate versions. However, it is also possible to use `r BiocStyle::Biocpkg("ExperimentHub")` directly for consistency with other `r BiocStyle::Biocpkg("ExperimentHub")` packages.

## Browsing `r BiocStyle::Biocpkg("ExperimentHub")`

In the next section we will demonstrate convenience functions that make `r BiocStyle::Biocpkg("ExperimentHub")` transparent, but it can be useful and powerful to interact directly with `r BiocStyle::Biocpkg("ExperimentHub")`. A "hub" connects you to the `r BiocStyle::Biocpkg("ExperimentHub")` server and its metadata, without downloading any data:

```{r}
suppressPackageStartupMessages(library(ExperimentHub))
eh = ExperimentHub()
```

The following queries ExperimentHub for all records matching `curatedMetagenomicData`:

```{r}
myquery = query(eh, "curatedMetagenomicData")
```

This is an abbreviated list of what was found:

```{r}
myquery
```

Note that the first column, "EH178" etc, are unique IDs for each dataset. For a full list, `mcols(myquery)` produces a `data.frame`. The following is unevaluated in this script, but can be used to display an interactive spreadsheet of the results:

```{r, eval=FALSE}
View(mcols(myquery))
```

Or to write the search results to a csv file:

```{r, eval=FALSE}
write.csv(mcols(myquery), file="curatedMetagenomicData_allrecords.csv")
```

## Advanced searching of `r BiocStyle::Biocpkg("ExperimentHub")`

Tags can (eventually) help identify useful datasets:
```{r}
head(myquery$tags)
```

Titles can be used to select body site and data type. Here are the MetaPhlan2 taxonomy tables for all the available Human Microbiome Project (HMP) body sites, found using a Perl regular expression to find "HMP" at the beginning of a string, followed by any number of characters ".*", followed by the word "metaphlan":

```{r}
grep("(?=HMP)(?=.*metaphlan)", myquery$title, perl=TRUE, val=TRUE)
```

Here are all the metabolic pathway abundance for the stool body site:
```{r}
grep("(?=.*stool)(?=.*metaphlan)", myquery$title, perl=TRUE, val=TRUE)
```

Or, here are all the data products available for the "Loman" study:
```{r}
(lomannames = grep("LomanNJ_2013", myquery$title, perl=TRUE, val=TRUE))
```

We could create a list of `ExpressionSet` objects containing all of the Loman products as follows (not evaluated):

```{r, eval=FALSE}
(idx = grep("LomanNJ_2013", myquery$title, perl=TRUE))
loman.list = lapply(idx, function(i){
  return(myquery[[i]])
})
names(loman.list) = lomannames
loman.list
```


# Addition of Datasets to `curatedMetagenomicData`

Authors welcome the addition of new metagenomic datasets provided that the raw data are hosted by NCBI/SRA and can be run through our [MetaPhlAn2 and HUMAnN2 pipeline](https://github.com/waldronlab/curatedMetagenomicData/tree/master/inst/pipeline). You can request the addition of a dataset by opening an issue on the [issue tracker](https://github.com/waldronlab/curatedMetagenomicData/issues), pointing us to the publication and raw data. 

You can speed up our ability to incorporate a new dataset by providing curated metadata. See https://github.com/waldronlab/curatedMetagenomicData for how to curate a new dataset.

# Reporting Bugs and Errors in Curation

Development of the `curatedMetagenomicData` package occurs on GitHub. Please visit the
[project repository](https://github.com/waldronlab/curatedMetagenomicData) and report software bugs and data problems on our [issue tracker](https://github.com/waldronlab/curatedMetagenomicData/issues).

# Other Issues

If you have an issue that is not documented elsewhere, visit the Bioconductor support site at https://support.bioconductor.org/, briefly describe your issue, and add the tag `curatedMetagenomicData`.