---
title: "tximeta: transcript quantification import with automatic metadata"
author: "Michael Love, Rob Patro, Charlotte Soneson, Peter Hickey"
date: "`r format(Sys.time(), '%m/%d/%Y')`"
output: 
  rmarkdown::html_document:
    highlight: tango
abstract: >
  `tximeta` performs numerous annotation and metadata gathering tasks on
  behalf of users during the import of transcript quantifications from
  *Salmon* or *Sailfish* into R/Bioconductor. Metadata and transcript
  ranges are added automatically, facilitating combining multiple
  genomic datasets and helping to prevent bioinformatic errors.
vignette: |
  %\VignetteIndexEntry{Transcript quantification import with automatic metadata}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Analysis starts with sample table

The first step using `tximeta` is to read in the sample table, which
will become the *column data*, `colData`, of the final object,
a *SummarizedExperiment*. The sample table should contain all the
information we need to identify the *Salmon* quantification
directories. Here we will use a *Salmon* quantification file in the
*tximportData* package to demonstrate the usage of `tximeta`. We do
not have a sample table, so we construct one in R. It is recommended
to keep a sample table as a CSV or TSV file while working on an
RNA-seq project with multiple samples.

```{r}
dir <- system.file("extdata/salmon_dm", package="tximportData")
# here gzipped, normally these are not
files <- file.path(dir, "SRR1197474_cdna", "quant.sf.gz") 
file.exists(files)
coldata <- data.frame(files, names="SRR1197474", condition="A", stringsAsFactors=FALSE)
coldata
```

`tximeta` expects at least two columns in `coldata`: 

1. `files` - a pointer to the `quant.sf` files
2. `names` - the unique names that should be used to identify samples

# Running tximeta from a sample table

Normally, we would just run `tximeta` like so:

```{r eval=FALSE}
library(tximeta)
se <- tximeta(coldata)
```

However, to avoid downloading remote GTF files during this vignette,
we will point to a GTF file saved locally (in the *tximportData*
package). We link the transcriptome of the *Salmon* index to its
locally saved GTF. The standard recommended usage of `tximeta` would
be the code chunk above, or to specify a remote GTF source, not a
local one. This following code is therefore not recommended for a
typically workflow, but is particular to the vignette code.

```{r}
dir <- system.file("extdata", package="tximeta")
indexDir <- file.path(dir, "Drosophila_melanogaster.BDGP6.cdna.v92_salmon_0.10.2")
fastaFTP <- "ftp://ftp.ensembl.org/pub/release-92/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP6.cdna.all.fa.gz"
dir2 <- system.file("extdata/salmon_dm", package="tximportData")
gtfPath <- file.path(dir2,"Drosophila_melanogaster.BDGP6.92.gtf.gz")
suppressPackageStartupMessages(library(tximeta))
makeLinkedTxome(indexDir=indexDir,
                source="Ensembl",
                organism="Drosophila melanogaster",
                release="92",
                genome="BDGP6",
                fasta=fastaFTP,
                gtf=gtfPath,
                write=FALSE)
```

```{r}
library(tximeta)
se <- tximeta(coldata)
```

# What happened? 

`tximeta` recognized the signature of the transcriptome that the files
were quantified against, it accessed the GTF file of the transcriptome
source, found and attached the transcript ranges, and added the
appropriate transcriptome and genome metadata.  A remote GTF is only
downloaded once, and a local or remote GTF is only parsed to build a
*TxDb* once: if `tximeta` recognizes that it has seen this *Salmon*
index before, it will use a cached version of the metadata and
transcript ranges. 

Note the warning above that 9 of the transcripts are missing from the
GTF file and so are dropped from the final output. This is a problem
coming from the annotation source, and not easily avoided by
`tximeta`. 

We plan to create and maintain a large table of signatures for as many
sources, organisms, releases of transcriptomes as possible. `tximeta`
also has functions to support for *linked transcriptomes*, where one or
more sources for transcript sequences have been combined or
filtered. See the **Linked transcriptome** section below for a
demonstration. (The *makeLinkedTxome* function was used above to avoid
downloading the GTF during the vignette building process.)

# Examining SummarizedExperiment output

We, of course, have our coldata from before. Note that we've removed `files`.

```{r}
suppressPackageStartupMessages(library(SummarizedExperiment))
colData(se)
```

Here we show the three matrices that were imported. `tximeta` does not
yet support import of inferential replicates (Gibbs samples or
bootstrap samples), but this functionality will be added in a future
version.

```{r}
assayNames(se)
```

`tximeta` has imported the correct ranges for the transcripts:

```{r}
rowRanges(se)
```

We have appropriate genome information, which prevents us from making 
bioinformatic mistakes:

```{r}
seqinfo(se)
```

# Easy summarization to gene-level

Because the SummarizedExperiment maintains all the metadata of its
creation, it also keeps a pointer to the necessary database for
summarizing transcript-level quantifications and bias corrections to
the gene-level. If necessary, `summarizeToGene` can pull down the
remote source for summarization, but given that we've already built a
TxDb once, it simply loads the stashed version. In order to remove the
stashed TxDb and regenerate, one can remove the relevant entry from
the `tximeta` file cache that resides at the location given by
`getTximetaBFC()`.

```{r}
gse <- summarizeToGene(se)
rowRanges(gse)
```

# Add different identifiers

We would like to add support to easily map transcript or gene
identifiers from one annotation to another. This is just a prototype
function, but we show how we can easily add alternate IDs given that we
know the organism and the source of the transcriptome. (This function
currently only works for Gencode and Ensembl gene or transcript IDs
but could be extended to work for arbitrary sources.)

```{r}
library(org.Dm.eg.db)
gse <- addIds(gse, "REFSEQ", gene=TRUE)
mcols(gse)
```

# Run a differential expression analysis

The following code chunk demonstrates how to build a *DESeqDataSet*
and begin a differential expression analysis. 

```{r}
suppressPackageStartupMessages(library(DESeq2))
# here there is a single sample so we use ~1.
# expect a warning that there is only a single sample...
suppressWarnings({dds <- DESeqDataSet(gse, ~1)})
dds <- estimateSizeFactors(dds)
# ... and so on
```

The following code chunk demonstrates how to build a *DGEList* object,
for use with *edgeR* (for example of generating an object for running
*limma*, see details in the *tximport* vignette).

```{r}
suppressPackageStartupMessages(library(edgeR))
cts <- assays(gse)[["counts"]]
normMat <- assays(gse)[["length"]]
normMat <- normMat / exp(rowMeans(log(normMat)))
o <- log(calcNormFactors(cts/normMat)) + log(colSums(cts/normMat))
y <- DGEList(cts)
y <- scaleOffset(y, t(t(log(normMat)) + o))
# ... see edgeR User's Guide for further steps
```

# Metadata galore

The following information is attached to the *SummarizedExperiment* by
`tximeta`: 

```{r}
names(metadata(se))
str(metadata(se)[["quantInfo"]])
str(metadata(se)[["txomeInfo"]])
str(metadata(se)[["tximetaInfo"]])
str(metadata(se)[["txdbInfo"]])
```

# Quantification files with an unknown transcriptome

`tximeta` automatically imports relevant metadata when the
transcriptome matches a known source, but also facilitates the
linking of transcriptomes used as for a *Salmon* index with relevant
public sources. The linking is important in the case that the
transcript sequence no longer matches a known source (combined or
filtered FASTA files), or if the source is not known to
`tximeta`. Below we demonstrate how to make a *linkedTxome* and how
to share and load a *linkedTxome*.

Here we point to *Salmon* quantification files which were quantified
against a transcriptome combining two Ensembl FASTA files: the cDNA
and the non-coding transcripts for *Drosophila melanogaster*.

```{r}
dir <- system.file("extdata/salmon_dm/SRR1197474", package="tximportData")
file <- file.path(dir, "quant.sf.gz")
file.exists(file)
coldata <- data.frame(files=file, names="SRR1197474", sample="1",
                      stringsAsFactors=FALSE)
```

Trying to import the files gives a message that `tximeta` couldn't find
a matching transcriptome, so it returns an un-ranged
*SummarizedExperiment*. 

```{r}
se <- tximeta(coldata)
```

# Linked transcriptome for reproducible analysis

If the transcriptome used to generate the *Salmon* index does not
match any transcriptomes from known sources (e.g. from combining or filtering
known transcriptome files), there is not much that can be done to
automatically populate the metadata during quantification
import. However, we can facilitate the following two cases: 

1) the transcriptome was created locally and has been linked to its
public source(s) 
2) the transcriptome was produced by another group, and
they have produced and shared a file that links the transcriptome to
public source(s)

`tximeta` offers functionality to assist reproducible analysis in both
of these cases.

In the case of the quantification file above, the transcriptome was
generated locally by downloading and combining the Ensembl cDNA and
non-coding FASTA files *Drosophila melanogaster*, release 92. The
following un-evaluated command line code chunk reproduces the
production of the transcriptome from publicly available sources.

```
wget ftp://ftp.ensembl.org/pub/release-92/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP6.cdna.all.fa.gz 
wget ftp://ftp.ensembl.org/pub/release-92/fasta/drosophila_melanogaster/ncrna/Drosophila_melanogaster.BDGP6.ncrna.fa.gz
cat Drosophila_melanogaster.BDGP6.cdna.all.fa.gz Drosophila_melanogaster.BDGP6.ncrna.fa.gz > Drosophila_melanogaster.BDGP6.v92.fa.gz
```

To make this quantification reproducible, we make a `linkedTxome`
which records key information about the sources of the transcript
FASTA files, and the location of the relevant GTF file. It also
records the signature of the transcriptome that was computed by
*Salmon* during the `index` step.

By default, `linkedTxome` will write out a JSON file which can be
shared with others, linking the signature of the index with the other
metadata, including FASTA and GTF sources. By default, it will write
out to a file with the same name as the `indexDir`, but with a `.json`
extension added. This can be prevented with `write=FALSE`, and the
file location can be changed with `jsonFile`.

First we specify the path where the *Salmon* index is located. 

Typically you would not use `system.file` to find this directory, but
simply define `indexDir` to be the path of the *Salmon* directory on
your machine. Here we use `system.file` because we have included parts
of a *Salmon* index directory in the *tximeta* package itself for
demonstration of functionality in this vignette.

```{r}
dir <- system.file("extdata", package="tximeta")
indexDir <- file.path(dir, "Drosophila_melanogaster.BDGP6.v92_salmon_0.10.2")
```

Now we provide the location of the FASTA files and the GTF file for
this transcriptome. The recommended usage of `tximeta` would be to
specify a remote GTF source, as seen in the commented-out line below:

```{r}
fastaFTP <- c("ftp://ftp.ensembl.org/pub/release-92/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP6.cdna.all.fa.gz",
              "ftp://ftp.ensembl.org/pub/release-92/fasta/drosophila_melanogaster/ncrna/Drosophila_melanogaster.BDGP6.ncrna.fa.gz")
#gtfFTP <- "ftp://ftp.ensembl.org/pub/release-92/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.92.gtf.gz"
```

Instead of the above commented-out FTP location for the GTF file, we
specify a location within an R package. This step is just to avoid
downloading from a remote FTP during vignette building. This use of
`system.file` to point to a file in an R package is specific to this
vignette and would not be used in a typical workflow.

```{r}
dir2 <- system.file("extdata/salmon_dm", package="tximportData")
gtfPath <- file.path(dir2,"Drosophila_melanogaster.BDGP6.92.gtf.gz")
```

Finally, we create a *linkedTxome*.  In this vignette, we point to a
temporary directory for the JSON file, but a more typical workflow
would write the JSON file to the same location as the *Salmon* index
by not specifying `jsonFile`.

`makeLinkedTxome` performs two operation: (1) it creates a new entry in
an internal table that links the transcriptome used in the *Salmon*
index to its sources, and (2) it creates a JSON file such that this
*linkedTxome* can be shared.

```{r}
tmp <- tempdir()
jsonFile <- file.path(tmp, paste0(basename(indexDir), ".json"))
makeLinkedTxome(indexDir=indexDir,
                source="Ensembl", organism="Drosophila melanogaster",
                release="92", genome="BDGP6",
                fasta=fastaFTP, gtf=gtfPath,
                jsonFile=jsonFile)
```

After running `makeLinkedTxome`, the connection between this *Salmon*
index (and its signature) with the sources is saved for persistent
usage.

With use of `tximeta` and a *linkedTxome* -- as with `tximeta` on a
known, un-filtered, un-combined transcriptome -- the software
figures out if the remote GTF has been accessed and compiled into a
*TxDb* before, and on future calls, it will simply load the
pre-computed metadata and transcript ranges.

Note the warning that 9 of the transcripts are missing from the GTF
file and so are dropped from the final output. This is a problem
coming from the annotation source, and not easily avoided by
`tximeta`. 

```{r}
se <- tximeta(coldata)
```

We can see that the appropriate metadata and transcript ranges are
attached.

```{r}
rowRanges(se)
seqinfo(se)
```

# Clear *linkedTxomes*

The following code removes the entire table with information about the
*linkedTxomes*. This is just for demonstration, so that we can show
how to load a JSON file below.

**Note:** Running this code will clear any information about
*linkedTxomes*. Don't run this unless you really want to clear this
table!

```{r}
library(BiocFileCache)
if (interactive()) {
  bfcloc <- getTximetaBFC()
} else {
  bfcloc <- tempdir()
}
bfc <- BiocFileCache(bfcloc)
bfcinfo(bfc)
bfcremove(bfc, bfcquery(bfc, "linkedTxomeTbl")$rid)
bfcinfo(bfc)
```

# Loading *linkedTxome* JSON files

If a collaborator or the Suppmentary Files for a publication shares a
`linkedTxome` JSON file, we can likewise use `tximeta` to
automatically assemble the relevant metadata and transcript
ranges. This implies that the other person has used `tximeta` with the
function `makeLinkedTxome` demonstrated above, pointing to their
*Salmon* index and to the FASTA and GTF source(s).

We point to the JSON file and use `loadLinkedTxome` and then the
relevant metadata is saved for persistent usage. In this case, we
saved the JSON file in a temporary directory.

```{r}
jsonFile <- file.path(tmp, paste0(basename(indexDir), ".json"))
loadLinkedTxome(jsonFile)
```

Again, using `tximeta` figures out whether it needs to access the
remote GTF or not, and assembles the appropriate object on the user's
behalf.

```{r}
se <- tximeta(coldata)
```

# Clear *linkedTxomes*

Finally, we clear the *linkedTxomes* table again so that the above
examples will work. This is just for the vignette code and not part of
a typical workflow.

**Note:** Running this code will clear any information about
*linkedTxomes*. Don't run this unless you really want to clear this
table!

```{r}
if (interactive()) {
  bfcloc <- getTximetaBFC()
} else {
  bfcloc <- tempdir()
}
bfc <- BiocFileCache(bfcloc)
bfcinfo(bfc)
bfcremove(bfc, bfcquery(bfc, "linkedTxomeTbl")$rid)
bfcinfo(bfc)
```

# Other quantifiers

`tximeta` can import the output from any quantifiers that are
supported by `tximport`, and if these are not *Salmon* or *Sailfish*
output, it will simply return a un-ranged *SummarizedExperiment*. We
are working to allow manually passing of the hash value of the
transcriptome, the cDNA sequences of which can be hashed with
[FastaDigest](https://github.com/COMBINE-lab/FastaDigest) 
(can be installed with `pip install fasta_digest`).

# Acknowledgments

The development of *tximeta* has benefited from suggestions from these
and other individuals in the community:

* Vincent Carey
* Lori Shepherd
* Martin Morgan
* Koen Van den Berge

# Next steps

### Basic functionality

* Switching `rowRanges` from transcript ranges to exons-by-transcript
  ranges list, or from gene ranges to exons-by-gene ranges list.
* As is already supported in `tximport`, also import inferential
  variance matrices (Gibbs samples or bootstrap samples)

### Facilitate plots and summaries
    
* Basic plots across samples: abundances, mapping rates, rich bias model parameters
* Time summaries: when quantified? when imported? I would love to
  know when the library was prepared and sequenced but this seems hopeless.

### Challenges

* Building out actual, sustainable plan for supporting as many
  organisms and sources as possible. We can define rules which
  determine where the FASTA and GTF files will be based on `source` and
  `release` (also here we ignored something like "type", e.g. CHR
  or ALL gene files from Gencode)
* Some support already for linked transcriptomes, see `linkedTxomes`
  vignette. Need to work more on combining multiple sources
  (potentially meta-transcriptomes from different organisms?), and
  also on how to approach de novo transcriptomes, and how to support
  reproducibility there.
* Facilitate functional annotation, either with vignettes/workflow or
  with additional functionality. E.g.: 
  housekeeping genes, arbitrary gene sets, genes expressed in GTEx tissues
* `liftOver` is clunky and doesn't integrate with
  *GenomeInfoDb*. It requires user input and there's a chance to
  mis-annotate. Ideally this should all be automated.

# Session info

```{r}
library(devtools)
session_info()
```