--- title: "tximeta: transcript quantification import with automatic metadata" author: "Michael Love, Rob Patro, Charlotte Soneson, Peter Hickey" date: "`r format(Sys.time(), '%m/%d/%Y')`" output: rmarkdown::html_document: highlight: tango abstract: > `tximeta` performs numerous annotation and metadata gathering tasks on behalf of users during the import of transcript quantifications from *Salmon* or *Sailfish* into R/Bioconductor. Metadata and transcript ranges are added automatically, facilitating combining multiple genomic datasets and helping to prevent bioinformatic errors. vignette: | %\VignetteIndexEntry{Transcript quantification import with automatic metadata} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Analysis starts with sample table The first step using `tximeta` is to read in the sample table, which will become the *column data*, `colData`, of the final object, a *SummarizedExperiment*. The sample table should contain all the information we need to identify the *Salmon* quantification directories. Here we will use a *Salmon* quantification file in the *tximportData* package to demonstrate the usage of `tximeta`. We do not have a sample table, so we construct one in R. It is recommended to keep a sample table as a CSV or TSV file while working on an RNA-seq project with multiple samples. ```{r} dir <- system.file("extdata/salmon_dm", package="tximportData") # here gzipped, normally these are not files <- file.path(dir, "SRR1197474_cdna", "quant.sf.gz") file.exists(files) coldata <- data.frame(files, names="SRR1197474", condition="A", stringsAsFactors=FALSE) coldata ``` `tximeta` expects at least two columns in `coldata`: 1. `files` - a pointer to the `quant.sf` files 2. `names` - the unique names that should be used to identify samples # Running tximeta from a sample table Normally, we would just run `tximeta` like so: ```{r eval=FALSE} library(tximeta) se <- tximeta(coldata) ``` However, to avoid downloading remote GTF files during this vignette, we will point to a GTF file saved locally (in the *tximportData* package). We link the transcriptome of the *Salmon* index to its locally saved GTF. The standard recommended usage of `tximeta` would be the code chunk above, or to specify a remote GTF source, not a local one. This following code is therefore not recommended for a typically workflow, but is particular to the vignette code. ```{r} dir <- system.file("extdata", package="tximeta") indexDir <- file.path(dir, "Drosophila_melanogaster.BDGP6.cdna.v92_salmon_0.10.2") fastaFTP <- "ftp://ftp.ensembl.org/pub/release-92/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP6.cdna.all.fa.gz" dir2 <- system.file("extdata/salmon_dm", package="tximportData") gtfPath <- file.path(dir2,"Drosophila_melanogaster.BDGP6.92.gtf.gz") suppressPackageStartupMessages(library(tximeta)) makeLinkedTxome(indexDir=indexDir, source="Ensembl", organism="Drosophila melanogaster", release="92", genome="BDGP6", fasta=fastaFTP, gtf=gtfPath, write=FALSE) ``` ```{r} library(tximeta) se <- tximeta(coldata) ``` # What happened? `tximeta` recognized the signature of the transcriptome that the files were quantified against, it accessed the GTF file of the transcriptome source, found and attached the transcript ranges, and added the appropriate transcriptome and genome metadata. A remote GTF is only downloaded once, and a local or remote GTF is only parsed to build a *TxDb* once: if `tximeta` recognizes that it has seen this *Salmon* index before, it will use a cached version of the metadata and transcript ranges. Note the warning above that 9 of the transcripts are missing from the GTF file and so are dropped from the final output. This is a problem coming from the annotation source, and not easily avoided by `tximeta`. We plan to create and maintain a large table of signatures for as many sources, organisms, releases of transcriptomes as possible. `tximeta` also has functions to support for *linked transcriptomes*, where one or more sources for transcript sequences have been combined or filtered. See the **Linked transcriptome** section below for a demonstration. (The *makeLinkedTxome* function was used above to avoid downloading the GTF during the vignette building process.) # Examining SummarizedExperiment output We, of course, have our coldata from before. Note that we've removed `files`. ```{r} suppressPackageStartupMessages(library(SummarizedExperiment)) colData(se) ``` Here we show the three matrices that were imported. `tximeta` does not yet support import of inferential replicates (Gibbs samples or bootstrap samples), but this functionality will be added in a future version. ```{r} assayNames(se) ``` `tximeta` has imported the correct ranges for the transcripts: ```{r} rowRanges(se) ``` We have appropriate genome information, which prevents us from making bioinformatic mistakes: ```{r} seqinfo(se) ``` # Easy summarization to gene-level Because the SummarizedExperiment maintains all the metadata of its creation, it also keeps a pointer to the necessary database for summarizing transcript-level quantifications and bias corrections to the gene-level. If necessary, `summarizeToGene` can pull down the remote source for summarization, but given that we've already built a TxDb once, it simply loads the stashed version. In order to remove the stashed TxDb and regenerate, one can remove the relevant entry from the `tximeta` file cache that resides at the location given by `getTximetaBFC()`. ```{r} gse <- summarizeToGene(se) rowRanges(gse) ``` # Add different identifiers We would like to add support to easily map transcript or gene identifiers from one annotation to another. This is just a prototype function, but we show how we can easily add alternate IDs given that we know the organism and the source of the transcriptome. (This function currently only works for Gencode and Ensembl gene or transcript IDs but could be extended to work for arbitrary sources.) ```{r} library(org.Dm.eg.db) gse <- addIds(gse, "REFSEQ", gene=TRUE) mcols(gse) ``` # Run a differential expression analysis The following code chunk demonstrates how to build a *DESeqDataSet* and begin a differential expression analysis. ```{r} suppressPackageStartupMessages(library(DESeq2)) # here there is a single sample so we use ~1. # expect a warning that there is only a single sample... suppressWarnings({dds <- DESeqDataSet(gse, ~1)}) dds <- estimateSizeFactors(dds) # ... and so on ``` The following code chunk demonstrates how to build a *DGEList* object, for use with *edgeR* (for example of generating an object for running *limma*, see details in the *tximport* vignette). ```{r} suppressPackageStartupMessages(library(edgeR)) cts <- assays(gse)[["counts"]] normMat <- assays(gse)[["length"]] normMat <- normMat / exp(rowMeans(log(normMat))) o <- log(calcNormFactors(cts/normMat)) + log(colSums(cts/normMat)) y <- DGEList(cts) y <- scaleOffset(y, t(t(log(normMat)) + o)) # ... see edgeR User's Guide for further steps ``` # Metadata galore The following information is attached to the *SummarizedExperiment* by `tximeta`: ```{r} names(metadata(se)) str(metadata(se)[["quantInfo"]]) str(metadata(se)[["txomeInfo"]]) str(metadata(se)[["tximetaInfo"]]) str(metadata(se)[["txdbInfo"]]) ``` # Quantification files with an unknown transcriptome `tximeta` automatically imports relevant metadata when the transcriptome matches a known source, but also facilitates the linking of transcriptomes used as for a *Salmon* index with relevant public sources. The linking is important in the case that the transcript sequence no longer matches a known source (combined or filtered FASTA files), or if the source is not known to `tximeta`. Below we demonstrate how to make a *linkedTxome* and how to share and load a *linkedTxome*. Here we point to *Salmon* quantification files which were quantified against a transcriptome combining two Ensembl FASTA files: the cDNA and the non-coding transcripts for *Drosophila melanogaster*. ```{r} dir <- system.file("extdata/salmon_dm/SRR1197474", package="tximportData") file <- file.path(dir, "quant.sf.gz") file.exists(file) coldata <- data.frame(files=file, names="SRR1197474", sample="1", stringsAsFactors=FALSE) ``` Trying to import the files gives a message that `tximeta` couldn't find a matching transcriptome, so it returns an un-ranged *SummarizedExperiment*. ```{r} se <- tximeta(coldata) ``` # Linked transcriptome for reproducible analysis If the transcriptome used to generate the *Salmon* index does not match any transcriptomes from known sources (e.g. from combining or filtering known transcriptome files), there is not much that can be done to automatically populate the metadata during quantification import. However, we can facilitate the following two cases: 1) the transcriptome was created locally and has been linked to its public source(s) 2) the transcriptome was produced by another group, and they have produced and shared a file that links the transcriptome to public source(s) `tximeta` offers functionality to assist reproducible analysis in both of these cases. In the case of the quantification file above, the transcriptome was generated locally by downloading and combining the Ensembl cDNA and non-coding FASTA files *Drosophila melanogaster*, release 92. The following un-evaluated command line code chunk reproduces the production of the transcriptome from publicly available sources. ``` wget ftp://ftp.ensembl.org/pub/release-92/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP6.cdna.all.fa.gz wget ftp://ftp.ensembl.org/pub/release-92/fasta/drosophila_melanogaster/ncrna/Drosophila_melanogaster.BDGP6.ncrna.fa.gz cat Drosophila_melanogaster.BDGP6.cdna.all.fa.gz Drosophila_melanogaster.BDGP6.ncrna.fa.gz > Drosophila_melanogaster.BDGP6.v92.fa.gz ``` To make this quantification reproducible, we make a `linkedTxome` which records key information about the sources of the transcript FASTA files, and the location of the relevant GTF file. It also records the signature of the transcriptome that was computed by *Salmon* during the `index` step. By default, `linkedTxome` will write out a JSON file which can be shared with others, linking the signature of the index with the other metadata, including FASTA and GTF sources. By default, it will write out to a file with the same name as the `indexDir`, but with a `.json` extension added. This can be prevented with `write=FALSE`, and the file location can be changed with `jsonFile`. First we specify the path where the *Salmon* index is located. Typically you would not use `system.file` to find this directory, but simply define `indexDir` to be the path of the *Salmon* directory on your machine. Here we use `system.file` because we have included parts of a *Salmon* index directory in the *tximeta* package itself for demonstration of functionality in this vignette. ```{r} dir <- system.file("extdata", package="tximeta") indexDir <- file.path(dir, "Drosophila_melanogaster.BDGP6.v92_salmon_0.10.2") ``` Now we provide the location of the FASTA files and the GTF file for this transcriptome. The recommended usage of `tximeta` would be to specify a remote GTF source, as seen in the commented-out line below: ```{r} fastaFTP <- c("ftp://ftp.ensembl.org/pub/release-92/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP6.cdna.all.fa.gz", "ftp://ftp.ensembl.org/pub/release-92/fasta/drosophila_melanogaster/ncrna/Drosophila_melanogaster.BDGP6.ncrna.fa.gz") #gtfFTP <- "ftp://ftp.ensembl.org/pub/release-92/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.92.gtf.gz" ``` Instead of the above commented-out FTP location for the GTF file, we specify a location within an R package. This step is just to avoid downloading from a remote FTP during vignette building. This use of `system.file` to point to a file in an R package is specific to this vignette and would not be used in a typical workflow. ```{r} dir2 <- system.file("extdata/salmon_dm", package="tximportData") gtfPath <- file.path(dir2,"Drosophila_melanogaster.BDGP6.92.gtf.gz") ``` Finally, we create a *linkedTxome*. In this vignette, we point to a temporary directory for the JSON file, but a more typical workflow would write the JSON file to the same location as the *Salmon* index by not specifying `jsonFile`. `makeLinkedTxome` performs two operation: (1) it creates a new entry in an internal table that links the transcriptome used in the *Salmon* index to its sources, and (2) it creates a JSON file such that this *linkedTxome* can be shared. ```{r} tmp <- tempdir() jsonFile <- file.path(tmp, paste0(basename(indexDir), ".json")) makeLinkedTxome(indexDir=indexDir, source="Ensembl", organism="Drosophila melanogaster", release="92", genome="BDGP6", fasta=fastaFTP, gtf=gtfPath, jsonFile=jsonFile) ``` After running `makeLinkedTxome`, the connection between this *Salmon* index (and its signature) with the sources is saved for persistent usage. With use of `tximeta` and a *linkedTxome* -- as with `tximeta` on a known, un-filtered, un-combined transcriptome -- the software figures out if the remote GTF has been accessed and compiled into a *TxDb* before, and on future calls, it will simply load the pre-computed metadata and transcript ranges. Note the warning that 9 of the transcripts are missing from the GTF file and so are dropped from the final output. This is a problem coming from the annotation source, and not easily avoided by `tximeta`. ```{r} se <- tximeta(coldata) ``` We can see that the appropriate metadata and transcript ranges are attached. ```{r} rowRanges(se) seqinfo(se) ``` # Clear *linkedTxomes* The following code removes the entire table with information about the *linkedTxomes*. This is just for demonstration, so that we can show how to load a JSON file below. **Note:** Running this code will clear any information about *linkedTxomes*. Don't run this unless you really want to clear this table! ```{r} library(BiocFileCache) if (interactive()) { bfcloc <- getTximetaBFC() } else { bfcloc <- tempdir() } bfc <- BiocFileCache(bfcloc) bfcinfo(bfc) bfcremove(bfc, bfcquery(bfc, "linkedTxomeTbl")$rid) bfcinfo(bfc) ``` # Loading *linkedTxome* JSON files If a collaborator or the Suppmentary Files for a publication shares a `linkedTxome` JSON file, we can likewise use `tximeta` to automatically assemble the relevant metadata and transcript ranges. This implies that the other person has used `tximeta` with the function `makeLinkedTxome` demonstrated above, pointing to their *Salmon* index and to the FASTA and GTF source(s). We point to the JSON file and use `loadLinkedTxome` and then the relevant metadata is saved for persistent usage. In this case, we saved the JSON file in a temporary directory. ```{r} jsonFile <- file.path(tmp, paste0(basename(indexDir), ".json")) loadLinkedTxome(jsonFile) ``` Again, using `tximeta` figures out whether it needs to access the remote GTF or not, and assembles the appropriate object on the user's behalf. ```{r} se <- tximeta(coldata) ``` # Clear *linkedTxomes* Finally, we clear the *linkedTxomes* table again so that the above examples will work. This is just for the vignette code and not part of a typical workflow. **Note:** Running this code will clear any information about *linkedTxomes*. Don't run this unless you really want to clear this table! ```{r} if (interactive()) { bfcloc <- getTximetaBFC() } else { bfcloc <- tempdir() } bfc <- BiocFileCache(bfcloc) bfcinfo(bfc) bfcremove(bfc, bfcquery(bfc, "linkedTxomeTbl")$rid) bfcinfo(bfc) ``` # Other quantifiers `tximeta` can import the output from any quantifiers that are supported by `tximport`, and if these are not *Salmon* or *Sailfish* output, it will simply return a un-ranged *SummarizedExperiment*. We are working to allow manually passing of the hash value of the transcriptome, the cDNA sequences of which can be hashed with [FastaDigest](https://github.com/COMBINE-lab/FastaDigest) (can be installed with `pip install fasta_digest`). # Acknowledgments The development of *tximeta* has benefited from suggestions from these and other individuals in the community: * Vincent Carey * Lori Shepherd * Martin Morgan * Koen Van den Berge # Next steps ### Basic functionality * Switching `rowRanges` from transcript ranges to exons-by-transcript ranges list, or from gene ranges to exons-by-gene ranges list. * As is already supported in `tximport`, also import inferential variance matrices (Gibbs samples or bootstrap samples) ### Facilitate plots and summaries * Basic plots across samples: abundances, mapping rates, rich bias model parameters * Time summaries: when quantified? when imported? I would love to know when the library was prepared and sequenced but this seems hopeless. ### Challenges * Building out actual, sustainable plan for supporting as many organisms and sources as possible. We can define rules which determine where the FASTA and GTF files will be based on `source` and `release` (also here we ignored something like "type", e.g. CHR or ALL gene files from Gencode) * Some support already for linked transcriptomes, see `linkedTxomes` vignette. Need to work more on combining multiple sources (potentially meta-transcriptomes from different organisms?), and also on how to approach de novo transcriptomes, and how to support reproducibility there. * Facilitate functional annotation, either with vignettes/workflow or with additional functionality. E.g.: housekeeping genes, arbitrary gene sets, genes expressed in GTEx tissues * `liftOver` is clunky and doesn't integrate with *GenomeInfoDb*. It requires user input and there's a chance to mis-annotate. Ideally this should all be automated. # Session info ```{r} library(devtools) session_info() ```