--- title: "Generating reference files for spliced and unspliced abundance estimation with alignment-free methods" author: "Charlotte Soneson" date: "`r Sys.Date()`" output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{Generating reference files for spliced and unspliced abundance estimation with alignment-free methods} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(width = 70) ``` # Introduction In this vignette, we show how to prepare reference files for estimating abundances of spliced and unspliced abundances with alignment-free methods (e.g., [Salmon](https://salmon.readthedocs.io/en/latest/salmon.html), [alevin](https://salmon.readthedocs.io/en/latest/alevin.html) or [kallisto](https://pachterlab.github.io/kallisto/)). Such abundances are used as input, e.g., for estimation of RNA velocity in single-cell data. ```{r setup} library(eisaR) ``` # Generate feature ranges The first step is to generate a `GRangesList` object containing the genomic ranges for the features of interest (spliced transcripts, and either unspliced transcripts or intron sequences). This is done using the `getFeatureRanges()` function, based on a reference GTF file. Here, we exemplify this with a small subset of a GTF file from [Gencode release 28](https://www.gencodegenes.org/human/release_28.html). We extract genomic ranges for spliced transcript and introns, where the latter are defined for each transcript separately (using the same terminology as in the `r Biocpkg("BUSpaRse")` package). For each intron, a flanking sequence of 50 nt is added on each side to accommodate reads mapping across an exon/intron boundary. ```{r} gtf <- system.file("extdata/gencode.v28.annotation.sub.gtf", package = "eisaR") grl <- getFeatureRanges( gtf = gtf, featureType = c("spliced", "intron"), intronType = "separate", flankLength = 50L, joinOverlappingIntrons = FALSE, verbose = TRUE ) ``` The output from `getFeatureRanges()` is a `GRangesList` object, with all the features of interest (both spliced transcripts and introns): ```{r} grl ``` The `metadata` slot of the `GRangesList` object contains a list of the feature IDs for each type of feature: ```{r} lapply(S4Vectors::metadata(grl)$featurelist, head) ``` As we can see, the intron IDs are identified by a `-I` suffix. Each feature is further annotated to a gene ID. For the intronic features, the corresponding gene ID also bears the `-I` suffix appended to the original gene ID. Having separate gene IDs for spliced transcripts and introns allows, for example, joint quantification of spliced and unspliced versions of a gene with alevin. Adding a suffix rather than defining a completely new gene ID allows us to easily match corresponding spliced and unspliced feature IDs. To further simplify the latter, the metadata of the `GRangesList` object returned by `getFeatureRanges()` contains `data.frame`s matching the corresponding gene IDs (as well as transcript IDs, if unspliced transcripts are extracted): ```{r} head(S4Vectors::metadata(grl)$corrgene) ``` # Extract feature sequences Once the genomic ranges of the features of interest are extracted, we can obtain the nucleotide sequences using the `extractTranscriptSeqs()` function from the `r Biocpkg("GenomicFeatures")` package. In addition to the ranges, this requires the genome sequence. This can be obtained, for example, from the corresponding BSgenome package, or from an external FASTA file. ```{r} suppressPackageStartupMessages({ library(BSgenome) library(BSgenome.Hsapiens.UCSC.hg38) }) seqs <- GenomicFeatures::extractTranscriptSeqs( x = BSgenome.Hsapiens.UCSC.hg38, transcripts = grl ) seqs ``` The resulting `DNAStringSet` can be written to a FASTA file and used to generate an index for alignment-free methods such as _Salmon_ and _kallisto_. # Generate an expanded GTF file In addition to extracting feature sequences, we can also export a GTF file with the full set of features. This is useful, for example, in order to generate a linked transcriptome for later import of estimated abundances with `r Biocpkg("tximeta")`. ```{r} exportToGtf( grl, filepath = file.path(tempdir(), "exported.gtf") ) ``` In the exported GTF file, each entry of `grl` will correspond to a "transcript" feature, and each individual range corresponds to an "exon" feature. In addition, each gene is represented as a "gene" feature. # Generate a transcript-to-gene mapping Finally, we can export a `data.frame` (or a tab-separated test file) providing a conversion table between "transcript" and "gene" identifiers. This is needed to aggregate the transcript-level abundance estimates from alignment-free methods such as _Salmon_ and _kallisto_ to the gene level. ```{r} df <- getTx2Gene(grl) head(df) tail(df) ``` # Session info ```{r} sessionInfo() ```