--- title: "Using factR" author: "Fursham Hamid" date: "`r Sys.Date()`" output: rmarkdown::html_document: highlight: kate toc: true toc_depth: 3 number_sections: true vignette: > %\VignetteIndexEntry{factR} %\VignetteEngine{knitr::rmarkdown} bibliography: references.bib --- ```{r setup, echo=FALSE, cache=FALSE} ## Global options options(max.print="100") knitr::opts_chunk$set( collapse = TRUE, comment = "#>") knitr::opts_knit$set(width=75) ``` ```{r silentload, include=FALSE} #Silent load all dependencies for vignette library(factR) library(AnnotationHub) library(Biostrings) library(BSgenome.Mmusculus.UCSC.mm10) library(GenomicFeatures) library(rtracklayer) library(tidyverse) ``` # Introduction Many eukaryotic genes give rise to multiple RNA isoforms, increasing the protein-coding capacity of the genome and extending the range of post-transcriptional regulation possibilities. High-throughput sequencing is often used to deduce repertoires of transcripts expressed in specific biological samples by aligning the data to genomic sequences and assembling the alignments into transcript architectures. This typically outputs Gene Transfer Format (GTF) files describing newly identified transcripts as sets of exonic coordinates, but lacking the information about their coding sequences (CDSs; also known as ORFs) and possible biological functions. To this end, we developed a package for functional annotation of custom-assembled transcriptomes in R (*factR*). *factR* predicts CDSs for novel RNA isoforms using a reference-guided process and then determines domain organisation of the protein products and possible susceptibility of transcripts to nonsense-mediated decay (NMD; a pathway destabilizing mRNAs with premature translation termination codons). *factR* also provides supporting tools for matching new transcripts to "official" gene IDs, visualizing transcript architectures and annotating alternatively spliced segments. # Getting started ## Installing *factR* To install *factR*, enter the following commands in your R environment: ```{r install.factR, eval = FALSE} if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("factR") ``` ## Materials needed ### Custom-assembled transcriptome *factR* requires a custom transcriptome file in the GTF format as an input, and we provide the following three sample custom transcriptome files that can be used to test *factR* tools: 1. **bulk_merged_sample.gtf.gz** assembled using the HISAT2-StringtTie2 pipeline [@Pertea2016] from bulk RNA-seq data for mouse embryonic stem cells treated with the NMD inhibitor cycloheximide (CHX) or left untreated as a control. 2. **sc_merged_sample.gtf.gz** assembled using the HISAT2-StringtTie2 pipeline from single-cell RNA-seq data for glutamatergic neurons, GABAergic neurons, astrocytes or endothelial cells from the mouse visual cortex [@Tasic2018]. A simplified version of this GTF containing the first 500 genes on chr15 is available as part of *factR*. 3. **lr_merged_sample.gtf.gz** assembled using minimap2 [@Li2018] and StringtTie2 [@Kovaka2019] from brain-specific long-read Oxford Nanopore RNA sequencing data [@Sessegolo2019]. Method to download the above GTF files is described in the ["Importing and inspecting GTF data"](#importing-and-inspecting-gtf-data) section below. Users may alternatively prepare their own GTF files making sure that these contain both **gene_id** and **transcript_id** attributes. ### Reference transcriptome Several *factR* functions require reference transcriptome files (GTF or GFF3) as a guide. Such files can be accessed from within the R environment using e.g. the R package *AnnotationHub* or downloaded from an external database such as GENCODE or Ensembl. Both possibilities are described in the ["Updating gene info"](#updating-gene-info) section below. ### Genome *factR* needs genomic DNA sequence to predict CDSs. Users may obtain genome files using e.g. *BSgenome* or *AnnotationHub* or download them from an external database such as GENCODE or Ensembl. We describe this in more detail in the ["Constructing CDS information"](#constructing-cds-information) section below. # Using *factR* Load *factR* to the R environment as follows: ```{r loadfactR} library(factR) ``` ## Importing and inspecting GTF data *factR* handles transcriptome information in the form of *GenomicRanges* objects containing genomic interval data and relevant metadata. To create such an object from a GTF file, we use the `importGTF` function: ```{r importGTF} gtf <- system.file("extdata", "sc_merged_sample.gtf.gz", package = "factR") custom.gtf <- importGTF(gtf) ``` The imported GTF file is stored as a *GenomicRanges* object. ```{r checktype} class(custom.gtf) ``` Contents of the object can be examined using `head`: ```{r headobj} head(custom.gtf) ``` In addition to genomic coordinates (seqnames and ranges), a typical GTF file contains metadata describing the feature type (e.g. transcript or exon), transcript IDs and some information on their parental genes. Use the following command to calculate the total number of transcripts in the input transcriptome: ```{r counttx} length(unique(custom.gtf$transcript_id)) ``` Note that none of the transcripts in the custom.gtf object contain CDS information: ```{r countcds} length(unique(custom.gtf[custom.gtf$type=="CDS"]$transcript_id)) ``` ## Plotting transcript structures Users may visualize specific sets of transcripts using `viewTranscripts`. For example, the following will plot transcripts from the *Zfr* gene encoding a conserved zinc finger-containing RNA-binding protein with known neuronal functions : ```{r plottranscripts, message=FALSE, warning=FALSE} viewTranscripts(custom.gtf, "Zfr") ``` StringTie2, a popular transcript assembler used to generate our custom GTF, typically assigns arbitrary names to newly identified transcripts (e.g. "MSTRG.x.y") and uses the same prefix for their gene IDs (e.g. "MSTRG.x"). Incidentally, this is why the output above contains 10 previously known *Zfr* transcripts but lacks any novel entries. ## Updating gene info *factR* can update gene metadata in custom transcriptome objects using a reference annotation as guide. Below, we describe two alternative ways to obtain mouse reference transcriptome data from GENCODE. 1. Using *AnnotationHub* package ```{r load.gencode, eval = TRUE} # query database for mouse gencode basic annotation library(AnnotationHub) ah <- AnnotationHub() query(ah, c('Mus musculus', 'gencode', 'gff')) # Download full annotation ref.gtf <- ah[['AH49546']] ``` 2. Downloading from a GENCODE database (ref.gtf generated from this method will be used for the remaining of the workflow) ```{r import.gencode, warning=FALSE} tmp <- tempfile() download.file("ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/gencode.vM25.annotation.gtf.gz", destfile = tmp) ref.gtf <- importGTF(tmp) ``` When choosing a reference, users should consider using one with the same chromosome naming style (e.g. "chr1" or "1"). Alternatively, the styles can be matched using the `matchChromosomes` function (see `help(matchChromosomes)` for more detail). Once the ref.gtf object is ready, novel transcripts in custom.gtf can be assigned to "official" gene IDs, whenever possible, using `matchGeneInfo`. By default, this function matches the query (custom.gtf) to the reference (ref.gtf) by finding overlapping coordinates: ```{r matchgenemeta, warning=FALSE} # matching gene metadata custom_matched_1.gtf <- matchGeneInfo(custom.gtf, ref.gtf) ``` To tune the performance of `matchGeneInfo`, we provide additional arguments specifying the name of columns containing "primary" and potentially "secondary" gene IDs from the query (custom.gtf). For further information, see `help(matchGeneInfo)`. ```{r matchgenemeta2, warning=FALSE} # matching gene metadata custom_matched_2.gtf <- matchGeneInfo(custom.gtf, ref.gtf, primary_gene_id = "gene_id", secondary_gene_id = "ref_gene_id") ``` Note that custom.gtf updated by `matchGeneInfo` now contains 10 known and 7 newly *Zfr* assembled transcripts: ```{r plottranscripts2} viewTranscripts(custom_matched_2.gtf, "Zfr") ``` ## Shortlisting novel transcripts As seen in the above example, custom transcriptomes typically combine both new and previously annotated transcripts. To select only newly predicted transcripts, run the following: ```{r subsettx, warning=FALSE} custom_new.gtf <- subsetNewTranscripts(custom_matched_2.gtf, ref.gtf) viewTranscripts(custom_new.gtf,"Zfr") ``` This will subset custom.gtf transcripts with distinct exonic coordinates compared to ref.gtf and will store these transcripts in the custom_new.gtf object. Some custom-built transcripts may differ from their reference counterparts by having different start or/and end coordinates, with otherwise similar exon-intron structure. To shortlist novel transcripts with distinct intronic coordinates only, simply set the "refine.by" argument to "intron": ```{r subsettx2, warning=FALSE} custom_new.gtf <- subsetNewTranscripts(custom_matched_2.gtf, ref.gtf, refine.by = "intron") viewTranscripts(custom_new.gtf, "Zfr") ``` We will use the custom_new.gtf object in the rest of the workflow. ## Constructing CDS information Functional annotation of newly assembled transcripts in *factR* begins by deducing their protein-coding sequences (CDSs). To search for putative CDSs, *factR* requires a genome sequence file, which can be obtained using R packages or downloaded from online databases (e.g. UCSC, GENCODE or Ensembl). Three alternative ways to retrieve mouse genomic sequence are described below. 1. Using *BSgenome* (Mmusculus generated from this method will be used for the remaining of the workflow) This package supports most sequenced genomes. Mouse mm10 sequence can be downloaded as follows: ```{r genomeBSgenome, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("BSgenome.Mmusculus.UCSC.mm10") ``` and loaded into R environment: ```{r loadBSgenome} library(BSgenome.Mmusculus.UCSC.mm10) Mmusculus <- BSgenome.Mmusculus.UCSC.mm10 ``` 2. Using *AnnotationHub* ```{r genomeAhub} library(AnnotationHub) ah <- AnnotationHub() query(ah, c("mm10","2bit")) ``` ```{r, eval=FALSE} # Retrieve mouse genome Mmusculus <- ah[['AH14005']] ``` 3. Downloading from a database (e.g. GENCODE) ```{r downloadGencode, eval=FALSE} tmp <- tempfile() download.file("ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/GRCm38.primary_assembly.genome.fa.gz", tmp) Mmusculus <- importFASTA(paste0(tmp, "/GRCm38.primary_assembly.genome.fa.gz")) ``` Once the genome sequence object is ready, *factR* can predict CDSs using its buildCDS() function and reference transcriptome data as a guide. buildCDS() first generates a database of previously annotated ATGs and uses this information to search for a potential translation start sites in query transcripts. buildCDS() then deduces the CDS and appends its coordinates to the custom transcriptome object. Let's run this function for our novel transcripts: ```{r buildcds, warning=FALSE} custom_new_CDS.gtf <- buildCDS(custom_new.gtf, ref.gtf, Mmusculus) ``` Note that the novel *Zfr* transcripts have been updated with information about likely CDSs (dark blue) and untranslated regions (light blue) : ```{r viewafterCDS} viewTranscripts(custom_new_CDS.gtf, "Zfr") ``` We can display exonic regions and CDSs more clearly (at the expense of loosing their bona fide genomic coordinates) by setting rescale_intron argument to TRUE. ```{r viewafterCDSscale} viewTranscripts(custom_new_CDS.gtf, "Zfr", rescale_introns = TRUE) ``` For comparison, here is the CDS situation in the reference Zfr transcripts: ```{r viewrefCDS} viewTranscripts(ref.gtf, "Zfr", rescale_introns = TRUE) ``` ## Predicting NMD To explore possible susceptibility of newly identified mRNA isoforms to NMD, we use the `predictNMD` function: ```{r predictNMD1, warning=FALSE} NMDprediction.out <- predictNMD(custom_new_CDS.gtf) head(NMDprediction.out) ``` `predictNMD` outputs a data frame containing information on features that may promote NMD and predicts NMD sensitivity for each CDS-containing transcript based on the distance between the stop codon and the last exon-exon junction. To identify putative NMD targets for specific genes (e.g. *Zfr*), run the following: ```{r predictNMD2} NMDprediction.Zfr <- predictNMD(custom_new_CDS.gtf, gene_name == "Zfr") head(NMDprediction.Zfr) ``` ## Predicting protein domains *factR* can also inspect domain structure of protein products encoded by newly identified mRNA isoforms using its `predictDomains()` function. This requires connection to the online PFAM database and may require a substantial amount of time and stable internet connection to query multiple transcripts simultaneously. To quickly explore functionality of the `predictDomains()` tool, users may prefer to begin with a relatively small subset of transcripts. For example, the following will predict domain structure for all transcripts of the Zfr gene: ```{r predDomain2} predictDomains(custom_new_CDS.gtf, Mmusculus, gene_name == "Zfr", progress_bar = FALSE) ``` The domain architectures for the set of query proteins can be additionally plotted by switching the argument "plot" to TRUE: ```{r predDomain3, eval = F} domains.Zfr <- predictDomains(custom_new_CDS.gtf, Mmusculus, gene_name == "Zfr", plot = TRUE, progress_bar = FALSE) ``` If you want to predict domains for all new transcripts and do not mind waiting for some time depending on the connection speed and the PFAM server load, run the following: ```{r predDomain1, eval=FALSE} domains.out <- predictDomains(custom_new_CDS.gtf, Mmusculus, progress_bar = FALSE) ``` ## Export output objects Annotated *custom_new_CDS.gtf* object can be exported to a GTF file as follows: ```{r exportgtf, eval=FALSE} library(rtracklayer) export(custom_new_CDS.gtf, "Custom_new.gtf", format = "GTF") ``` Finally, to export *NMDprediction.out* and *domains.out* data frames as tab-delimited text files, run the following: ```{r exporttable, eval=FALSE} write.table(NMDprediction.out, "Custom_new_NMD.txt", sep = "\t", row.names = FALSE, quote = FALSE) write.table(domains.out, "Custom_new_domains.txt", sep = "\t", row.names = FALSE, quote = FALSE) ``` # Citing *factR* Please cite *factR* if you find it useful: # Session Information This workflow was conducted on: ```{r sessioninfo} sessionInfo() ``` # References