--- title: "The sRNADiff package" author: "Matthias Zytnicki" date: "`r BiocStyle::doc_date()`" package: "`r BiocStyle::pkg_ver('srnadiff')`" output: rmarkdown::html_vignette bibliography: references.bib vignette: > %\VignetteIndexEntry{Vignette Title} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup_knitr, include = FALSE, cache = FALSE} library("srnadiff") library("BiocStyle") library("knitr") library("rmarkdown") opts_chunk$set(message = FALSE, error = FALSE, warning = FALSE, cache = FALSE, fig.width = 5, fig.height = 5) ``` sRNADiff aims at finding differentially expressed short RNAs with or without annotation. To do so, sRNADiff uses up to 4 different strategies: - annotation-based, - naive, - HMM, - clustering. ## Example Raw data have been downloaded from the GEO data set [GSE62830](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62830), provided in [Viollet *et al.*, 2015](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0126439) [@data]. Adapters were removed with [fastx_clipper](http://hannonlab.cshl.edu/fastx_toolkit/commandline.html) and mapped with [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) [@bowtie2] on the human genome version GRCh38. This example provides this data, restricted to a small locus on chr14. It uses the whole genome annotation (with coding genes, etc.) and extracts miRNAs. The file `data.csv` contains three columns for each BAM file: - the file name (`FileName`) - a human readable sample name (`SampleName`) - a condition, e.g. `WT` (`Condition`) ```{r, standard, results="hide", warning=FALSE} dir <- system.file("extdata", package="srnadiff", mustWork = TRUE) data <- read.csv(file.path(dir, "data.csv")) gtfFile <- file.path(dir, "Homo_sapiens.GRCh38.76.gtf.gz") annotation <- readWholeGenomeAnnotation(gtfFile) bamFiles <- file.path(dir, data$FileName) replicates <- data$SampleName conditions <- factor(data$Condition) exp <- sRNADiffExp(annotation, bamFiles, replicates, conditions) diffRegions <- runAll(exp) ``` For your conveniance, the sample can be loaded with an only command `exp <- sRNADiffExample()`, so the script boils down to: ```{r, example} exp <- sRNADiffExample() diffRegions <- runAll(exp) ``` ## Rationale and scope of the package There is no real method for finding differentially expressed short RNAs. The most used method focusses on miRNAs, and only uses a standard RNA-Seq pipe-line on these genes. However, annotated tRF, siRNAs, piRNA, etc. are thus out of the scope of these analyses. Several ad hoc method have been used, and this package implements a unifying method, finding differentially expressed genes or regions of any kind. ## Explanation of the different strategies Each strategy segments the genome its own way and finds potential differentially expressed regions. Every strategy makes assumption of the profiles of the differentially expressed regions. These assumptions differ between strategies, and thus it is useful to combined them. ### Annotation This is the standard pipe-line, and the method simply provides the list of miRNA genes extracted from the given GTF file. If you do not provide a GTF file, this step is skipped. Unfortunately, there is no clear standard for an annotation format. We provide here four ways to parse the annotation file. #### Whole genome file annotation This GTF file can be found in the central repositories ([NCBI](https://www.ncbi.nlm.nih.gov/), [Ensembl](https://www.ensembl.org/index.html)) and contains all the annotation found in an organism (coding genes, tranposable element, etc.). The following function reads the annotation file and extracts the miRNAs. Annotation files may have different formats, but this command has been tested on several model organisms (including human) from Ensembl. ```{r, read_whole_genome} gtfFile <- file.path(dir, "Homo_sapiens.GRCh38.76.gtf.gz") annotation <- readWholeGenomeAnnotation(gtfFile) ``` #### miRBase annotation [miRBase](http://www.mirbase.org/) [@mirbase] is the central repository for miRNAs. If your organism is available, you can download their miRNA annotation in GFF3 format (check the "Browse" tab). The following function parses a GFF3 miRBase file, and extracts the precursor miRNAs. ```{r, read_mirbase_pre} gffFile <- file.path(dir, "mirbase21_GRCh38.gff3") annotation <- readMiRBasePreAnnotation(gffFile) ``` As a result, the reads will be counted per pre-miRNA, and the 5' and 3' arms, the miRNA and the miRNA* will be merged in the same feature. If you want to separate the two, use: ```{r, read_mirbase_mature} annotation <- readMiRBaseMatureAnnotation(gffFile) ``` #### Other format When the previous functions do not work, you can use your own parser with: ```{r, read_annotation} annotation <- readAnnotation(gtfFile, source="miRNA", feature="gene", name="gene_name") ``` The `source` parameter keeps all the lines such that the second field matches the given parameter (e.g. `miRNA`). The `feature` parameter keeps all the lines such that the third field matches the given parameter (e.g. `gene`). The name of the feature will be given by the tag `name` (e.g. `gene_name`). `source`, `feature` and `name` can be `NULL`. In this case, no selection is performed on `source` or `feature`. If `name` is null, then a systematic name is given (`annotation_N`). ### Naive This strategy is the simplest, most commonly used, to find transcription units. It proceeds in two steps. The first step merges all the BAM files. Regions with a minimum number coverage (summing all the reads in all input BAM files) are kept. The second steps merges regions that are within a given distance (also provided by the user). The default distance can be change using this function: ```{r, set_merge_distance} exp <- setMergeDistance(exp, 100) ``` This method is efficient when a gene is differentially expressed on the whole transcript. ### HMM This strategy first computes the coverage of every input BAM file of every nucleotide of the genome. If two genome nucleotides have exactly the same coverage profile, we merge them. We give these coverage profiles to `r Biocpkg("DESeq2")` [@deseq2], and compute the p-values. We thus have a p-value for each position of the genome. We then use a very crude hidden Markov model (HMM) on each chromosome of the genome. The HMM has two states: "not differentially expressed", and "differentially expressed". Each state uses a binomial distribution, the probability to emit a p-value < $t$, and the probability to emit a p-value >= $t$ (where $t$ is a threshold that can be tuned, initially set to 0.1). We did not try to optimize the parameters of the HMM, because it takes times, and do not improve the results. Instead, the HMM is used a crude (but effective) way to segment the genome. The parameters of the HMM can be changed using these functions: ```{r, hmm} exp <- setTransitionProbabilities(exp, noDiffToDiff = 0.0001, diffToNoDiff = 0.01) exp <- setEmissionProbabilities(exp, probability = 0.9) exp <- setEmissionThreshold(exp, threshold = 0.1) ``` where: * `noDiffToDiff` is the probability to move from the "not differentially expressed" state to the "differentially expressed" state, * `diffToNoDiff` is the probability to move from the "differentially expressed" state to the "not differentially expressed" state, * `probability` is the probability to emit a p-value < $t$ in the "differentially expressed" state, and a p-value >= $t$ in the "not differentially expressed" state, where $t$ is the next threshold, * `threshold` is the threshold that limits each state. This method is efficient when a region is highly differentially expressed, even if it is concealed in a longer transcript. ### Slicing This stretegy proceeds in three steps: * It normalizes counts using a CPM strategy developped in EdgeR. * The mean distribution of the coverage of the replicates is computed for each condition. * For each condition, we "slice" the mean coverage distribution: we add horizontal lines at height h (where h is the minimum depth parameter defined in the general parameter section), $h+1$, $h+2$, ..., $+\infty$, and the intervals where the distribution and the horizontal lines meet is a putative region. * These putative regions are the output of the strategy. Obviously, many putative regions may have very similar start and end points. If two regions differ only by a few base pairs, the smallest regions will be discarded. The default values can be changed using these functions: ```{r, slicing} exp <- setMinDifferences(exp, differences = 50) ``` This method is especially efficient when a differentially expressed region is large, even though the fold change is moderate. ### General parameters The three last strategies can be tuned by specifying: * the minimum and maximum regions sizes, * the minimum depth of the most expressed condition. The default values can be changed using these functions: ```{r, general_parameters} exp <- setMinDepth(exp, depth = 10) exp <- setSizes(exp, minSize = 15, maxSize = 1000) ``` ### Combination of the strategies #### Choice of the strategies All the regions given by each strategies are then combined into a list of regions. You can choose not to use some strategies, use the function ```{r, strategies} exp <- setStrategies(exp, annotation = TRUE, naive = FALSE, hmm = TRUE, slice = TRUE) ``` where `annotation`, `naive`, `hmm` and `slice` are logical variables, set to `FALSE` if you want to skip that step. #### Quantification of the features The selected regions are then quantified using of the `summarizeOverlaps` function of the `r Biocpkg("GenomicAlignments")` package. Notice that a read can overlap two different regions (e.g. extracted from the naive and the slicing methods), and thus can be counted twice for the quantification. You can adjust the minimum number of overlapping nucleotides between a read and a region to declare a hit, using: ```{r, minOverlap} exp <- setMinOverlap(exp, minOverlap = 20) ``` `r Biocpkg("DESeq2")` is then used to get the adjusted p-values of these regions. ### Extracting regions The regions, with corresponding information provided by `r Biocpkg("DESeq2")` (mean expression, fold-change, p-value, adjusted p-value, etc.), can be extracted with this command: ```{r regions} exp <- sRNADiffExample() exp <- runAll(exp) regions <- regions(exp, pvalue = 0.5) ``` where `pvalue` is the (adjusted) p-value threshold. The output in a `r Biocpkg("GenomicRanges")` object, and the information is accessible with the `mcols()` function. ## Visualization You can get the read coverage profile of your selected regions using the `plotRegion` method. For instance, if `exp` is your `srnadiff` object, you can type: ```{r, plot} plotRegion(exp, regions[1]) ``` Each line provides the read density on an input sample. The dark grey rectangle emphasizes the selected region. ## Misc ### Naming the conditions The conditions can be sorted (`WT` before `mutant`, `untreated` before `treated`) this way: ```{r, conditions} conditions <- factor(c("wt", "wt", "ko", "ko"), levels=c("wt", "ko")) ``` ### Using several cores The quantification and differential expression steps can be accelerated using several cores and the following command: ```{r, threads} exp <- setNThreads(exp, nThreads = 4) ``` ### Troubleshooting While installing the package, if the compiler complains and says #error This file requires compiler and library support for the ISO C++ 2011 standard. This support is currently experimental, and must be enabled with the -std=c++11 or -std=gnu++11 compiler options. Add this line Sys.setenv("PKG_CXXFLAGS"="-std=c++11") before installing the package. # Session information ```{r, session_info} devtools::session_info() ``` # References