%\VignetteIndexEntry{Differential transcript usage and transcript usage QTL analyses in RNA-seq with the DRIMSeq package} %\VignettePackage{DRIMSeq} %\VignetteEngine{knitr::knitr} \documentclass[10pt]{article} <>= BiocStyle::latex() @ \usepackage[utf8]{inputenc} \usepackage[sort]{cite} \usepackage{xstring} \bioctitle[Differential transcript usage and transcript usage QTL analyses in RNA-seq with the DRIMSeq package]{DRIMSeq: Dirichlet-multinomial framework for differential transcript usage and transcript usage QTL analyses in RNA-seq} \author{ Malgorzata Nowicka\thanks{\email{gosia.nowicka@uzh.ch}}, Mark D. Robinson\\ Institute for Molecular Life Sciences, University of Zurich, Switzerland\\ SIB Swiss Institute of Bioinformatics, University of Zurich, Switzerland } \begin{document} \maketitle \packageVersion{\Sexpr{BiocStyle::pkg_ver("DRIMSeq")}} \newpage \tableofcontents \newpage <>= library(knitr) opts_chunk$set(cache = FALSE, warning = FALSE, out.width = "7cm", fig.width = 7, out.height = "7cm", fig.height = 7) @ %------------------------------------------------------------------------------ % Introduction %------------------------------------------------------------------------------ \section{Main changes in the DRIMSeq package} For the full list of changes, type: <>= news(package = "DRIMSeq") @ Implementation of the regression framework in differential transcript usage analysis. It allows fitting more complicated than multiple group comparison experimental designs, for example, one can account for the batch effects or the fact that samples are paired or model continuous time course changes. It enables also testing of more complicated contrasts. Transcript-level analysis based on the beta-binomial model. In this case, each transcript ratio is modeled separately assuming it follows the beta-binomial distribution which is a one-dimensional version of the Dirichlet-multinomial distribution. Based on the fact that when $(Y_1,\ldots,Y_q) \sim DM(\pi_1,\ldots,\pi_q, \gamma_0)$ then $Y_j \sim BB(\pi_j,\gamma_0)$ \cite{Danaher1988}, we do not need to reestimate the beta-binomial parameters, only the likelihoods are recalculated. \Rpackage{DRIMSeq} returns gene-level and transcript-level p-values which can be used as input to the stage-wise testing procedure \cite{VandenBerge2017} as screening and confirmation p-values, respectively. Such approach provides increased power to identify transcripts which are actually differentially used in a gene detected as gene with DTU. Usage of term 'precision' instead of 'dispersion'. In the differential analysis based on the negative-binomial model, dispersion parameter is estimated. This dispersion parameter captures all sources of the inter-library variation between replicates present in the RNA-seq data. In the DM model, we do not directly estimate dispersion but a parameter called precision which is closely linked to dispersion via the formula: $dispersion = 1 / (1 + precision)$. In the previous version of \cite{Nowicka2016}, we used 'dispersion' as a name for the functions and variables calculating and storing, in fact, the precision estimates. Now, we use the term 'precision'. \section{Overview of the Dirichlet-multinomial model} For the statistical details about the Dirichlet-multinomial model, see the \Rpackage{DRIMSeq} paper \cite{Nowicka2016}. In the \Rpackage{DRIMSeq} package we implemented a Dirichlet-multinomial framework that can be used for modeling various multivariate count data with the interest in finding the instances where the ratios of observed features are different between the experimental conditions. Such a model can be applied, for example, in differential transcript usage (DTU) analysis or in the analysis that aim in detecting SNPs that are associated with differential transcript usage (tuQTL analysis). In both cases the multivariate features of a gene are transcripts. The implementation of Dirichlet-multinomial model in \Rpackage{DRIMSeq} package is customized for differential transcript usage and tuQTL analyses, but the data objects used in \Rpackage{DRIMSeq} can contain various types of counts. Therefore, other types of multivariate differential analyses can be performed such as differential methylation analysis or differential polyA usage from polyA-seq data. In short, the method consists of three statistical steps: First, we use the Cox-Reid adjusted profile likelihood to estimate the precision which is inverse proportional to dispersion, i.e., the variability of transcript ratios between samples (replicates) within conditions. Dispersion is needed in order to find the significant changes in transcript ratios between conditions which should be sufficiently stronger than the changes/variability within conditions. Second, we use maximum likelihood to estimate at the gene-level the regression coefficients in the Dirichlet-multinomial (DM) regression, the fitted transcript proportions in each sample and the full model likelihoods. For the analysis at the transcript-level we apply the beta-binomial (BB) regression to each transcript separately. In the differential transcript usage analysis, the full model is defined by the user with the design matrix. In the QTL analysis, full models are defined by the genotypes of SNPs associated with a given gene. Finally, we fit the null model DM and BB regression and use the likelihood ratio statistics to test for the differences in transcript proportions between the full and null models at the gene and transcript level. In the differential transcript usage analysis, the null model is again defined by the user. In the QTL analysis, null models correspond to regression with intercept only. \section{Important notes} Currently, transcript-level analysis based on the BB model are implemented only in the DTU analysis (\Rcode{bb\_model = TRUE}). When the model (full or null) of interest corresponds to multiple (or one) group fitting, then a shortcut algorithm called 'one way' (\Rcode{one\_way = TRUE}), which we adapted from the \Rfunction{glmFit} function in \Biocpkg{edgeR} \cite{McCarthy2012}, can be used. Choosing it is equivalent to running the original \Rpackage{DRIMSeq} implementation. In such a case, we use maximum likelihood to estimate the transcript proportions in each group separately and then the regression coefficients are calculated using matrix operations. Otherwise, the regression coefficients are directly estimated with the maximum likelihood approach. \section{Hints for DRIMSeq pipelines} In this vignette, we present how one could perform differential transcript usage analysis and tuQTL analysis with the \Rpackage{DRIMSeq} package. We use small data sets so that the whole pipelines can be run within few minutes in \R{} on a single core computer. In practice, the package is designed to take advantage of multicore computing for larger data sets. In the filtering function \Rfunction{dmFilter}, all the parameters that are influencing transcript count filtering are set to zero. This results in a very relaxed filtering, where transcripts with zero expression in all the samples and genes with only one transcript remained are removed. Functions \Rfunction{dmPrecision}, \Rfunction{dmFit} and \Rfunction{dmTest}, which perform the actual statistical analyses described above, have many other parameters available for tweaking, but they do have the default values assigned. Those values were chosen based on many real data analyses. Some of the steps are quite time consuming, especially the precision estimation, where proportions of each gene are refitted for different precision parameters. To speed up the calculations, we have parallelized many functions using \Biocpkg{BiocParallel}. Thus, if possible, we recommend to increase the number of workers in \Robject{BPPARAM}. In general, tuQTL analyses are more computationally intensive than differential transcript usage analysis because one needs to do the analysis for every SNP in the surrounding region of a gene. Additionally, a permutation scheme is used to compute the p-values. It is indeed feasible to perform tuQTL analysis for small chunks of genome, for example, per chromosome. %------------------------------------------------------------------------------ % Differential transcript usage analysis workflow %------------------------------------------------------------------------------ \section{Differential transcript usage analysis workflow} \subsection{Example data} To demonstrate the application of \Rpackage{DRIMSeq} in differential transcript usage analysis, we will use the \emph{pasilla} data set produced by Brooks et al. \cite{Brooks2011}. The aim of their study was to identify exons that are regulated by pasilla protein, the Drosophila melanogaster ortholog of mammalian NOVA1 and NOVA2 (well studied transcript usage factors). In their RNA-seq experiment, the libraries were prepared from 7 biologically independent samples: 4 control samples and 3 samples in which pasilla was knocked-down. The libraries were sequenced on Illumina Genome Analyzer II using single-end and paired-end sequencing and different read lengths. The RNA-seq data can be downloaded from the NCBI Gene Expression Omnibus (GEO) under the accession number GSE18508. In the examples below, we use a subset of \software{kallisto} \cite{Bray2016} counts available in \Biocexptpkg{PasillaTranscriptExpr} package, where you can find all the steps needed, for preprocessing the GEO data, to get a table with transcript counts. \subsection{Differential transcript usage analysis between two conditions} In this section, we present how to perform the DTU analysis between two conditions control and knock-down without accounting for the batch effect which is the library layout. Analysis where batch effects are included are presented in the next section. We start the analysis by creating a \Rclass{dmDSdata} object, which contains transcript counts and information about grouping samples into conditions. With each step of the pipeline, additional elements are added to this object. At the end of the analysis, the object contains results from all the steps, such as precision estimates, regression coefficients, fitted transcript ratios in each sample, likelihood ratio statistics, p-values, adjusted p-values at gene and transcript level. As new elements are added, the object also changes its name \Rclass{dmDSdata} $\rightarrow$ \Rclass{dmDSprecision} $\rightarrow$ \Rclass{dmDSfit} $\rightarrow$ \Rclass{dmDStest}, but each container inherits slots and methods available for the previous one. \subsubsection{Loading pasilla data into R} The transcript-level counts obtained from \software{kallisto} and metadata are saved as text files in the \Rcode{extdata} directory of the \Biocexptpkg{PasillaTranscriptExpr} package. <>= library(PasillaTranscriptExpr) data_dir <- system.file("extdata", package = "PasillaTranscriptExpr") ## Load metadata pasilla_metadata <- read.table(file.path(data_dir, "metadata.txt"), header = TRUE, as.is = TRUE) ## Load counts pasilla_counts <- read.table(file.path(data_dir, "counts.txt"), header = TRUE, as.is = TRUE) @ Load the \Rpackage{DRIMSeq} package. <>= library(DRIMSeq) @ To create a \Rcode{dmDSdata} object, saved as variable \Robject{d}, we need to prepare a data frame containing information about samples and we will call it \Robject{pasilla\_samples}. It has to have a variable called \Rcode{sample\_id} with unique sample names that are identical to column names in \Robject{pasilla\_counts} that correspond to samples. Additionally, it has to contain other variables that the user would like to use for the further regression analysis. Here, we are interested in the differential analysis between the control and knock-down condition. This information is stored in \Rcode{pasilla\_metadata\$condition}. The data frame with counts called \Robject{pasilla\_counts} is already formatted in the right way. It contains variables \Rcode{feature\_id} with unique transcript names and \Rcode{gene\_id} with unique gene IDs and columns with counts have the same names as \Rcode{sample\_id} in \Rcode{pasilla\_samples}. When printing variable \Robject{d}, you can see its class, size (number of genes and samples) and which accessor methods can be applied. For \Rcode{dmDSdata} object, there are two methods that return data frames with counts and samples. <>= pasilla_samples <- data.frame(sample_id = pasilla_metadata$SampleName, group = pasilla_metadata$condition) levels(pasilla_samples$group) d <- dmDSdata(counts = pasilla_counts, samples = pasilla_samples) d head(counts(d), 3) head(samples(d), 3) @ You can also make a data summary plot, which is a histogram of the number of transcripts per gene. <>= plotData(d) @ To make the analysis runnable within this vignette, we want to keep only a small subset of genes, which is defined in the following file. <>= gene_id_subset <- readLines(file.path(data_dir, "gene_id_subset.txt")) d <- d[names(d) %in% gene_id_subset, ] d @ \subsubsection{Filtering} \label{DS_filtering} Genes may have many transcripts that are lowly expressed or not expressed at all. You can remove them using the \Rfunction{dmFilter} function. Filtering of lowly expressed transcripts can be done at two levels: minimal \textit{expression} using \Robject{min\_samps\_feature\_expr} and \Robject{min\_feature\_expr} parameters or minimal \textit{proportion} with \Robject{min\_samps\_feature\_prop} and \Robject{min\_feature\_prop}. In the \emph{pasilla} experiment we use a filtering based only on the transcript absolute expression and parameters are adjusted according to the number of replicates per condition. Since we have 3 knock-down and 4 control samples, we set \Robject{min\_samps\_feature\_expr} equal to 3. In this way, we allow a situation where a transcript is expressed in one condition but not in another, which is a case of differential transcript usage. The level of transcript expression is controlled by \Robject{min\_feature\_expr}. We set it to the value of 10, which means that only the transcripts that have at least 10 estimated counts in at least 3 samples are kept for the downstream analysis. Filtering at the gene level ensures that the observed transcript ratios have some minimal reliability. Although, Dirichlet-multinomial model works on feature counts, and not on feature ratios, which means that it gives more confidence to the ratios based on 100 versus 500 reads than 1 versus 5, minimal filtering based on gene expression removes the genes with mostly zero counts and reduces the number of tests in multiple test correction. For the \emph{pasilla} data, we want that genes have at least 10 counts in all the samples: \Rcode{min\_samps\_gene\_expr = 7} and \Rcode{min\_gene\_expr = 10}. <>= # Check what is the minimal number of replicates per condition table(samples(d)$group) d <- dmFilter(d, min_samps_gene_expr = 7, min_samps_feature_expr = 3, min_gene_expr = 10, min_feature_expr = 10) @ \subsubsection{Precision estimation} \label{DS_precision_estimation} Ideally, we would like to get accurate precision estimates for every gene, which is problematic when analyzing small data sets because precision estimates become inaccurate when the sample size decreases, especially for lowly expressed genes. As an alternative, we could assume that all the genes have the same precision and based on all the data, we could calculate a common precision, but we expect this to be too strong of an assumption. Moderated precision is a trade-off between gene-wise and common precision. The moderated estimates originate from a weighted likelihood which is a combination of common and individual likelihoods. We recommend this approach when analyzing small sample size data sets. At this step, three values may be calculated: mean expression of genes, common precision and gene-wise precisions. In the default setting, all of them are computed and common precision is used as an initial value in the grid approach to estimate gene-wise precisions, which are shrunk toward the trended precision. The grid approach is adapted from the \Rfunction{estimateDisp} function in \Biocpkg{edgeR} \cite{McCarthy2012}. By default, to estimate the common precision, we use 10\% percent (\Rcode{prec\_subset = 0.1}) of randomly selected genes. That is due to the fact that common precision is used only as an initial value, and estimating it based on all the genes takes a substantial part of time. To ensure that the analysis are reproducible, the user should define a random seed \Rcode{set.seed()} before running the \Rcode{dmPrecision()} function. Thank to that, each time the same subset of genes is selected. To estimate precision parameters, the user has to define a design matrix with the full model of interest, which will be also used later in the proportion estimation. Here, the full model is defined by a formula $\sim group$ which indicates that samples come from two conditions. This step of our pipeline is the most time consuming. Thus, for real data analysis, consider using \Rcode{BPPARAM = BiocParallel::MulticoreParam()} with more than one worker. <>= ## Create the design matrix design_full <- model.matrix(~ group, data = samples(d)) design_full @ <>= ## To make the analysis reproducible set.seed(123) ## Calculate precision d <- dmPrecision(d, design = design_full) d head(mean_expression(d), 3) common_precision(d) head(genewise_precision(d)) @ To inspect the behavior of precision estimates, you can plot them against the mean gene expression. Normally in the differential analysis based on RNA-seq data, such plot has dispersion parameter plotted on the y-axis. Here, the y-axis represents precision since in the Dirichlet-multinomial model this is the parameter that is directly estimated. It is important to keep in mind that the precision parameter is inverse proportional to dispersion: $dispersion = 1 / (1 + precision)$. In RNA-seq data, we can typically observe a trend where the dispersion decreases (precision increases) for genes with higher mean expression. <>= plotPrecision(d) @ All of the plotting functions from \Rpackage{DRIMSeq} package, return a \Rclass{ggplot} object which can be further modified using \CRANpkg{ggplot2} package. For example, the user can increase the size of points. <>= library(ggplot2) ggp <- plotPrecision(d) ggp + geom_point(size = 4) @ \subsubsection{Proportion estimation} In this step, we estimate the full model regression coefficients, fitted proportions and likelihoods. We use the same design matrix as in the precision estimation step. By default, \Rcode{one\_way = TRUE} which means that whenever the design corresponds to multiple group fitting, the 'one way' shortcut algorithm will be used, which in fact corresponds to the first implementation of \Rpackage{DRIMSeq}. Transcript proportions are estimated for each condition separately and then the regression coefficients are calculated using matrix operations. The 'one way' algorithm is adapted from the \Rfunction{glmFit} function in \Biocpkg{edgeR} \cite{McCarthy2012}. By setting \Rcode{verbose = 1}, we can see that the one way approach is used with the current design. When \Rcode{bb\_model = TRUE} (the default), additionally to the gene-level Dirichlet-multinomial estimates the transcript-level beta-binomial results will be computed. <>= d <- dmFit(d, design = design_full, verbose = 1) d ## Get fitted proportions head(proportions(d)) ## Get the DM regression coefficients (gene-level) head(coefficients(d)) ## Get the BB regression coefficients (feature-level) head(coefficients(d), level = "feature") @ \subsubsection{Testing for differential transcript usage} \label{DS_testing} Calling the \Rfunction{dmTest} function results in two calculations. First, null model is fitted. This null model can be defined by the user via \Rcode{coef}, \Rcode{design} or \Rcode{contrast} parameters. Second, likelihood ratio statistics are used to test for the difference between the full and null model. Both steps are done at the gene and transcript level when \Rcode{bb\_model = TRUE}. In our example, we would like to test whether there are differences in transcript usage between control (CTL) and knock-down (KD). We can achieve that by using the \Rcode{coef} parameter which should indicate which columns of the full design should be removed to get the null design. We define it equal to \Rcode{"groupKD"}. Then the null design has only an intercept column which means that all the samples are treated as if they came from one condition. Note that \Rcode{one\_way = TRUE} and the one way approach is used. <>= d <- dmTest(d, coef = "groupKD", verbose = 1) design(d) head(results(d), 3) @ The same can be achieved by directly defining the null design matrix with the \Rcode{design} parameter. <>= design_null <- model.matrix(~ 1, data = samples(d)) design_null d <- dmTest(d, design = design_null) head(results(d), 3) @ Or by using the \Rcode{contrast} parameter. The null design is calculated using the approach from the \Rfunction{glmLRT} function in \Biocpkg{edgeR} \cite{McCarthy2012}. <>= contrast <- c(0, 1) d <- dmTest(d, contrast = contrast) design(d) head(results(d), 3) @ To obtain the results of likelihood ratio tests, you have to call the function \Rfunction{results}, which returns a data frame with likelihood ratio statistics, degrees of freedom, p-values and Benjamini and Hochberg (BH) adjusted p-values for each gene by default and for each transcript when \Rcode{level = "feature"}. <>= head(results(d, level = "feature"), 3) @ You can plot a histogram of gene-level and transcript-level p-values. <>= plotPValues(d) plotPValues(d, level = "feature") @ For genes of interest, you can make plots (bar plots, line plots, box plots, ribbon plots) of observed and estimated with Dirichlet-multinomial model transcript ratios. You have to define the \Rcode{group\_variable} parameter which should indicate a variable from \Rcode{samples(d)}. Currently, plots can be done only for categorical variables. We choose the \Rcode{"group"} column since it corresponds to the comparison of our interest. Estimated proportions are marked with diamond shapes. As an example, we plot the top significant gene. <>= res <- results(d) res <- res[order(res$pvalue, decreasing = FALSE), ] top_gene_id <- res$gene_id[1] plotProportions(d, gene_id = top_gene_id, group_variable = "group") plotProportions(d, gene_id = top_gene_id, group_variable = "group", plot_type = "lineplot") plotProportions(d, gene_id = top_gene_id, group_variable = "group", plot_type = "ribbonplot") @ \subsubsection{Two-stage test} \Rpackage{DRIMSeq} returns gene and transcript level p-values which can be used as an input to the stage-wise analysis \cite{VandenBerge2017} implemented in the \Rcode{stageR} package, currently available on github \url{https://github.com/statOmics/stageR}. As pointed by the authors of \Rcode{stageR}, interpreting both gene-level and transcript-level adjusted p-values does not provide appropriate FDR control and should be avoided. However, applying a stage-wise testing provides a useful biological interpretation of these results and improved statistical performance. In short, the procedure consists of a screening stage and a confirmation stage. In the screening stage, gene-level BH-adjusted p-values are screened to detect genes for which the hypothesis of interest is significantly rejected. Only those genes are further considered in the confirmation stage, where for each gene separately, transcript-level p-values are adjusted to control the FWER and BH-adjusted significance level of the screening stage. It is important to note that transcript-level stage-wise adjusted p-values for genes that do not pass the screening stage are set to \Rcode{NA}. Also the stage-wise adjusted p-values can not be compared to significance level other than chosen in the stage-wise analysis. If that is of interest, one has to rerun this analysis with the new significance level. The following code chunk is not evaluated by this vignette and to run it, user has to make sure that the \Rcode{stageR} package is installed. It shows how one can use the \Rpackage{DRIMSeq} output in the stage-wise analysis. <>= library(stageR) ## Assign gene-level pvalues to the screening stage pScreen <- results(d)$pvalue names(pScreen) <- results(d)$gene_id ## Assign transcript-level pvalues to the confirmation stage pConfirmation <- matrix(results(d, level = "feature")$pvalue, ncol = 1) rownames(pConfirmation) <- results(d, level = "feature")$feature_id ## Create the gene-transcript mapping tx2gene <- results(d, level = "feature")[, c("feature_id", "gene_id")] ## Create the stageRTx object and perform the stage-wise analysis stageRObj <- stageRTx(pScreen = pScreen, pConfirmation = pConfirmation, pScreenAdjusted = FALSE, tx2gene = tx2gene) stageRObj <- stageWiseAdjustment(object = stageRObj, method = "dtu", alpha = 0.05) getSignificantGenes(stageRObj) getSignificantTx(stageRObj) padj <- getAdjustedPValues(stageRObj, order = TRUE, onlySignificantGenes = FALSE) head(padj) @ \subsection{Differential transcript usage analysis between two conditions with accounting for the batch effects} The regression framework implemented in \Rpackage{DRIMSeq} allows to account for the batch effects. Here, this would be the library layout stored in \Rcode{pasilla\_metadata\$LibraryLayout}. The steps of this analysis are the same as described above. The only difference is that we have to include the library layout variable in the \Rcode{sample} slot in the \Rcode{dmDSdata} object and define a full model that contains the batch effect. <>= pasilla_samples2 <- data.frame(sample_id = pasilla_metadata$SampleName, group = pasilla_metadata$condition, library_layout = pasilla_metadata$LibraryLayout) d2 <- dmDSdata(counts = pasilla_counts, samples = pasilla_samples2) ## Subsetting to a vignette runnable size d2 <- d2[names(d2) %in% gene_id_subset, ] ## Filtering d2 <- dmFilter(d2, min_samps_gene_expr = 7, min_samps_feature_expr = 3, min_gene_expr = 10, min_feature_expr = 10) ## Create the design matrix design_full2 <- model.matrix(~ group + library_layout, data = samples(d2)) design_full2 ## To make the analysis reproducible set.seed(123) ## Calculate precision d2 <- dmPrecision(d2, design = design_full2) common_precision(d2) head(genewise_precision(d2)) plotPrecision(d2) ## Fit proportions d2 <- dmFit(d2, design = design_full2, verbose = 1) ## Test for DTU d2 <- dmTest(d2, coef = "groupKD", verbose = 1) design(d2) head(results(d2), 3) ## Plot p-value distribution plotPValues(d2) @ <>= ## Plot the top significant gene res2 <- results(d2) res2 <- res2[order(res2$pvalue, decreasing = FALSE), ] top_gene_id2 <- res2$gene_id[1] ggp <- plotProportions(d2, gene_id = top_gene_id2, group_variable = "group") ggp + facet_wrap(~ library_layout) @ %------------------------------------------------------------------------------ % tuQTL analysis workflow %------------------------------------------------------------------------------ \section{tuQTL analysis workflow} In the transcript usage QTL analysis, we want to identify genetic variants (here, bi-allelic SNPs) that are associated with changes in transcript usage. Such SNPs are then called transcript usage quantitative trait locies (tuQTLs). Ideally, we would like to test associations of every SNP with every gene. However, such an approach would be very costly computationally and in terms of multiple testing correction. Under the assumption that SNPs that directly affect transcript usage are likely to be placed in the close surrounding of genes, we test only the SNPs that are located within the gene body and within some range upstream and downstream of the gene. \subsection{Example data} To demonstrate the tuQTL analysis with the \Rpackage{DRIMSeq} package, we use data from the GEUVADIS project \cite{Lappalainen2013}, where 462 RNA-Seq samples from lymphoblastoid cell lines were obtained. The genome sequencing data of the same individuals is provided by the 1000 Genomes Project. The samples in this project come from five populations: CEPH (CEU), Finns (FIN), British (GBR), Toscani (TSI) and Yoruba (YRI). We use transcript quantification (expected counts from FluxCapacitor) and genotypes available on the GEUVADIS project website \url{http://www.ebi.ac.uk/Tools/geuvadis-das/}, and the Gencode v12 gene annotation is available at \url{http://www.gencodegenes.org/releases/12.html}. In order to make this vignette runnable, we perform the analysis on subsets of bi-allelic SNPs and transcript expected counts for CEPH population (91 individuals) that correspond to 50 randomly selected genes from chromosome 19. The full dataset can be accessed from \Biocexptpkg{GeuvadisTranscriptExpr} package along with the description of preprocessing steps. \subsection{tuQTL analysis with the DRIMSeq package} Assuming you have gene annotation, feature counts and bi-allelic genotypes that are expressed in terms of the number of alleles different from the reference, the \Rpackage{DRIMSeq} workflow for tuQTL analysis is analogous to the one for differential transcript usage. First, we have to create a \Rclass{dmSQTLdata} object, which contains feature counts, sample information and genotypes. Similarly as in the differential transcript usage pipeline, results from every step are added to this object and at the end of the analysis, it contains precision estimates, proportions estimates, likelihood ratio statistics, p-values, adjusted p-values. As new elements are added, the object also changes its name \Rclass{dmSQTLdata} $\rightarrow$ \Rclass{dmSQTLprecision} $\rightarrow$ \Rclass{dmSQTLfit} $\rightarrow$ \Rclass{dmSQTLtest}. For each object, slots and methods are inherited from the previous one. \subsubsection{Loading GEUVADIS data into R} We use the subsets of data defined in the \Biocexptpkg{GeuvadisTranscriptExpr} package. <>= library(GeuvadisTranscriptExpr) geuv_counts <- GeuvadisTranscriptExpr::counts geuv_genotypes <- GeuvadisTranscriptExpr::genotypes geuv_gene_ranges <- GeuvadisTranscriptExpr::gene_ranges geuv_snp_ranges <- GeuvadisTranscriptExpr::snp_ranges @ Load the \Rpackage{DRIMSeq} package. <>= library(DRIMSeq) @ In the tuQTL analysis, an initial data object \Robject{d} is of \Robject{dmSQTLdata} class and, additionally to feature counts and sample information, it contains genotypes of SNPs that are in some surrounding of genes. This surrounding is defined with the parameter \Rcode{window}. In order to find out which SNPs should be tested with which genes, the \Rfunction{dmSQTLdata} functions requires as an input the location of genes (\Rcode{gene\_ranges}) and SNPs (\Rcode{snp\_ranges}) stored as \Rclass{GRanges} objects. Variables with transcript IDs and gene IDs in the \Robject{counts} data frame must have names \Rcode{feature\_id} and \Rcode{gene\_id}, respectively. In the \Robject{genotypes} data frame, the variable with SNP IDs must have name \Rcode{snp\_id}. <>= colnames(geuv_counts)[c(1,2)] <- c("feature_id", "gene_id") colnames(geuv_genotypes)[4] <- "snp_id" geuv_samples <- data.frame(sample_id = colnames(geuv_counts)[-c(1,2)]) d <- dmSQTLdata(counts = geuv_counts, gene_ranges = geuv_gene_ranges, genotypes = geuv_genotypes, snp_ranges = geuv_snp_ranges, samples = geuv_samples, window = 5e3) d @ In our tuQTL analysis, we do not repeat tests for the SNPs that define the same grouping of samples (genotype). We identify SNPs with identical genotypes across the samples and assign them to blocks. Estimation and testing are done at the block level, but the returned results are extended to a SNP level by repeating the block statistics for each SNP that belongs to a given block. The data summary plot \Rfunction{plotData} produces three histograms: the number of features per gene, the number of SNPs per gene and the number of blocks per gene. <>= plotData(d, plot_type = "features") plotData(d, plot_type = "snps") plotData(d, plot_type = "blocks") @ \subsubsection{Filtering} The filtering step eliminates genes and features with low expression, as in the differential transcript usage analysis (see section \ref{DS_filtering}). Additionally, it filters out the SNPs/blocks that do not define at least two genotypes where each of them is present in at least \Robject{minor\_allele\_freq} individuals. Usually, \Robject{minor\_allele\_freq} is equal to roughly 5\% of the total sample size. Ideally, we would like that genes were expressed at some minimal level in all samples because this would lead to better estimates of feature ratios. However, for some genes, missing values may be present in the counts data, or genes may be lowly expressed in some samples. Setting up \Robject{min\_samps\_gene\_expr} to 91 may exclude too many genes from the analysis. We can be slightly less stringent by taking, for example, \Rcode{min\_samps\_gene\_expr = 70}. <>= d <- dmFilter(d, min_samps_gene_expr = 70, min_samps_feature_expr = 5, minor_allele_freq = 5, min_gene_expr = 10, min_feature_expr = 10) @ \subsubsection{Precision estimation} In the DTU analysis (section \ref{DS_precision_estimation}), the full model used in precision estimation has to be defined by the user. Here, full models are defined by genotypes. For a given SNP, genotype can have numeric values of 0, 1, and 2. When \Rcode{one\_way = TRUE}, multiple group fitting is performed. When \Rcode{one\_way = FALSE}, a regression framework is used with the design matrix defined by a formula $\sim group$ where $group$ is a continuous (not categorical) variable with genotype values 0, 1, and 2. For the tuQTL analysis, it has an additional parameter called \Rcode{speed}. If \Rcode{speed = FALSE}, gene-wise precisions are calculated for each gene-block. This calculation may take a long time, since there can be hundreds of SNPs/blocks per gene. If \Rcode{speed} is set to \Rcode{TRUE}, there will be only one precision calculated per gene (assuming a null model, i.e., model with intercept only), and it will be assigned to all the blocks matched to this gene. In the default setting, \Rcode{speed = TRUE} and common precision is used as an initial value in the grid approach to estimate gene-wise precisions with NO moderation, since the sample size is quite large. Again, this step of the pipeline is one of the most time consuming. Thus consider using \Rcode{BPPARAM = BiocParallel::MulticoreParam()} with more than one worker when performing real data analysis. <>= ## To make the analysis reproducible set.seed(123) ## Calculate precision d <- dmPrecision(d) plotPrecision(d) @ \subsubsection{Proportion estimation} Dirichlet-multinomial full model proportions/coefficients and likelihoods are estimated for each gene-block pair. Currently, no transcript-level analysis are implemented in the tuQTL workflow. <>= d <- dmFit(d) @ \subsubsection{Testing for tuQTLs} \Rfunction{dmTest} function estimates gene-level null model proportions/coefficients and likelihoods and performs the likelihood ratio test. The null models equal to models with intercept only. In contrast to the DTU analysis, there are some additional challenges that have to handled in the tuQTL analysis. They include a large number of tests per gene with highly variable allele frequencies (models) and linkage disequilibrium. As in other sQTL studies, we apply a permutation approach to empirically assess the null distribution of associations and use it for the adjustment of nominal p-values. There are two permutation schemes available. When \Rcode{permutation\_mode} equals to \Rcode{"all\_genes"}, the null p-value distribution is calculated from all the genes. When \Rcode{permutation\_mode = "per\_gene"}, null distribution of p-values is calculated for each gene separately based on permutations of this individual gene. The latter approach may take a lot of computational time. We suggest using the first option, which is also the default one. The function \Rfunction{results} returns a data frame with likelihood ratio statistics, degrees of freedom, p-values and Benjamini and Hochberg adjusted p-values for each gene-block/SNP pair. <>= d <- dmTest(d) plotPValues(d) head(results(d)) @ You can plot the observed transcript ratios for the tuQTLs of interest. Plotting the fitted values is not possible as we do not return this estimates due to their size. When the sample size is large, we recommend using box plots as a \Rcode{plot\_type}. We plot a tuQTL with the lowest p-value. <>= res <- results(d) res <- res[order(res$pvalue, decreasing = FALSE), ] top_gene_id <- res$gene_id[1] top_snp_id <- res$snp_id[1] plotProportions(d, gene_id = top_gene_id, snp_id = top_snp_id) plotProportions(d, gene_id = top_gene_id, snp_id = top_snp_id, plot_type = "boxplot2") @ %-------------------------------------------------- % Session information %-------------------------------------------------- \section{Session information} <>= sessionInfo() @ %-------------------------------------------------- % References %-------------------------------------------------- \bibliography{References} \end{document}