%\VignetteIndexEntry{Generally applicable transcriptome-wide analysis of translational efficiency using anota2seq} %\VignetteKeywords{Translation, ANCOVA} %\VignettePackage{anota2seq} %\VignetteEngine{knitr::knitr} \documentclass{article} \usepackage{float} <>= knitr::opts_chunk$set(message = FALSE, warning = FALSE, error = FALSE, tidy = FALSE, eval = TRUE) # turn off verbosity if(packageVersion("BiocStyle") < '2.5.0'){ BiocStyle::latex2() } else if(packageVersion("BiocStyle") >= '2.5.0'){ BiocStyle::latex() } @ \begin{document} \bioctitle[Generally applicable transcriptome-wide analysis of translational efficiency using anota2seq]{Generally applicable\\ transcriptome-wide analysis\\ of translational efficiency using anota2seq} \author{Christian Oertlin, Julie Lorent, Ola Larsson \thanks{\email{ola.larsson@ki.se}}} \maketitle \tableofcontents \newpage \addcontentsline{toc}{section}{Introduction} \section*{Introduction} Gene expression is a multi-step process including transcription, mRNA-transport, -stability and -translation. Dysregulated mRNA translation is commonly observed in human diseases such as cancer and understanding which mRNAs are differentially translated and the mechanisms that mediate such effects is therefore of high importance. Estimates of transcriptome-wide translational efficiency can be obtained using polysome-profiling and ribosome-profiling. Both approaches are based on isolation of translated mRNA (polysome-associated mRNA or Ribosome Protected Fragments [RPF]) followed by quantification using DNA-microarrays or RNA sequencing (RNAseq). A parallel total mRNA sample is also isolated and quantified in order to identify \emph{bona fide} changes in translational efficiency. More details are found in \cite{larsson2008, piccirillo}. \\ During analysis of the resulting data, three regulatory modes can be observed: changes in mRNA abundance (i.e. similar changes in total mRNA levels and levels of translated mRNA) and changes in translational efficiency leading to changes in protein levels (a change in the amount of translated mRNA that is not explained by a change in total mRNA) or buffering which maintains constant levels of translated mRNA (and hence also protein levels) despite altered levels of total mRNA. Efficient separation of these regulatory modes is necessary to elucidate underlying regulatory mechanisms \cite{Oertlin}. Studies of changes in translational efficiency commonly apply per sample differences (log scale) between levels of translated mRNA and total mRNA \cite{larsson2008} that are compared between treatments. However, as discussed in \cite{larsson2010} such translational efficiency scores and outputs from methods that use such scores will show spurious correlations leading to elevated false positive findings \cite{larsson2010}.\\ This bias from spurious correlations can be solved by using per-identifier regression-based analysis between levels of translated mRNA and total mRNA. Such analysis produces residuals that are uncorrelated with the total mRNA levels and changes in translational efficiency leading to altered protein levels or buffering can be identified using Analysis of Partial Variance (APV) \cite{larsson2010}. Anota2seq allows for identification of all three regulatory modes from polysome- or ribosome- profiling data quantified by DNA-microarrays or RNAseq. It uses APV and thereby eliminates spurious correlation bias. Here we illustrate the use of the anota2seq package. \section{Workflow} Analysis of translational activity using anota2seq includes the following steps: \begin{enumerate} \item Initialize an Anota2seqDataSet and pre-process RNA sequencing data using \Rfunction{anota2seqDataSetFromMatrix} or \Rfunction{anota2seqDataSetFromSE}. See section \ref{preproc} \item Assessment of model assumptions using \Rfunction{anota2seqPerformQC} and \Rfunction{ anota2seqResidOutlierTest}. See section \ref{qc} \item Analysis of changes in mRNA abundance and translational efficiency leading to altered protein levels or buffering using \Rfunction{anota2seqAnalyze}. See section \ref{analysis} \item Selection of identifiers and classification into different regulatory modes of gene expression using \Rfunction{anota2seqSelSigGenes} and \Rfunction{anota2seqRegModes}. See section \ref{genesSection} and \ref{regModesSection} respectively. \item Visualize the results using \Rfunction{anota2seqPlotPvalues}, \Rfunction{anota2seqPlotFC} and \Rfunction{anota2seqPlotGenes} See sections \ref{fcSection}, \ref{densitySection} and \ref{genesSection}, respectively. \end{enumerate} \section{Getting started} anota2seq provides a wrapper function called \Rfunction{anota2seqRun} which performs all analysis steps with relevant default parameters. Here we show an overview of the whole workflow using this function. We illustrate an analysis using count data from a RNA sequencing experiment on total mRNA (called anota2seq\_data\_T here) and on translated mRNA (polysome-associated mRNA or Ribosome Protected Fragments, data called anota2seq\_data\_P) on at least 2 conditions ("ctrl" and "treatment" here) and a vector of sample annotation (called anota2seq\_pheno\_vec). The following code performs normalization, assesses model assumptions and performs the analysis for the default contrast (treatment vs. control in this case): <>= library(anota2seq) data(anota2seq_data) @ <>= ads <- anota2seqDataSetFromMatrix( dataP = anota2seq_data_P[1:1000,], dataT = anota2seq_data_T[1:1000,], phenoVec = anota2seq_pheno_vec, dataType = "RNAseq", normalize = TRUE) ads <- anota2seqRun(ads) @ << echo = FALSE, message = FALSE, results = 'hide', eval = TRUE >>= unlink(c("ANOTA2SEQ_interaction_p_distribution.pdf", "ANOTA2SEQ_residual_distribution_summary.jpeg", "ANOTA2SEQ_residual_vs_fitted.jpeg", "ANOTA2SEQ_rvm_fit_for_all_contrasts_group.jpg", "ANOTA2SEQ_rvm_fit_for_interactions.jpg", "ANOTA2SEQ_rvm_fit_for_omnibus_group.jpg", "ANOTA2SEQ_simulated_vs_obt_dfbetas_without_interaction.pdf")) @ The regulatory modes can be quickly visualized: <>= anota2seqPlotFC(ads, selContrast = 1, plotToFile = FALSE) @ The following code chunk illustrates how to access the top list of significant changes in translational efficiency leading to altered protein levels (effect, adjusted p-value, regulatory mode): <>= head( anota2seqGetOutput( ads, analysis = "translation", output = "selected", getRVM = TRUE, selContrast = 1)[, c("apvEff", "apvRvmPAdj", "singleRegMode")]) @ This provided an overview of the features of the package. Each step of the analysis are detailed in the next section. \newpage \section{Transcriptome-wide analysis of translational efficiency using anota2seq} \subsection{Input Data}\label{inputSection} anota2seq can analyze data from both ribosome-profiling and polysome-profiling quantified by RNAseq or DNA-microarrays. anota2seq cannot use data from competitive two channel experiments when the polysome- associated mRNA is directly compared to total mRNA as these do not allow independent estimates of polysome- associated mRNA and total mRNA levels\footnote{A two-channel reference design should be applicable although we have not tested this data type.}. anota2seq requires 3 replicate experiments per group if there is 2 treatments. If there is more than two treatments, two replicates is sufficient but will result in reduced statistical power as compared to three replicates. We recommend three replicates in most cases.\\ In this vignette, we will use simulated data provided with the package to illustrate how to perform each step of the analysis. These data originate from the study by Oertlin et al. \cite{Oertlin} which compared methods for analysis of translatomes quantified by RNAseq. Eight samples were simulated from 2 sample classes ("control" and "treatment"); both total mRNA (anota2seq\_data\_T, raw RNAseq counts) and paired translated mRNA (anota2seq\_data\_P, raw RNAseq counts) are provided together with a sample class vector (anota2seq\_pheno\_vec). <>= data(anota2seq_data) # Polysome-associated mRNA and total mRNA columns must follow the same order head(anota2seq_data_P, n = 2) head(anota2seq_data_T, n = 2) # phenoVec must describe the sample class for corresponding columns # in dataT and dataP anota2seq_pheno_vec @ \subsection{Normalization and transformation of the raw data}\label{preproc} The anota2seq performance will vary depending on normalization and transformation of the data. We therefore recommend that the user tries several different transformations and normalization approaches while monitoring the quality control plots (the influential data points, the interactions and the normality of the residuals) and the RVM F-distribution fit plot if RVM is used (see sections \ref{qc} and \ref{rvmSection}). \\ anota2seq gives the options to supply normalized DNA-microarrays data, normalized and transformed RNAseq data or raw RNAseq data for both translated mRNA (i.e. polysome-associated mRNA or RPF) and total mRNA. As anota2seq requires data on a continuous log scale, raw RNAseq data (count data) will be pre-processed to ensure efficient analysis.\\ In general, RMA is an efficient normalization for DNA-microarray data (for Affymetrix GeneChips) while TMM-log2 normalization \cite{robinson, ritchie} is efficient for RNAseq data (this is the default method in anota2seq when raw RNAseq data is provided as input). The rlog algorithm from \Biocpkg{DESeq2} \cite{love} can also be used within anota2seq.\\ Normalization and transformation of RNAseq data are performed at the step of initialization of an \Rclass{Anota2seqDataSet} object in the \Rfunction{anota2seqDataSetFromMatrix} and \Rfunction{anota2seqDataSetFromSE} functions. Filtering of identifiers with 0 counts in at least one sample is also available when raw RNAseq data is provided (parameter \Rcode{filterZeroGenes}). Additionally, users can filter the dataset to remove identifiers with no variance in each mRNA source prior to analysis. This filtering prevents an APV analysis without variance which will result in an error and a halt in the analysis (parameter \Rcode{varCutOff}). <>= ads <- anota2seqDataSetFromMatrix( dataP = anota2seq_data_P[1:1000,], dataT = anota2seq_data_T[1:1000,], phenoVec = anota2seq_pheno_vec, dataType = "RNAseq", filterZeroGenes = TRUE, normalize = TRUE, transformation = "TMM-log2", varCutOff = NULL) @ <>= library(SummarizedExperiment) countData <- as.matrix(cbind(anota2seq_data_P[1:1000,],anota2seq_data_T[1:1000,])) # annotations anot <- data.frame( row.names = colnames(countData), #information on mRNA types RNA = c(rep("P",8),rep("T",8)), #samples classes treatment = rep(c(rep("ctrl",4),rep("treatment",4)),2), #sample Pairs samplePairs = rep(c(paste("ctrl",c(1:4),sep="_"),paste("treatment",c(1:4),sep="_")),2), # batch information, in this case replicate number batches = rep(c(1,2,3,4),4)) # Create the SummarizedExperiment mySummarizedExperiment <- SummarizedExperiment( assays = list(counts = countData), colData = anot) @ Similarly, an \Rclass{Anota2seqDataSet} object can be initialized from a \Rclass{SummarizedExperiment} object using \Rfunction{anota2seqDataSetFromSE} as follows\footnote{see \Rcode{help(anota2seqDataSetFromSE)} for details on required \Rcode{colData} formatting}: <>= adsFromSE <- anota2seqDataSetFromSE( se = mySummarizedExperiment, assayNum = 1, # Position of the count data in assays(mySummarizedExperiment) dataType = "RNAseq", normalize = TRUE, transformation = "TMM-log2") @ \subsection{Assessment of model assumptions} \label{qc} To apply APV, multiple assumptions need to be fulfilled for tens of thousands of identifiers, which is a substantial challenge to evaluate. Due to the high dimensionality of the data, anota2seq takes multiple testing into account when assessing assumption violations. If we observe the same number of problematic features as expected, we assume that we can apply anota2seq.\\ Using the following code, anota2seq performs quality control checks and outputs diagnostic plots (Fig. \ref{singleReg} to \ref{int}) which are further described below. <>= ads <- anota2seqPerformQC(Anota2seqDataSet = ads, generateSingleGenePlots = TRUE) @ <>= ads <- anota2seqPerformQC(Anota2seqDataSet = ads, generateSingleGenePlots = TRUE, fileName = "figure/singleReg.pdf", fileStem = "figure/") @ Highly influential data points may cause errors in the regression analyzes. On the one hand, we expect that a number of highly influential data points will appear merely by chance because of the large number of analyzes performed. Thus anota2seq attempts to establish if we, when considering all analyzed identifiers, observe more influential data points compared to what would be expected by chance. If the answer is no, then there are no concerns with the overall analysis. On the other hand, influential data points may nonetheless affect the specific APV analyzes in which they are found. For this reason, anota2seq provides an output (Fig. \ref{singleReg}) that can be used to flag these identifiers so that they can be examined in more detail if desired.\\ \begin{figure} \includegraphics[page=1]{figure//singleReg.pdf} \caption{anota2seq can be set to output identifier per identifier regressions between translated mRNA and total mRNA levels. Plotting symbols are taken from the \Rfunction{phenoVec} argument and the lines are the regression lines per samples class} \label{singleReg} \end{figure} For detection of influential data points, anota2seq uses standardized dfbeta for the slope of the regression and several thresholds to determine whether or not a data point is highly influential. As there is no known distribution of the dfbetas when the underlying data are normally distributed, anota2seq simulates data sets to obtain estimates of the expected number of outliers. The simulation is performed by sampling N (corresponding to the number of samples in the analysis) data points from the normal distribution and calling these data points the translated mRNA level. In detail, such translated mRNA levels are obtained by sampling data points from a normal distribution with a mean of the corresponding total mRNA level data point. Ten different such data sets are obtained with different variances when sampling translated mRNA level data. These data sets are then merged and frequencies of outlier dfbetas are calculated and compared to the frequencies of outlier dfbetas from the analyzed data (Fig. \ref{dfbs}). \\ \begin{figure} \includegraphics[]{figure//_simulated_vs_obt_dfbetas_without_interaction.pdf} \caption{A bar graph showing the obtained and expected (based on a simulation) number of influential data points as judged by different thresholds. For each threshold the difference between the obtained and the simulated frequency of outliers is shown.} \label{dfbs} \end{figure} APV assumes that the slopes of the regressions from each treatment are the same so that using the common slope is valid. This assumption postulates that the relationship between the translated mRNA level and the total mRNA level shows the same slope for each treatment, i.e., treatment and total mRNA levels do not interact in predicting translated mRNA levels. Again, because we analyze thousands of regressions, we expect that a number of interactions will arise simply due to chance. If the number of interactions does not exceed what is expected by chance, their p-values should follow a uniform distribution. Thus anota2seq provides an output allowing to compare the distribution of the interaction significances as well as their distribution after adjusting for multiple testing (Fig. \ref{int}).\\ \begin{figure} \includegraphics[page=1]{figure//_interaction_p_distribution.pdf} \caption{Assessment of whether the significances for the interactions follow the uniform NULL distribution. Shown are both density plots and histograms of the nominal and adjusted p-values (in this case adjusted using Benjamini-Hochberg FDR).} \label{int} \end{figure} Significance testing within the APV framework assumes that the residuals from the regressions are normally distributed. The \Rfunction{anota2seqResidOutlierTest} function assesses whether the residuals from the linear regressions (identifier by identifier) of translated mRNA level~total mRNA level are normally distributed. <>= ads <- anota2seqResidOutlierTest(ads) @ <>= ads <- anota2seqResidOutlierTest(ads, residFitPlot = FALSE, generateSingleGenePlots = TRUE, nGraphs = 12) file.rename(from = "ANOTA2SEQ_residual_distributions_single.pdf", to = "figure/ANOTA2SEQ_residual_distributions_single.pdf") file.rename(from = "ANOTA2SEQ_residual_distribution_summary.jpeg", to = "figure/ANOTA2SEQ_residual_distribution_summary.jpeg") @ anota2seq generates normal Q-Q plots of the residuals. If the residuals are normally distributed, the data quantiles will form a straight diagonal line from bottom left to top right. Because there are typically relatively few data points, anota2seq calculates ''envelopes'' based on a set of samplings from the normal distribution using the same number of data points as for the true data \cite{Venables}. To enable a comparison, both the true and the sampled data are scaled (variance=1) and centered (mean=0). The samples (both true and sampled) are then sorted and the true sample is compared to the envelopes of the sampled series at each sort position. The result is presented as a Q-Q plot of the true data where the envelopes of the sampled series are indicated. If there are 99 samplings we expect that 1/100 values should be outside the range obtained from the samplings. Thus it is possible to assess if approximately the expected number of outlier residuals are obtained. anota2seq provides a single identifier output (Fig. \ref{residSingle}) as well as a summary output (Fig. \ref{resid}).\\ \begin{figure} \includegraphics[]{figure//ANOTA2SEQ_residual_distributions_single.pdf} \caption{Assessment of whether the residuals are approximately normally distributed. Shown is the output from the single identifier alternative within \Rfunction{anota2seqResidOutlierTest}. The Q-Q plot for the identifier is compared to the outer limits of a set of Q-Q plots generated by sampling from the normal distribution.} \label{residSingle} \end{figure} \begin{figure} \includegraphics[]{figure//ANOTA2SEQ_residual_distribution_summary.jpeg} \caption{Assessment of whether the residuals are approximately normally distributed. Shown is the output from all identifiers using the \Rfunction{anota2seqResidOutlierTest} function. The Q-Q plot for the identifiers is compared to the outer limits of a set of Q-Q plots generated by sampling from the normal distribution. The obtained and expected percentage of outliers is indicated at each rank position and combined.} \label{resid} \end{figure} While anota2seq enables testing of the issues discussed above, it is left to the user to decide whether it is possible to use anota2seq to identify changes in translational efficiency affecting protein levels or buffering. A few issues that may cause problems in the quality control are: \begin{enumerate} \item Outlier samples. One or a few outlier samples in the analysis (either from the translated mRNA data or the total mRNA data) could give rise to many influential data points. Thus, if there are more influential data points than would be expected, a careful quality control of the data followed by identification and exclusion of outlier samples might be needed to resolve such issues. \item More significant interactions compared to what is expected by chance could be caused by bias in the data set. \item If the resulting residuals deviate strongly from normality an alternative normalization method could be tested. \end{enumerate} \clearpage \subsection{Analysis of changes in translational efficiency leading to altered protein levels or buffering} \label{analysis} Once the data set has been validated as suitable for analysis, significant changes in translational efficiency affecting protein levels or buffering can be identified. \\ Translational buffering is a regulatory pattern which decouples mRNA levels from protein levels (despite altered levels of total mRNAs between conditions, translated mRNA levels remain constant; mRNAs under such a regulatory pattern are colored in dark and light blue in the example presented in Fig. \ref{fig:gettingStartedVisualizeResults}) and which potentially holds important information regarding how gene expression is regulated. anota2seq allows to distinguish between changes in translational efficiency leading to altered protein levels (orange and red colored mRNAs in Fig. \ref{fig:gettingStartedVisualizeResults}) and buffering. Both analyzes can be performed on our sample data using the following code: <>= ads <- anota2seqAnalyze(Anota2seqDataSet = ads, analysis = c("translation", "buffering")) @ <>= ads <- anota2seqAnalyze(Anota2seqDataSet = ads, analysis = c("translation", "buffering"), fileStem = "figure/") @ While \Rfunction{anota2seqPerformQC} performs an omnibus treatment effect test when there are more than 2 treatments, \Rfunction{anota2seqAnalyze} allows the user to set custom contrasts using the \Rcode{contrasts} parameter. In the example above, the default contrast ("treatment" vs. "control") is used. \subsubsection{Random variance model (RVM) to improve power in detection of changes in translational efficiency leading to altered protein levels or buffering} \label{rvmSection} RVM is an empirical Bayes method which has been shown to increase statistical power for small N analysis \cite{wright}. In RVM, the variance of each identifier is adjusted using the variance obtained from an inverse gamma distribution derived from the variances of all identifiers. A key assumption in RVM is that the resulting variances follow a theoretical F-distribution. anota2seq tests this for the analysis of omnibus group effects (Fig. \ref{rvm}), omnibus interactions (not shown, output of \Rfunction{anota2seqPerformQC}), and the identification of changes in translational efficiency leading to altered protein levels and buffering (not shown, output of \Rfunction{anota2seqAnalyze}). Each of these analyzes generates a comparison of the obtained empirical distribution compared to the theoretical distribution (similarity is then assessed using a KS test whose alternative hypothesis should be rejected for a good fit). We have noticed that the normalization of the data can strongly influence the fit but that RVM seems to be applicable in most cases after identifying an efficient normalization/transformation. It is necessary to validate that application of RVM does not influence the distribution of the interaction p-values (not shown, output of \Rfunction{anota2seqPerformQC}). \Rfunction{anota2seqAnalyze} performs analyzes both with and without RVM ; we recommend using RVM as it improves the power to detect changes in translational efficiency leading to altered protein levels or buffering within anota2seq \cite{larsson2010}. \begin{figure}[H] \includegraphics{figure//_rvm_fit_for_omnibus_group.jpg} \caption{An output from the \Rfunction{anota2seqPerformQC} function (used with parameter \Rcode{useRVM = TRUE}) comparing obtained variances to the theoretical F-distribution. RVM assumes that the empirical and the theoretical distributions are similar.} \label{rvm} \end{figure} \subsubsection{Visualization of the results from \Rfunction{anota2seqAnalyze}}\label{densitySection} \Rfunction{anota2seqAnalyze} outputs details of the tests for each identifier (information about slopes of the APV model, test statistics, effect, unadjusted and adjusted p-value): <>= head(anota2seqGetOutput( ads, analysis = "translation", output = "full", selContrast = 1, getRVM = TRUE)) @ The density of p-values can be visualized using the \Rfunction{anota2seqPlotPvalues} function on the output of \Rfunction{anota2seqAnalyze} (Fig. \ref{fig:pvalDensity}). <>= par(mfrow = c(1, 2)) anota2seqPlotPvalues(ads, selContrast = 1, plotToFile = FALSE) @ \subsubsection{Unrealistic models of changes in translation efficiency}\label{slopeSection} The slopes that are fitted in the anota2seq APV models can take unrealistic values that will influence the analysis of changes in translation efficiency leading to altered protein levels or buffering. anota2seq therefore tests whether slopes for analysis of changes in translational efficiency affecting protein levels that are >1 differ from 1 and slopes for analysis of changes in translational efficiency leading to buffering that are <-1 differ from -1. Furthermore, slopes < 0 for analysis of changes in translational efficiency affecting protein levels or > 0 for analysis of changes in translational efficiency leading to buffering, indicate unlikely but not impossible translational control so these events are also tested for. Results of these tests (p-values) are found in the output of \Rfunction{anota2seqPerformQC}, \Rfunction{anota2seqAnalyze} and \Rfunction{anota2seqRun} functions. These p-values can be used to filter or flag identifiers with unrealistic slopes or slopes revealing unlikely translational control. \subsubsection{Identifier selection and visualization of single gene regressions}\label{genesSection} The output from \Rfunction{anota2seqAnalyze} can be filtered using the \Rfunction{anota2seqSelSigGenes}. Identifiers can be selected based on several criteria: \begin{itemize} \item include only realistic slopes (see section \ref{slopeSection}), using parameters \Rcode{minSlopeTranslation}, \Rcode{maxSlopeTranslation}\footnote{and similarly \Rcode{minSlopeBuffering} and \Rcode{maxSlopeBuffering}} and \Rcode{slopeP} \item include a minimum effect threshold, using parameter \Rcode{minEff} \item include only significant identifiers according to a defined p-value or adjusted p-value threshold (parameter \Rcode{maxP} and \Rcode{maxPAdj}) \end{itemize} An example of code to perform this filtering is as follows: <>= ads <- anota2seqSelSigGenes(Anota2seqDataSet = ads, selContrast = 1, minSlopeTranslation = -1, maxSlopeTranslation = 2, minSlopeBuffering = -2, maxSlopeBuffering = 1, maxPAdj = 0.05) @ Once the \Robject{Anota2seqDataSet} object has been filtered, single gene regressions can be visualized using the \Rfunction{anota2seqPlotGenes} function (Fig. \ref{fig:singleGeneRegressions_translation} and \ref{fig:singleGeneRegressions_buffering}). The graphical output includes both the graphical interpretation of the APV analysis and the key statistics from both the standard and the RVM based analysis. <>= anota2seqPlotGenes(ads, selContrast = 1, analysis = "translation", plotToFile = FALSE) @ <>= anota2seqPlotGenes(ads, selContrast = 1, analysis = "buffering", plotToFile = FALSE) @ \subsubsection{Note about analysis of translational buffering} The APV model fitted in anota2seq for analysis of changes in translational efficiency leading to altered protein levels consists in a model with translated mRNA as independent variable and total mRNA and the sample class variable as dependent variables. In other words, a common slope for all sample categories is considered and the translational effect is defined as a difference in intercepts \cite{larsson2010}. This regression model can be visualized in Fig. \ref{fig:singleGeneRegressions_translation}.\\ Translational buffering is defined as changes in total mRNA level that are not paralleled by changes in levels of translated mRNA. As such, performing analysis of buffering considers total mRNA as independent variable and translated mRNA as dependent variable (together with the sample class; as illustrated in Fig. \ref{fig:singleGeneRegressions_buffering}). \clearpage \subsection{Categorizing genes into regulatory modes}\label{regModesSection} Polysome or ribosome profiling allows the user to distinguish between three regulatory modes: changes in mRNA abundance (i.e. similar changes in total mRNA levels and levels of translated mRNA) and translational efficiency leading to altered protein levels or buffering \cite{Oertlin}. For that, the \Rfunction{anota2seqAnalyze} and \Rfunction{anota2seqSelSigGenes} functions have to be run with parameter \Rcode{analysis} set to "translation" and "buffering" as shown above but analysis of differential expression of total mRNA and translated mRNA is also required. For that, the same functions can be used with \Rcode{analysis} parameter set to "total mRNA" and "translated mRNA" as shown below: <>= ads <- anota2seqAnalyze(Anota2seqDataSet = ads, analysis = c("total mRNA", "translated mRNA"), fileStem = "figure/") ads <- anota2seqSelSigGenes(Anota2seqDataSet = ads, analysis = c("total mRNA", "translated mRNA"), selContrast = 1, minSlopeTranslation = -1, maxSlopeTranslation = 2, minSlopeBuffering = -2, maxSlopeBuffering = 1, maxPAdj = 0.05) @ <>= ads <- anota2seqAnalyze(Anota2seqDataSet = ads, analysis = c("total mRNA", "translated mRNA")) ads <- anota2seqSelSigGenes(Anota2seqDataSet = ads, analysis = c("total mRNA", "translated mRNA"), selContrast = 1, minSlopeTranslation = -1, maxSlopeTranslation = 2, minSlopeBuffering = -2, maxSlopeBuffering = 1, maxPAdj = 0.05) @ Once all analyzes have been performed, all regulated identifiers can be categorized into one of these regulatory modes using the \Rfunction{anota2seqRegModes} function. This categorization into gene expression regulatory patterns might be of interest in order to elucidate underlying mechanisms. <>= ads <- anota2seqRegModes(ads) @ Notably, there is a hierarchy such that mRNAs identified as changing their translational efficiency leading to altered protein levels will belong to the translation group and no other group; mRNAs that change their levels in the translated pool and total mRNA pool but are not identified as changing their translational efficiency leading to altered protein levels will be in the abundance group; and mRNAs that are identified as changing their translational efficiency leading to buffering and are not in the former two groups are allocated to the set of buffered mRNAs. Specifically, the \Rfunction{anota2seqRegModes} function adds a column named "singleRegModes" indicating the classification into regulatory modes in the data.frame containing gene by gene statistical results. This output can be accessed by using \Rfunction{anota2seqGetOutput} with \Rcode{output} parameter set to "regModes". << regModesOutput, echo=TRUE, eval = TRUE >>= head(anota2seqGetOutput(object = ads, output="regModes", selContrast = 1, analysis="buffering", getRVM = TRUE))[, c("apvSlope", "apvEff", "apvRvmP", "apvRvmPAdj", "singleRegMode")] @ \subsubsection{Visualizing the different regulatory modes}\label{fcSection} anota2seq provides the \Rfunction{anota2seqPlotFC} function which plots the translated mRNA log Fold Change vs. the total mRNA log Fold Change and colors genes according to their regulatory mode (Fig. \ref{fig:fcPlot}). <>= anota2seqPlotFC(ads, selContrast = 1, plotToFile = FALSE) @ \subsection{One-step procedure with \Rfunction{anota2seqRun}}\label{runSection} In addition to application of each of the functions within anota2seq which provides the maximum flexibility, the anota2seq package provides the option to perform a one-step analysis of translated mRNA, total mRNA and changes in translational efficiency leading to altered protein levels or buffering. This analysis performs quality control followed by analysis of changes in translational efficiency affecting protein levels or buffering. A filtering is also performed (as in \Rfunction{anota2seqSelSigGenes}) as well as categorization into regulatory modes. <>= ads <- anota2seqRun( Anota2seqDataSet = ads, thresholds = list( maxPAdj = 0.05, minEff = 1.5), performQC = TRUE, performROT = TRUE, useRVM = TRUE) @ The output of the \Rfunction{anota2seqRun} function can be supplied to the \Rfunction{anota2seqPlotPvalues}, \Rfunction{anota2seqPlotGenes} and \Rfunction{anota2seqPlotFC} functions for similar visualization of the results as in Fig. \ref{fig:pvalDensity}, \ref{fig:singleGeneRegressions_translation}, \ref{fig:singleGeneRegressions_buffering} and \ref{fig:fcPlot}. \section{Extending anota2seq to analysis of other data sources} In principle, any data source where the intention is to identify changes in a subset that is independent of a background can be analyzed (e.g. RIP-SEQ data). \section{New features in anota2seq compared to \Biocpkg{anota}} The core models in anota2seq are similar to those in the \Bioconductor{} package \Biocpkg{anota}. However, there are many differences including: \begin{itemize} \item \Biocpkg{anota} was designed to analyze data from DNA-microarrays platforms. anota2seq allows analysis of both DNA-microarrays and RNA sequencing (section \ref{inputSection}) \item Implementation of analysis of translational buffering (section \ref{analysis}) \item anota2seq allows for batch adjustment (parameter \Rcode{batchVec} of \Rfunction{anota2seqDataSetFromMatrix} and \Rfunction{anota2seqDataSetFromSE}) \item anota2seq provides additional functions in order to easily and consistently visualize the results of analyzes: \Rfunction{anota2seqPlotPvalues} (section \ref{densitySection}) and \Rfunction{anota2seqPlotFC} (section \ref{fcSection}) \item anota2seq provides a wrapper function which performs all steps of the workflow (section \ref{runSection}) \item anota2seq provides a classification of mRNAs into different gene expression regulatory modes: changes in mRNA abundance, or translational efficiency leading to altered protein levels or buffering (section \ref{regModesSection}) \end{itemize} \newpage \addcontentsline{toc}{section}{References} \bibliography{references} \end{document}