%\VignetteIndexEntry{Generally applicable transcriptome-wide analysis of translational efficiency using anota2seq} %\VignetteKeywords{Translation, ANCOVA} %\VignettePackage{anota2seq} %\VignetteEngine{knitr::knitr} \documentclass{article} \usepackage{float} <>= knitr::opts_chunk$set(message = FALSE, warning = FALSE, error = FALSE, tidy = FALSE, eval = TRUE) # turn off verbosity if(packageVersion("BiocStyle") < '2.5.0'){ BiocStyle::latex2() } else if(packageVersion("BiocStyle") >= '2.5.0'){ BiocStyle::latex() } @ \begin{document} \bioctitle[Generally applicable transcriptome-wide analysis of translational efficiency using anota2seq]{Generally applicable\\ transcriptome-wide analysis\\ of translational efficiency using anota2seq} \author{Christian Oertlin, Julie Lorent, Ola Larsson \thanks{\email{ola.larsson@ki.se}}} \maketitle \tableofcontents \newpage \addcontentsline{toc}{section}{Introduction} \section*{Introduction} Gene expression is a multi-step process including transcription, mRNA-transport, -stability and -translation. Dysregulated mRNA translation is commonly observed in human diseases such as cancer and understanding which mRNAs are differentially translated and the mechanisms that mediate such effects is therefore of high importance. Estimates of transcriptome-wide translational efficiency can be obtained using polysome-profiling and ribosome-profiling. Both approaches are based on isolation of translated mRNA (polysome-associated mRNA or Ribosome Protected Fragments [RPF]) followed by quantification using DNA-microarrays or RNA sequencing (RNAseq). A parallel total mRNA sample is also isolated and quantified in order to identify \emph{bona fide} changes in translational efficiency. More details are found in \cite{larsson2008, piccirillo}. \\ During analysis of the resulting data, three modes for regulation of gene expression can be assessed: changes in mRNA abundance (i.e. similar changes in total mRNA levels and levels of translated mRNA) and changes in translational efficiency leading to changes in protein levels (a change in the amount of translated mRNA that is not explained by a change in total mRNA) or buffering which maintains constant levels of translated mRNA (and hence also protein levels) despite altered levels of total mRNA. Efficient separation of these modes for regulation of gene expression is necessary to elucidate underlying regulatory mechanisms \cite{Oertlin}. Studies of changes in translational efficiency commonly apply per sample differences (log scale) between levels of translated mRNA and total mRNA \cite{larsson2008} that are compared between treatments. However, as discussed in \cite{larsson2010} such translational efficiency scores and outputs from methods that use such scores will show spurious correlations leading to elevated false positive findings \cite{larsson2010}.\\ This bias from spurious correlations can be avoided by using per-feature regression-based analysis between levels of translated mRNA and total mRNA. Such analysis produces residuals that are uncorrelated with the total mRNA levels and changes in translational efficiency leading to altered protein levels or buffering can be identified using Analysis of Partial Variance (APV) \cite{larsson2010}. Anota2seq allows for identification of all three gene expression modes from polysome- or ribosome- profiling data quantified by DNA-microarrays or RNAseq. It uses APV and thereby eliminates spurious correlation bias. Here we illustrate the use of the anota2seq package. \section{Workflow} Analysis of translational activity using anota2seq includes the following steps: \begin{enumerate} \item Initialize an Anota2seqDataSet and pre-process RNA sequencing data using \Rfunction{anota2seqDataSetFromMatrix} or \Rfunction{anota2seqDataSetFromSE}. See section \ref{preproc} \item Assessment of model assumptions using \Rfunction{anota2seqPerformQC} and \Rfunction{ anota2seqResidOutlierTest}. See section \ref{qc} \item Analysis of changes in mRNA abundance and translational efficiency leading to altered protein levels or buffering using \Rfunction{anota2seqAnalyze}. See section \ref{analysis} \item Selection of features and classification into modes for regulation of gene expression using \Rfunction{anota2seqSelSigGenes} and \Rfunction{anota2seqRegModes}. See section \ref{genesSection} and \ref{regModesSection}, respectively. \item Visualize the results using \Rfunction{anota2seqPlotPvalues}, \Rfunction{anota2seqPlotFC} and \Rfunction{anota2seqPlotGenes} See sections \ref{fcSection}, \ref{densitySection} and \ref{genesSection}, respectively. \end{enumerate} \section{Getting started} anota2seq provides a wrapper function called \Rfunction{anota2seqRun} which performs all analysis steps with relevant default parameters. Here we show an overview of the whole workflow using this function. We illustrate an analysis using count data from an RNA sequencing experiment on total mRNA (called anota2seq\_data\_T here) and on translated mRNA (polysome-associated mRNA or Ribosome Protected Fragments, data called anota2seq\_data\_P) on conditions ("ctrl" and "treatment") and a vector of sample annotation (called anota2seq\_pheno\_vec). The following code performs normalization, assesses model assumptions and performs the analysis for the default contrast (treatment vs. control in this case): <>= library(anota2seq) data(anota2seq_data) @ <>= ads <- anota2seqDataSetFromMatrix( dataP = anota2seq_data_P[1:1000,], dataT = anota2seq_data_T[1:1000,], phenoVec = anota2seq_pheno_vec, dataType = "RNAseq", normalize = TRUE) ads <- anota2seqRun(ads) @ << echo = FALSE, message = FALSE, results = 'hide', eval = TRUE >>= unlink(c("ANOTA2SEQ_interaction_p_distribution.pdf", "ANOTA2SEQ_residual_distribution_summary.jpeg", "ANOTA2SEQ_residual_vs_fitted.jpeg", "ANOTA2SEQ_rvm_fit_for_all_contrasts_group.jpg", "ANOTA2SEQ_rvm_fit_for_interactions.jpg", "ANOTA2SEQ_rvm_fit_for_omnibus_group.jpg", "ANOTA2SEQ_simulated_vs_obt_dfbetas_without_interaction.pdf")) @ The gene expression modes can quickly be visualized: <>= anota2seqPlotFC(ads, selContrast = 1, plotToFile = FALSE) @ The following code illustrates how to access a top list of significant changes in translational efficiency leading to altered protein levels (effect [minEff], adjusted p-value [apvRvmPAdj], gene expression mode [singleRegMode]; notice that, in anota2seq, "translation" refers to changes in translated mRNA after adjustment for corresponding changes in total RNA and is distinct from "translated mRNA" which refers to changes in translated mRNA without such adjustment): <>= head( anota2seqGetOutput( ads, analysis = "translation", output = "selected", getRVM = TRUE, selContrast = 1)[, c("apvEff", "apvRvmPAdj", "singleRegMode")]) @ This provided an overview of the features of the package. All steps of the analysis are detailed in the next section. \newpage \section{Transcriptome-wide analysis of translational efficiency using anota2seq} \subsection{Input Data}\label{inputSection} anota2seq can analyze data from both ribosome-profiling and polysome-profiling quantified by RNAseq or DNA-microarrays. anota2seq cannot use data from competitive two channel experiments when the polysome-associated mRNA is directly compared to total mRNA as these do not allow independent estimates of polysome-associated mRNA and total mRNA levels\footnote{A two-channel reference design should be applicable although we have not tested this data type.}. anota2seq requires 3 replicate experiments per group if there are 2 conditions. If there are more than two conditions, two replicates is sufficient but will result in reduced statistical power as compared to three replicates. We recommend three replicates in most cases, regardless of the number of conditions.\\ In this vignette, we will use simulated data provided with the package to illustrate how to perform each step of the analysis. These data originate from the study by Oertlin et al. \cite{Oertlin} which compared methods for analysis of translatomes quantified by RNAseq. Eight samples were simulated from 2 sample classes ("control" and "treatment"); both total mRNA (anota2seq\_data\_T, raw RNAseq counts) and paired translated mRNA (anota2seq\_data\_P, raw RNAseq counts) are provided together with a sample class vector (anota2seq\_pheno\_vec). <>= data(anota2seq_data) # Polysome-associated mRNA and total mRNA columns must follow the same order head(anota2seq_data_P, n = 2) head(anota2seq_data_T, n = 2) # phenoVec must describe the sample class for corresponding columns # in dataT and dataP anota2seq_pheno_vec @ \subsection{Normalization and transformation of the raw data}\label{preproc} The anota2seq performance will vary depending on normalization and transformation of the data. We therefore recommend that the user tries several different transformations and normalization approaches while monitoring the quality control plots (the influential data points, the interactions and the normality of the residuals) and the RVM F-distribution fit plot if RVM is used (see sections \ref{qc} and \ref{rvmSection}). \\ anota2seq accepts normalized DNA-microarray data, normalized and transformed RNAseq data or raw RNAseq data for both translated mRNA (i.e. polysome-associated mRNA or RPF) and total mRNA as input. As anota2seq requires data on a continuous log scale, raw RNAseq data (count data) will be pre-processed to ensure efficient analysis.\\ In general, RMA is an efficient normalization for DNA-microarray data (for Affymetrix GeneChips) while TMM-log2 normalization \cite{robinson, ritchie} is efficient for RNAseq data (this is the default method in anota2seq when raw RNAseq data are provided as input). The rlog algorithm from \Biocpkg{DESeq2} \cite{love} can also be used within anota2seq.\\ Normalization and transformation of RNAseq data are performed during initialization of an \Rclass{Anota2seqDataSet} object when applying \Rfunction{anota2seqDataSetFromMatrix} or \Rfunction{anota2seqDataSetFromSE} functions. Filtering of features with 0 counts in at least one sample is also available when raw RNAseq data is provided (parameter \Rcode{filterZeroGenes}). Additionally, users can filter the dataset to remove features with no variance in each mRNA source prior to analysis. This filtering prevents an APV analysis without variance which will result in an error and a halt in the analysis (parameter \Rcode{varCutOff}). <>= ads <- anota2seqDataSetFromMatrix( dataP = anota2seq_data_P[1:1000,], dataT = anota2seq_data_T[1:1000,], phenoVec = anota2seq_pheno_vec, dataType = "RNAseq", filterZeroGenes = TRUE, normalize = TRUE, transformation = "TMM-log2", varCutOff = NULL) @ <>= library(SummarizedExperiment) countData <- as.matrix(cbind(anota2seq_data_P[1:1000,],anota2seq_data_T[1:1000,])) # annotations anot <- data.frame( row.names = colnames(countData), #information on mRNA types RNA = c(rep("P",8),rep("T",8)), #samples classes treatment = rep(c(rep("ctrl",4),rep("treatment",4)),2), #sample Pairs samplePairs = rep(c(paste("ctrl",c(1:4),sep="_"),paste("treatment",c(1:4),sep="_")),2), # batch information, in this case replicate number batches = rep(c(1,2,3,4),4)) # Create the SummarizedExperiment mySummarizedExperiment <- SummarizedExperiment( assays = list(counts = countData), colData = anot) @ Similarly, an \Rclass{Anota2seqDataSet} object can be initialized from a \Rclass{SummarizedExperiment} object using \Rfunction{anota2seqDataSetFromSE} as follows\footnote{see \Rcode{help(anota2seqDataSetFromSE)} for details on required \Rcode{colData} formatting}: <>= adsFromSE <- anota2seqDataSetFromSE( se = mySummarizedExperiment, assayNum = 1, # Position of the count data in assays(mySummarizedExperiment) dataType = "RNAseq", normalize = TRUE, transformation = "TMM-log2") @ \subsection{Assessment of model assumptions} \label{qc} To apply APV within anota2seq, multiple assumptions need to be fulfilled for tens of thousands of features, which is a substantial challenge to evaluate. Due to the high dimensionality of the data, anota2seq accounts for multiple testing when assessing assumption violations. Thus, if a similar number violations as expected by chance occur, it is assumed that anota2seq can be applied.\\ Using the following code, anota2seq performs quality control checks and outputs diagnostic plots (Fig. \ref{dfbs} to \ref{int}) which are further described below. <>= ads <- anota2seqPerformQC(Anota2seqDataSet = ads, generateSingleGenePlots = TRUE) @ <>= ads <- anota2seqPerformQC(Anota2seqDataSet = ads, generateSingleGenePlots = TRUE, fileName = "figure/singleReg.pdf", fileStem = "figure/") @ \subsubsection{Highly influential datapoints} Highly influential data points may cause errors during regression analyzes. It is expected that a number of highly influential data points will appear merely by chance because of the large number of analyzes performed. Thus anota2seq attempts to establish if there are more influential data points compared to what would be expected by chance when considering all analyzed features. For detection of influential data points, anota2seq uses standardized dfbeta for the slope of the regression and several thresholds to determine whether or not a data point is highly influential. As there is no known distribution of the dfbetas when the underlying data are normally distributed, anota2seq simulates data sets to obtain estimates of the expected number of outliers. The simulation is performed by sampling N (corresponding to the number of samples in the analysis) data points from the normal distribution and calling these data points the translated mRNA level. In detail, such translated mRNA levels are obtained by sampling data points from a normal distribution with a mean of the corresponding total mRNA level data point. Ten different such data sets are obtained with different variances when sampling translated mRNA data. These data sets are then merged and frequencies of outlier dfbetas are calculated and compared to the frequencies of outlier dfbetas from the analyzed data (Fig. \ref{dfbs}). \\ \begin{figure} \includegraphics[]{figure//_simulated_vs_obt_dfbetas_without_interaction.pdf} \caption{A bar graph showing the obtained and expected (based on a simulation) number of influential data points as judged by different thresholds. For each threshold the difference between the obtained and the simulated frequency of outliers is shown. } \label{dfbs} \end{figure} If there are no substantial differences between the simulated and obtained number of influential data points as shown in (Fig \ref{dfbs}), we can assume that the model assumption is met. A substantial difference between obtained and simulated proportions of influential data points, however, indicates that the data set contains issues which makes anota2seq analysis (and likely analysis using any other algorithm as well) unreliable \footnote{Highly influential data points can be due to poor data replication.}. It is then advised to perform quality control of included samples to see if one or a few samples behave as outliers (e.g. using principal components analysis [PCA]). Clues regarding whether there is a single sample or multiple samples which behave as outliers may also be possible to obtain by visualizing single feature regressions (Fig. \ref{singleReg}) but in most instances PCA will be a more powerful approach to identify outlier samples. Removing outlier samples (both translated mRNA and total mRNA) may result in a data set which passes quality control.\\ \begin{figure} \includegraphics[page=1]{figure//singleReg.pdf} \caption{anota2seq can be set to output feature per feature regressions between translated mRNA and total mRNA levels. Plotting symbols are taken from the \Rfunction{phenoVec} argument and the lines are the regression lines per samples class} \label{singleReg} \end{figure} \subsubsection{Common slopes between treatment groups} APV assumes that the slopes of the regressions from each individual condition are the same such that using the common slope is valid. This assumption postulates that the relationship between the translated mRNA level and the total mRNA level shows the same slope for each condition, i.e., condition and total mRNA levels do not interact in predicting translated mRNA levels. As discussed above, because we analyze thousands of regressions, we expect that a number of interactions with low p-values will arise simply due to chance also when true positive interactions are absent. If the number of interactions does not exceed what is expected by chance, their p-values should follow a uniform distribution and anota2seq can be applied using common slopes for all features. Thus anota2seq provides an output allowing assessment of the distribution of p-values for the interaction together with the distribution after adjusting for multiple testing (Fig. \ref{int}).\\ \begin{figure} \includegraphics[page=1]{figure//_interaction_p_distribution.pdf} \caption{Assessment of whether the p-values for the interaction follow the uniform NULL distribution. Shown are both density plots and histograms of the nominal and adjusted p-values (in this case adjusted using Benjamini-Hochberg FDR).} \label{int} \end{figure} \subsubsection{Normal distribution of regression residuals} Significance testing within the APV framework assumes that residuals from the regressions are normally distributed. The \Rfunction{anota2seqResidOutlierTest} function assesses whether the residuals from the linear regressions (feature by feature) of translated mRNA level to total mRNA level are normally distributed by generating normal Q-Q plots of the residuals\footnote{If the residuals are normally distributed, the data quantiles will form a straight diagonal line from bottom left to top right. If the resulting residuals deviate strongly from normality an alternative normalization method could be tested.}.\\ <>= anota2seqResidOutlierTest(ads, residFitPlot = FALSE, generateSingleGenePlots = TRUE, nGraphs = 12) @ <>= ads <- anota2seqResidOutlierTest(ads, residFitPlot = FALSE, generateSingleGenePlots = TRUE, nGraphs = 12) file.rename(from = "ANOTA2SEQ_residual_distributions_single.pdf", to = "figure/ANOTA2SEQ_residual_distributions_single.pdf") file.rename(from = "ANOTA2SEQ_residual_distribution_summary.jpeg", to = "figure/ANOTA2SEQ_residual_distribution_summary.jpeg") @ Because there are typically relatively few data points, anota2seq calculates ''envelopes'' based on a set of samplings from a normal distribution using the same number of data points as the input data (i.e. corresponding to the number of samples)\cite{Venables}. To enable a comparison, both the true and the sampled data are scaled (variance=1) and centered (mean=0). The samples (both true and sampled) are then sorted and the input data is compared to the envelopes of the sampled series at each sort position. The result is presented as a Q-Q plot of the input data where the envelopes of the sampled series are indicated. If there are 99 samplings we expect that 1/100 values should be outside the range obtained from the samplings. Thus it is possible to assess if approximately the expected number of outlier residuals are obtained. anota2seq provides a summary output (Fig. \ref{resid}) as well as a feature by feature output (Fig. \ref{residSingle}). If approximately the expected number of outlier residuals are observed, the data set is suitable for anota2seq analysis. If there is a strong deviation as judged by the summary output (Fig. \ref{resid}), the per-feature plots may be useful to identify whether single or multiple samples contribute to outlier residuals (Fig. \ref{residSingle}). Such samples may be possible to identify as outliers in PCA analysis and excluded from analysis as indicated above. Alternatively, another normalization/transformation approach could be evaluated.\\ \begin{figure} \includegraphics[]{figure//ANOTA2SEQ_residual_distribution_summary.jpeg} \caption{Assessment of whether the residuals are approximately normally distributed. Shown is the output from all features using the \Rfunction{anota2seqResidOutlierTest} function. The Q-Q plot for the features is compared to the outer limits of a set of Q-Q plots generated by sampling from the normal distribution. The obtained and expected percentage of outliers is indicated at each rank position and combined.} \label{resid} \end{figure} \begin{figure} \includegraphics[]{figure//ANOTA2SEQ_residual_distributions_single.pdf} \caption{Assessment of whether the residuals are approximately normally distributed. Shown is the output from the single feature alternative within \Rfunction{anota2seqResidOutlierTest}. The Q-Q plot for the feature is compared to the outer limits of a set of Q-Q plots generated by sampling from the normal distribution.} \label{residSingle} \end{figure} \subsubsection{An example of quality control performed on a data set with non-optimal normalization/transformation } In section \ref{qc} we illustrated quality control within anota2seq on a dataset that fulfills all criteria. As a comparison, here we showcase quality control using a dataset which was normalized using an approach which does not allow analysis using anota2seq. First and somewhat surprising the assessment of influential data points indicates fewer such data points compared to what is expected (Fig. \ref{dfbsBad}). Second, we obtain a substantial elevation of outlier residuals (1\% expected vs 3.6\% obtained) (Fig. \ref{badResid}). The issue regarding the residuals becomes evident in the per-feature analysis, as multiple features do not display the expected straight line from bottom left to top right but rather a non-linear pattern (Fig \ref{badSingleResid}). Finally, the distribution of variances from all features does not follow the theoretical F-distribution (Fig. \ref{badRVM}) indicating that RVM can not be applied to this dataset (see further below). The last aspect (Fig. \ref{badRVM}) is from our experience a good integrated sensor for that deviations from what is expected by chance within other quality measures are substantial enough to question the reliability of anota2seq analysis (and likely analysis using other algorithms as well).\\ \begin{figure} \includegraphics[]{figure/badSim.pdf} \caption{A bar graph showing the obtained and expected (based on a simulation) number of influential data points as judged by different thresholds for a poor-quality data set. For each threshold the difference between the obtained and the simulated frequency of outliers is shown. Notice that fewer influential data points as compared what is expected by chance is observed.} \label{dfbsBad} \end{figure} \begin{figure} \includegraphics[]{figure/badResid.pdf} \caption{Assessment of whether the residuals are approximately normally distributed for a poor-quality data set. Shown is the output from all features using the \Rfunction{anota2seqResidOutlierTest} function. The Q-Q plot for the features is compared to the outer limits of a set of Q-Q plots generated by sampling from the normal distribution. The obtained and expected percentage of outliers is indicated at each rank position and combined. Notice that a substantially higher proportion of outlier residuals is observed.} \label{badResid} \end{figure} \begin{figure} \includegraphics[]{figure/badSingleResid.pdf} \caption{Assessment of whether the residuals are approximately normally distributed for a poor-quality data set. Shown is the output from the single feature alternative within \Rfunction{anota2seqResidOutlierTest}. The Q-Q plot for the feature is compared to the outer limits of a set of Q-Q plots generated by sampling from the normal distribution. Notice, that residuals from several individual features (residual from each samples is indicated with an "x") show a non-linear pattern.} \label{badSingleResid} \end{figure} \begin{figure} \includegraphics[]{figure/badRVM.pdf} \caption{An output from the \Rfunction{anota2seqPerformQC} function (used with parameter \Rcode{useRVM = TRUE}) comparing obtained variances to the theoretical F-distribution for a poor-quality data set. RVM assumes that the empirical and the theoretical distributions are similar. Notice the strong difference between the expected theoretical and obtained distributions. A Kolmogorov-Smirnov test is used to assess a difference between the distributions and the p-value is indicated. Data sets with small deviations where lines largely overlap albeit with a, for a Kolmogorov-Smirnov test on approximately 10 000 data points, modest p-value (e.g. larger than approximately 0.001) are considered fit for further analysis.} \label{badRVM} \end{figure} \clearpage \subsection{Analysis of changes in translational efficiency leading to altered protein levels or buffering} \label{analysis} Once the data set has been validated as suitable for analysis, significant changes in translational efficiency affecting protein levels or buffering can be identified. \\ Translational buffering decouples mRNA levels from protein levels (despite altered levels of total mRNA between conditions, translated mRNA levels remain constant; such mRNAs are colored in dark and light blue in the example presented in Fig. \ref{fig:gettingStartedVisualizeResults}), which potentially holds important information regarding how gene expression is regulated. anota2seq distinguishes between changes in translational efficiency leading to altered protein levels (orange and red colored mRNAs in Fig. \ref{fig:gettingStartedVisualizeResults}) and buffering. Both analyzes can be performed on the sample data using the following code: <>= ads <- anota2seqAnalyze(Anota2seqDataSet = ads, analysis = c("translation", "buffering")) @ <>= ads <- anota2seqAnalyze(Anota2seqDataSet = ads, analysis = c("translation", "buffering"), fileStem = "figure/") @ While \Rfunction{anota2seqPerformQC} performs an omnibus treatment effect test when there are more than 2 treatments, \Rfunction{anota2seqAnalyze} allows the user to set custom contrasts using the \Rcode{contrasts} parameter. In the example above, the default contrast ("treatment" vs. "control") is used. \subsubsection{Random variance model (RVM) to improve power in detection of changes in translational efficiency leading to altered protein levels or buffering} \label{rvmSection} RVM is an empirical Bayes method which has been shown to increase statistical power for small N analysis \cite{wright}. In RVM, the variance of each feature is adjusted using the variance obtained from an inverse gamma distribution derived from the variances of all features. A key assumption in RVM is that the resulting variances follow a theoretical F-distribution. anota2seq tests this for the analysis of omnibus group effects (Fig. \ref{rvm}), omnibus interactions (not shown, output of \Rfunction{anota2seqPerformQC}), and identification of changes in translational efficiency leading to altered protein levels and buffering (not shown, output of \Rfunction{anota2seqAnalyze}). Each of these analyzes generates a comparison of the obtained empirical distribution compared to the theoretical distribution (similarity is then assessed using a KS test whose alternative hypothesis should be rejected for a good fit). We have noticed that the normalization of the data can strongly influence the fit but that RVM seems to be applicable in most cases after identifying an efficient normalization/transformation. It is necessary to validate that application of RVM does not influence the distribution of the interaction p-values (not shown, output of \Rfunction{anota2seqPerformQC}). \Rfunction{anota2seqAnalyze} performs analyzes both with and without RVM but we strongly recommend using RVM as it improves the power to detect changes in translational efficiency leading to altered protein levels or buffering within anota2seq \cite{larsson2010}. \begin{figure}[H] \includegraphics{figure//_rvm_fit_for_omnibus_group.jpg} \caption{An output from the \Rfunction{anota2seqPerformQC} function (used with parameter \Rcode{useRVM = TRUE}) comparing obtained variances to the theoretical F-distribution. RVM assumes that the empirical and the theoretical distributions are similar. A Kolmogorov-Smirnov test is used to assess a difference between the distributions and the p-value is indicated. Data sets with small deviations where lines largely overlap albeit with a, for a Kolmogorov-Smirnov test on thousands of data points, modest p-value (e.g. larger than approximately 0.001) are considered fit for further analysis.} \label{rvm} \end{figure} \subsubsection{Visualization of the results from \Rfunction{anota2seqAnalyze}}\label{densitySection} \Rfunction{anota2seqAnalyze} outputs details of the tests for each feature (information about slopes of the APV model, test statistics, effect, unadjusted and adjusted p-value): <>= head(anota2seqGetOutput( ads, analysis = "translation", output = "full", selContrast = 1, getRVM = TRUE)) @ The density of p-values can be visualized using the \Rfunction{anota2seqPlotPvalues} function on the output of \Rfunction{anota2seqAnalyze} (Fig. \ref{fig:pvalDensity}). <>= par(mfrow = c(1, 2)) anota2seqPlotPvalues(ads, selContrast = 1, plotToFile = FALSE) @ \subsubsection{Unrealistic models of changes in translation efficiency}\label{slopeSection} The slopes that are fitted in the anota2seq APV models can take unrealistic values that will influence the analysis of changes in translation efficiency leading to altered protein levels or buffering. anota2seq therefore tests whether slopes for analysis of changes in translational efficiency affecting protein levels that are >1 differ from 1 and slopes for analysis of changes in translational efficiency leading to buffering that are <-1 differ from -1. Furthermore, as slopes < 0 for analysis of changes in translational efficiency affecting protein levels or > 0 for analysis of changes in translational efficiency leading to buffering indicate unlikely but not impossible translational control these events are also assessed. Results of these tests (p-values) are found in the output of \Rfunction{anota2seqPerformQC}, \Rfunction{anota2seqAnalyze} and \Rfunction{anota2seqRun} functions. These p-values can be used to filter or flag features with unrealistic slopes or slopes revealing unlikely translational control. Such filtering is also applied by default in \Rfunction{anota2seqSelSigGenes}. \subsubsection{Feature selection and visualization of single gene regressions}\label{genesSection} The output from \Rfunction{anota2seqAnalyze} can be filtered using the \Rfunction{anota2seqSelSigGenes}. Features can be selected based on several criteria: \begin{itemize} \item include only realistic slopes (see section \ref{slopeSection}), using parameters \Rcode{minSlopeTranslation}, \Rcode{maxSlopeTranslation}\footnote{and similarly \Rcode{minSlopeBuffering} and \Rcode{maxSlopeBuffering}} and \Rcode{slopeP} \item include a minimum effect threshold, using parameter \Rcode{minEff} \item include only significant features according to a defined p-value or adjusted p-value threshold (parameter \Rcode{maxP} and \Rcode{maxPAdj}) \end{itemize} An example of code to perform this filtering is as follows: <>= ads <- anota2seqSelSigGenes(Anota2seqDataSet = ads, selContrast = 1, minSlopeTranslation = -1, maxSlopeTranslation = 2, minSlopeBuffering = -2, maxSlopeBuffering = 1, maxPAdj = 0.05) @ Once the \Robject{Anota2seqDataSet} object has been filtered, single gene regressions can be visualized using the \Rfunction{anota2seqPlotGenes} function (Fig. \ref{fig:singleGeneRegressions_translation} and \ref{fig:singleGeneRegressions_buffering}). The graphical output includes both the graphical interpretation of the APV analysis and the key statistics from both the standard and the RVM based analysis. <>= anota2seqPlotGenes(ads, selContrast = 1, analysis = "translation", plotToFile = FALSE) @ <>= anota2seqPlotGenes(ads, selContrast = 1, analysis = "buffering", plotToFile = FALSE) @ \subsubsection{Note about analysis of translational buffering} The APV model fitted in anota2seq for analysis of changes in translational efficiency leading to altered protein levels uses a model with translated mRNA as dependent variable and total mRNA and the sample class variable as independent variables. In other words, a common slope for all sample categories is considered and the change in translational efficiency is defined as the difference in intercepts \cite{larsson2010}. This regression model is visualized in Fig. \ref{fig:singleGeneRegressions_translation}.\\ Translational buffering is defined as changes in total mRNA level that are not paralleled by changes in levels of translated mRNA. As such, performing analysis of buffering considers total mRNA as dependent variable and translated mRNA as independent variable (together with the sample class; as illustrated in Fig. \ref{fig:singleGeneRegressions_buffering}). \clearpage \subsection{Categorizing genes into gene expression modes}\label{regModesSection} Polysome or ribosome profiling allows the user to distinguish between three gene expression modes: changes in mRNA abundance (i.e. similar changes in total mRNA levels and levels of translated mRNA) and translational efficiency leading to altered protein levels or buffering \cite{Oertlin}. For that, the \Rfunction{anota2seqAnalyze} and \Rfunction{anota2seqSelSigGenes} functions have to be run with parameter \Rcode{analysis} set to "translation" and "buffering" as shown above but analysis of differential expression of total mRNA and translated mRNA is also required. For that, the same functions can be used with \Rcode{analysis} parameter set to "total mRNA" and "translated mRNA" as shown below (all 4 modes can also be indicated at the same time in the "analysis" parameter; the code below assumes that "translation" and "buffering" has already been analyzed as indicated above): <>= ads <- anota2seqAnalyze(Anota2seqDataSet = ads, analysis = c("total mRNA", "translated mRNA"), fileStem = "figure/") ads <- anota2seqSelSigGenes(Anota2seqDataSet = ads, analysis = c("total mRNA", "translated mRNA"), selContrast = 1, minSlopeTranslation = -1, maxSlopeTranslation = 2, minSlopeBuffering = -2, maxSlopeBuffering = 1, maxPAdj = 0.05) @ <>= ads <- anota2seqAnalyze(Anota2seqDataSet = ads, analysis = c("total mRNA", "translated mRNA")) ads <- anota2seqSelSigGenes(Anota2seqDataSet = ads, analysis = c("total mRNA", "translated mRNA"), selContrast = 1, minSlopeTranslation = -1, maxSlopeTranslation = 2, minSlopeBuffering = -2, maxSlopeBuffering = 1, maxPAdj = 0.05) @ Once all four analyzes (translation, buffering, translated mRNA and total mRNA) have been performed, all regulated features can be categorized into gene expression modes using the \Rfunction{anota2seqRegModes} function. <>= ads <- anota2seqRegModes(ads) @ Notably, there is a hierarchy such that mRNAs identified as changing their translational efficiency leading to altered protein levels will belong to the translation group and no other group (thus these mRNAs may also be regulated by changes in abundance); mRNAs that change their levels in the translated pool and total mRNA pool but are not identified as changing their translational efficiency leading to altered protein levels will be in the abundance group; and mRNAs that are identified as changing their translational efficiency leading to buffering and are not in the former two groups are allocated to the set of buffered mRNAs. Specifically, the \Rfunction{anota2seqRegModes} function adds a column named "singleRegModes" indicating the classification into gene expression modes in the data.frame containing gene by gene statistics output. This output can be accessed by using \Rfunction{anota2seqGetOutput} with \Rcode{output} parameter set to "regModes". << regModesOutput, echo=TRUE, eval = TRUE >>= head(anota2seqGetOutput(object = ads, output="regModes", selContrast = 1, analysis="buffering", getRVM = TRUE))[, c("apvSlope", "apvEff", "apvRvmP", "apvRvmPAdj", "singleRegMode")] @ \subsubsection{Visualizing the different gene expression modes}\label{fcSection} anota2seq includes the \Rfunction{anota2seqPlotFC} function which plots the translated mRNA log2 fold-change vs. the total mRNA log2 fold-change and colors genes according to their mode of regulation (Fig. \ref{fig:fcPlot}). <>= anota2seqPlotFC(ads, selContrast = 1, plotToFile = FALSE) @ A table (data frame) corresponding to the anota2seqPlotFC plot (when visualize is set to "all") can be obtained by: <>= anota2seqGetOutput(ads,output="singleDf",selContrast=1) @ \subsection{Complete analysis using the one-step procedure function \Rfunction{anota2seqRun}}\label{runSection} In addition to application of each of the functions within anota2seq which provides the maximum flexibility, the anota2seq package provides the option to perform a one-step analysis of translated mRNA, total mRNA and changes in translational efficiency leading to altered protein levels or buffering. This analysis performs quality control followed by analysis of changes in translational efficiency affecting protein levels or buffering. A filtering is also performed (as in \Rfunction{anota2seqSelSigGenes}) as well as categorization into gene expression modes. <>= ads <- anota2seqRun( Anota2seqDataSet = ads, thresholds = list( maxPAdj = 0.05, minEff = 1.5), performQC = TRUE, performROT = TRUE, useRVM = TRUE) @ The output of the \Rfunction{anota2seqRun} function can be supplied to the \Rfunction{anota2seqPlotPvalues}, \Rfunction{anota2seqPlotGenes} and \Rfunction{anota2seqPlotFC} functions for similar visualization of the results as in Fig. \ref{fig:pvalDensity}, \ref{fig:singleGeneRegressions_translation}, \ref{fig:singleGeneRegressions_buffering} and \ref{fig:fcPlot}.\\ \section{Extending anota2seq to analysis of other data sources} In principle, any data source where the intention is to identify changes in a subset that is independent of a background can be analyzed (e.g. RIP-SEQ data). \section{New features in anota2seq compared to \Biocpkg{anota}} The core models in anota2seq are similar to those in the \Bioconductor{} package \Biocpkg{anota}. However, there are many differences including: \begin{itemize} \item \Biocpkg{anota} was designed to analyze data from DNA-microarrays platforms. anota2seq allows analysis of data from both DNA-microarrays and RNA sequencing (section \ref{inputSection}) \item Implementation of analysis of translational buffering (section \ref{analysis}) \item anota2seq allows for batch adjustment (parameter \Rcode{batchVec} of \Rfunction{anota2seqDataSetFromMatrix} and \Rfunction{anota2seqDataSetFromSE}) \item anota2seq provides additional functions in order to easily and consistently visualize the results of analyzes: \Rfunction{anota2seqPlotPvalues} (section \ref{densitySection}) and \Rfunction{anota2seqPlotFC} (section \ref{fcSection}) \item anota2seq provides a wrapper function which performs all steps of the workflow (section \ref{runSection}) \item anota2seq provides a classification of mRNAs into different gene expression modes: changes in mRNA abundance, or translational efficiency leading to altered protein levels or buffering (section \ref{regModesSection}) \end{itemize} \newpage \addcontentsline{toc}{section}{References} \bibliography{references} \end{document}