%\VignetteIndexEntry{INSPEcT} %\VignettePackage{INSPEcT} %\VignetteDepends{INSPEcT} %\VignetteEngine{knitr::knitr} \documentclass[11pt]{article} \usepackage[utf8]{inputenc} <>= BiocStyle::latex() @ \newcommand{\Rmethod}[1]{{\Rfunction{#1}}} \title{INSPEcT - INference of Synthesis, Processing and dEgradation rates from Transcriptomic data} \author{de Pretis S. - Furlan M. - Pelizzola M.} \begin{document} \maketitle \tableofcontents \section{Introduction} \Rpackage{INSPEcT} is an R/Bioconductor compliant solution for the study of RNA transcriptional dynamics from RNA-seq data. It is based on a system two of ordinary differential equations (ODEs) that describes the synthesis ($k_1$) and processing ($k_2$) of premature RNA ($P$) and the degradation ($k_3$) of mature RNA ($M=T-P$): \begin{equation}\label{eq:modelsystem} \left\{ \begin{array}{l l} \dot{P}=k_1 - k_2 \, \cdot \, P \\ \dot{T}=k_1 - k_3 \, \cdot \, (T - P) \end{array} \right. \end{equation} The total RNA ($T$) is measured from the exonic quantification of transcripts in RNA-seq experiment, while premature RNA ($P$) is measured by the intronic quantification. The model is based on two commonly accepted assumptions: premature RNAs are not degraded, but only converted to mature RNA, and nuclear export occurs at a rate considerably faster than other rates and can be disregarded.\\ The package supports the analysis of several experimental designs, such as steady-state conditions or time-course data, and the presence or absence of the nascent RNA library. According to the experimental design, the software returns different outputs. In particular, when the nascent RNA is present, the rates of synthesis, processing and degradation can be calculated both in time-course and in steady-state state conditions and tested for differential regulation. Without the information from the nascent RNA, the rates of synthesis, processing and degradation, as well as their significant variation, can be calculated only in time-course, while only a variation in the ratio between processing and degradation (post-transcriptional ratio) can assessed between steady states. A schema of the possible configurations and related outputs is reported in Table \ref{table:INSPEcTanalysisDesigns}. \begin{table}\label{table:INSPEcTanalysisDesigns} \caption{Table reporting the experimental designs suitable for INSPEcT} \begin{tabular}{cl|c|c|c|} \cline{3-5} & & \multicolumn{3}{c|}{\textbf{Total and nascent RNA}} \\ \cline{3-5} & & \textit{Synthesis} & \textit{Processing} & \textit{Degradation} \\ \cline{1-1} \cline{3-5} \multicolumn{1}{|c|}{\textbf{Steady state}} & & Yes & Yes & Yes \\ \cline{1-1} \cline{3-5} \multicolumn{1}{|c|}{\textbf{Time-course}} & & Yes & Yes & Yes \\ \cline{1-1} \cline{3-5} \\ \cline{3-5} & & \multicolumn{3}{c|}{\textbf{Total RNA}} \\ \cline{3-5} & & \textit{Synthesis} & \textit{Processing} & \textit{Degradation} \\ \cline{1-1} \cline{3-5} \multicolumn{1}{|c|}{\textbf{Steady state}} & & No & \multicolumn{2}{c|}{Ratio $\frac{Processing}{Degradation}$} \\ \cline{1-1} \cline{3-5} \multicolumn{1}{|c|}{\textbf{Time-course}} & & Yes & Yes & Yes \\ \cline{1-1} \cline{3-5} \end{tabular} \end{table} \section{Quantification of Exon and Intron features} The \Rpackage{INSPEcT} analysis starts with the quantification of exonic and intronic quantifications and the estimation of their variance from the replicates in the different conditions of the analysis. \Rpackage{INSPEcT} can estimate transcripts abundances and associated variances starting from different entry points according to the source of data available: SAM (or BAM) files, read counts or transcripts abundances. The user can choose to estimate the variance by means of two different softwares: \Rpackage{DESeq2} or \Rpackage{plgem}. By default, \Rpackage{DESeq2} is used when starting from BAM or read counts, while only \Rpackage{plgem} can be used when starting from transcripts abundances. <>= library(INSPEcT) @ \subsection{From BAM or SAM files} In case the user has access to the raw aligned data, \Rpackage{INSPEcT} can quantify transcripts abundaces and variances directly from SAM or BAM files. Transcrips annotation should be provided with an annotation object of class \Rcode{TxDb}. \Rpackage{INSPEcT} retrieves the genomic coordinates of the exons and defines introns as inter-exonic regions. By default, \Rpackage{INSPEcT} collapses the exons of the transcripts that belong to the same gene (argument \Rcode{by} set equal to \Rcode{'gene'}), but can work also at the transcript level (argument \Rcode{by} set equal to \Rcode{'tx'}). When working at the transcript level, \Rpackage{INSPEcT} cannot discriminate betweeen transcript, therefore we suggest to assign reads to all the overlapping features by setting the argument \Rcode{allowMultiOverlap} to \Rcode{TRUE}. \Rpackage{INSPEcT} manage strandness of the reads with the argument \Rcode{strandSpecific}, which can be set to \Rcode{0}, in case of non-stranded reads, \Rcode{1} for stranded and \Rcode{2} for reverse-stranded experiments. \Rpackage{INSPEcT} prioritizes the exon annotation, meaning that only the reads that do not overlap with any of the annotated exon are eventually accounted for introns. Here we report an example where the data from 4 sample BAM files (2 time points, 2 replicates each, as specified with the argument \Rcode{experimentalDesign}) is quantified in a non-stranded specific way, at the gene level, leaving un-assigned the reads mapping to more than one feature. <>= require(TxDb.Mmusculus.UCSC.mm9.knownGene) txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene bamfiles_paths <- system.file('extdata/', c('bamRep1.bam','bamRep2.bam','bamRep3.bam','bamRep4.bam'), package='INSPEcT') exprFromBAM <- quantifyExpressionsFromBAMs(txdb=txdb , BAMfiles=bamfiles_paths , by = 'gene' , allowMultiOverlap = FALSE , strandSpecific = 0 , experimentalDesign=c(0,0,1,1)) @ The function returns a list containing exonic/intronic expressions and variances, as well as exonic/intronic feature widths extracted from the annotation, exonic/intronic counts and counts statitics. <>= names(exprFromBAM) exprFromBAM$countsStats @ \subsection{From read counts} In case the matices of intronic and exonic reads have been already calulated, \Rpackage{INSPEcT} calulates the mean expression and the associated variance using, by default, the software \Rpackage{DESeq2}. The package \Rpackage{INSPEcT} contains the read counts associated to the intronic and exonic features of 500 genes profiled both for the nascent and total RNA in 11 time-points and replicated 3 times each following the induction of Myc in 3T9 cells [ref to the paper]. As a complementary information, the width of the intronic and exonic features for the same set of genes, as well as the sequencing depth of all the samples is provided within the package. The nascent and total quantification evaluated in this section of the vignette will be used later to generate the synthetic dataset for the benchmarking of the method. <>= data('allcounts', package='INSPEcT') data('featureWidths', package='INSPEcT') data('libsizes', package='INSPEcT') nascentCounts<-allcounts$nascent matureCounts<-allcounts$mature expDes<-rep(c(0,1/6,1/3,1/2,1,1.5,2,4,8,12,16),3) nasExp_DESeq2<-quantifyExpressionsFromTrCounts( allcounts=nascentCounts ,libsize=nascentLS ,exonsWidths=exWdths ,intronsWidths=intWdths ,experimentalDesign=expDes) matExp_DESeq2<-quantifyExpressionsFromTrCounts( allcounts=matureCounts ,libsize=totalLS ,exonsWidths=exWdths ,intronsWidths=intWdths ,experimentalDesign=expDes) @ \subsection{From transcripts abundances} In order to exemplify the quantification of mean transcripts abundances and variances starting from their replicated quantification, we transform the read counts loaded from the package in the section above into RPKMs using the width of intron and exon and the library sizes <>= trAbundancesFromCounts <- function(counts, widths, libsize) t(t(counts/widths)/libsize*10^9) nascentTrAbundance <- list( exonsAbundances=trAbundancesFromCounts( nascentCounts$exonsCounts, exWdths, nascentLS), intronsAbundances=trAbundancesFromCounts( nascentCounts$intronsCounts, intWdths, nascentLS)) @ Once we have obtained transcripts abundances for introns and exons in the different conditions and replicates of the experimental design, we can calulate the mean expression and variance (using \Rpackage{plgem}) by: <>= nasExp_plgem<-quantifyExpressionsFromTrAbundance( trAbundaces = nascentTrAbundance , experimentalDesign = expDes) @ \section{Analysis of RNA dynamics in time-course experiments} \subsection{Analysis of Total and Nascent RNA} In case of the joint analysis of total and nascent RNA data, for each transcript with at least one intron, the exonic and intronic quantifications are available in the Total and in the Nascent RNA libraries. The system of equations \eqref{eq:modelsystem} describes the exonic and intronic quantifications of the Total library, but when integrated between zero and the time of labeling used to generate the nascent RNA transcripts ($t_L$) it describes the exonic and intronic quantifications of the Nascent library. For matter of simplicity and to avoid unrealistic high rates of synthesis, processing and degradation, \Rpackage{INSPEcT} assumes by default that the nascent transcripts are not degradaed during the labeling time. The set of equations to describe Total and Nascent RNA is: \begin{equation}\label{eq:2} \left\{ \begin{array}{l l} \dot{P}_{R}=k_1 - k_2 \, \cdot \, P_{R} \\ \dot{T}_{R}=k_1 - k_3 \, \cdot \, (T_{R} - P_{R}) \\ s_F \, \cdot \, P_{L}=\frac{k_1}{k_2} - ( 1 - e^{k_2 \, \cdot \, t_L} ) \\ s_F \, \cdot \, T_{L}=k_1 \, \cdot \, t_L \end{array} \right. \end{equation} where $P_{R}$ and $T_{R}$ are the premature and total RNA levels, respectively, estimated from the total RNA library, $P_{L}$ and $T_{L}$ are premature and total RNA levels, respectively, estimated from the nascent RNA library. This system describes a single condition or time-point. In case of steady-state analysis, $\dot{P}_{R}$ and $\dot{T}_{R}$ are equal to zero, while in case of the time-course analysis they are estimated from the interpolation of $P_R(t)$ and $T_R(t)$ during the entire time-course via cubic splines. \Rpackage{INSPEcT} exploit the fact that the system is overdetermined (three unknowns - $k_1$, $k_2$ adn $k_3$, and four equations) to calculate a scaling factor between the Total and Nascent RNA quantifications, $s_F$, that multiplies $P_{L}$ and $T_{L}$. To avoid overfitting, a single scaling-factor representative of the entire population of transcripts is used, which represents the normalization factor between the Nascent and Total libraries (ref to INSPEcT paper). \\ \subsubsection{Estimation of rates from intronic and exonic quantifications} Once obtained the intronic and exonic quantifications from the Total and Nascent library, the rates of synthesis, processing and degradation are calculated following the procedure described in the section above during the initialization of the \Rpackage{INSPEcT} object with the function \Rmethod{newINSPEcT}: <>= tpts<-c(0,1/6,1/3,1/2,1,1.5,2,4,8,12,16) tL<-1/6 nascentInspObj<-newINSPEcT(tpts=tpts ,labeling_time=tL ,nascentExpressions=nasExp_DESeq2 ,matureExpressions=matExp_DESeq2) @ Once created the \Rpackage{INSPEcT} object, the method \Rmethod{ratesFirstGuess} returns the expressions of premature and total RNA for the analyzed genes, as well as the rates of synthesis, processing and degradation. Additionally, the method \Rmethod{ratesFirstGuessVar} returns the variances of premature and total RNA, and of the synthesis rates. <>= round(ratesFirstGuess(nascentInspObj,'total')[1:5,1:3],3) round(ratesFirstGuessVar(nascentInspObj,'total')[1:5,1:3],3) round(ratesFirstGuess(nascentInspObj,'synthesis')[1:5,1:3],3) @ The \Rpackage{INSPEcT} object can be subsetted to focus on a specific set of genes. For the sake of speeding up the downstream analysis, we will focus on the first 10 genes of the INSPEcT object. The complete pattern of total and pre-mRNA concentrations variation together with the synthesis, degradation and processing rates can be visualized using the \Rmethod{inHeatmap} method (Figure \ref{fig:inHeatmapNascentTimeCourse}). \begin{center} <>= nascentInspObj10<-nascentInspObj[1:10] inHeatmap(nascentInspObj10, clustering=FALSE) @ \end{center} In case of a long $t_L$ (more than 30 minutes), it can be useful to set the argument \Rcode{degDuringPulse=TRUE} to estimate the rates of the RNA life-cycle without assuming that nascent transcripts are not degradaed during the labeling time. The longer the labelling time is the weaker this assumption gets, however, taking into account nascent RNA degradation involves the solution of a more complicated system of equations and can originate unrealistic high rates of synthesis, processing and degradation. <>= nascentInspObjDDP<-newINSPEcT(tpts=tpts ,labeling_time=tL ,nascentExpressions=nasExp_DESeq2 ,matureExpressions=matExp_DESeq2 ,degDuringPulse=TRUE) @ For short labeling times, as 10 minutes can be considered, the absence of degradation during the pulse is a good assumption and similar rates are estimated in both cases. In Figure \ref{fig:firstGuessCompareDDP} we compared the degradation rates estimated with or without the assumption that degradation of mature RNA occurs during the labeling time. \begin{center} <>= k3 <- ratesFirstGuess(nascentInspObj, 'degradation') k3ddp <- ratesFirstGuess(nascentInspObjDDP, 'degradation') plot(rowMeans(k3), rowMeans(k3ddp), log='xy', xlim=c(.2,10), ylim=c(.2,10), xlab='no degradation during pulse', ylab='degradation during pulse', main='first guess degradation rates') abline(0,1,col='red') @ \end{center} \subsubsection{Selection of the regulative scenario} The rates evaluated so far are highly exposed to the experimental noise, in particular processing and degradation rates which sum up the noise coming from different sources of experimental data (such as the signal from the Total and the nascent libraries). For this reason, it could be challenging to distinguish real variations of a rate from noise. Once priors have been computed, the method \Rmethod{modelRates} add a second step of modeling to identify the rates that are significantly regulated over the time course. This method tests wether the variation of a rate over time (synthesis, processing, degradation, or a combination of them) is required to obtain a good fit of the experimental data. To this purpose, 8 different models are tested for each gene, representing all the combinations of constant or variable synthesis, processing and degradation rates, eventually choosing the best trade off between goodness of fit and simplicity of the model. During this step, constant functions, sigmoid functions or impulse models are used to represent the rates over time. This evaluation can be computationally intense and some parameters are initialized randomly, therefore the argument \Rcode{seed} guarantees the reproducibility of the results. <>= nascentInspObj10<-modelRates(nascentInspObj10, seed=1) @ The rates that are computed through this second modeling procedure can be accessed through the method \Rmethod{viewModelRates} and visualized with the \Rmethod{inHeatmap} method, setting the argument \Rcode{type='model'} (Figure \ref{fig:viewModelRatesAndInHeatmapNascentTimeCourse}). \begin{center} <>= round(viewModelRates(nascentInspObj10, 'synthesis')[1:5,1:3],3) inHeatmap(nascentInspObj10, type='model', clustering=FALSE) @ \end{center} The metohd \Rmethod{geneClass} returns the transcriptional regulatory mechanism assigned to each modeled gene. In particular, each gene is assigned to a class named after the set of varying rates: "0" denotes a gene whose rates are constant over time, "a" denotes a gene whose synthesis changes over time, "b" denotes a gene whose degradation changes over time, "c" denotes a gene whose processing changes over time.\\ <>= geneClass(nascentInspObj10) @ Additionally, the metohd \Rmethod{plotGene} can be used to fully investigate the profiles of RNA concentrations and rates of a single gene. Estimated synthesis, degradation and processing rates, premature RNA and total RNA concentrations are displayed with solid thin lines, while their standard deviations are in dashed lines and the modelled rates and concentrations are in thick solid lines. This example shows a gene of class "ab", indicating that its expression levels are controlled by synthesis and degradation rates (Figure \ref{fig:plotGeneNascentTimeCourse}). \begin{center} <>= plotGene(nascentInspObj10, ix="101206") @ \end{center} The quality of each model is evaluated through log likelihood and chi-square, to take into account the different complexities of the regulatory scenarios we compute also the chi-square p value; it can visualized as a histogram (Figure \ref{fig:chi2NascentTimeCourse}) and eventually used to discard badly fitted genes. \begin{center} <>= chisq<-chisqmodel(nascentInspObj10) hist(log10(chisq), main='', xlab='log10 chi-squared p-value') discard<-which(chisq>1e-4) featureNames(nascentInspObj10)[discard] nascentInspObj_reduced<-nascentInspObj10[-discard] @ \end{center} The model selection is a fundamental and subtle stage of the analysis, it influences all the results shown above and should be carefully triggered according to the specific dataset under analysis. See the \textbf{Parameters setting} and \textbf{Evaluation of time-course model on simulated data} sections for further details. \subsection{Analysis of Total RNA without Nascent} The absence of the nascent RNA component makes the identification of a unique solution of the system more difficult. Nonetheless, the joint analysis of premature and mature RNA time-course RNA-seq analysis is an important source of information in regards of RNA dynamics. The main idea behind this relies in the fact that genes that are modulated transcriptionally (i.e. by the synthesis rate) respond in their RNA levels following a precise pattern, that is premature RNA and mature RNA are modulated to the same extent (same fold change compared to unperturbed condition) but with different timing. In fact, the mature RNA follow the modulation of the premature RNA by a delay that is indicative to the mature RNA stability (i.e. degradation rate). Deviations from this behavior are signs of post-transcriptional regulation. We developed a computational approach that explots these concepts and is able to quantify RNA dynamics based on time-course profiling of total RNA-seq data with good approximation. Without nascent RNA, the sole information of total ($T$, exonic quantification) and premature ($P$, intronic quantification) from the Total library is available and used to model the RNA dynamics. In particular, we model premature RNA ($P$) and mature RNA ($M=T-P$), using the equation: \begin{equation}\label{eq:modelsystemMature} \begin{array}{l l} \dot{M}=k_2 \, P(t) - k_3 \, M(t) \end{array} \end{equation} We start solving the equations assuming constant processing and degradation rates over the time-course and interpolating $P$ with a linear piecewise in order to find the $k2$ and $k3$ that minimize the error over $M$. While $k1(t)$ is obtained as follows: \begin{equation}\label{eq:k1modelsystemMature} \begin{array}{l l} k_1(t) = \dot{P(t)} + k_2 \, P(t) \\ \end{array} \end{equation} After that, $k_1(t)$ is used in combination with $P(t)$ and $T(t)$ to obatain $k_2(t)$ and $k_3(t)$ using the same procedure implemented for the joint analysis of nascent and total RNA. \subsubsection{Estimation of rates from intronic and exonic quantifications} The procedure described above is implemented in the method \Rmethod{newINSPEcT}, which creates the \Rclass{INSPEcT} object for the analysis and gives a first estimation of the synthesis, processing and degradation rates along the time-course, for those genes with enough signal for both introns and exons. In comparison to the mature and nascent configuration, from the practical point of view, nothing changes except for the absence of $t_L$ and new synthesized expression data. <>= tpts<-c(0,1/6,1/3,1/2,1,1.5,2,4,8,12,16) matureInspObj<-newINSPEcT(tpts=tpts ,labeling_time=NULL ,nascentExpressions=NULL ,matureExpressions=matExp_DESeq2) @ All the methods that apply to the \Rcode{INSPEcT} object created with the nascent RNA apply also to the object created with the sole total RNA. For example, we can compare the time-averaged rates of synthesis estimated in this first step of modeling with or without the nascent RNA, by extracting themwith the \Rmethod{ratesFirstGuess} (Figure \ref{fig:firstGuessCompare}). \begin{center} <>= cg <- intersect(featureNames(nascentInspObj), featureNames(matureInspObj)) k1Nascent <- ratesFirstGuess(nascentInspObj[cg,], 'synthesis') k1Mature <- ratesFirstGuess(matureInspObj[cg,], 'synthesis') smoothScatter(log10(rowMeans(k1Nascent)), log10(rowMeans(k1Mature)), main='synthesis rates', xlab='nascent + mature RNA', ylab='mature RNA') abline(0,1,col='red') @ \end{center} \subsubsection{Selection of the regulative scenario} Similarly to the procedure with the nascent RNA, inspect provide an additional step of modeling, where the software assigns to the rates of the RNA cycle a precise functional form to describe their behavior over time (see the corresponding section in "Analysis of Total and Nascent RNA"). In this case, RNA dynamics can be modeled either an impulse functions or constant functions. During this procedure 8 independent models for each gene are tested (all the combinations of contant/impulsive functions for each of the three rates), and the best model is chosen based on the trade-off between the goodness of fit and the simplicity. This modeling procedure refines the estimation of the rates and gives a statistical confidence about the variation of each rate during the time-course. <>= matureInspObj10 <- matureInspObj[1:10, ] matureInspObj10 <- modelRates(matureInspObj10, seed=1) @ In figure \ref{fig:plotGeneMatureTimeCourse}, the same gene represented in figure \ref{fig:plotGeneNascentTimeCourse} is plotted. Also in this case, estimated synthesis, degradation and processing rates, premature RNA and total RNA concentrations are displayed with solid thin lines, while their standard deviations are in dashed lines and the modelled rates and concentrations are in thick solid lines. \begin{center} <>= plotGene(matureInspObj10, ix="101206") @ \end{center} \subsection{Performance evaluation with simulated data} \Rpackage{INSPEcT} provides functionalities to build a synthetic dataset for which the transcriptional scenario is known for each gene. Simulated data can be used to evaluate the performance of \Rpackage{INSPEcT} in classifying each rate as constant or variable over time, to estimate the number of time points and replicates necessary to achieve a given performance and to configure the model selection parameters.\\ The first step consists in sampling from an \Rpackage{INSPEcT} object: absolute values of the rates, fold changes, correlations between these two elements and associated variances. Once these parameters distributions have been estimated, they are used to simulate a given number of genes (\textit{nGenes = ...}). Optionally, the user can provide the probabilities for a rate to be modelled as a constant, sigmoid or impulse function; by default they are respectively 0.5, 0.2 and 0.3.\\ <>= simRates<-makeSimModel(nascentInspObj, 1000, seed=1) @ \Rmethod{makeSimModel} produces an object of class \Rclass{INSPEcT\_model}, to be analysed it must be transformed in an \Rclass{INSPEcT} object using the \Rmethod{makeSimDataset} method. It takes as arguments the number of replicates required and the time points at which the data should be virtually collected. Regarding the latter point, the new time-course must be designed as a sampling of the original experimental time window (same initial and final conditions). The object created by this method can be modelled via \Rmethod{modelRates} as any other object of class \Rclass{INSPEcT}, and in this case the results could be compared with the ground truth and the perfromace of the procedure with given number of replicates and time points can be estimated. In order to avoid long computation time, the next two chunks of code have been previously evaluated and will be not computed within this vignette. <>= newTpts <- c(0.00, 0.13, 0.35, 0.69, 1.26, 2.16, 3.63, 5.99, 9.82, 16.00) nascentSim2replicates<-makeSimDataset(object=simRates ,tpts=newTpts ,nRep=2 ,NoNascent=FALSE ,seed=1) nascentSim2replicates<-modelRates(nascentSim2replicates[1:100] ,seed=1) newTpts <- c(0.00, 0.10, 0.26, 0.49, 0.82, 1.32, 2.06, 3.16, 4.78, 7.18, 10.73, 16.00) nascentSim3replicates<-makeSimDataset(object=simRates ,tpts=newTpts ,nRep=3 ,NoNascent=FALSE ,seed=1) nascentSim3replicates<-modelRates(nascentSim3replicates[1:100] ,seed=1) @ Starting from the very same \Robject{simRates}, it is also possible to generate \Rclass{INSPEcT} objects without information about the nascent RNA (argument \Rcode{NoNascent} set to TRUE). However, it is not allowed to produce artificial gene sets starting from \Rpackage{INSPEcT} object without nascent RNA because the experimental estimation of the synthesis rate is mandatory to guarantee the good quality of the simulated data. <>= newTpts <- c(0.00, 0.13, 0.35, 0.69, 1.26, 2.16, 3.63, 5.99, 9.82, 16.00) matureSim2replicates<-makeSimDataset(object=simRates ,tpts=newTpts ,nRep=2 ,NoNascent=TRUE ,seed=1) modelSelection(matureSim2replicates)$thresholds$chisquare <- 1 matureSim2replicates<-modelRates(matureSim2replicates[1:100] ,seed=1) newTpts <- c(0.00, 0.10, 0.26, 0.49, 0.82, 1.32, 2.06, 3.16, 4.78, 7.18, 10.73, 16.00) matureSim3replicates<-makeSimDataset(object=simRates ,tpts=newTpts ,nRep=3 ,NoNascent=TRUE ,seed=1) modelSelection(matureSim3replicates)$thresholds$chisquare <- 1 matureSim3replicates<-modelRates(matureSim3replicates[1:100] ,seed=1) @ Once the simulated data have been produced and analysed, it is possible to compare the results obtained by the method \Rmethod{modelRates} and the object \Robject{simRates}, which contains the ground truth of rates.\\ The method \Rmethod{rocCurve}, for example, measures the performace of the constant/variable classification individually for the rates of synthesis, processing and degradation, using the receiver operating characteristic (ROC) curve. (Figure \ref{fig:ROCs}). \begin{center} <>= data("nascentSim2replicates","nascentSim3replicates", "matureSim2replicates","matureSim3replicates",package='INSPEcT') par(mfrow=c(2,2)) rocCurve(simRates[1:100],nascentSim2replicates) title("2rep. 10t.p. Total and nascent RNA", line=3) rocCurve(simRates[1:100],nascentSim3replicates) title("3rep. 12t.p. Total and nascent RNA", line=3) rocCurve(simRates[1:100],matureSim2replicates) title("2rep. 10t.p. Total RNA", line=3) rocCurve(simRates[1:100],matureSim3replicates) title("3rep. 12t.p. Total RNA", line=3) @ \end{center} Further, the method \Rmethod{rocThresholds} can be used to assess the sensitivity and specificity that is achieved with different thresholds of the chi-squared and Brown tests (Figure \ref{fig:ROCsThresholds}). \begin{center} <>= rocThresholds(simRates[1:100],nascentSim2replicates, bTsh=c(.01,.01,.05),cTsh=.1) @ \end{center} \subsection{Parameters setting} If desired, different parameters can be set for both the modelling and the testing part. Regarding the modelling part, we might want to increase the number of different initializations that are performed for each gene (\textit{nInit} option) or increase the maximum number of steps in the rates optimization process (\textit{nIter} option). All these choices could improve the performance of the method, but also the needed computational time; the impact of these options on the quality of the modelling can be evaluated using simulated datasets. <>= nascentInspObj10<-removeModel(nascentInspObj10) modelingParams(nascentInspObj10)$nInit<-20 modelingParams(nascentInspObj10)$nIter<-1000 @ Instead, we might want to change the thresholds for chi-squared and log-likelihood ratio tests, or define the specific set of models to be compared with log-likelihood ratio test while assessing if a given rate is variable or not. The chi-squared test is used to discard bad models that will not enter in the model selection. By default, the threshold of acceptance is set to 0.2 and in this example we decide to be more stringent and set the threshold to 0.1. In case of low number of time points, few or no models can result in a low chi-square. In these case, it could be necessary to use larger values for the chi-squared test threshold, or be completely comprehensive and set the threshold to 1. The threshold relative to the Brown test is used to call differential regulation and can be set independently for each rate. Low threshold of the Brown test reflect in a more strict call for differential regulation of the rate. In this example, we are changing the thresholds of both the chi-squared test and the Brown method for combining p-values. We are also imposing that only the constant models ("0") and processing varying model ("c") are tested one agains each other used to to assess the variability of the processing rates using log-likelihood ratio test. Usually, all nested models that differ for the processing rate are used ( "0"vs"c", "c"vs"bc", "c"vs"ac" and "ab"vs"abc" ). <>= modelSelection(matureInspObj10)$thresholds$chisquare<-.1 modelSelection(matureInspObj10)$thresholds$brown<-c(synthesis=.01 ,processing=.01 ,degradation=.05) @ To have a sense of all parameters that can be set, type: <>= ## modelling modelingParams(matureInspObj10) ## model selection and testing framework modelSelection(matureInspObj10) @ \section{Analysis of RNA dynamics in steady state RNA-seq data} \subsection{Analysis of Total and Nascent RNA} \Rpackage{INSPEcT} is also designed to analyze RNA dynamics in steady state conditions and to compare the RNA dynamics of two steady state at time. In order to exemplify the procedure, we treat the data relative to the time-course analyzed above as 11 individual steady states. \Rpackage{INSPEcT} recognizes that the analysis is at steady state when a character vector of condtions is given as time-points. <>= conditions <- letters[1:11] nasSteadyObj <- newINSPEcT(tpts=conditions ,labeling_time=tL ,nascentExpressions=nasExp_DESeq2 ,matureExpressions=matExp_DESeq2) @ Once created the steady-state \Rpackage{INSPEcT} object, the first guess of the rates can be extracted with the method \Rmethod{ratesFirstGuess}, as usual. If we want to test for differential regulation in the RNA dynamics between two conditions, the method \Rmethod{compareSteady} can be used. The \Rpackage{INSPEcT} object created with 11 conditions can be subsetted in order to select the ones that we want to compare. In this case, we selected the first and the last conditions, tested for differential regulation and visualized the results relative to individual rates using the methods \Rmethod{synthesis}, \Rmethod{processing} and \Rmethod{degradation}. These methods return the rates computed by the modeling in the two conditions, the log transfromed geometric mean of the rate, the log2 fold change between the two conditions and the statistics relative to the test. In case a rate was not assessed as differentially regulated the same value is reported for the two conditions. Also in this case, the p-values thresholds of the chi-squared and Brown test can be modified by the \Rmethod{modelSelection}. <>= diffrates <- compareSteady(nasSteadyObj[,c(1,11)]) head(round(synthesis(diffrates),2)) head(round(processing(diffrates),2)) head(round(degradation(diffrates),2)) @ Following this, the method \Rmethod{plotMA} can be used to plot the results of the comparison for each indiviadual rate. In figure (Figure \ref{fig:steadyStateNascent3}), we show the results for the synthesis rate. All the points that do not lie on the line y = 0 corresponds to genes called differentially regulated by the method \Rmethod{compareSteady}. This method select differentially regulated genes based on the p-value of the Brown test previosly chosen. Additionally, it is possible to visualize genes that are differentially according to a certain threshold of the Benjamini and Hochberg correction of the p-values (argument \textbf{padj}). Those genes will be visualized as orange triangles. <>= plotMA(diffrates, rate='synthesis', padj=1e-3) @ \subsection{Analysis of Total without Nascent RNA} It is not possible to estimate the kinetic rates from steady state data without nascent RNA expression data because of the underdetermination of the system (Equations \ref{eq:modelsystem}). The only information that can be extracted is the ratio between the premature ($P$) and mature ($M$) RNA: \begin{equation}\label{eq:PTratio} \nonumber \frac{P}{M} = \frac{P_{T}}{T_{T} - P_{T}} = \frac{k_3}{k_2} \\ \end{equation} where $P_{R}$ and $T_{R}$ are the premature and total RNA levels, respectively, estimated from the total RNA library. However, even from this aggregated quantity, valuable information regarding the differential post transcriptional regulation of genes in different samples can still be obtained.\\ This analysis is performed by the metohd \Rmethod{compareSteadyNoNascent} which takes as input an object of class \Rpackage{INSPEcT} created witouh nascent RNA, an expression threshold for premature and total RNA, to select reliable data to be analysed (\textit{expressionThreshold}) and the distance, expressed in log2, from the expected behaviour beyond which a gene should be considered as post-transcriptionally regulated.\\ We noticed that the ratio $P$ over $M$ decreases along with the expression following a power law, which can be easily fitted in the log log space by a linear model. This power law is calculated in the first step of the \Rmethod{compareSteadyNoNascent} analysis. Following that, genes that deviates from this trend across conditions are identified as differentially post-transcriptionally regulated. <>= conditions <- letters[1:11] matureInspObj <- newINSPEcT(tpts=conditions ,matureExpressions=matExp_DESeq2) regGenes<-compareSteadyNoNascent(inspectIds=matureInspObj ,expressionThreshold=0.25 ,log2FCThreshold=2.) regGenes[1:6,1:5] table(regGenes, useNA='always') @ The output of the function is a matrix of booleans, as large as the expression input data, where TRUE means that a specific gene, in a given condition, is post-transcriptionally regulated in a peculiar way.\\ The entire analysis is based on the estimation of a standard behaviour of a set of genes which means that benefits from the size of the dataset under analysis. The example is, obviously, just a proff of concept to explicit arguments and output of \Rfunction{compareSteadyNoNascent}; for this reason, we have decided to use time-course data. A real application does not require data to be temporally related in any way. \section{Analysis of the delay introduced by the rates of processing to RNA responsiveness} During the transition between steady-states, RNA kinetic rates define the speed at which the mature form of a transcript can be set at a new level, i.e. the responsiveness. The rate of degradation is typically considered the major determinant of responsiveness: the lower is transcript stability, the higher is its responsiveness. Nonetheless, RNA processing rate influences RNA responiveness and specifically introduces a delay to it, which is commonly considered non-influent due to the speed at which the processing of the transcripts occur. To better elucidate the role of processing rate in this context, we devised two metrics, $\tau$ and $\Delta$, that measure the additional time required to double mature RNA levels following a doubling in synthesis rates, compared to a scenario where processing occurs instantaneously ($\tau$), and the number of RNA molecules involved in this delay ($\Delta$). These metrics can be calculated from the kinetic rates of individual genes, or can be approximated with accuracy from premature and mature RNA abundances. The responsiveness of genes with high values for both metrics is thus significantly burdened by the processing step.\\ The two metrics, as well as the genes delayed by the processing rates, can be easily calculated from the \Rpackage{INSPEcT} object. <>= tau <- calculateTau(nasSteadyObj) procDelay<- processingDelay(inspectIds=nasSteadyObj ,tauThreshold=1.2 ,deltaThreshold=1.0) table(procDelay, useNA='always') @ In case the object was created without the nascent RNA information, approximated metrics will be calculated. \begin{center} <>= tau_approx <- calculateTau(matureInspObj) cg <- intersect(featureNames(nasSteadyObj), featureNames(matureInspObj)) plot(tau[cg,], tau_approx[cg,], log='xy' , main=expression(paste(tau,' metric')) , xlab='with nascent RNA' , ylab='approximation without nascent RNA') @ \end{center} \section{About this document} <>= sessionInfo() @ \end{document}