%\VignetteIndexEntry{INSPEcT}
%\VignettePackage{INSPEcT}
%\VignetteDepends{INSPEcT}
%\VignetteEngine{knitr::knitr}
\documentclass[11pt]{article}
\usepackage[utf8]{inputenc}

<<style-knitr, eval=TRUE, echo=FALSE, results="asis">>=

BiocStyle::latex()
@

\newcommand{\Rmethod}[1]{{\Rfunction{#1}}}

\title{INSPEcT - INference of Synthesis, Processing and dEgradation rates from Transcriptomic data}
\author{de Pretis S. - Furlan M. - Pelizzola M.}

\begin{document}

\maketitle

\tableofcontents

\section{Introduction}

  \Rpackage{INSPEcT} is an R/Bioconductor compliant solution for the study of RNA transcriptional dynamics
  from RNA-seq data. It is based on a system two of ordinary differential equations (ODEs) that describes the synthesis ($k_1$) 
  and processing ($k_2$) of premature RNA ($P$) and the degradation ($k_3$) of mature RNA ($M=T-P$): 

  \begin{equation}\label{eq:modelsystem}
    \left\{
      \begin{array}{l l}
    \dot{P}=k_1 - k_2 \, \cdot \, P \\
    \dot{T}=k_1 - k_3 \, \cdot \, (T - P)
      \end{array}
    \right.
  \end{equation}

  The total RNA ($T$) is measured from the exonic quantification of transcripts in RNA-seq experiment, while 
  premature RNA ($P$) is measured by the intronic quantification. The model is based on two commonly accepted assumptions: 
  premature RNAs are not degraded, but only converted to mature RNA, and nuclear export occurs at a rate considerably 
  faster than other rates and can be disregarded.\\
  The package supports the analysis of several experimental designs, such as steady-state conditions or time-course data, 
  and the presence or absence of the nascent RNA library. According to the experimental design, the software 
  returns different outputs. In particular, when the nascent RNA
  is present, the rates of synthesis, processing and degradation can be calculated both in time-course and in steady-state
  state conditions and tested for differential regulation. Without the information from the nascent RNA, 
  the rates of synthesis, processing and degradation, as well as their significant variation, can be calculated only in time-course, 
  while only a variation in the ratio between processing and degradation (post-transcriptional ratio) can assessed 
  between steady states. A schema of the possible configurations 
  and related outputs is reported in Table \ref{table:INSPEcTanalysisDesigns}.

  \begin{table}\label{table:INSPEcTanalysisDesigns}
  \caption{Table reporting the experimental designs suitable for INSPEcT}

  \begin{tabular}{cl|c|c|c|}
  \cline{3-5}
                                              &  & \multicolumn{3}{c|}{\textbf{Total and nascent RNA}}              \\ \cline{3-5}
                                              &  & \textit{Synthesis} & \textit{Processing} & \textit{Degradation}  \\ \cline{1-1} \cline{3-5}
  \multicolumn{1}{|c|}{\textbf{Steady state}} &  & Yes                & Yes                 & Yes                   \\ \cline{1-1} \cline{3-5}
  \multicolumn{1}{|c|}{\textbf{Time-course}}  &  & Yes                & Yes                 & Yes                   \\ \cline{1-1} \cline{3-5}
  \\
  \cline{3-5}
                                              &  & \multicolumn{3}{c|}{\textbf{Total RNA}}                         \\ \cline{3-5}
                                              &  & \textit{Synthesis} & \textit{Processing} & \textit{Degradation} \\ \cline{1-1} \cline{3-5}
  \multicolumn{1}{|c|}{\textbf{Steady state}} &  & No                 & \multicolumn{2}{c|}{Ratio $\frac{Processing}{Degradation}$}                 \\ \cline{1-1} \cline{3-5}
  \multicolumn{1}{|c|}{\textbf{Time-course}}  &  & Yes                & Yes                 & Yes                  \\ \cline{1-1} \cline{3-5}
  \end{tabular}

  \end{table}

  \section{Quantification of Exon and Intron features}

  The \Rpackage{INSPEcT} analysis starts with the quantification of exonic and intronic quantifications and 
  the estimation of their variance from the replicates in the different conditions of the
  analysis. \Rpackage{INSPEcT} can estimate transcripts abundances and associated variances starting from 
  different entry points according to the source of data available: SAM (or BAM) files, read counts 
  or transcripts abundances. The user can choose to estimate the variance by means of two different 
  softwares: \Rpackage{DESeq2} or \Rpackage{plgem}. By default, \Rpackage{DESeq2} is used when starting 
  from BAM or read counts, while only \Rpackage{plgem} can be used when starting from transcripts abundances.

  <<INSPEcT_loading,eval=TRUE,echo=TRUE,message=FALSE,warnings=FALSE>>=
  library(INSPEcT)
  @

  \subsection{From BAM or SAM files}

  In case the user has access to the raw aligned data, \Rpackage{INSPEcT} can quantify 
  transcripts abundaces and variances directly from SAM or BAM files. Transcrips annotation should 
  be provided with an annotation object of class \Rcode{TxDb}. \Rpackage{INSPEcT} retrieves the genomic coordinates of
  the exons and defines introns as inter-exonic regions. By default, \Rpackage{INSPEcT} collapses the exons of the
  transcripts that belong to the same gene (argument \Rcode{by} set equal to \Rcode{'gene'}), but can work also 
  at the transcript level (argument \Rcode{by} set equal to \Rcode{'tx'}). When working at the transcript level, \Rpackage{INSPEcT}
  cannot discriminate betweeen transcript, therefore we suggest to assign reads to all the overlapping features 
  by setting the argument \Rcode{allowMultiOverlap} to \Rcode{TRUE}. \Rpackage{INSPEcT} manage strandness of the reads with the 
  argument \Rcode{strandSpecific}, which can be set to \Rcode{0}, in case of non-stranded reads, \Rcode{1} for stranded
  and \Rcode{2} for reverse-stranded experiments. \Rpackage{INSPEcT} prioritizes the exon annotation, meaning that only the reads
  that do not overlap with any of the annotated exon are eventually accounted for introns. Here we report an 
  example where the data from 4 sample BAM files (2 time points, 2 replicates each, as specified with the 
  argument \Rcode{experimentalDesign}) is quantified in a non-stranded specific way, at the gene level, 
  leaving un-assigned the reads mapping to more than one feature.

  <<quantifyExpressionsFromBAMs,eval=TRUE,message=FALSE,warnings=FALSE>>=

  require(TxDb.Mmusculus.UCSC.mm9.knownGene)
  txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene

  bamfiles_paths <- system.file('extdata/',
    c('bamRep1.bam','bamRep2.bam','bamRep3.bam','bamRep4.bam'), 
    package='INSPEcT')
  
  exprFromBAM <- quantifyExpressionsFromBAMs(txdb=txdb
             , BAMfiles=bamfiles_paths
             , by = 'gene'
             , allowMultiOverlap = FALSE
             , strandSpecific = 0
             , experimentalDesign=c(0,0,1,1))
  
  @

  The function returns a list containing exonic/intronic expressions and variances, as well as exonic/intronic 
  feature widths extracted from the annotation, exonic/intronic counts and counts statitics.

  <<quantifyExpressionsFromBAMs2,eval=TRUE,message=TRUE,warnings=FALSE>>=

  names(exprFromBAM)
  exprFromBAM$countsStats

  @


  \subsection{From read counts}

  In case the matices of intronic and exonic reads have been already calulated, \Rpackage{INSPEcT} calulates
  the mean expression and the associated variance using, by default, the software \Rpackage{DESeq2}. The 
  package \Rpackage{INSPEcT} contains the read counts associated to the intronic and exonic features of 
  500 genes profiled both for the nascent and total RNA in 11 time-points and replicated 3 times each 
  following the induction of Myc in 3T9 cells [ref to the paper]. As a complementary information, the width
  of the intronic and exonic features for the same set of genes, as well as the sequencing depth of all the 
  samples is provided within the package. The nascent and total quantification evaluated in this section of
  the vignette will be used later to generate the synthetic dataset for the benchmarking of the method.

  <<quantifyExpressionsFromCountsNascentTimeCourse,eval=TRUE,message=FALSE,warnings=FALSE>>=

  data('allcounts', package='INSPEcT')
  data('featureWidths', package='INSPEcT')
  data('libsizes', package='INSPEcT')

  nascentCounts<-allcounts$nascent
  matureCounts<-allcounts$mature
  expDes<-rep(c(0,1/6,1/3,1/2,1,1.5,2,4,8,12,16),3)

  nasExp_DESeq2<-quantifyExpressionsFromTrCounts(
          allcounts=nascentCounts
          ,libsize=nascentLS
          ,exonsWidths=exWdths
          ,intronsWidths=intWdths
          ,experimentalDesign=expDes)

  matExp_DESeq2<-quantifyExpressionsFromTrCounts(
          allcounts=matureCounts
          ,libsize=totalLS
          ,exonsWidths=exWdths
          ,intronsWidths=intWdths
          ,experimentalDesign=expDes)

  @

  \subsection{From transcripts abundances}

  In order to exemplify the quantification of mean transcripts abundances and variances 
  starting from their replicated quantification, we transform the read counts loaded from the package 
  in the section above into RPKMs using the width of intron and exon and the library sizes 

  <<createExpressions,eval=TRUE,message=FALSE,warnings=FALSE>>=
      
  trAbundancesFromCounts <- function(counts, widths, libsize)
          t(t(counts/widths)/libsize*10^9)
  nascentTrAbundance <- list(
    exonsAbundances=trAbundancesFromCounts(
          nascentCounts$exonsCounts, exWdths, nascentLS),
    intronsAbundances=trAbundancesFromCounts(
          nascentCounts$intronsCounts, intWdths, nascentLS))
  @

  Once we have obtained transcripts abundances for introns and exons in the different conditions 
  and replicates of the experimental design, we can calulate the mean expression and variance 
  (using \Rpackage{plgem}) by:

  <<quantifyExpressionsFromTrAbundancesNascentTimeCourse,eval=TRUE, message=FALSE,warnings=FALSE>>=

  nasExp_plgem<-quantifyExpressionsFromTrAbundance(
          trAbundaces = nascentTrAbundance
          , experimentalDesign = expDes)

  @

  \section{Analysis of RNA dynamics in time-course experiments}

  \subsection{Analysis of Total and Nascent RNA}

  In case of the joint analysis of total and nascent RNA data, for each transcript with at least one intron, 
  the exonic and intronic quantifications are available in the Total and in the Nascent RNA libraries. The system 
  of equations \eqref{eq:modelsystem} describes the exonic and intronic quantifications of the Total library, 
  but when integrated between zero and the time of labeling used to generate the nascent RNA transcripts ($t_L$) 
  it describes the exonic and intronic quantifications of the Nascent library. 
  For matter of simplicity and to avoid unrealistic high rates of synthesis, processing and degradation, 
  \Rpackage{INSPEcT} assumes by default that the nascent transcripts are not degradaed during the labeling time.
  The set of equations to describe Total and Nascent RNA is:

  \begin{equation}\label{eq:2}
  \left\{
    \begin{array}{l l}
      \dot{P}_{R}=k_1 - k_2 \, \cdot \, P_{R} \\
      \dot{T}_{R}=k_1 - k_3 \, \cdot \, (T_{R} - P_{R}) \\
      s_F \, \cdot \, P_{L}=\frac{k_1}{k_2} - ( 1 - e^{k_2 \, \cdot \, t_L} )  \\
      s_F \, \cdot \, T_{L}=k_1 \, \cdot \, t_L
    \end{array}
  \right.
  \end{equation}

  where $P_{R}$ and $T_{R}$ are the premature and total RNA levels, respectively, estimated
  from the total RNA library, $P_{L}$ and $T_{L}$ are premature and total RNA levels, respectively, estimated from 
  the nascent RNA library. This system describes a single condition or time-point. In case of steady-state analysis, 
  $\dot{P}_{R}$ and $\dot{T}_{R}$ are equal to zero, while in case of the time-course analysis 
  they are estimated from the interpolation of $P_R(t)$ and $T_R(t)$ during the entire time-course via cubic splines. 
  \Rpackage{INSPEcT} exploit the fact that the system is overdetermined (three unknowns - $k_1$, $k_2$ adn $k_3$, 
  and four equations) to calculate a scaling factor between the Total and Nascent RNA quantifications, $s_F$, 
  that multiplies $P_{L}$ and $T_{L}$. To avoid overfitting, a single scaling-factor representative
  of the entire population of transcripts is used, which represents the normalization factor between 
  the Nascent and Total libraries (ref to INSPEcT paper).
  \\
  
  \subsubsection{Estimation of rates from intronic and exonic quantifications}
  Once obtained the intronic and exonic quantifications from the Total and Nascent library,
  the rates of synthesis, processing and degradation are calculated following the procedure described in the 
  section above during the initialization of the \Rpackage{INSPEcT} object with the function \Rmethod{newINSPEcT}:

  <<newINSPEcTNascentTimeCourse,eval=TRUE,message=FALSE,warnings=FALSE>>=

    tpts<-c(0,1/6,1/3,1/2,1,1.5,2,4,8,12,16)
    tL<-1/6
    nascentInspObj<-newINSPEcT(tpts=tpts
                              ,labeling_time=tL
                              ,nascentExpressions=nasExp_DESeq2
                              ,matureExpressions=matExp_DESeq2)
  @

  Once created the \Rpackage{INSPEcT} object, the method \Rmethod{ratesFirstGuess} returns the expressions of 
  premature and total RNA for the analyzed genes, as well as the rates of synthesis, processing and degradation. 
  Additionally, the method \Rmethod{ratesFirstGuessVar} returns the variances of premature and total RNA, 
  and of the synthesis rates. 

  <<newINSPEcTNascentTimeCourseShowResults,eval=TRUE,message=FALSE,warnings=FALSE>>=
    
    round(ratesFirstGuess(nascentInspObj,'total')[1:5,1:3],3)
    round(ratesFirstGuessVar(nascentInspObj,'total')[1:5,1:3],3)
    round(ratesFirstGuess(nascentInspObj,'synthesis')[1:5,1:3],3)

  @

  The \Rpackage{INSPEcT} object can be subsetted to focus on a specific set of genes. For the sake of speeding up 
  the downstream analysis, we will focus on the first 10 genes of the INSPEcT object. 
  The complete pattern of total and pre-mRNA concentrations variation together with the synthesis, degradation and 
  processing rates can be visualized using the \Rmethod{inHeatmap} method (Figure \ref{fig:inHeatmapNascentTimeCourse}).

  \begin{center}
  <<inHeatmapNascentTimeCourse,eval=TRUE,message=FALSE,warnings=FALSE, fig.width=6, fig.height=4, fig.cap="inHeatmap of the ratesFirstGuess. Heatmap representing the concentrations of total RNA, of premature RNA and the first guess of the rates of the RNA life cycle", fig.align="center">>=

  nascentInspObj10<-nascentInspObj[1:10]
  inHeatmap(nascentInspObj10, clustering=FALSE)
  @
  \end{center}

  In case of a long $t_L$ (more than 30 minutes), it can be useful to set the argument \Rcode{degDuringPulse=TRUE} 
  to estimate the rates of the RNA life-cycle without assuming that nascent transcripts are 
  not degradaed during the labeling time. The longer the labelling time is the weaker this assumption gets, 
  however, taking into account nascent RNA degradation involves the solution of a more complicated system of 
  equations and can originate unrealistic high rates of synthesis, processing and degradation.

  <<newINSPEcTNascentTimeCourseDDP,eval=TRUE,message=FALSE,warnings=FALSE>>=

    nascentInspObjDDP<-newINSPEcT(tpts=tpts
                              ,labeling_time=tL
                              ,nascentExpressions=nasExp_DESeq2
                              ,matureExpressions=matExp_DESeq2
                              ,degDuringPulse=TRUE)
  @

  For short labeling times, as 10 minutes can be considered, the absence of degradation during the pulse is a good 
  assumption and similar rates are estimated in both cases. In Figure \ref{fig:firstGuessCompareDDP} we compared the 
  degradation rates estimated with or without the assumption that degradation of mature RNA occurs during the labeling time.

  \begin{center}
  <<firstGuessCompareDDP,eval=TRUE,message=FALSE,warnings=FALSE, fig.width=3.5, fig.height=3.5, hold=TRUE, fig.cap="Dotplot of synthesis rates calculated with or without the assumption of no degradation during pulse. First guess of the synthesis rates calcualted with or without degDuringPulse option. In red is represented the line of equation y = x.", fig.align="center">>=

  k3 <- ratesFirstGuess(nascentInspObj, 'degradation')
  k3ddp <- ratesFirstGuess(nascentInspObjDDP, 'degradation')
  plot(rowMeans(k3), rowMeans(k3ddp), log='xy', 
       xlim=c(.2,10), ylim=c(.2,10), 
       xlab='no degradation during pulse',
       ylab='degradation during pulse',
       main='first guess degradation rates')
  abline(0,1,col='red')

  @
  \end{center}

  \subsubsection{Selection of the regulative scenario}

  The rates evaluated so far are highly exposed to the experimental noise, in particular processing and degradation rates
  which sum up the noise coming from different sources of experimental data (such as the signal from the 
  Total and the nascent libraries). For this reason, it could be challenging to distinguish real variations of a rate
  from noise. Once priors have been computed, the method \Rmethod{modelRates} add a second step of modeling to 
  identify the rates that are significantly regulated over the time course. This method tests wether the variation of a 
  rate over time (synthesis, processing, degradation, or a combination of them) is required to obtain a good fit 
  of the experimental data. To this purpose, 8 different models are tested for each gene, representing all the 
  combinations of constant or variable synthesis, processing and degradation rates, eventually choosing 
  the best trade off between goodness of fit and simplicity of the model. 
  During this step, constant functions, sigmoid functions or impulse models are used to represent the rates over time. 
  This evaluation can be computationally intense and some parameters are initialized randomly, therefore the argument 
  \Rcode{seed} guarantees the reproducibility of the results.

  <<modelRatesNascentTimeCourse,eval=TRUE,message=FALSE,warnings=FALSE>>=
  nascentInspObj10<-modelRates(nascentInspObj10, seed=1)
  @

  The rates that are computed through this second modeling procedure can be accessed through the method 
  \Rmethod{viewModelRates} and visualized with the \Rmethod{inHeatmap} method, setting the argument \Rcode{type='model'} 
  (Figure \ref{fig:viewModelRatesAndInHeatmapNascentTimeCourse}).

  \begin{center}
  <<viewModelRatesAndInHeatmapNascentTimeCourse,eval=TRUE,message=FALSE,warnings=FALSE, fig.width=6, fig.height=4, fig.cap="inHeatmap of the rates after the second step of modeling. Heatmap representing the concentrations of total RNA, of premature RNA and the rates of the RNA life cycle after the second step of modeling.", fig.align="center">>=
  round(viewModelRates(nascentInspObj10, 'synthesis')[1:5,1:3],3)
  inHeatmap(nascentInspObj10, type='model', clustering=FALSE)
  @
  \end{center}

  The metohd \Rmethod{geneClass} returns the transcriptional regulatory mechanism assigned to each modeled gene. 
  In particular, each gene is assigned to a class named after the set of varying rates: "0" denotes a gene whose 
  rates are constant over time, "a" denotes a gene whose synthesis changes over time, "b" denotes a gene whose 
  degradation changes over time, "c" denotes a gene whose processing changes over time.\\

  <<geneClassNascentTimeCourse,eval=TRUE,message=FALSE,warnings=FALSE>>=
  geneClass(nascentInspObj10)
  @

  Additionally, the metohd \Rmethod{plotGene} can be used to fully investigate the profiles of RNA concentrations 
  and rates of a single gene. Estimated synthesis, degradation and processing rates, premature RNA and total RNA 
  concentrations are displayed with solid thin lines, while their standard deviations are in dashed lines and 
  the modelled rates and concentrations are in thick solid lines. This example shows a gene of class "ab", 
  indicating that its expression levels are controlled by synthesis and degradation rates 
  (Figure \ref{fig:plotGeneNascentTimeCourse}).

  \begin{center}
  <<plotGeneNascentTimeCourse,eval=TRUE,message=TRUE,warnings=FALSE, fig.width=10, fig.height=3, fig.cap="Plot of a single gene resolved using nascent and total RNA. Plot of concentrations and rates over time for the given gene 101206. Total and nascent RNA were used to resolve the whole set of rates and concentrations.", fig.align="center">>=

  plotGene(nascentInspObj10, ix="101206")
  @
  \end{center}

  The quality of each model is evaluated through log likelihood and chi-square, to take into account the different 
  complexities of the regulatory scenarios we compute also the chi-square p value; it can visualized as a 
  histogram (Figure \ref{fig:chi2NascentTimeCourse}) and eventually used to discard badly fitted genes.

  \begin{center}
  <<chi2NascentTimeCourse,eval=TRUE,message=FALSE,warnings=FALSE, fig.width=3.5, fig.height=3.5, hold=TRUE, fig.cap="Goodness of the fit. Histogram of the p-values from the goodness of fit test for modeled genes.", fig.align="center">>=
  
  chisq<-chisqmodel(nascentInspObj10)
  hist(log10(chisq), main='', xlab='log10 chi-squared p-value')
  discard<-which(chisq>1e-4)
  featureNames(nascentInspObj10)[discard]
  nascentInspObj_reduced<-nascentInspObj10[-discard]  
  @
  \end{center}

  The model selection is a fundamental and subtle stage of the analysis, it influences all the results shown above and should 
  be carefully triggered according to the specific dataset under analysis. See the \textbf{Parameters setting} and 
  \textbf{Evaluation of time-course model on simulated data} sections for further details.

  \subsection{Analysis of Total RNA without Nascent}

  The absence of the nascent RNA component makes the identification of a unique solution of the system more difficult. 
  Nonetheless, the joint analysis of premature and mature RNA time-course RNA-seq analysis is an important source 
  of information in regards of RNA dynamics. The main idea behind this relies in the fact that genes that 
  are modulated transcriptionally (i.e. by the synthesis rate) respond in their RNA levels following a precise 
  pattern, that is premature RNA and mature RNA are modulated to the same extent (same fold change compared to 
  unperturbed condition) but with different timing. In fact, the mature RNA follow the modulation 
  of the premature RNA by a delay that is indicative to the mature RNA stability (i.e. degradation rate). 
  Deviations from this behavior are signs of post-transcriptional regulation. We developed a computational 
  approach that explots these concepts and is able to quantify RNA dynamics based on time-course profiling 
  of total RNA-seq data with good approximation. Without nascent RNA, the sole information of total 
  ($T$, exonic quantification) and premature ($P$, intronic quantification) from the Total library is available 
  and used to model the RNA dynamics. In particular, we model premature RNA ($P$) and mature RNA ($M=T-P$), 
  using the equation:

    \begin{equation}\label{eq:modelsystemMature}
        \begin{array}{l l}
      \dot{M}=k_2 \, P(t) - k_3 \, M(t)
        \end{array}
    \end{equation}

  We start solving the equations assuming constant processing and degradation rates over the time-course and interpolating 
  $P$ with a linear piecewise in order to find the $k2$ and $k3$ that minimize the error over $M$. While $k1(t)$ is obtained 
  as follows:

    \begin{equation}\label{eq:k1modelsystemMature}
        \begin{array}{l l}
      k_1(t) = \dot{P(t)} + k_2 \, P(t) \\
        \end{array}
    \end{equation}

  After that, $k_1(t)$ is used in combination with $P(t)$ and $T(t)$ to obatain $k_2(t)$ and $k_3(t)$ using the
  same procedure implemented for the joint analysis of nascent and total RNA.

  \subsubsection{Estimation of rates from intronic and exonic quantifications}

  The procedure described above is implemented in the method \Rmethod{newINSPEcT}, which creates the \Rclass{INSPEcT} object 
  for the analysis and gives a first estimation of the synthesis, processing and degradation rates along the time-course, 
  for those genes with enough signal for both introns and exons. In comparison to the mature and nascent configuration, from the 
  practical point of view, nothing changes except for the absence of $t_L$ and new synthesized expression data.

  <<newINSPEcTTimeCourse,eval=TRUE,message=FALSE,warnings=FALSE>>=
  tpts<-c(0,1/6,1/3,1/2,1,1.5,2,4,8,12,16)
  
  matureInspObj<-newINSPEcT(tpts=tpts
                          ,labeling_time=NULL
                          ,nascentExpressions=NULL
                          ,matureExpressions=matExp_DESeq2)
  @

  All the methods that apply to the \Rcode{INSPEcT} object created with the nascent RNA apply also to the object 
  created with the sole total RNA. For example, we can compare the time-averaged rates of synthesis estimated 
  in this first step of modeling with or without the nascent RNA, by extracting themwith the \Rmethod{ratesFirstGuess} 
  (Figure \ref{fig:firstGuessCompare}).

  \begin{center}
  <<firstGuessCompare,eval=TRUE,message=FALSE,warnings=FALSE, fig.width=3.5, fig.height=3.5, hold=TRUE, fig.cap="Dotplot of synthesis rates calculated with or without nascent RNA. First guess of the synthesis rates calcualted on the same set of genes with or without the nascent RNA information. In red is represented the line of equation y = x.", fig.align="center">>=

  cg <- intersect(featureNames(nascentInspObj), 
                  featureNames(matureInspObj))
  k1Nascent <- ratesFirstGuess(nascentInspObj[cg,], 'synthesis')
  k1Mature <- ratesFirstGuess(matureInspObj[cg,], 'synthesis')
  smoothScatter(log10(rowMeans(k1Nascent)), 
                log10(rowMeans(k1Mature)),
                main='synthesis rates',
                xlab='nascent + mature RNA', 
                ylab='mature RNA')
  abline(0,1,col='red')
  @
  \end{center}

  \subsubsection{Selection of the regulative scenario}

  Similarly to the procedure with the nascent RNA, inspect provide an additional step of modeling, where the software assigns
  to the rates of the RNA cycle a precise functional form to describe their behavior over time (see the corresponding section in 
  "Analysis of Total and Nascent RNA"). In this case, RNA dynamics can be modeled either 
  an impulse functions or constant functions. During this procedure 8 independent models for each gene are tested (all the 
  combinations of contant/impulsive functions for each of the three rates), and the best model is chosen based on the trade-off
  between the goodness of fit and the simplicity. This modeling procedure refines the estimation of the rates and gives a
  statistical confidence about the variation of each rate during the time-course.

  <<modelRatesMatureTimeCourse,eval=TRUE,message=FALSE,warnings=FALSE>>=
  matureInspObj10 <- matureInspObj[1:10, ]
  matureInspObj10 <- modelRates(matureInspObj10, seed=1)
  @

  In figure \ref{fig:plotGeneMatureTimeCourse}, the same gene represented in figure \ref{fig:plotGeneNascentTimeCourse} is plotted.
  Also in this case, estimated synthesis, degradation and processing rates, premature RNA and total RNA concentrations 
  are displayed with solid thin lines, while their standard deviations are in dashed lines and the modelled rates 
  and concentrations are in thick solid lines.

  \begin{center}
  <<plotGeneMatureTimeCourse,eval=TRUE,message=TRUE,warnings=FALSE, fig.width=10, fig.height=3, fig.cap="Plot of a single gene resolved using only total RNA. Plot of concentrations and rates over time for the given gene 101206. Only total RNA was used to resolve the whole set of rates and concentrations.", fig.align="center">>=
  plotGene(matureInspObj10, ix="101206")
  @
  \end{center}

  \subsection{Performance evaluation with simulated data}

  \Rpackage{INSPEcT} provides functionalities to build a synthetic dataset for which the transcriptional scenario 
  is known for each gene. Simulated data can be used to evaluate the performance of \Rpackage{INSPEcT} in classifying 
  each rate as constant or variable over time, to estimate the number of time points and replicates necessary to achieve 
  a given performance and to configure the model selection parameters.\\
  The first step consists in sampling from an \Rpackage{INSPEcT} object: absolute values of the rates, fold changes, 
  correlations between these two elements and associated variances. Once these parameters distributions have been 
  estimated, they are used to simulate a given number of genes (\textit{nGenes = ...}). Optionally, the user can 
  provide the probabilities for a rate to be modelled as a constant, sigmoid or impulse function; by default they 
  are respectively 0.5, 0.2 and 0.3.\\

  <<generateSimModel,eval=TRUE,message=FALSE,warnings=FALSE>>=
  simRates<-makeSimModel(nascentInspObj, 1000, seed=1)
  @

  \Rmethod{makeSimModel} produces an object of class \Rclass{INSPEcT\_model}, to be analysed it must be transformed 
  in an \Rclass{INSPEcT} object using the \Rmethod{makeSimDataset} method. It takes as arguments the number of 
  replicates required and the time points at which the data should be virtually collected. Regarding the latter point, 
  the new time-course must be designed as a sampling of the original experimental time window (same initial and final 
  conditions). The object created by this method can be modelled via \Rmethod{modelRates} as any other object of 
  class \Rclass{INSPEcT}, and in this case the results could be compared with the ground truth and the perfromace 
  of the procedure with given number of replicates and time points can be estimated. In order to avoid long computation
  time, the next two chunks of code have been previously evaluated and will be not computed within this vignette.

  <<simulatedDataNascent,eval=FALSE,message=FALSE,warnings=FALSE>>=

  newTpts <- c(0.00, 0.13, 0.35, 0.69, 1.26, 2.16, 3.63, 
               5.99, 9.82, 16.00)
  nascentSim2replicates<-makeSimDataset(object=simRates
                                     ,tpts=newTpts
                                     ,nRep=2
                                     ,NoNascent=FALSE
                                     ,seed=1)
  nascentSim2replicates<-modelRates(nascentSim2replicates[1:100]
                                 ,seed=1)

  newTpts <- c(0.00, 0.10, 0.26, 0.49, 0.82, 1.32, 2.06, 
               3.16, 4.78, 7.18, 10.73, 16.00)
  nascentSim3replicates<-makeSimDataset(object=simRates
                                     ,tpts=newTpts
                                     ,nRep=3
                                     ,NoNascent=FALSE
                                     ,seed=1)
  nascentSim3replicates<-modelRates(nascentSim3replicates[1:100]
                                 ,seed=1)
  @

  Starting from the very same \Robject{simRates}, it is also possible to generate \Rclass{INSPEcT} objects without 
  information about the nascent RNA (argument \Rcode{NoNascent} set to TRUE). However, it is not allowed to produce
  artificial gene sets starting from \Rpackage{INSPEcT} object without nascent RNA because the experimental 
  estimation of the synthesis rate is mandatory to guarantee the good quality of the simulated data.

  <<simulatedData,eval=FALSE,message=FALSE,warnings=FALSE>>=
  
  newTpts <- c(0.00, 0.13, 0.35, 0.69, 1.26, 2.16, 3.63, 
               5.99, 9.82, 16.00)
  matureSim2replicates<-makeSimDataset(object=simRates
                                   ,tpts=newTpts
                                   ,nRep=2
                                   ,NoNascent=TRUE
                                   ,seed=1)
  modelSelection(matureSim2replicates)$thresholds$chisquare <- 1
  matureSim2replicates<-modelRates(matureSim2replicates[1:100]
                               ,seed=1)

  newTpts <- c(0.00, 0.10, 0.26, 0.49, 0.82, 1.32, 2.06, 
               3.16, 4.78, 7.18, 10.73, 16.00)
  matureSim3replicates<-makeSimDataset(object=simRates
                                   ,tpts=newTpts
                                   ,nRep=3
                                   ,NoNascent=TRUE
                                   ,seed=1)
  modelSelection(matureSim3replicates)$thresholds$chisquare <- 1
  matureSim3replicates<-modelRates(matureSim3replicates[1:100]
                               ,seed=1)

  @

  Once the simulated data have been produced and analysed, it is possible to compare the results obtained by the 
  method \Rmethod{modelRates} and the object \Robject{simRates}, which contains the ground truth of rates.\\
  The method \Rmethod{rocCurve}, for example, measures the performace of the constant/variable classification
  individually for the rates of synthesis, processing and degradation, using the receiver operating characteristic 
  (ROC) curve. (Figure \ref{fig:ROCs}).

  \begin{center}

  <<ROCs, message=FALSE, warnings=FALSE, eval=TRUE, fig.width=10, fig.height=10, fig.cap="ROC analysis of INSPEcT classification. For each rate, INSPEcT classification performance is measured in terms of sensitivity, TP / (TP + FN), and specificity, TN / (TN + FP), using a ROC curve analysis; false negatives (FN) represent cases where the rate is identified as constant while it was simulated as varying; false positives (FP) represent cases where INSPEcT identified a rate as varying while it was simulated as constant; on the contrary, true positives (TP) and negatives (TN) are cases of correct classification of varying and constant rates, respectively; sensitivity and specificity are computed using increasing thresholds for the Brown method used to combine multiple p-values derived from the log-likelihood ratio tests", fig.align="center">>=

  data("nascentSim2replicates","nascentSim3replicates",
    "matureSim2replicates","matureSim3replicates",package='INSPEcT')

  par(mfrow=c(2,2))
  rocCurve(simRates[1:100],nascentSim2replicates)
  title("2rep. 10t.p. Total and nascent RNA", line=3)
  
  rocCurve(simRates[1:100],nascentSim3replicates)
  title("3rep. 12t.p. Total and nascent RNA", line=3)
  
  rocCurve(simRates[1:100],matureSim2replicates)
  title("2rep. 10t.p. Total RNA", line=3)
  
  rocCurve(simRates[1:100],matureSim3replicates)
  title("3rep. 12t.p. Total RNA", line=3)

  @
  
  \end{center}

  Further, the method \Rmethod{rocThresholds} can be used to assess the sensitivity and specificity that is achieved 
  with different thresholds of the chi-squared and Brown tests (Figure \ref{fig:ROCsThresholds}).

  \begin{center}
  <<ROCsThresholds, message=FALSE, warnings=FALSE, eval=TRUE, fig.width=8, fig.height=4, fig.cap="Effect of chi-squared and Brown tests thresholds in INSPEcT classification. Plot of the sensitivity (black curve) and specificity (red curve) that is achieved after performing the log-likelihood ratio and Brown method for combining p-values with selected thresholds; thresholds that can be set for chi-squared test to accept models that will undergo the log-likelihood ratio test and for Brown p-value to assess variability of rates">>=

  rocThresholds(simRates[1:100],nascentSim2replicates,
    bTsh=c(.01,.01,.05),cTsh=.1)
 
  @
  \end{center}

  \subsection{Parameters setting}
      
  If desired, different parameters can be set for both the modelling and the testing part. Regarding the modelling part, 
  we might want to increase the number of different initializations that are performed for each gene (\textit{nInit} option) 
  or increase the maximum number of steps in the rates optimization process (\textit{nIter} option). All these choices 
  could improve the performance of the method, but also the needed computational time; the impact of these options 
  on the quality of the modelling can be evaluated using simulated datasets.

  <<parameters1, message=FALSE, warnings=FALSE, eval=TRUE>>=

  nascentInspObj10<-removeModel(nascentInspObj10)

  modelingParams(nascentInspObj10)$nInit<-20
  modelingParams(nascentInspObj10)$nIter<-1000

  @

  Instead, we might want to change the thresholds for chi-squared and log-likelihood ratio tests, or define the specific 
  set of models to be compared with log-likelihood ratio test while assessing if a given rate is variable or not. 
  The chi-squared test is used to discard bad models that will not enter in the model selection. By default, the threshold of 
  acceptance is set to 0.2 and in this example we decide to be more stringent and set the threshold to 0.1. In case of low
  number of time points, few or no models can result in a low chi-square. In these case, it could be necessary to use larger 
  values for the chi-squared test threshold, or be completely comprehensive and set the threshold to 1. The threshold 
  relative to the Brown test is used to call differential regulation and can be set independently for each rate. Low threshold 
  of the Brown test reflect in a more strict call for differential regulation of the rate. In this example, we are changing 
  the thresholds of both the chi-squared test and the Brown method for combining p-values. We are also imposing that only 
  the constant models ("0") and processing varying model ("c") are tested one agains each other used to to assess 
  the variability of the processing rates using log-likelihood ratio test. Usually, all nested models that differ 
  for the processing rate are used ( "0"vs"c", "c"vs"bc", "c"vs"ac" and "ab"vs"abc" ).

  <<parameters2, message=FALSE, eval=TRUE>>=

  modelSelection(matureInspObj10)$thresholds$chisquare<-.1
  modelSelection(matureInspObj10)$thresholds$brown<-c(synthesis=.01
                                      ,processing=.01
                                      ,degradation=.05)
  @

  To have a sense of all parameters that can be set, type:

  <<parameters3, message=TRUE, eval=TRUE>>=

  ## modelling
  modelingParams(matureInspObj10)

  ## model selection and testing framework
  modelSelection(matureInspObj10)
  @

  \section{Analysis of RNA dynamics in steady state RNA-seq data}

  \subsection{Analysis of Total and Nascent RNA}

  \Rpackage{INSPEcT} is also designed to analyze RNA dynamics in steady state conditions and to compare the RNA dynamics 
  of two steady state at time. In order to exemplify the procedure, we treat the data relative to the time-course analyzed 
  above as 11 individual steady states. \Rpackage{INSPEcT} recognizes that the analysis is at steady state when a character 
  vector of condtions is given as time-points.

  <<steadyStateNascent1, message=FALSE, warnings=FALSE, eval=TRUE>>=
  conditions <- letters[1:11]
  nasSteadyObj <- newINSPEcT(tpts=conditions
                          ,labeling_time=tL
                          ,nascentExpressions=nasExp_DESeq2
                          ,matureExpressions=matExp_DESeq2)
  @

  Once created the steady-state \Rpackage{INSPEcT} object, the first guess of the rates can be extracted with the 
  method \Rmethod{ratesFirstGuess}, as usual. If we want to test for differential regulation in the RNA dynamics 
  between two conditions, the method \Rmethod{compareSteady} can be used. The \Rpackage{INSPEcT} object created 
  with 11 conditions can be subsetted in order to select the ones that we want to compare. In this case, we selected 
  the first and the last conditions, tested for differential regulation and visualized the results relative to individual 
  rates using the methods \Rmethod{synthesis}, \Rmethod{processing} and \Rmethod{degradation}. These methods return
  the rates computed by the modeling in the two conditions, the log transfromed geometric mean of the rate, the log2 fold
  change between the two conditions and the statistics relative to the test. In case a rate was not assessed as 
  differentially regulated the same value is reported for the two conditions. Also in this case, the p-values thresholds
  of the chi-squared and Brown test can be modified by the \Rmethod{modelSelection}.

  <<steadyStateNascent2, message=FALSE, warnings=FALSE, eval=TRUE>>=

  diffrates <- compareSteady(nasSteadyObj[,c(1,11)])

  head(round(synthesis(diffrates),2))
  head(round(processing(diffrates),2))
  head(round(degradation(diffrates),2))
  @

  Following this, the method \Rmethod{plotMA} can be used to plot the results of the comparison for each indiviadual rate.
  In figure (Figure \ref{fig:steadyStateNascent3}), we show the results for the synthesis rate. All the points that do not 
  lie on the line y = 0 corresponds to genes called differentially regulated by the method \Rmethod{compareSteady}. This method
  select differentially regulated genes based on the p-value of the Brown test previosly chosen. Additionally, it is possible 
  to visualize genes that are differentially according to a certain threshold of the Benjamini and Hochberg correction of
  the p-values (argument \textbf{padj}). Those genes will be visualized as orange triangles.
  
  <<steadyStateNascent3, message=TRUE, eval=TRUE, fig.width=5, fig.height=5, hold=TRUE, fig.cap="Representation of rate geometric mean adn variation between conditions. plotMA generated image. Orange triangles correspond to genes whose rates are differentially used between the two conditions, blue cloud correspond to the whole distribution of rates.", fig.align="center">>=
  plotMA(diffrates, rate='synthesis', padj=1e-3)
  @

  \subsection{Analysis of Total without Nascent RNA}

  It is not possible to estimate the kinetic rates from steady state data without nascent RNA expression data because of the 
  underdetermination of the system (Equations \ref{eq:modelsystem}). The only information that can be extracted is the ratio
  between the premature ($P$) and mature ($M$) RNA:

  \begin{equation}\label{eq:PTratio}
  \nonumber
    \frac{P}{M} = \frac{P_{T}}{T_{T} - P_{T}} = \frac{k_3}{k_2} \\
  \end{equation}

  where $P_{R}$ and $T_{R}$ are the premature and total RNA levels, respectively, estimated
  from the total RNA library. However, even from this aggregated quantity, valuable information regarding 
  the differential post transcriptional regulation of genes in different samples can still be obtained.\\
  This analysis is performed by the metohd \Rmethod{compareSteadyNoNascent} which takes as input an object of class
  \Rpackage{INSPEcT} created witouh nascent RNA, an expression threshold for premature and total RNA, to select reliable data 
  to be analysed (\textit{expressionThreshold}) and the distance, expressed in log2, from the expected behaviour beyond which a 
  gene should be considered as post-transcriptionally regulated.\\
  We noticed that the ratio $P$ over $M$ decreases along with the expression following a power law, which can be easily fitted in
  the log log space by a linear model. This power law is calculated in the first step of the \Rmethod{compareSteadyNoNascent}
  analysis. Following that, genes that deviates from this trend across conditions are identified as differentially
  post-transcriptionally regulated.

  <<steadyStateTotal1, message=FALSE, warnings=FALSE, eval=TRUE>>=

  conditions <- letters[1:11]
  matureInspObj <- newINSPEcT(tpts=conditions
                             ,matureExpressions=matExp_DESeq2)
  regGenes<-compareSteadyNoNascent(inspectIds=matureInspObj
                    ,expressionThreshold=0.25
                    ,log2FCThreshold=2.)
  regGenes[1:6,1:5]
  table(regGenes, useNA='always')
  @

  The output of the function is a matrix of booleans, as large as the expression input data, where TRUE means that a specific gene, 
  in a given condition, is post-transcriptionally regulated in a peculiar way.\\
  The entire analysis is based on the estimation of a standard behaviour of a set of genes which means that benefits from the 
  size of the dataset under analysis. The example is, obviously, just a proff of concept to explicit arguments and output of 
  \Rfunction{compareSteadyNoNascent}; for this reason, we have decided to use time-course data. A real application does not
  require data to be temporally related in any way.

  \section{Analysis of the delay introduced by the rates of processing to RNA responsiveness}

  During the transition between steady-states, RNA kinetic rates define the speed at which the mature form of a transcript 
  can be set at a new level, i.e. the responsiveness. The rate of degradation is typically considered the major determinant of 
  responsiveness: the lower is transcript stability, the higher is its responsiveness. Nonetheless, RNA processing rate
  influences RNA responiveness and specifically introduces a delay to it, which is commonly considered non-influent due to
  the speed at which the processing of the transcripts occur. To better elucidate the role of processing rate in this context,
  we devised two metrics, $\tau$ and $\Delta$, that measure the additional time required to double mature RNA levels following 
  a doubling in synthesis rates, compared to a scenario where processing occurs instantaneously ($\tau$), and the number of RNA 
  molecules involved in this delay ($\Delta$). These metrics can be calculated from the kinetic rates of individual genes, or 
  can be approximated with accuracy from premature and mature RNA abundances. The responsiveness of genes with high values 
  for both metrics is thus significantly burdened by the processing step.\\
  The two metrics, as well as the genes delayed by the processing rates, can be easily calculated from the \Rpackage{INSPEcT}
  object.

  <<processingDelay1, message=FALSE, warnings=FALSE, eval=TRUE>>=

  tau <- calculateTau(nasSteadyObj)

  procDelay<- processingDelay(inspectIds=nasSteadyObj
                             ,tauThreshold=1.2
                             ,deltaThreshold=1.0)
  table(procDelay, useNA='always')

  @  

  In case the object was created without the nascent RNA information, approximated metrics will be calculated.

  \begin{center}
  <<processingDelay2, message=FALSE, warnings=FALSE, eval=TRUE, fig.width=4, fig.height=4, fig.cap="">>=

  tau_approx <- calculateTau(matureInspObj)
  cg <- intersect(featureNames(nasSteadyObj), featureNames(matureInspObj))
  plot(tau[cg,], tau_approx[cg,], log='xy'
      , main=expression(paste(tau,' metric'))
      , xlab='with nascent RNA'
      , ylab='approximation without nascent RNA')

  @
  \end{center}

\section{About this document}

<<r>>=
sessionInfo()
@

\end{document}