%\VignetteIndexEntry{kissDE Reference Manual} %\VignettePackage{kissDE} %\VignetteEngine{utils::Sweave} \documentclass{article} <>= BiocStyle::latex() @ \usepackage{booktabs} % book-quality tables \usepackage[utf8]{inputenc} \usepackage{multicol} \usepackage{multirow} \usepackage{amsmath} \usepackage{amssymb} <>= options(digits=3, width=80, prompt=" ", continue=" ") @ \newcommand{\kde}{\Biocpkg{kissDE}} \newcommand{\kp}{\software{KisSplice}} \newcommand{\krg}{\software{KisSplice2refgenome}} \bioctitle{The 'kissDE' package} \author[1]{Clara Benoit-Pilven} \author[2]{Camille Marchet} \author[3]{Janice Kielbassa} \author[1]{Audric Cologne} \author[1]{Aurélie Siberchicot} \author[1]{Vincent Lacroix\thanks{\email{vincent.lacroix@univ-lyon1.fr}}} \affil[1]{Université de Lyon, Université Lyon 1, CNRS UMR5558, Laboratoire de Biométrie et Biologie Evolutive, Villeurbanne, France} \affil[2]{Univ Rennes, Inria, CNRS, IRISA, France} \affil[3]{Synergie Lyon Cancer, Université Lyon 1, Centre Léon Berard, Lyon, France} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \begin{abstract} \kde~is a package dedicated to the analysis of count data obtained from the quantification of pairs of variants in RNA-Seq data.\\ It can be used to study splice variants, where the two variants of the pair differ by the inclusion/exclusion of an exonic or intronic region. It can also be used to study genomic variants (whenever they are transcribed), which differ by a single nucleotide variation (SNV) or an indel.\\ The statistical framework is based on similar hypotheses as \Biocpkg{DESeq2} \cite{DESeq2} and includes its normalization method using geometric means. Counts are modelled using the negative binomial distribution. We use the framework of the generalised linear model, and we test for association of a variant with a condition using a likelihood ratio test.\\ This vignette explains how to use this package.\\ The workflow for SNPs/SNVs is fully described in Lopez-Maestre et al. \cite{Lopez-Maestre2016}, the workflow for splicing is fully described in Benoit-Pilven et al. \cite{Benoit-Pilven} \end{abstract} \packageVersion{\Sexpr{BiocStyle::pkg_ver("kissDE")}} \tableofcontents \newpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \newpage \section{Prerequisites} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Use case} \kde~is meant to work on pairs of variants that have been quantified across different conditions. It can deal with single nucleotide variations (SNPs, mutations, RNA editing), indels or alternative splicing.\\ As \kde~was first designed to be a brick of the \kp~\cite{KisSplice} pipeline (web page: \url{http://kissplice.prabi.fr/}), the \Rfunction{kissplice2counts} function can be directly applied to the output files from \kp~or \krg. Yet, \kde~can also run with any other software which produces count data as long as this data is properly formatted.\\ \kde~was designed to work with at least two replicates for each condition, which means that the minimal input contains the read counts of the variants for 4 different samples, each couple representing a biological condition and its 2 replicates. There can be more replicates and more conditions, but it is not mandatory to have an equal number of replicates in each condition.\\ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Install and load \kde} In a \R{} session, \Bioconductor{} has first to be installed. <>= source("http://bioconductor.org/biocLite.R") @ Then, the \kde~package can be installed from \Bioconductor{} and finally loaded. <>= biocLite("kissDE") @ <>= library(kissDE) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Quick start} Here we present the basic \R{} commands for an analysis with \kde. These commands require an external output file of \kp~, for example \file{output_kissplice.fa} (which is not included in this package). To deal with other types of input files, please refer to section \ref{subsec:input}. The funtions used in \kde~are \Rfunction{kissplice2counts}, \Rfunction{qualityControl}, \Rfunction{diffExpressedVariants} and \Rfunction{writeOutputKissDE}. For each function, default values of the parameters are used. For more details on functions and their parameters see section \ref{sec:workflow}. Here we assume that there are two conditions ($condition\_1$ and $condition\_2$) with two biological replicates and we also assume that the RNA-Seq libraries are single-end. <>= counts <- kissplice2counts("output_kissplice.fa") conditions <- c(rep("condition_1", 2), rep("condition_2", 2)) qualityControl(counts, conditions) results <- diffExpressedVariants(counts, conditions) writeOutputKissDE(results, output = "kissDE_output.tab") @ Note that the functions \Rfunction{kissplice2counts} and \Rfunction{diffExpressedVariants} may take some time to run (see section \ref{subsec:time} for more details on running time). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{\kde's workflow} \label{sec:workflow} In this section, the successive steps and functions of a differential analysis with \kde~are described. \begin{figure} \includegraphics{Workflow_KissDE.png} \caption{\label{fig:workflow}Schema of \kde's workflow. Numbers in light blue point to the section of this vignette explaining the step.} \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Input data} \label{subsec:input} \kde's input is a table of raw counts and a vector describing the number of conditions and replicates per condition. The table of raw counts can either be directly provided by the user or obtained with \kp~or \krg~(\url{http://kissplice.prabi.fr/training/}). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Condition vector} \label{subsubsec:condition} The condition vector describes the order of the columns in the count table.\\ As an example, the counts are ordered as follow: the two first counts represent the two replicates of $condition\_1$ and the two following counts the two replicates of $condition\_2$. In this case, the condition vector for these 2 conditions with 2 replicates per condition, would be: <>= myConditions <- c(rep("condition_1", 2), rep("condition_2", 2)) @ In the case where the input data contains more than 2 conditions, we advise the user to remove samples from the analysis in order to compare 2 conditions only, because \kde was uniquely tested in this context. To remove samples from the analysis the "*" character can be used: <>= myConditions <- c(rep("condition_1", 2), rep("*", 2), rep("condition_3", 2)) @ Here, there are 3 conditions and 2 replicates per condition, but only $condition\_1$ and $condition\_3$ will be considered in the analysis.\\ If the count table was loaded from \kp~or~\krg~output, the condition vector must contain the samples in the same order they were given to \kp~(see sections \ref{subsubsec:input_ks} and \ref{subsubsec:input_krg}).\\ \warning{To run \kde, all conditions must have replicates. So each condition must at least be present twice in the condition vector. If this is not the case, an error message will be printed.} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{User's own data (without \kp): table of counts format} \label{subsubsec:input_count} Let's assume we work with two conditions ($condition\_1$ and $condition\_2$) and two replicates per condition. An input example table contained in a flat file called \file{table_counts_alt_splicing.txt} is loaded and stored in a \Robject{tableCounts} object.\\ \bioccomment{\Robject{fpath1} contains the absolute path of the file on the user's hard disk.} <>= fpath1 <- system.file("extdata", "table_counts_alt_splicing.txt", package="kissDE") tableCounts <- read.table(fpath1, head = TRUE) @ In \kde, the table of counts must be formatted as follows: <>= head(tableCounts) @ It must be a data frame with: \begin{itemize} \item \textbf{in rows:} \begin{itemize} \item One variation is represented by two lines, one for each variant. For instance, for SNVs, one allele is described in the first line, and the other in the second line. For alternative splicing events, the inclusion isoform and the exclusion isoform have one line each. \item The header must contain the column names in the flat file. \end{itemize} \item \textbf{in columns:} \begin{itemize} \item The first column (\Rcode{eventsName}) contains the name of the variation. \item The second column (\Rcode{eventsLength}) contains the effective size of the variant in nucleotides (bp). The effective size corresponds to the number of read mapping positions used when estimating the abundance of a variant.\\ For the exclusion variant (2nd line), which should correspond to an exon-exon junction, it corresponds to: \begin{equation} effectiveLengthExclu = readLength - 2 * overhang + 1 \end{equation} where $overhang$ corresponds to the minimal number of bases needed to accept that a read is aligned to a junction.\\ For the inclusion variant (1st line), it corresponds to: \begin{equation} effectiveLengthInclu = effectiveLengthExclu + variablePartLength \end{equation} where $variablePartLength$ is the length of the region only present in the inclusion variant.\\ In the special case where the abundance of the inclusion variant has been estimated using only junction reads, then the effective length of the inclusion variant is: \begin{equation} effectiveLengthInclu = 2 * effectiveLengthExclu \end{equation} This information is used only in the context of alternative splicing. In the context of SNVs, it can be set to 0. It is used to assess which splice variants may induce a frameshift (the difference of length between the inclusion and exclusion variant is not a multiple of 3). It is also used to precisely estimate the PSI (Percent Spliced In). \item All other columns (\Rcode{cond1rep1}, \Rcode{cond1rep2}, \Rcode{cond2rep1}, \Rcode{cond2rep2}) contain read counts of a variant in a sample. In the example above, \Rcode{cond1rep1} is the number of reads supporting this variant in the first replicate of $condition\_1$, \Rcode{cond1rep2} is the number of reads supporting replicate 2 in $condition\_1$, \Rcode{cond2rep1} and \Rcode{cond2rep2} are counts for replicates 1 and 2 of $condition\_2$. \end{itemize} \end{itemize} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Input table from \kp~output} \label{subsubsec:input_ks} \kde~was developped to deal with \kp~output, which is in fasta format. Below is the first four lines of an example of \kp~output: <>= headfasta <- system.file("extdata", "head_output_kissplice_alt_splicing_fasta.txt", package = "kissDE") writeLines(readLines(headfasta)) @ Events are reported in blocks of 4 lines, the first two lines correspond to one variant of the splicing event (or one allele of the SNV), the following two lines correspond to the other variant (or the other allele). As for all fasta file, there is a header line beginning with the \Rcode{>} symbol and a line with the sequence. Each variant correspond to one entry in the fasta file. Headers contain information used in \kde. In the example, there are: \begin{itemize} \item elements shared by the headers of the two variants: \begin{itemize} \item \Rcode{bcc\_68965|Cycle\_4} is the event's ID. \item \Rcode{Type\_1} means that the sequences correspond to a splicing event. \Rcode{Type\_0} corresponds to SNVs. \end{itemize} \item elements that are specific to a variant: \begin{itemize} \item \Rcode{upper\_path\_length\_112} and \Rcode{lower\_path\_length\_82} gives the length of the nucleotide sequences. Upper path and lower path are a denomination for the representation of each variant in \kp's graph. For alternative splicing events, the upper path represents the inclusion isoform and the lower path the exclusion isoform. \item \Rcode{AS1\_1|SB1\_1|S1\_0|ASSB1\_0|AS2\_0|SB2\_0|S2\_0|ASSB2\_0|AS3\_0| SB3\_0|...} and \Rcode{AB1\_21|AB2\_12|AB3\_12|AB4\_2|AB5\_5|...} summarizes the counts found by \kp~quantification step. Here \kp~was run with the option \Rcode{counts} set to 2. For the upper path, we have 4 counts for each sample: AS, SB, S and ASSB. For the lower path, we have 1 count per sample: AB. The different reads categories are shown on Figure \ref{fig:readstype}. There are 8 sets of counts because we gave 8 files in input to \kp~(denotated by the number before the "\_" character). Each count (denotated by the number after the "\_" character) corresponds to the reads coming from each file that could be mapped on the variant, in the order they have been passed to \kp. \end{itemize} \item a rank information which is a deprecated measure. \end{itemize} \begin{figure} \includegraphics{reads_type.png} \caption{\label{fig:readstype}Different categories of reads. In this figure, we show an example of an alternative skipped exon. AS reads correspond to reads spanning the junction between the excluded sequence and its left flanking exon, SB to reads spanning the junction between the excluded sequence and its right flanking exon, ASSB to reads spanning the two inclusion junctions, S to reads entirely included in the alternative sequence and AB to reads spanning the junction between the two flanking exons. S reads correspond to exonic reads and all other categories of reads represented here correspond to junction reads.} \end{figure} \kde~can be used on any type of events output by \kp~(0: SNV, 1: alternative splicing events, 3: indels,...). The user should refer to \kp~manual (\url{http://kissplice.prabi.fr/documentation/}) for further questions about the \kp~format and its output.\\ To be used in \kde, \kp~output must be converted into a table of counts. This can be done with the \Rfunction{kissplice2counts} function. In the example below, the \kp~output file called \file{output_kissplice_alt_splicing.fa}, included in the \kde~package, is loaded. The table of counts yielded by the \Rfunction{kissplice2counts} function is stored in \Robject{myCounts}.\\ \bioccomment{\Robject{fpath2} contains the absolute path of the file on the user's hard disk.} <>= fpath2 <- system.file("extdata", "output_kissplice_alt_splicing.fa", package="kissDE") myCounts <- kissplice2counts(fpath2, counts = 2, pairedEnd = TRUE) @ The counts returned by \Rfunction{kissplice2counts} are extracted from the \kp~header. By default, \Rfunction{kissplice2counts} expects single-end reads and one count for each variant.\\ The \Rcode{counts} parameter of \Rfunction{kissplice2counts} must be the same as the \Rcode{counts} parameter used to obtain data with \kp. The possible values are 0, 1 or 2. 0 is the default value for both \Rfunction{kissplice2counts} and \kp.\\ The user can also specify the \Rcode{pairedEnd} parameter in \Rfunction{kissplice2counts}. If RNA-Seq libraries are paired-end, \Rcode{pairedEnd} should be set to \Rcode{TRUE}. In this case, the \Rfunction{kissplice2counts} function expects the counts of the paired-end reads to be next to each other. If it is not the case, an additional \Rcode{order} parameter should be used to indicate the actual order of the counts. For instance, if the experimental design is composed of two conditions with two paired-end replicates and if the input in \kp~ followed this order:\\ cond1\_sample1\_readpair1, cond1\_sample2\_readpair1, cond2\_sample1\_readpair1,\\ cond2\_sample2\_readpair1, cond1\_sample1\_readpair2, cond1\_sample2\_readpair2,\\ cond2\_sample1\_readpair2 and cond2\_sample2\_readpair2.\\ The order vector should be equal to \Rcode{c(1,2,3,4,1,2,3,4)}.\\ An example of a paired-end dataset run with \Rcode{counts} equal to 0 is shown in section \ref{subsec:SNV}. \Rfunction{kissplice2counts} returns a list of four elements, including \Rcode{countsEvents} which contains the table of counts required in \kde. <>= names(myCounts) head(myCounts$countsEvents) @ \Robject{myCounts\$countsEvents} has the same structure as the \Robject{tableCounts} object in the section \ref{subsubsec:input_count}. It is a data frame with: \begin{itemize} \item \textbf{in rows:} One variation is represented by two lines, one for each variant. For instance for SNVs, one allele is described in the first line and the other in the second line. For alternative splicing events (as in this example), the inclusion and the exclusion isoform have one line each. \item \textbf{in columns:} \begin{itemize} \item The first column (\Rcode{events.names}) contains the name of the variation, using \kp~notation. \item The second column (\Rcode{events.length}) contains the size of the variant in bp, extracted from the \kp~header. \item All others columns (\Rcode{counts1, counts2, counts3, counts4}) contain counts for each replicate in each condition for the variant. \end{itemize} ~ \end{itemize} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Input table from \krg~output} \label{subsubsec:input_krg} The \Rfunction{kissplice2counts} function can also deal with \krg~output data, in this case the \Rcode{k2rg} parameter has to be set to \Rcode{TRUE}. \krg~allows the annotation of the alternative splicing events. It assigns each event a gene and a type of alternative splicing event, among which: Exon Skipping (ES), Intron Retention (IR), Alternative Donor (AltD), Alternative Acceptor (AltA). Interested users should refer to \krg~manual for further questions about \krg~format and output (\url{http://kissplice.prabi.fr/tools/kiss2refgenome/}). In the example below, \file{output_k2rg_alt_splicing.txt}, a \krg's output included in the \kde~package, is loaded. The \Rfunction{kissplice2counts} function uses the same \Rcode{counts} and \Rcode{pairedEnd} parameters as explained in the section \ref{subsubsec:input_ks}. The table of counts yielded by the \Rfunction{kissplice2counts} function is stored in \Robject{myCounts\_k2rg}. It has exactly the same structure as detailed in section \ref{subsubsec:input_ks}.\\ \bioccomment{\Robject{fpath3} contains the absolute path of the file on the user's hard disk.} <>= fpath3 <- system.file("extdata", "output_k2rg_alt_splicing.txt", package="kissDE") myCounts_k2rg <- kissplice2counts(fpath3, counts = 2, pairedEnd = TRUE, k2rg = TRUE) names(myCounts_k2rg) head(myCounts_k2rg$countsEvents) @ The \krg~output contains information about the type of splicing events. By default, all of the splicing events are analysed in \kde, but it is also possible to focus on subtypes of events. This events selection will speed up \kde's running time and improve statistical power for choosen events. To do this, the \Rfunction{kissplice2counts} function contains two parameters: \Rcode{keep} and \Rcode{remove}. Both take a character vector indicating the types of events to keep or remove. The event names must be part of this list: \Rcode{deletion}, \Rcode{insertion}, \Rcode{IR}, \Rcode{ES}, \Rcode{altA}, \Rcode{altD}, \Rcode{altAD}, \Rcode{alt}, \Rcode{unclassified}.\\ Thus, if the user is only interested in intron retention events, the \Rcode{keep} option should be set to \Rcode{c("IR")}. If the user isn't interessed in deletions and insertions, the \Rcode{remove} option should be equal to \Rcode{c("insertion", "deletion")}.\\ The \Rcode{keep} and \Rcode{remove} parameters can be used at the same time only if \Rcode{ES} is part of the \Rcode{keep} vector. The \Rcode{remove} vector will then act on the different types of exon skipping: multi-exon skipping (\Rcode{MULTI}) or exon skipping associated with an alternative acceptor site (\Rcode{altA}), an alternative donor site (\Rcode{altD}), both alternative acceptor and donor site (\Rcode{altAD}) or an undetermined alternative splice site (\Rcode{alt}). Thus, in this specific case, the \Rcode{remove} vector should contain names from this list: \Rcode{MULTI}, \Rcode{altA}, \Rcode{altD}, \Rcode{altAD}, \Rcode{alt}. If the user wants to analyse only cassette exon events (i.e., a single exon is skipped or included), the following command should be used: <>= myCounts_k2rg_ES <- kissplice2counts(fpath3, counts = 2, pairedEnd = TRUE, k2rg = TRUE, keep = c("ES"), remove = c("MULTI", "altA", "altD", "altAD", "alt")) @ %TODO: check if it works also on SNV %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Quality Control} \kde~contains a function that allows the user to control the quality of the data and to check if no error occured at the data loading step. This data quality assessment is essential and should be done before the differential analysis. The \Rfunction{qualityControl} function takes as input a count table (see sections \ref{subsubsec:input_count}, \ref{subsubsec:input_ks} and \ref{subsubsec:input_krg}) and a condition vector (see section \ref{subsubsec:condition}): <>= qualityControl(myCounts, myConditions) @ It produces 2 graphs: \begin{itemize} \item{a heatmap of the sample-to-sample distances using the 500 most variant events (see left panel of Figure \ref{fig:qualityControl})} \item{the factor map formed by the first two axes of a principal component analysis (PCA) using the 500 most variant events (see right panel of Figure \ref{fig:qualityControl})} \end{itemize} \begin{figure} \includegraphics[page=1,width=0.45\textwidth]{kissDE-qualityControl_howto.pdf} \includegraphics[page=2,width=0.45\textwidth]{kissDE-qualityControl_howto.pdf} \caption{\label{fig:qualityControl}Quality control plots. Left: Heatmap of the sample-to-sample distances. Right: Principal Component Analysis.} \end{figure} These two graphs show the similarities and the differences between the analyzed samples. Replicates of the same condition are expected to cluster together. If this is not the case, the user should check if the order of the samples in the count table and in the condition vector is the same. If it is, this could mean that a sample is contaminated or has an abnormality that will influence the differential analysis. The user can then go back to the quality control of the raw data to solve the problem or decide to remove the sample from the analysis. In the heatmap plot, the samples that cluster together are from the same condition. In the PCA plot, the first principal component (PC1) summarize 90.2\% of the total variance of the dataset. This first axis clearly separates the 2 conditions. The created graphs can be saved by setting the \Rcode{storeFigs} parameter of the \Rfunction{qualityControl} function to \Rcode{TRUE} (then graphs are stored in a \file{kissDEFigures} folder, created in a temporary directory, which is removed at the end of the user \R{} session) or to the path where the user wants to store his/her graphs. We recommend to use this parameter when the \Rfunction{qualityControl} function is used in an automatized workflow. To customize the PCA plot, the data frame used for this plot can be extracted by setting the option \Rcode{returnPCAdata} to \Rcode{TRUE} as follows: <>= PCAdata <- qualityControl(myCounts, myConditions, returnPCAdata = TRUE) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Differential analysis} \label{subsec:diffanalysis} When data are loaded, the differential analysis can be run using the \Rfunction{diffExpressedVariants} function. This function has two mandatory parameters: a count table (\Rcode{countsData} parameter, see sections \ref{subsubsec:input_count}, \ref{subsubsec:input_ks} and \ref{subsubsec:input_krg}) and a condition vector (\Rcode{conditions} parameter, see section \ref{subsubsec:condition}). In the example below, the differential analysis results are stored in the \Robject{myResults} object: <>= myResults <- diffExpressedVariants(countsData = myCounts, conditions = myConditions) @ The \Rfunction{diffExpressedVariants} function has three parameters to change the filters or the flags applied on the data, and one parameter to indicate if the replicates are technical or biological: \begin{itemize} \item \Rcode{pvalue}: By default, the p-value threshold to output the significant events is set to 1. So all variants are output in the final table. This parameter must be a numeric value between 0 and 1. Be aware that by setting \Rcode{pvalue} to 0.05, only events that have been identified as significant between the conditions with a false discovery rate (FDR) $\leqslant$ 5\% will be present in the final table. A posteriori changing this threshold will require to re-run the differential analysis. \item \Rcode{filterLowCountsVariants}: This parameter allows to change the threshold to filter low expressed events before testing (as explained in section \ref{subsec:filter}). By default, it is set to 10. \item \Rcode{flagLowCountsConditions}: This parameter allows to change the threshold to flag low expressed events (as explained in section \ref{subsec:flag}). By default, it is set to 10. \item \Rcode{technicalReplicates}: Boolean value indicating if the user is working with technical replicates only (we do not advise users to mix biological and technical replicates in their analyses). If this parameter is set to \Rcode{TRUE}, the counts will be modeled with a Poisson distribution. If it is equal to \Rcode{FALSE}, the counts will be modeled with a Negative Binomial distribution. For more information, see section \ref{subsec:dispersion}. By default, this option is set to \Rcode{FALSE}. \end{itemize} The \Rfunction{diffExpressedVariants} function returns a list of 6 objects: <>= names(myResults) @ The \Rcode{uncorrectedPVal} and \Rcode{correctedPval} outputs are numeric vectors with p-values before and after correction for multiple testing. \Rcode{resultFitNBglmModel} is a data frame containing the results of the fitting of the model to the data. \Rcode{k2rgFile} is a string containing either the \krg~file path and name or NULL if no \krg~file was used as input. For explanations about the \Rcode{finalTable} and \Rcode{f/psiTable} outputs, see section \ref{subsubsec:finaltable} and section \ref{subsubsec:psitable}, respectively.\\ To visualize the distribution of the p-values before the application of the Benjamini-Hochberg \cite{benjamini1995} multiple testing correction procedure, the histogram of the p-values before correction can be plotted by using the following command: <>= hist(myResults$uncorrectedPVal, main="Histogram of p-values", xlab = "p-values", breaks = 50) @ Because the dataset used here is small ($\sim$ 100 lines), the histograms of the two complete datasets presented in the case studies (section \ref{sec:casestudies}) are represented. As expected, the histograms show a uniform distribution with a peak near 0 (Figure \ref{fig:distribpvalue}). \begin{figure} \includegraphics[width=\textwidth]{hist_pvalue_uncorrected.png} \caption{\label{fig:distribpvalue}Distribution of p-values before correction for multiple testing. Left: for the complete dataset presented in section \ref{subsec:SNV}. Right: for the complete dataset presented in section \ref{subsec:AS}.} \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Output results} \subsubsection{Final table} \label{subsubsec:finaltable} The \Rcode{finalTable} object is the main output of the \Rfunction{diffExpressedVariants} function. The first 3 rows of the \Robject{myResults\$finalTable} output are as follows: <>= print(head(myResults$finalTable, n = 3), row.names = FALSE) @ The columns of this table contain the following information: \begin{itemize} \item \Rcode{ID} is the event identifier. Each event is represented by one row in the table. \item \Rcode{Length\_diff} contains the variable part length in a splicing event. It is the length difference between the upper and lower path. This column is not relevant for SNVs. \item \Rcode{Variant1\_condition\_1\_repl1\_Norm} and following columns contain the counts for each replicate of each variant after normalization (raw counts are normalized as in the \Biocpkg{DESeq2} \Bioconductor{} \R{} package, see details in section \ref{subsec:norm}). The first half of these columns concerns the first variant of each event, the second half the second variant. \item \Rcode{Adjusted\_pvalue} contains p-values adjusted by a Benjamini-Hochberg procedure. \item \Rcode{Deltaf/DeltaPSI} summarizes the magnitude of the effect (see details in section \ref{subsec:psi}). \item \Rcode{lowcounts} contains booleans which flag low counts events as described in section \ref{subsec:flag}. A \Rcode{TRUE} value means that the event has low counts (counts below the chosen threshold). \end{itemize} In the \Rcode{finalTable} output, events are sorted by p-values and then by magnitude of effect (based on their absolute values), so that the top candidates for further investigation/validation appear at the beginning of the output.\\ \warning{When the p-value computed by \kde~is lower than the smallest number greater than zero that can be stored (i.e., 2.2e-16), this p-value is set to 0.} \paragraph{}To save results, a tab-delimited file can be written with \Rfunction{writeOutputKissDE} function where an \Rcode{output} parameter (containing the name of the saved file) is required. Here, the \Robject{myResults} output is saved in a file called \file{results_table.tab}: <>= writeOutputKissDE(myResults, output = "kissDE_results_table.tab") @ Users can choose to export only events passing some thresholds on adjusted p-value and/or Deltaf/DeltaPSI using the options \Rcode{adjPvalMax} and \Rcode{dPSImin} of the \Rfunction{writeOutputKissDE} function. For example, if we want to save in a file called \file{results_table_filtered.tab} only events with the adjusted p-value $\leqslant$ 0.05 and the Deltaf/DeltaPSI absolute value $\geqslant$ 0.10, the following command can be used: <>= writeOutputKissDE(myResults, output = "kissDE_results_table_filtered.tab", adjPvalMax = 0.05, dPSImin = 0.10) @ If the counts table was built from a \krg~output with the \Rfunction{kissplice2counts} function, running the \Rfunction{writeOutputKissDE} will write a file merging results of differential analysis with \krg~data. As previously explained (section \ref{subsubsec:finaltable}), users can choose to save only events passing thresholds: <>= writeOutputKissDE(myResults_K2RG, output = "kissDE_K2RG_results_table.tab", adjPvalMax = 0.05, dPSImin = 0.10) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{f/PSI table} \label{subsubsec:psitable} The \Rcode{f/psiTable} output of the \Rfunction{diffExpressedVariants} function contains the f values for SNV analysis or PSI values for alternative splicing analysis (see details and computation in section \ref{subsec:psi}) for each event in each sample. The first three rows of the \Rcode{f/psiTable} output of the \Robject{myResults} object (created in the section \ref{subsec:diffanalysis}) look like this: <>= head(myResults$`f/psiTable`, n = 3) @ This output can be useful to carry out downstream analysis or to produce specific plots (like heatmap on f/PSI events). To use this information with external tools, this table can be saved in a tab-delimited file (here called \file{result_PSI.tab}), setting the \Rcode{writePSI} parameter to \Rcode{TRUE} in the \Rfunction{writeOutputKissDE} function: <>= writeOutputKissDE(myResults, output = "result_PSI.tab", writePSI = TRUE) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{\kde's theory} In this section, the different steps of the \kde~main function, \Rfunction{diffExpressedVariants}, are detailed. They are summarized in the Figure \ref{fig:kdetheory}. \begin{figure} \centering \includegraphics[width=0.5\textwidth]{Workflow_diffExpressedVariants.png} \caption{\label{fig:kdetheory}The different steps of the \Rfunction{diffExpressedVariants} function. Numbers in light blue point to the section of this vignette explaining the step.} \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Normalization} \label{subsec:norm} In a first step, counts are normalized with the default normalization methods provided by the \Biocpkg{DESeq2} \cite{DESeq2} package. The size factors are estimated using the sum of counts of both variants for each event, which is a proxy of the gene expression. By using this normalization, we correct for library size, because the sequencing depth can vary between samples. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Estimation of dispersion} \label{subsec:dispersion} A model to describe the counts distribution is first chosen. When working with technical replicates (\Rcode{technicalReplicates} = \Rcode{TRUE} in \Rfunction{diffExpressedVariants}), the Poisson model is chosen in \kde~and the \CRANpkg{glmnet} \R{} package \cite{glmnet} (model $\mathcal{M}(\phi=0)$) is used.\\ When working with biological replicates (\Rcode{technicalReplicates} = \Rcode{FALSE} in \Rfunction{diffExpressedVariants}), the Poisson distribution's variance parameter is in general not flexible enough to describe the data, because replicates add several sources of variance.\\ This overdispersion is often modeled using a Negative Binomial distribution. In \kde, the overdispersion parameter, $\phi$, is estimated using the \Biocpkg{DSS} \R{} package \cite{DSS1, DSS2, DSS3, DSS4} (model $\mathcal{M}(\phi=\phi^i_{\text{DSS}})$.\\ The \Biocpkg{DSS} package (and, to our knowledge, every other package estimating the overdispersion of the Negative Binomial model) is suited for differential expression analysis (one count per sample). In differential splicing and SNV analysis, two counts (one for each splice variant or allele) are associated with each sample. In order to mimic gene expression, the overdispersion parameter $\phi$ is estimated on the sum of the splice variant or allele counts of each sample. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Pre-test filtering} \label{subsec:filter} If global counts for both variants are too low (option \Rcode{filterLowCountsVariants}), the event is not tested. The rationale behind this filter is to speed up the analysis and gain statistical power.\\ Here we present an example to explain how \Rcode{filterLowCountsVariants} option works. Let's assume that there are two conditions and two replicates per condition. \Rcode{filterLowCountsVariants} keeps its default value, 10.\\ \begin{table}[h] \begin{tabular}{c|c|c|c|c|c} &\multicolumn{2}{c|}{Condition 1} & \multicolumn{2}{c|}{Condition 2} & Sum by variant\\ & replicate 1 & replicate 2 & replicate 1 & replicate 2 & \\ \hline Variant 1 & 2 & 1 & 3 & 2 & 2+1+3+2=8 $<$ 10 \\ Variant 2 & 8 & 0 & 1 & 0 & 8+0+1+0=9 $<$ 10 \\ \end{tabular} \caption{Example of an event filtered out before the differential analysis, because less than 10 reads support each variant.} \label{tab:filter1} \end{table} \paragraph{}In this example (Table \ref{tab:filter1}), the two variants have global counts less than 10, this event will be used to compute the overdispersion, but will not be used to compute the models. It will neither appear in the result table.\\ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Model fitting} Then we design two models to take into account interactions with variants (SNVs or alternative isoforms) and experimental conditions as main effects. We use the generalised linear model framework. The expected intensity $\lambda_{ijk}$ can be written as follows: \begin{equation} \mathcal{M_0}:\:\:\log \lambda_{ijk} = \mu + \alpha_{i} +\beta_{j}\\ \end{equation} \begin{equation} \mathcal{M_1}:\:\:\log \lambda_{ijk} = \mu + \alpha_{i} + \beta_{j} + \left(\alpha \beta \right)_{ij} \end{equation} where $\mu$ is the local mean expression of the transcript that contains the variant, $\alpha_{i}$ the effect of variant $i$ on the expression, $\beta_{j}$ the contribution of condition $j$ to the total expression, and $\left(\alpha \beta \right)_{ij}$ the interaction term. To avoid singular hessian matrices while fitting models, pseudo-counts (\textit{i.e.}, systematic random allocation of ones) were considered for variants showing many zero counts. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Likelihood ratio test} To select between $\mathcal{M_0}$ and $\mathcal{M_1}$, we perform a Likelihood Ratio Test (LRT) with one degree of freedom. In the null hypothesis $H_0:\{\left(\alpha \beta \right)_{ij}=0\}$, there is no interaction between variant and condition. For events where $H_0$ is rejected, the interaction term is significant to explain the count's distribution, which leads to conclude to a differential usage of a variant across conditions. p-values are then adjusted with a 5\% false discovery rate (FDR) following a Benjamini-Hochberg procedure \cite{benjamini1995} to account for multiple testing. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Flagging low counts} \label{subsec:flag} If in at least $n-1$ conditions (be $n$ the number of conditions $\geq 2$) an event has low counts (option \Rcode{flagLowCountsConditions}), it is flagged (\Rcode{TRUE} in the last column of the \Rcode{finalTable} output).\\ In the example Table \ref{tab:flag1}, we can see that the counts are quite contrasted, variant 1 seemed more expressed in condition 2 and variant 2 in condition 1. Moreover, this event has enough counts for each variant not to be filtered out when the \Rcode{filterLowCountsVariants} parameter is set to 10:\\ \begin{table}[h] \begin{tabular}{c|c|c|c|c|c} &\multicolumn{2}{c|}{Condition 1} & \multicolumn{2}{c|}{Condition 2}& Sum by variant\\ & replicate 1 & replicate 2 & replicate 1 & replicate 2 & \\ \hline Variant 1 & 1 & 0 & 60 & 70 & 1+0+60+70=131 $>$ 10 \\ Variant 2 & 5 & 3 & 10 & 20 & 5+3+10+20=38 $>$ 10 \\ \hline Sum by condition &\multicolumn{2}{c|}{\textbf{9} $<$ 10 } & \multicolumn{2}{c|}{160 $>$ 10} & \\ \end{tabular} \caption{Example of an event flagged as having low counts, because less than 10 reads support this event in the first condition.} \label{tab:flag1} \end{table} \paragraph{}However, in $n-1$ (here 1) condition, the global count for one condition is less than 10 (9 for condition 1), so \Rcode{flagLowCountsConditions} option will flag this event as \Rcode{'Low\_Counts'}. This event may be interesting because it has the potential to be found as differential. However, it will be hard to validate it experimentally, because the gene is poorly expressed in condition 1. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Magnitude of the effect} \label{subsec:psi} When a gene is found to be differentially spliced between two conditions, or an allele is found to be differentially present in two populations/conditions, one concern which remains is to quantify the magnitude of this effect. Indeed, especially in RNA-Seq, where some genes are very highly expressed (and hence have very high read counts), it is often the case that we detect significant (p-value $\leqslant$ 0.05) but weak effects. When dealing with genomic variants, we quantify the magnitude of the effect using the difference of allele frequencies (f) between the two conditions. When dealing with splicing variants, we quantify the magnitude of the effect using the difference of Percent Spliced In (PSI) between the two conditions. These two measures turn out to be equivalent and can be summarized using the following formula: \begin{equation} PSI \:=\: f \:=\: \frac{\#counts*\_variant_1}{\#counts*\_variant_1 + \#counts\_variant_2} \end{equation} \begin{equation} \Delta PSI \:=\: PSI_{cond1} - PSI_{cond2} \end{equation} \begin{equation} \Delta f \:=\: f_{cond1} - f_{cond2} \end{equation} In this formula, $\#counts*\_variant_1$ correspond to the normalized number of reads of the $variant_1$, itself normalized for the variant length. Indeed, by construction, $variant_1$ always have a length greater than or equal to the $variant_2$. That's why we divide the normalized number of reads of the $variant_1$ by the ratio of the length of the $variant_1$ and the $variant_2$. The $\Delta$PSI/$\Delta$f is computed as follows: \begin{itemize} \item First, individual (per replicate) PSI/f are calculated. If counts for both upper and lower paths are too low ($<10$) after normalization, the individual PSI/f are not computed. \item Then mean PSI/f are computed for each condition. If more than half of the individual PSI/f were not calculated at the previous step, the mean PSI/f is not computed either. \item Finally, we output $\Delta$PSI/$\Delta$f. Unless one of the mean PSI/f of a condition could not be computed, $\Delta$PSI/$\Delta$f is calculated subtracting one condition PSI/f from another. $\Delta$PSI/$\Delta$f absolute value vary between 0 and 1, with values close to 0 indicating low effects and values close to 1 strong effects. Note that the conditions are ordered alphabetically, and that \kde~substract the condition coming first in the alphabet to the other. \end{itemize} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Case studies} \label{sec:casestudies} To detect SNVs (SNPs, mutations, RNA editing) or alternative splicing (AS) in the expressed regions of the genome, \kp~can be run on RNA-seq data. Counts can then be analysed using \kde. We present two distinct case study with \kde: analysis of AS events and analysis of SNVs. \subsection{Application of \kde~to alternative splicing} \label{subsec:AS} This first example corresponds to the case of differential analysis of alternative splicing (AS) events. The sample data presented here is a subset of the case study used in \cite{Benoit-Pilven} (\url{http://kissplice.prabi.fr/pipeline_ks_farline/}). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Dataset} The data used in this example comes from the ENCODE project \cite{Djebali2012}. The samples are from a neuroblastoma cell line, SK-N-SH, with or without a retinoic acid treatment. Each condition is composed of two biological replicates. The data are paired-end.\\ In a preliminary step, \kp~has been run to analyse these two conditions. Results from \kp~(type 1 events) were then mapped to the reference genome with \software{STAR} \cite{Dobin2013} and analyzed with \krg. \krg~enables to annotate the AS events discovered by \kp. It assigns to each event a gene and a type of alternative splicing (Exon Skipping (ES), Intron Retention (IR), Alternative Donor (AltD), Alternative Acceptor (AltA), \dots).\\ For further information on these tools (\kp~ and \krg), please refer to the manual that can be found on this web page: \url{http://kissplice.prabi.fr/}.\\ The output file of \krg~is a tab-delimited file that stores the annotated alternative splicing events found in the dataset. Below is an extract of this file (the first 3 rows and first 10 columns), where each row is one alternative splicing event of our data: <>= fileInAS <- system.file("extdata", "output_k2rg_alt_splicing.txt", package = "kissDE") exampleK2RG <- read.table(fileInAS) names(exampleK2RG) <- c("Gene_Id","Gene_name", "Chromosome_and_genomic_position","Strand","Event_type", "Variable_part_length","Frameshift_?","CDS_?","Gene_biotype", "number_of_known_splice_sites/number_of_SNPs") print(head(exampleK2RG[,c(1:10)], 3), row.names = FALSE) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Load data} The \Rfunction{kissplice2counts} function allows to load directly the \krg~ output file (here called \file{output_k2rg_alt_splicing.txt}) into a format compatible with \kde's main functions.\\ \bioccomment{\Robject{fileInAS} contains the absolute path of the file on the user's hard disk.}\\ The \Rcode{k2rg} parameter is set to \Rcode{TRUE} to indicate that the file comes from \krg~and not directly from \kp. As these samples are paired-end, the \Rcode{pairedEnd} parameter is set to \Rcode{TRUE}. The \Rcode{counts} parameter must be set to the same value (i.e., 2) used in \kp~and \krg~to indicate which type of counts are given in the input. Here the exonic reads are not taken into account (\Rcode{exonicReads = FALSE}). Only junction reads will be used (see Figure \ref{fig:readstype}). The table of counts is stored in a \Robject{myCounts\_AS} object (for a detailed description of its structure, see section \ref{subsubsec:input_krg}): <>= fileInAS <- system.file("extdata", "output_k2rg_alt_splicing.txt", package = "kissDE") myCounts_AS <- kissplice2counts(fileInAS, pairedEnd = TRUE, k2rg = TRUE, counts = 2, exonicReads = FALSE) head(myCounts_AS$countsEvents) @ To perform the differential analysis, a vector that describes the experimental plan is needed. In this case study, there are two replicates of the SK-N-SH cell line without treatment (SKNSH) followed by two replicates of the same cell line treated with retinoic acid (SKSNH-RA). So the \Robject{myConditions\_AS} vector is defined as follows: <>= myConditions_AS <- c(rep("SKNSH",2), rep("SKNSH-RA",2)) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Quality control} Before running the differential analysis, we check that the data was loaded correctly, using the \Rfunction{qualityControl} function. <>= qualityControl(myCounts_AS, myConditions_AS) @ \begin{figure} \includegraphics[page=1,width=0.45\textwidth]{kissDE-qualityControl_AS.pdf} \includegraphics[page=2,width=0.45\textwidth]{kissDE-qualityControl_AS.pdf} \caption{\label{fig:qcAS}Quality control plots on alternative data. Left: Heatmap of the sample-to-sample distances for the alternative splicing dataset. Right: Principal Component Analysis for the alternative splicing dataset.} \end{figure} On both plots returned by the \Rfunction{qualityControl} function (Figure \ref{fig:qcAS}), the replicates of the same condition seem to be more similar between themselves than to the samples of the other condition. On the heatmap (left of Figure \ref{fig:qcAS}), the samples of the same condition cluster together. On the PCA plot (right of Figure \ref{fig:qcAS}), the first principal component (which summarises 88\% of the total variance) clearly discriminates the two conditions. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Differential analysis} The main function of \kde, \Rfunction{diffExpressedVariants}, can now be run to compute the differential analysis. Outputs are stored in a \Robject{myResult\_AS} object (for a detailed description of its structure, see section \ref{subsubsec:finaltable}) and the result for the first three events is given below: <>= myResult_AS <- diffExpressedVariants(myCounts_AS, myConditions_AS) head(myResult_AS$finalTable, n = 3) @ The first event in the \Robject{myResult\_AS} output has a very low p-value (\Rcode{Adjusted\_pvalue} column, less than \Rcode{2.2e-16}) and a very contrasted $\Delta PSI$ (\Rcode{Deltaf/DeltaPSI} column, equal to \Rcode{-0.804}) close to the maximum value (1 in absolute). This gene is differentially spliced. When the SK-N-SH cell line is treated with retinoic acid, the inclusion variant becomes the major isoform. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Export results} In order to facilitate the downstream analysis of the results, two tables are exported: the result table (\Robject{myResults\_AS{\$}finalTable} object, see section \ref{subsubsec:finaltable}) is saved in a \file{results_table.tab} file and the PSI table (\Robject{myResults\_AS{\$}$^{\backprime}$f/psiTable$^{\backprime}$}, see section \ref{subsubsec:psitable}) is saved in a \file{psi_table.tab} file. Here are the commands to carry out this task: <>= writeOutputKissDE(myResults_AS, output = "results_table.tab") writeOutputKissDE(myResults_AS, output = "psi_table.tab", writePSI = TRUE) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Application of \kde~to SNPs/SNVs} \label{subsec:SNV} This second example present an analysis of SNPs/SNVs done with \kde~on RNA-Seq data from a subset of the case study presented in \cite{Lopez-Maestre2016} (\url{http://kissplice.prabi.fr/TWAS/}).\\ The original purpose of this study was to demonstrate that the method can deal with pooled data (i.e. individuals are pooled prior to sequencing). Pooling can be used to decrease the costs. It is also sometimes the only option, when too few RNA is available per individual. The method can in principle be used on unpooled data, polyploid genomes, and for the detection of somatic mutations, but has for now only been evaluated for the detection of SNPs/SNVs in pooled RNAseq data.\\ In the remaining, we use the term SNV, which designates a variation of a single nucleotide, without any restriction on the frequency of the two alleles. The term SNP is indeed classically used for variants present in at least 1\% of a population. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Dataset} The dataset comes from the human GEUVADIS project. Two populations were selected: Toscans (TSC) and Central Europeans (CEU). For each population, we selected 10 individuals, which are pooled in two groups of 5. Each group corresponds to a replicate for \kde. The conditions being compared are the populations. \begin{figure} \includegraphics[width=0.7\textwidth]{experimental_design_SNP.png} \caption{\label{fig:designSNP}Experimental design of the SNP dataset. Each cross corresponds to an individual.} \end{figure} The data are paired-end. So each sample consists of 2 files. In total, 8 files have been used: 4 files for the two TSC samples and 4 files for the two CEU samples. Paired-end files from a same sample have been given as following each other to \kp.\\ \kp~outputs a fasta file that stores SNVs found in the dataset. Its structure is described in section \ref{subsubsec:input_ks}. The first SNV is presented below: <>= headfasta <- system.file("extdata", "head_output_kissplice_SNV_fasta.txt", package="kissDE") writeLines(readLines(headfasta)) @ Events are reported in 4 lines, the two first represent one allele of the SNV, the two last the other allele. Thus the sequences only differ from each other at one position which corresponds to the SNV, here A/C in the center of the sequence (at position 42). Because \kp~was run with the default value of the \Rcode{counts} parameter (i.e., 0), the counts have the following format \Rcode{C1\_x|C2\_y|...|Cn\_z}. In this example, there are 8 counts because we input 8 files. Each count corresponds to the reads coming from each file that could be mapped on the variant, in the order they have been passed to \kp. This information is particularly important in \kde~since it represents the counts used for the test. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Load data} The first step is to convert this fasta file (here called \file{output_kissplice_SNV.fa}) into a format that will be used in \kde~ main functions, thanks to the \Rfunction{kissplice2counts} function.\\ \bioccomment{\Robject{fileInSNV} contains the absolute path of the file on the user's hard disk.}\\ Due to paired-end RNA-Seq data, the \Rcode{pairedEnd} parameter was set to \Rcode{TRUE}. This conversion in a table of counts is stored in the \Robject{myCounts\_SNV} object (for a detailed description of its structure, see section \ref{subsubsec:input_ks}) and can be done as follows: <>= fileInSNV <- system.file("extdata", "output_kissplice_SNV.fa", package = "kissDE") myCounts_SNV <- kissplice2counts(fileInSNV, pairedEnd = TRUE) head(myCounts_SNV$countsEvents) @ To perform the differential analysis, a vector with the conditions has to be provided.\\ In the example, there are two replicates of TSC and two replicates of CEU, thus the condition vector \Robject{myConditions\_SNV} is: <>= myConditions_SNV <- c(rep("TSC",2), rep("CEU",2)) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Quality control} Before running the differential analysis, we recommand to check if the data was correctly loaded, by running the \Rfunction{qualityControl} function. <>= qualityControl(myCounts_SNV, myConditions_SNV) @ \begin{figure} \includegraphics[page=1,width=0.45\textwidth]{kissDE-qualityControl_SNV.pdf} \includegraphics[page=2,width=0.45\textwidth]{kissDE-qualityControl_SNV.pdf} \caption{\label{fig:qcSNV}Quality control plots on SNV data. Left: Heatmap of the sample-to-sample distances on SNV data. Right: Principal Component Analysis on SNV data.} \end{figure} On both plots outputed (Figure \ref{fig:qcSNV}), the replicates of the same condition seem to be more similar between themselves than to the samples of the other condition. On the heatmap (left of Figure \ref{fig:qcSNV}), the samples of the same condition cluster together. On the PCA plot (right of Figure \ref{fig:qcSNV}), the first principal component (which summarises 88\% of the total variance) clearly discriminates the two conditions. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Differential analysis} The main function of \kde, \Rfunction{diffExpressedVariants}, can now be run to compute the statistical test.\\ Outputs are stored in a \Robject{myResult\_SNV} object (for a detailed description of its structure, see section \ref{subsubsec:finaltable}) and the result for the first three events is printed: <>= myResult_SNV <- diffExpressedVariants(myCounts_SNV, myConditions_SNV) head(myResult_SNV$finalTable, n = 3) @ The first event in the \Robject{myResult\_SNV} output has a low p-value (\Rcode{Adjusted\_pvalue} column, equal to \Rcode{8.63e-13}) and a very high absolute value of $\Delta f$ (\Rcode{Deltaf/DeltaPSI} column, equal to \Rcode{-0.926}) close to the maximum value (1 in absolute). This SNP would typically be population specific. One allele is enriched in the Toscan population, the other in the European population. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Export results} We consider as significant the events that have an adjusted p-value lower than 5\%, so we set \Rcode{adjPvalMax = 0.05}. Results passing this threshold are saved in a \file{final_table_significants.tab} file, with the \Rfunction{writeOutputKissDE} function, as follows: <>= writeOutputKissDE(myResults_SNV, output = "final_table_significants.tab", adjPvalMax = 0.05) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Time / Requirements} \label{subsec:time} The data loading function (\Rfunction{kissplice2counts}) and the statistical analysis function (\Rfunction{diffExpressedVariants}) are the most time-consuming steps. Here is an example of the running time of these two functions on the two complete datasets presented in the case studies(section \ref{sec:casestudies}). The time presented were evaluated on a desktop computer with the following caracteristics: Intel Core i7, CPU 2,60 GHz, 16G RAM. \begin{table}[h] \begin{tabular}{c|c|c|c|c} \multirow{2}{*}{Dataset} & \multirow{2}{*}{Options} & Number of & \multicolumn{2}{|c}{Running time} \\ & & events & \footnotesize{\Rfunction{kissplice2counts}} & \footnotesize{\Rfunction{diffExpressedVariants}} \\ \hline \multirow{3}{*}{AS data} & \Rcode{counts=2}, & \multirow{3}{*}{59132} & \multirow{3}{*}{7m} & \multirow{3}{*}{50m} \\ & \Rcode{pairedEnd=TRUE} & & & \\ & \Rcode{k2rg=TRUE} & & & \\ \hline \multirow{2}{*}{SNV data} & \Rcode{counts=0}, & \multirow{2}{*}{64824} & \multirow{2}{*}{2m} & \multirow{2}{*}{1h} \\ & \Rcode{pairedEnd=TRUE} & & & \\ \end{tabular} \caption{Profiling. Running time of the two principal functions of \kde~(\Rfunction{kissplice2counts} and \Rfunction{diffExpressedVariants} for two datasets (AS dataset from the ENCODE project \cite{Djebali2012} described in section \ref{subsec:AS} and SNP dataset from the GEUVADIS project \cite{Lappalainen2013} described in section \ref{subsec:SNV}).} \label{tab:time} \end{table} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Session info} <>= sessionInfo() @ \bibliography{bibliography} \end{document}