%\VignetteIndexEntry{MultiRNAflow: A R package for analysing RNA-seq raw counts with different time points and several biological conditions.} %\VignettePackage{MultiRNAflow} %\VignetteEngine{knitr::knitr} \documentclass{article} <>= BiocStyle::latex(use.unsrturl=FALSE) @ %\usepackage{booktabs} % book-quality tables \usepackage{hyperref} \setcounter{secnumdepth}{4} \setcounter{tocdepth}{3} \newcommand{\subsubsubsection}[1]{\paragraph{#1}\mbox{}\\} \newcommand\BiocStyle{\Rpackage{BiocStyle}} \newcommand\knitr{\Rpackage{knitr}} \newcommand\rmarkdown{\Rpackage{rmarkdown}} \newcommand\Sweave{\software{Sweave}} \newcommand\latex[1]{{\ttfamily #1}} \makeatletter \def\thefpsfigure{\fps@figure} \makeatother \newcommand{\exitem}[3]{% \item \latex{\textbackslash#1\{#2\}} #3 \csname#1\endcsname{#2}.% } %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \title{MultiRNAflow: An R package for analysing RNA-seq raw counts with different time points and several biological conditions} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \author[1]{Loubaton Rodolphe} \author[1]{Champagnat Nicolas} \author[1]{Vallois Pierre} \author[2,3]{Vallat Laurent} \affil[1]{Universit\'e de Lorraine, CNRS, Inria, IECL, F-54000 Nancy, France} \affil[2]{Universit\'e de Strasbourg, INSERM, IRFAC UMR-S1113, Strasbourg, France} \affil[3]{Department of molecular genetic of cancers, Molecular hematology unit, Strasbourg University Hospital, Strasbourg, France} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{document} <>= library(knitr) opts_chunk$set( concordance=TRUE ) @ \maketitle %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{abstract} Our R package MultiRNAflow provides an easy to use unified framework allowing to make both unsupervised and supervised (DE) analysis for datasets with an arbitrary number of biological conditions and time points. In particular, our code makes a deep downstream analysis of DE information, e.g. identifying temporal patterns across biological conditions and DE genes which are specific to a biological condition for each time. \end{abstract} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \newpage \tableofcontents \newpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% <>= options(prompt = " ", continue = " ") @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction}\label{sec:introduction} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Context}\label{sec:context} In eukaryotic cells, genes are expressed in the form of RNA molecules (transcription) which are then translated into proteins with a cellular function. In resting cells, at the steady state, transcription is affected by stochastic phenomena generating a transcriptional noise within cells. After modification of the cellular environment (cellular stress, receptor activation), hundreds of genes are activated, inducing a dynamic temporal transcriptional response allowing an adapted response of the cells to the initial modification of the environment \cite{Yosef2013GRNdynamicPerturbation}. Alterations in these temporal transcriptional responses are at the origin of pathologies (infection, cancer) and are extensively studied by biologists \cite{BarJoseph2012DynamicBiologicalProcesses} through sometimes complex experimental designs. Recent technological developments now make it possible to quantify the transcription of all genes in the genome by sequencing RNA molecules (RNAseq). These analyses generate raw count data whose properties (discrete data) are different from the fluorescence intensity data (continuous data) generated by previous microarray techniques. These data are presented as a table reporting, for each sample, the number of reads for each gene, obtained from the alignment of collected RNA transcripts to a reference genome (or transcriptome). The raw count of a gene (or transcript) corresponds to the number of reads mapped to the RNA sequence of this gene (or transcript). Experimental or systematic errors may occur during sequencing (uncertainty on sequencing depth, effective library sizes, ...), which require the transformation of the original raw counts.\newline The first methods developed consist to compute either count per million (CPM) or reads per kilobase of transcripts per million reads mapped (RPKM) \cite{Mortazavi2008QuantifyingTranscriptomesRNAseq} or transcripts per million (TPM) \cite{Li2010TPM,Wagner2012TPM} as they were supposed to be a good measure of RNA molar concentration (RMC) of a transcript in a sample. Although the previous methods are useful for comparing gene expressions within a sample, as they represent the relative abundance of a gene or transcript in a sample, they should not be used for comparison between samples \cite{Zhao2020MisuseTPMRPKM,Wagner2012TPM} above all if samples belong to different biological conditions and/or time points.\newline New methods of normalization have been developed in order to be able both to compare gene expression within a sample and gene expression between samples which belong to different biological conditions and/or time points. These methods of normalization, ensure that differences between samples are only due to their membership to different biological conditions and/or time points. The most used R package for normalization are DESeq2 \cite{Love2014DESeq2} and EdgeR \cite{Robinson2010edgeR}. The main goals of normalization of raw counts data are both to allow for unsupervised analysis of data (within samples or between samples) and to compare gene reads in different biological condition and/or time points, to detect so-called \textbf{Differentially Expressed} (DE) genes. \clearpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{State of the art}\label{sec:stateart} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Several R packages propose tools to normalize data, find DE genes and realize unsupervised analysis, such as IDEAL \cite{Marini2020IDEAL}, RNASeqR \cite{Chao2021RNASeqR}, GEO2RNAseq \cite{Seelbinder2019GEO2RNAseq} and RNAflow \cite{Lataretu2020RNAflow}. These packages use the package DESeq2 and/or the package EdgedR in order to realize the normalization and the DE analysis. They can all detect DE genes in the case where samples belong to different biological conditions, although RNASeqR is limited to only two biological conditions. These packages were not designed to deal with temporal data, although they could be adapted to this situation. However, none of them can analyse RNA-seq data with both several time points and several biological conditions. The package IDEAL allows to study more than two biological conditions but all the other packages are restricted only to two biological conditions. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Our R package MultiRNAflow}\label{sec:DescriptionPackage} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Our R package MultiRNAflow, built from the R package DESeq2 \cite{Love2014DESeq2}, provides an easy to use unified framework allowing to automatically make both unsupervised and supervised (DE) analysis for datasets with an arbitrary number of biological conditions and time points. Specifically, our package realizes: \begin{enumerate} \item Exploratory (unsupervised) analyses of the data. \item Statistical (supervised) analysis of transcriptional response (differentially expressed (DE) genes). \item Analysis of temporal expression profiles of DE genes from different groups. \item Functional analysis of temporal expression clusters. \end{enumerate} In particular, our code makes a deep downstream analysis of DE information, e.g. identifying temporal patterns across biological conditions and DE genes which are specific to a biological condition for each time. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Supported dataset}\label{sec:dataset} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Our package takes as input a transcriptional dataset with an experimental design that includes multiple groups of individuals and/or multiple times, t0 (reference time) and ti (1 $\leq$ i $\leq$ n). The dataset is a table of raw counts where lines correspond to genes and columns correspond to samples. Each sample show the raw counts of an individual sequencing, corresponding to a biological condition, an individual sampling in this biological condition and a time. In this document, we will illustrate the use of our package on four examples of datasets (see section \nameref{sec:DatasetPackage}) belonging to three cases: \begin{itemize} \item \textbf{Case 1}. Samples belong to different biological conditions. \item \textbf{Case 2}. Measures were realized at different time points. \item \textbf{Case 3}. Samples belong to different biological conditions and measures were realized at different time points. \end{itemize} \clearpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Steps of the algorithm}\label{sec:stepsalgo} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{figure} \centering \includegraphics[width=0.75\linewidth,height=0.75\textheight]{MultiRNAflowImage/MultiRNAflow_Figure.pdf} \caption{Outputs from the package with a dataset containing several biological conditions and several time points. The experimental design is shown in (A). Exploratory analysis includes 3D PCA (B), temporal clustering of gene expression (C) and detailed temporal gene expression (D). Supervised statistical analysis (analysis map shown in (E)) include DE genes between each time and the reference time for each biological condition (F and G); specific DE genes for each biological condition at each time (H) or at at least one time (I); signature DE genes or each biological condition and each time (J). GO enrichment analysis is realized with the R package gprofiler2 (K) or by generating input files for several GO software programs, such as Webgestalt (L) or GSEA (M).} \label{fig:ApllicationNoteImage} \end{figure} The package MultiRNAflow realizes the following steps: \begin{itemize} \item Normalization, realized with the R package DESeq2 \cite{Love2014DESeq2}. \item Exploratory data analysis (unsupervised) which includes \begin{itemize} \item Visualization of individual patterns using factorial analysis with the R package FactoMineR \cite{Le2008FactoMineR}. \item Visualization of biological conditions and/or temporal clusters with the R package ComplexHeatmap \cite{Gu2016ComplexHeatmaps} \item Visualization of groups of genes with similar temporal behavior with the R package Mfuzz \cite{Futschik2005SoftClusteringMicroarray,Kumar2007Mfuzz} \end{itemize} \item Statistical analysis (supervised) of the transcriptional response of different groups of individuals over time with the R package DESeq2 \cite{Love2014DESeq2}, which includes \begin{itemize} \item Temporal statistical analysis \item Statistical analysis by group \item Combination of temporal and group statistical analysis \item Gene Ontology (GO) enrichment analysis using the R package GO enrichment with gprofiler2 \cite{Kolberg2020gprofiler2} (Figure~\ref{fig:ApllicationNoteImage}.K), and automatically generates outputs that can be implemented in DAVID \cite{Sherman2021DAVID}, Webgestalt \cite{Liao2019WebGestalt} (Figure~\ref{fig:ApllicationNoteImage}.L) and GSEA \cite{Subramanian2005GSEA} (Figure~\ref{fig:ApllicationNoteImage}.M) for further analysis using these databases. \end{itemize} \end{itemize} Below, we will give a short description of each of these steps before a full description of the package outputs for four examples of datasets. The following figure illustrates the short description below. It gathers a selection of graphs produced by our package for the \nameref{sec:dataWeger2021} \cite{Weger2021TemporalMouseLiver} in \textbf{case 3}. The experimental design is shown in Figure\ref{fig:ApllicationNoteImage}.A and Figure~\ref{fig:ApllicationNoteImage}.E. The authors analyzed the role of invalidation (ko) of two genes (Bmal1 and Cry1\slash{2}) on murine transcriptional dynamics over the course of the day (4 groups (Bmal1 ko (invalidated gene), Bmal1 wt (wild type, not invalidated), Cry1{\slash}2 ko, Cr1\slash{2} wt) $\times$ 4 individuals per group $\times$ 6 times $=$ 48 transcriptional time points) (Figure~\ref{fig:ApllicationNoteImage}.A). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Normalization}\label{sec:stepsnormalization} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% The function \textbf{DATAnormalization()} of our package allows to realize the three methods of normalization proposed in DESeq2, which must always be performed on raw counts: \begin{itemize} \item Relative log expression (rle) \cite{Anders2010SizeFactorVST}. Each column of the raw counts is scaled by a specific scalar called size factors, estimated using the "median ratio method". \item Regularized logarithm (rlog) \cite{Love2014DESeq2}. This method of normalization transforms the count data to the log2 scale in a way which minimizes differences between samples for rows (so genes) with small counts. This transformation removes the dependence of the variance on the mean, particularly the high variance of the logarithm of count data when the mean is low. This method of normalization is realized by the r function \texttt{rlog()} of the R package DESeq2. \item Variance Stabilizing Transformation (vst) \cite{Anders2010SizeFactorVST}. This method of normalization is similar to the rlog normalization. The vst normalization is faster but the rlog normalization is more robust in the case when the size factors vary widely. This method of normalization is realized by the r function \texttt{vst()} of the R package DESeq2. \end{itemize} As mentioned in the DESeq2 manual, the rle transformation is used to realize DE analysis and the rlog and vst transformations are advised for unsupervised analysis (factorial analysis, clustering) or other machine learning methods. In addition, the function \textbf{DATAnormalization()} allows to plot the distribution of normalized read counts for each sample using boxplots. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Exploratory data analysis (unsupervised)}\label{sec:stepsPCAhcpc} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsubsection{Factorial analysis}\label{sec:stepsPCA} Factorial analysis is realized by the two functions \textbf{PCAanalysis()} and \textbf{HCPCanalysis()} which implement two methods: Principal Component Analysis (PCA) and Hierarchical Clustering on Principle Components (HCPC). The two methods allow to visualize the temporal evolution of the transcription within each group of individuals and similarities or differences in transcriptional behaviors between groups. Of note, the PCA visualization is optimized thanks to the dynamic 3D PCA (see Figure~\ref{fig:ApllicationNoteImage}.B) allowing several viewing angles. \begin{itemize} \item \textbf{Case 1}. In the PCA graphs, samples are colored according to biological condition. \item \textbf{Case 2}. In the PCA graphs, consecutive time points within a sample are linked to help visualization of temporal patterns. \item \textbf{Case 3}. The PCA graphs combine the two previous displayings. \end{itemize} \clearpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsubsection{Visualization of groups of genes with similar temporal behavior} \label{sec:stepsMFUZZ} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% The R function \textbf{MFUZZanalysis()} allows, in \textbf{cases 2} and \textbf{cases 3}, to find the most common temporal behavior among all genes and all individuals in a given biological condition. This is done using the R package Mfuzz \cite{Futschik2005SoftClusteringMicroarray,Kumar2007Mfuzz} based on soft clustering. When there are several replicates per time, the Mfuzz package realizes soft clustering from the mean expression per time for each gene (see Figure~\ref{fig:ApllicationNoteImage}.C). When there are several biological conditions, the algorithm realizes the Mfuzz analysis for each biological condition. As in any clustering method, we need to find out the optimal number of clusters. Although a method is already implemented in the Mfuzz package, this method seems to fail when the number of genes is too big. Our function \textbf{MFUZZclusternumber()} finds the optimal number of cluster using kmeans, from the package R stats \cite{Rteam2021R} or HCPC. Among the outputs, \textbf{MFUZZclusternumber()} returns \begin{itemize} \item a graph indicating the selected number of clusters for each biological condition \item the results of the soft clustering for each biological condition \end{itemize} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsubsection{Visualization of the data}\label{sec:stepsTemporalExpr} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% The R function \textbf{DATAplotExpressionGenes()} allows to plot gene expression profiles according to time and/or biological conditions (see Figure~\ref{fig:ApllicationNoteImage}.D). The user can either use raw counts data or normalized data. \begin{itemize} \item \textbf{Case 1}. The output is a graph where are plotted: a box plot, a violin plot, and error bars (standard deviation) for each biological condition. \item \textbf{Case 2}. The output is a graph where are plotted: the evolution of the expression of each replicate across time (red lines) and the evolution of the mean and the standard deviation of the expression across time (black lines). \item \textbf{Case 3}. The output is a graph where are plotted: the evolution of the mean and the standard deviation of the expression across time for each biological condition. \end{itemize} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Statistical analysis of the transcriptional response of different groups of individuals over time} \label{sec:SupervisedStatisticalAnalysis} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsubsection{Differentially expressed (DE) genes with DESeq2}\label{sec:stepsDE} The function \textbf{DEanalysisGlobal()} detects DE genes. \textbf{Case 1}. Our algorithm search for differentially expressed (DE) genes between all pairs of biological conditions. This allows in particular to determine which genes are specific to each biological condition. A gene is called specific to a biological condition BC, if the gene is DE between BC and any other biological conditions, but not DE between any pairs of other biological conditions.\newline Among the outputs, \textbf{DEanalysisGlobal()} returns \begin{itemize} \item a Venn barplot which gives the number of genes for each possible intersection. We consider that a set of pairs of biological conditions forms an intersection if there is at least one gene which is DE for each of these pairs of biological conditions, but not for the others. \item a barplot which gives the number of specific (and over- and under-expressed, also often called up- and down-regulated) genes per biological condition. \end{itemize} \textbf{Case 2}. Our algorithm looks for differentially expressed genes between each time ti (1$\leq i\leq n$) and the reference time t0.\newline Among the outputs, \textbf{DEanalysisGlobal()} returns \begin{itemize} \item an alluvial graph of differentially expressed (DE) genes. \item a Venn barplot which gives the number of genes per temporal pattern. By temporal pattern, we mean the set of times ti such that the gene is DE between ti and the reference time t0. \end{itemize} \textbf{Case 3}. Our algorithm realizes a mix of the two previous cases. First, for each biological condition, the algorithm realizes \textbf{Case 2}. Then for each time, the algorithm realizes \textbf{Case 1}. We then can find specific genes. A gene is called specific to a biological condition BC at a time ti, if the gene is DE between BC and any other biological conditions at time ti, but not DE between any pairs of other biological conditions at time ti. The algorithm also finds signature genes: a gene is called signature of a biological condition BC at a given time ti if the gene is specific for BC at time ti and DE between ti versus t0 for BC. Among the outputs, \textbf{DEanalysisGlobal()} returns \begin{itemize} \item An alluvial graph of differentially expressed (DE) genes, for each biological condition (see Figure~\ref{fig:ApllicationNoteImage}.G). \item A barplot which gives the number of DE genes per time, for each biological condition (see Figure~\ref{fig:ApllicationNoteImage}.F). \item A barplot which gives the number of specific genes for each biological condition, one per time (see Figure~\ref{fig:ApllicationNoteImage}.H). \item An alluvial graph of genes which are specific at at least one time, for each biological condition (see Figure~\ref{fig:ApllicationNoteImage}.I). \item A graph which gives for each biological condition, the number of signature genes and non signatures genes per time ti versus the reference time t0 (see Figure~\ref{fig:ApllicationNoteImage}.J). \end{itemize} \subsubsubsection{Heatmaps, ration intensity (MA) plots and volcano plots}\label{sec:stepsVolMAheat} Clustering of samples versus genes (heatmap, not shown) allows visualization of correlations between gene expressions according to biological conditions or times. Clustering of samples versus samples allows visualization of correlations between individuals and groups. As there usually are to many genes in a dataset, the heatmaps are realized after the supervised analysis in order to reduce the number of genes for the heatmap. \clearpage \subsubsubsection{Gene ontology and gene enrichment}\label{sec:stepsGO} Gene Ontology (GO) enrichment analysis is a technique for interpreting DE genes through the Gene Ontology system of classification. The goal is to retrieve a functional profile of DE genes and better understand the underlying biological processes. We recommend the most used online tools and software: GSEA \cite{Subramanian2005GSEA}, DAVID \cite{Sherman2021DAVID}, WebGestalt \cite{Liao2019WebGestalt}, g:Profiler \cite{Raudvere2019gProfiler}, Panther \cite{Thomas2020PANTHER}, ShinyGO \cite{GE2020ShinyGO}, Enrichr \cite{Kuleshov2016Enrichr} and GOrilla \cite{Eden2009GOrilla}. Each of these softwares and online tools requires specific input files in order to realize their analysis. The R function \textbf{GSEApreprocessing()} automatically creates all required files.\newline Figure~\ref{fig:ApllicationNoteImage}.L and Figure~\ref{fig:ApllicationNoteImage}.M were produced by the online tool WebGestalt and the software GSEA. Alternatively, the function \textbf{GSEAquickAnalysis()} provides a GSEA analysis with the R package gprofiler2 \cite{Kolberg2020gprofiler2}. Among the ouputs, \textbf{GSEAquickAnalysis()} returns %%%%%%%%%%%%%%%%%%%%%%%% \begin{itemize} \item a data.frame (output \texttt{resGSEAgprofiler2Fission\$GSEAresults}) giving information about all detected gene ontologies the list of associated genes. \item a lollipop graph (Figure~\ref{fig:FissionLollipop}) showing the most important Gene Ontologies ranked by their $-\log_{10}(\mbox{pvalue})$. The y-axis indicates the \texttt{MaxNumberGO} most significant gene ontologies and pathways associated to the selected DE genes. The gene ontologies and pathways are sorted into descending order. The x-axis indicates the -log10(pvalues). The higher is a lollipop the more significant is a gene ontology or pathway. A lollipop is yellow if the pvalues is smaller than 0.05 (significant) and blue otherwise. \item A Manhattan plot (Figure~\ref{fig:FissionManhattan}) ranking all genes ontologies according to the functional database (G0::BP, G0::CC, G0::MF and KEGG) \end{itemize} \begin{figure} \centering \includegraphics[width=0.75\linewidth,height=0.75\textheight]{MultiRNAflowImage/Fission_2-5LollipopChart.pdf} \caption{Lollipop chart giving the most significant gene ontologies} \label{fig:FissionLollipop} \end{figure} \begin{figure} \centering \includegraphics[width=0.75\linewidth,height=0.75\textheight]{MultiRNAflowImage/Fission_2-5ManhattanPlot.pdf} \caption{Mahattan plot indicating all genes ontologies} \label{fig:FissionManhattan} \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Dataset used as examples in the package} \label{sec:DatasetPackage} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% In order to explain the use of each function in our package, we use four datasets. \subsubsection{Mouse dataset 1}\label{sec:dataAntoszewski2022} The \textbf{Mouse dataset 1} \cite{Antoszewski2022TcellNotch1} is accessible on the Gene Expression Omnibus (GEO) database with the accession number GSE169116. This dataset contains the transcription profile of 12 mice belonging to 4 biological conditions: \begin{itemize} \item 3 mice with wild type Notch1 and wild type Tcf1 \item 3 mice with wild type Notch1 and Tcf1 knocked-down \item 3 mice with Notch1 induced and wild type Tcf1 \item 3 mice with Notch1 induced and Tcf1 knocked-down. \end{itemize} The dataset contains temporal expression data of 39017 genes. Notch1 is a well-established lineage specifier for T cells and among the most frequently mutated genes throughout all subclasses of T cell acute lymphoblastic leukemia \cite{Antoszewski2022TcellNotch1}. To illustrate the use of our package in \textbf{case 1}, we selected 500 genes giving a representative sample of each DE profile across biological conditions, in particular genes that are specific to each biological condition.\newline This sub dataset is saved in the file \textbf{RawCounts\_Antoszewski2022\_MOUSEsub500}. \subsubsection{Fission dataset}\label{sec:dataLeong2014} The \textbf{Fission dataset} \cite{Leong2014FissionOsmoticStress} is accessible on the Gene Expression Omnibus (GEO) database with the accession number GSE56761. The dataset can also be obtained with the R package "fission" \cite{Leong2014FissionOsmoticStress}. This dataset contains the temporal transcription profile of 18 wild type fission yeasts (wt) and 18 fission yeasts where atf1 is knocked-out (mut), hence 36 samples. The dataset contains temporal expression data of 7039 genes. Data were collected 0, 15, 30, 60, 120 and 180 minutes after an osmotic stress. The gene atf1 codes for a transcription factor which alters sensitivity to oxidative stress. To illustrate the use of our package in \textbf{case 2}, we focus on the biological condition wt and select 500 genes giving a representative sample of each temporal DE profile in this biological condition. This sub dataset is saved in the file \textbf{RawCounts\_Leong2014\_FISSIONsub500wt}. \subsubsection{Leukemia dataset}\label{sec:dataSchleiss2021} The \textbf{Leukemia data} \cite{Schleiss2021LeukProliferation} is accessible on the Gene Expression Omnibus (GEO) database with the accession number GSE130385. This dataset contains the temporal transcription profile of 3 Proliferating (P) and 3 Non Proliferating (NP) primary chronic lymphocytic leukemia (CLL) B-cells samples. Data were collected at 0, 1h, 1h30, 3h30, 6h30, 12h, 24h, 48h and 96h after cell stimulation (so $(3+3)\times 9 = 54$ samples in total). The latest time point corresponds to the emergence of the proliferation clusters. The dataset contains temporal expression data of 25369 genes. To illustrate the use of our package in \textbf{case 3} with two biological conditions, we selected 500 genes giving a representative sample of each DE profile across time and biological conditions, in particular genes that are signature genes of each biological condition. This sub dataset is saved in the file \textbf{RawCounts\_Schleiss2021\_CLLsub500}. \subsubsection{Mouse data 2}\label{sec:dataWeger2021} The \textbf{Mouse dataset 2} \cite{Weger2021TemporalMouseLiver} is accessible on the Gene Expression Omnibus (GEO) database with the accession number GSE135898. This dataset contains the temporal transcription profile of 16 mice with Bmal1 and Cry1{\slash}2 knocked-down under an ad libitum (AL) or night restricted feeding (RF) regimen. Data were collected at 0, 4h, 8h, 16, 20h and 24h. Therefore, there are six time points and eight biological conditions. As there are only two mice per biological condition, we decided not to take into account the effect of the regimen. The dataset contains temporal expression data of 40327 genes. To illustrate the use of our package in \textbf{case 3} with more than two biological conditions, we take 500 genes, over the global 40327 genes in the original dataset. This sub dataset is saved in the file \textbf{RawCounts\_Weger2021\_MOUSEsub500}. \clearpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Preamble}\label{sec:preamble} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{R version and R packages to install}\label{sec:Rversionpackage} Before installing the necessary packages, you must install (or update) the R software in a version superior to 4.2.1 "Funny-Looking Kid" (released on 2022/06/23) from \href{https://cran.r-project.org}{CRAN (Comprehensive R Archive Network)}. Then, in order to use the MultiRNAflow package, the following R packages must be installed: \begin{itemize} \item From \href{https://cran.r-project.org}{CRAN}: \href{https://cran.r-project.org/web/packages/scales/index.html}{scales} (>= 1.2.1), \href{https://cran.r-project.org/web/packages/ggsci/index.html}{ggsci} (>= 2.9), \href{https://cran.r-project.org/web/packages/RColorBrewer/index.html}{RColorBrewer} (>= 1.1.3), \href{https://cran.r-project.org/web/packages/reshape2/index.html}{reshape2} (>= 1.4.4), \href{https://cran.r-project.org/web/packages/plyr/index.html}{plyr} (>= 1.8.8), \href{https://ggplot2.tidyverse.org}{ggplot2} (>= 3.4.0), \href{https://cran.r-project.org/web/packages/ggalluvial/index.html}{ggalluvial} (>= 0.12.3), \href{https://cran.r-project.org/web/packages/ggrepel/index.html}{ggrepel} (>= 0.9.2), \href{https://cran.r-project.org/web/packages/FactoMineR/index.html}{FactoMineR} (>= 2.6), \href{https://cran.r-project.org/web/packages/factoextra/index.html}{factoextra} (>= 1.0.7), \href{https://cran.r-project.org/web/packages/plot3D/index.html}{plot3D} (>= 1.4), \href{https://cran.r-project.org/web/packages/plot3Drgl/index.html}{plot3Drgl} (>= 1.0.3), \href{https://cran.r-project.org/web/packages/UpSetR/index.html}{UpSetR} (>= 1.4.0), \href{https://cran.r-project.org/web/packages/gprofiler2/index.html}{gprofiler2} (>= 0.2.1). % \item From \href{https://cran.r-project.org}{CRAN} and usually already included by default in R: graphics (>= 4.2.2), grDevices (>= 4.2.2), grid (>= 4.2.2), methods (>= 4.2.2), stats (>= 4.2.2), utils (>= 4.2.2). % \item From \href{https://bioconductor.org}{Bioconductor}: \href{https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html}{SummarizedExperiment} (>= 1.28.0), \href{https://bioconductor.org/packages/release/bioc/html/DESeq2.html}{DESeq2} (>= 1.38.1), \href{https://bioconductor.org/packages/release/bioc/html/ComplexHeatmap.html}{ComplexHeatmap} (>= 2.14.0), \href{https://www.bioconductor.org/packages/release/bioc/html/Mfuzz.html}{Mfuzz} (>= 2.58.0). % \end{itemize} Before installing a package, for instance the package FactoMineR, the user must check if the package is already installed with the command \texttt{library(FactoMineR)}. If the package has not been previously installed, the user must use the command \texttt{install.packages("FactoMineR")} (for CRAN). For beginners in programming, we recommend to follow the steps below for importing CRAN and Bioconductor packages. For the packages which must be download from CRAN, <>= Cran.pck<-c("scales", "reshape2", "plyr", "ggplot2", "ggrepel", "ggalluvial", "FactoMineR", "factoextra", "plot3D", "plot3Drgl", "UpSetR", "gprofiler2") @ the user can copy and paste the following lines of code for each package in order to download the missing packages. <>= Select.package.CRAN<-"FactoMineR" # if(!require(package=Select.package.CRAN, quietly=TRUE, character.only=TRUE, warn.conflicts=FALSE)){ install.packages(pkgs=Select.package.CRAN, dependencies=TRUE) }# if(!require(package=Cran.pck[i], quietly=TRUE, character.only=TRUE)) @ If the package is already installed (for instance here "FactoMineR"), the previous lines of code will return nothing. For the packages which must be download from Bioconductor, <>= Bioconductor.pck<-c("Mfuzz","SummarizedExperiment","DESeq2","ComplexHeatmap") @ the user must first copy and paste the following lines of code in order to install "BiocManager" <>= if(!require(package="BiocManager", quietly=TRUE, character.only=TRUE, warn.conflicts=FALSE)){ install.packages("BiocManager") }# if(!require(package="BiocManager", quietly=TRUE, character.only=TRUE)) @ then copy and paste the following lines of code in order to install the version 3.16 of bioconductor (it works with R version 4.2.0) <>= BiocManager::install(version = "3.16") @ and then copy and paste the following lines of code for each package in order to download the missing packages. <>= Select.package.CRAN<-"DESeq2" if(!require(package=Select.package.CRAN, quietly=TRUE, character.only=TRUE, warn.conflicts=FALSE)){ BiocManager::install(pkgs=Select.package.CRAN) }# if(!require(package=Select.package.CRAN, quietly=TRUE, character.only=TRUE)) @ If the package is already installed (for instance here "DESeq2"), the previous lines of code will return nothing. Once all packages have been installed, you may load our package. <>= library(MultiRNAflow) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Main functions} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Our package contains 38 functions among which 11 are main functions. The user should only use \begin{itemize} \item \textbf{DATAprepSE()} to store all information about the dataset in a standardized way (SummarizedExperiment class object) \item these main functions for exploratory data analysis (unsupervised analysis) % \begin{itemize} \item \textbf{DATAnormalization()}. This function allows to normalize raw counts data and the results will be used by the functions \textbf{PCAanalysis()}, \textbf{HCPCanalysis()}, \textbf{MFUZZanalysis()}\newline and \textbf{DATAplotExpressionGenes()}. \item \textbf{PCAanalysis()}. The function realizes the PCA analysis. \item \textbf{HCPCanalysis()}. The function realizes the clustering analysis with the R package HCPC. \item \textbf{MFUZZanalysis()}. The function realizes the temporal clustering analysis with the R package Mfuzz \item \textbf{DATAplotExpressionGenes()}. This function allows to plot the profile expression of all chosen genes. \end{itemize} % \item these main functions for supervised analysis % \begin{itemize} \item \textbf{DEanalysisGlobal()}. This function realizes the differential expression analysis. \item \textbf{DEplotVolcanoMA()}. The function plots and save all Volcano and MA plots from the output of \textbf{DEanalysisGlobal()}. \item \textbf{DEplotHeatmaps()}. The function plots a correlation heatmap and a heatmap of the normalized data for a selection of DE genes. \item \textbf{GSEAQuickAnalysis()}. The function realizes the GSEA analysis with the R package gprofiler2. \item \textbf{GSEApreprocessing()}. The function saves files to be used by 8 GSEA software and online tools. \end{itemize} % \end{itemize} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Load of the dataset}\label{sec:LoadDataset} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% If the user wants to use our package with in one of the dataset included in MultiRNAflow, the user must first write in the R console either <>= data("RawCounts_Antoszewski2022_MOUSEsub500") @ in order to load the \nameref{sec:dataAntoszewski2022}, either <>= data("RawCounts_Leong2014_FISSIONsub500wt") @ in order to load the \nameref{sec:dataLeong2014}, either <>= data("RawCounts_Schleiss2021_CLLsub500") @ in order to load the \nameref{sec:dataSchleiss2021}, either <>= data("RawCounts_Weger2021_MOUSEsub500") @ in order to load the \nameref{sec:dataWeger2021}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Structure of the dataset}\label{sec:StructureDataset} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% The dataset must be a data.frame containing raw counts data. If it is not the case, the function \textbf{DATAprepSE()} will stop and print an error. Each line should correspond to a gene, each column to a sample, except a particular column that may contain strings of characters describing the names of the genes. The first line of the data.frame should contain the names of the columns (strings of characters) that must have the following structure. <>= data("RawCounts_Leong2014_FISSIONsub500wt") colnames(RawCounts_Leong2014_FISSIONsub500wt) @ In this example, "Gene" indicates the column which contains the names of the different genes. The other column names contain all kind of information about the sample, including the biological condition, the time of measurement and the name of the individual (e.g patient, replicate, mouse, yeasts culture...). Other kinds of information can be stored in the column names (such as patient information), but they will not be used by the package. The various information in the column names must be separated by underscores. The order of these information is arbitrary but must be the same for all columns. For instance, the sample "wt\_t0\_r1" corresponds to the first replicate (r1) of the wild type yeast (wt) at time t0 (reference time).\newline The information located to the left of the first underscore will be considered to be in position 1, the information located between the first underscore and the second one will be considered to be in position 2, and so on. In the previous example, the biological condition is in position 1, the time is in position 2 and the replicate is in position 3.\newline In most of the functions of our package, the order of the previous information in the column names will be indicated with the inputs \texttt{Group.position}, \texttt{Time.position} and \texttt{Individual.position}. Similarly the input \texttt{Column.gene} will indicate the number of the column containing gene names. For example, in the previous dataset, the function \textbf{DATAprepSE()} must be called with the following arguments: <>= data("RawCounts_Leong2014_FISSIONsub500wt") @ <>= resSEexample <- DATAprepSE(RawCounts=RawCounts_Leong2014_FISSIONsub500wt, Column.gene=1, Group.position=NULL, Time.position=2, Individual.position=3) @ Here, the argument \texttt{Column.gene=1} means that the first column of the dataset contain genes name, \texttt{Time.position=2} means that the time of measurements is between the first and the second underscores in the columns names, \texttt{Individual.position=3} means that the name of the individual is between the second and the third underscores in the columns names and \texttt{Group.position=NULL} means that there is only one biological condition in the dataset. Similarly, \texttt{Time.position=NULL} would mean that there is only one time of measurement for each individual and \texttt{Column.gene=NULL} would mean that there is no column containing gene names. \clearpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Preprocessing step for the four dataset with DATAprepSE()} \label{sec:PreprocessingStep} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% The preprocessing step is realized by our R function \textbf{DATAprepSE()} to store all information about the dataset in a standardized way (SummarizedExperiment class object). The user must realize this step in order to realize exploratory data analysis (unsupervised analysis, section \ref{sec:ExploratoryAnalysis}) or statistical analysis of the transcriptional response (supervised analysis, section \ref{sec:SupervisedAnalysis}) \subsection{Mouse data (case 1)}\label{sec:DATAprepSEus1} In this section we use the mouse subdataset \textbf{RawCounts\_Antoszewski2022\_MOUSEsub500} (see the subsection \ref{sec:dataAntoszewski2022}) in order to explain \textbf{DATAprepSE()} in \textbf{Case 1}. The following lines of code realize the normalization step <>= resSEmus1500 <- DATAprepSE(RawCounts=RawCounts_Antoszewski2022_MOUSEsub500, Column.gene=1, Group.position=1, Time.position=NULL, Individual.position=2) @ The function returns \begin{itemize} \item \Rclass{SummarizedExperiment} class object containing all information of the dataset to be used for exploratory data analysis \item \Rclass{DESeqDataSet} class object to be used for statistical analysis of the transcriptional response. \end{itemize} Write \texttt{?DATAprepSE} in your console for more information about the function. \subsection{Fission data (case 2)}\label{sec:DATAprepSEfission} In this section we use the fission yeast subdataset \textbf{RawCounts\_Leong2014\_FISSIONsub500wt} (see the subsection \nameref{sec:dataLeong2014}) in order to explain \textbf{DATAprepSE()} in \textbf{case 2}. The following lines of code realize the normalization step <>= resSEfission500 <- DATAprepSE(RawCounts=RawCounts_Leong2014_FISSIONsub500wt, Column.gene=1, Group.position=NULL, Time.position=2, Individual.position=3) @ The function returns \begin{itemize} \item \Rclass{SummarizedExperiment} class object containing all information of the dataset to be used for exploratory data analysis \item \Rclass{DESeqDataSet} class object to be used for statistical analysis of the transcriptional response. \end{itemize} Write \texttt{?DATAprepSE} in your console for more information about the function. \subsection{Leukemia data (Case 3 with two biological conditions)} \label{sec:DATAprepSEleuk} In this section we use the Chronic lymphocytic leukemia (CLL) subdataset\newline \textbf{RawCounts\_Schleiss2021\_CLLsub500} (see subsection \nameref{sec:dataSchleiss2021}) in order to explain \textbf{DATAprepSE()} in \textbf{case 3} when there only two biological conditions. The following lines of code realize the normalization step. <>= resSEleuk500 <- DATAprepSE(RawCounts=RawCounts_Schleiss2021_CLLsub500, Column.gene=1, Group.position=2, Time.position=4, Individual.position=3) @ The function returns \begin{itemize} \item \Rclass{SummarizedExperiment} class object containing all information of the dataset to be used for exploratory data analysis \item \Rclass{DESeqDataSet} class object to be used for statistical analysis of the transcriptional response. \end{itemize} Write \texttt{?DATAprepSE} in your console for more information about the function. \subsection{Mouse data 2 (case 3 with more than two biological conditions)} \label{sec:DATAprepSEmus2} In this section we use the mouse subdataset \textbf{RawCounts\_Weger2021\_MOUSEsub500} (see Subsection \nameref{sec:dataAntoszewski2022}) in order to explain \textbf{DATAprepSE()} in \textbf{case 3} when there are more than two biological conditions. The following lines of code realize the normalization step. <>= resSEmus2500 <- DATAprepSE(RawCounts=RawCounts_Weger2021_MOUSEsub500, Column.gene=1, Group.position=1, Time.position=2, Individual.position=3) @ The function returns \begin{itemize} \item \Rclass{SummarizedExperiment} class object containing all information of the dataset to be used for exploratory data analysis \item \Rclass{DESeqDataSet} class object to be used for statistical analysis of the transcriptional response. \end{itemize} Write \texttt{?DATAprepSE} in your console for more information about the function. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Exploratory data analysis for the four dataset (unsupervised analysis)} \label{sec:ExploratoryAnalysis} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Normalization with DATAnormalization()}\label{sec:DATAnormalization} \subsubsection{Mouse data (case 1)}\label{sec:DATAnormalizationMus1} In this section we use the mouse subdataset \textbf{RawCounts\_Antoszewski2022\_MOUSEsub500} (see the subsection \ref{sec:dataAntoszewski2022}) in order to explain \textbf{DATAnormalization()} in \textbf{Case 1}. The following lines of code realize the normalization step from the results of the function \textbf{DATAprepSE()} (subsection \ref{sec:DATAprepSEmus1}) <>= resNORMmus1500 <- DATAnormalization(SEres=resSEmus1500, Normalization="rle", Blind.rlog.vst=FALSE, Plot.Boxplot=TRUE, Colored.By.Factors=TRUE, Color.Group=NULL, path.result=NULL) @ If \Rcode{Plot.Boxplot=TRUE} a boxplot showing the distribution of the normalized expression (\texttt{Normalization="rle"} means that the rle method is used) of genes for each sample is returned. If \texttt{Colored.By.Factors=TRUE}, the color of the boxplots would be different for different biological conditions. By default (if \texttt{Color.Group=NULL}), a color will be automatically applied for each biological condition. Colors can be changed the colors by creating the following data.frame <>= data.col.mus50<-data.frame(Name=c("N1wtT1wt","N1haT1wt","N1haT1ko","N1wtT1ko"), Col=c("black","red","green","blue")) @ and setting \texttt{Color.Group=data.col.mus50}. The x-labels correspond to the new name of the samples which are just biological information, time information and individual information separated by a dot. If the user wants to see the 6th first rows of the normalized data, he can write in his console \texttt{head(res.Norm.Mus500\$NormalizedData, n=6)}. The user can save the graph in a folder thanks to the input path.result. If \texttt{path.result=NULL} the results will still be plotted but not saved in a folder. Write \texttt{?DATAnormalization} in your console for more information about the function. \subsubsection{Fission data (case 2)}\label{sec:DATAnormalizationFission} In this section we use the fission yeast subdataset \textbf{RawCounts\_Leong2014\_FISSIONsub500wt} (see the subsection \nameref{sec:dataLeong2014}) in order to explain \textbf{DATAnormalization} in \textbf{case 2}. The following lines of code realize the normalization step from the results of the function \textbf{DATAprepSE()} (subsection \ref{sec:DATAprepSEfission}) <>= resNORMyeast500 <- DATAnormalization(SEres=resSEfission500, Normalization="rlog", Blind.rlog.vst=FALSE, Plot.Boxplot=FALSE, Colored.By.Factors=TRUE, Color.Group=NULL, Plot.genes=FALSE, path.result=NULL) @ If \texttt{Plot.Boxplot=TRUE} a boxplot showing the distribution of the normalized expression\newline (\texttt{Normalization="rlog"} means that the rlog method is used) of genes for each sample is returned (similar to the figure of the subsection \nameref{sec:DATAnormalizationMus1}). If \texttt{Colored.By.Factors=TRUE}, the color of the boxplots would be different for different time points. In case 2, the option \texttt{Color.Group} is not used. The x-labels indicate time information and individual information separated by a dot. For more details on the use of the inputs \texttt{Column.gene}, \texttt{Time.position}, \texttt{Group.position} and \texttt{Individual.position}, see Section \nameref{sec:StructureDataset}. If the user wants to see the 10 first rows of the normalized data, he can write in his console \texttt{head(resNORMyeast500\$NormalizedData, n=10)}. The user chooses the path of the folder where the graph can be saved. If \texttt{path.result=NULL}, results are plotted but not saved. Write \texttt{?DATAnormalization} in your console for more information about the function. \subsubsection{Leukemia data (Case 3 with two biological conditions)} \label{sec:DATAnormalizationLeuk} In this section we use the Chronic lymphocytic leukemia (CLL) subdataset\newline \textbf{RawCounts\_Schleiss2021\_CLLsub500} (see Subsection \nameref{sec:dataSchleiss2021}) in order to explain \textbf{DATAnormalization()} in \textbf{case 3} when there only two biological conditions. The following lines of code realize the normalization step from the results of the function \textbf{DATAprepSE()} (subsection \ref{sec:DATAprepSEleuk}) <>= resNORMleuk500<-DATAnormalization(SEres=resSEleuk500, Normalization="vst", Blind.rlog.vst=FALSE, Plot.Boxplot=FALSE, Colored.By.Factors=TRUE, Color.Group = NULL, path.result=NULL) @ If \texttt{Plot.Boxplot=TRUE} a boxplot showing the distribution of the normalized expression\newline (\texttt{Normalization="vst"} means that the vst method is used) of genes for each sample is returned (similar to the figure of the subsection \nameref{sec:DATAnormalizationMus1}). If \texttt{Colored.By.Factors=TRUE}, the color of the boxplots would be different for different biological conditions. By default (if \texttt{Color.Group=NULL}), a color will be automatically applied for each biological condition. You can change the colors by creating the following data.frame <>= data.col.Leuk<-data.frame(Name=c("NP","P"), Col=c("black","red")) @ and setting \texttt{Color.Group=data.col.Leuk}. The x-labels correspond to the new name of the samples which are just biological information, time information and individual information separated by a dot. If the user wants to see the 6th first rows of the normalized data, he can write in his console \texttt{head(resNORMleuk500\$NormalizedData, n=6)}. The user can save the graph in a folder thanks to the input path.result. If \texttt{path.result=NULL} the results will still be plotted but not saved in a folder. Write \texttt{?DATAnormalization} in your console for more information about the function. \subsubsection{Mouse data 2 (case 3 with more than two biological conditions)} \label{sec:DATAnormalizationMus2} In this section we use the mouse subdataset \textbf{RawCounts\_Weger2021\_MOUSEsub500} (see Subsection \nameref{sec:dataAntoszewski2022}) in order to explain \textbf{DATAnormalization()} in \textbf{case 3} when there are more than two biological conditions. The following lines of code realize the normalization step. <>= resNORMmus2500 <- DATAnormalization(SEres=resSEmus2500, Normalization="vst", Blind.rlog.vst=FALSE, Plot.Boxplot=FALSE, Colored.By.Factors=TRUE, Color.Group=NULL, path.result=NULL) @ If \texttt{Plot.Boxplot=TRUE} a boxplot showing the distribution of the normalized expression\newline (\texttt{Normalization="vst"} means that the vst method is used) of genes for each sample is returned (similar to the figure of the subsection \nameref{sec:DATAnormalizationMus1}). If \texttt{Colored.By.Factors=TRUE}, the color of the boxplots would be different for different biological conditions. By default (if \texttt{Color.Group=NULL}), a color will be automatically applied for each biological condition. You can change the colors by creating the following data.frame <>= data.col.Mus2 <- data.frame(Name=c("BmKo","BmWt","CrKo","CrWt"), Col=c("red","blue","orange","darkgreen")) @ and setting \texttt{Color.Group=data.col.Mus2}. The x-labels correspond to the new name of the samples which are just biological information, time information and individual information separated by a dot. If the user wants to see the 6th first rows of the normalized data, he can write in his console \texttt{head(resNORMmus2500\$NormalizedData, n=6)} The user can save the graph in a folder thanks to the input path.result. If \texttt{path.result=NULL} the results will still be plotted but not saved in a folder. Write \texttt{?DATAnormalization} in your console for more information about the function. \clearpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Factorial analysis: PCA with PCAanalysis() and clustering with HCPCanalysis()} \label{sec:Factorial} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Mouse data (case 1)}\label{sec:FactorialMus1} In this section we use the mouse subdataset \textbf{RawCounts\_Antoszewski2022\_MOUSEsub500} in order to explain \textbf{PCAanalysis()} and \textbf{HCPCanalysis()} in \textbf{case 1}. \subsubsubsection{PCA (case 1)}\label{sec:PCAMus1} When samples belong only to different biological conditions, the lines of code below return from the results of the function \textbf{DATAnormalization()} \begin{itemize} \item the results of the function \Rfunction{PCA()} \item one 2D PCA graph, one 3D PCA graph and the same 3D PCA graph in a rgl window (only if \texttt{D3.mouvement=FALSE}) where samples are colored with different colors for different biological conditions. \end{itemize} <>= ResPCAMus500 <- PCAanalysis(SEresNorm=resNORMmus1500, sample.deletion=NULL, Supp.del.sample=FALSE, gene.deletion=NULL, Plot.PCA=TRUE, Mean.Accross.Time=FALSE, Color.Group = NULL, Cex.label=0.8, Cex.point=0.7, epsilon=0.2, Phi=25,Theta=140, D3.mouvement=FALSE, path.result=NULL) @ The graphs are \begin{itemize} \item displayed if \texttt{Plot.PCA=TRUE} \item saved in a folder if the user selects a folder path in \texttt{path.result}. If \texttt{path.result=NULL} the results will not be saved in a folder. \end{itemize} By default (if \texttt{Color.Group=NULL}), a color will be automatically assigned to each biological condition. The user can change the colors by creating the following data.frame <>= data.col.mus50<-data.frame(Name=c("N1wtT1wt","N1haT1wt","N1haT1ko","N1wtT1ko"), Col=c("black","red","green","blue")) data.col.mus50 @ and setting \texttt{Color.Group=data.col.mus50}. If the user wants to delete, for instance, the genes 'ENSMUSG00000064842' and 'ENSMUSG00000051951' (respectively the second and forth gene) and/or delete the samples 'N1wtT1wt\_r2' and 'N1haT1wt\_r5', he can set \begin{itemize} \item \texttt{gene.deletion=c("ENSMUSG00000064842","ENSMUSG00000051951")} and/or\newline \texttt{sample.deletion=c("N1wtT1wt\_r2","N1haT1wt\_r5")} \item \texttt{gene.deletion=c(2,4)} and/or \texttt{sample.deletion=c(3,6)}.\newline The integers in \texttt{gene.deletion} and \texttt{sample.deletion} represent respectively the row numbers and the column numbers of \texttt{RawCounts} where the selected genes and samples are located. \end{itemize} Write \texttt{?PCAanalysis} in your console for more information about the function. \subsubsubsection{HCPC (case 1)}\label{sec:HCPCMus1} The user can realize the clustering with HCPC using the function \textbf{HCPCanalysis()} as below. The lines of code below return from the results of the function \textbf{DATAnormalization()} %%%%% \begin{itemize} \item The results of the function \Rfunction{HCPC()} \item A dendrogram \item A graph indicating by color for each sample, its cluster and the biological condition associated to the sample \item One 2D PCA graph, one 3D PCA graph and the same 3D PCA graph in a rgl window (only if \texttt{D3.mouvement=FALSE}). These PCA graphs are identical to the outputs of \textbf{PCAanalysis()} but samples are colored with different colors for different clusters. \end{itemize} <>= ResHCPCMus500 <- HCPCanalysis(SEresNorm=resNORMmus1500, gene.deletion=NULL, sample.deletion=NULL, Supp.del.sample=FALSE, Plot.HCPC=FALSE, Cex.label=0.8, Cex.point=0.7, epsilon=0.2, Phi=25, Theta=140, D3.mouvement=FALSE, path.result=NULL) @ The optimal number of clusters is automatically computed by the R function \Rfunction{HCPC()}. Write \texttt{?HCPCanalysis} in your console for more information about the function. \subsubsection{Fission data (case 2)}\label{sec:FactorialFission} In this section we use the fission yeast subdataset \textbf{RawCounts\_Leong2014\_FISSIONsub500wt} (see the subsection \nameref{sec:dataLeong2014}) in order to explain \textbf{PCAanalysis()} and \textbf{HCPCanalysis()} in \textbf{case 2}. \subsubsubsection{PCA (case 2)}\label{sec:PCAFission} In \textbf{case 2}, the lines of code below return from the results of the function \textbf{DATAnormalization()} \begin{itemize} \item the results of the function \Rfunction{PCA()} \item one 2D PCA graph, one 3D PCA graph and the same 3D PCA graph in a rgl window (only if \texttt{D3.mouvement=FALSE}) where samples are colored with different colors for different time points. Furthermore, lines are drawn in gray between each pair of consecutive points for each individual. \item the same graphs as above but without lines (not shown). \end{itemize} <>= ResPCAYeast500 <- PCAanalysis(SEresNorm=resNORMyeast500, gene.deletion=NULL, sample.deletion=NULL, Supp.del.sample=FALSE, Plot.PCA=FALSE, Mean.Accross.Time=FALSE, Cex.label=0.8, Cex.point=0.7, epsilon=0.3, Phi=25,Theta=140, D3.mouvement=FALSE, path.result=NULL) @ These figures show that the temporal behavior is similar between individuals. If the user wants for instance, to perform the PCA analysis without the genes 'SPAC212.11' and 'SPNCRNA.70' (first and third gene) and/or without the samples 'wt\_t0\_r2' and 'wt\_t1\_r1', he can set \begin{itemize} \item either \texttt{gene.deletion=c("SPAC212.11","SPNCRNA.70")} and/or\newline \texttt{sample.deletion=c("wt\_t0\_r2","wt\_t1\_r1")}, \item or \texttt{gene.deletion=c(1,3)} and/or \texttt{sample.deletion=c(3,5)}.\newline The integers in \texttt{gene.deletion} and \texttt{sample.deletion} represent respectively the row numbers (resp. the column numbers) of \texttt{RawCounts} corresponding to genes (resp. samples) that need to be removed from \texttt{RawCounts}. \end{itemize} In \textbf{case 2}, the user cannot select a color for each time. A specific palette is automatically used. Write \texttt{?PCAanalysis} in your console for more information about the function. \subsubsubsection{HCPC (case 2)}\label{sec:HCPCFission} The lines of code below return from the results of the function \textbf{DATAnormalization()} \begin{itemize} \item The results of the function \Rfunction{HCPC()} \item A dendrogram \item A graph indicating by color for each sample, its cluster and the time point associated to the sample. \item one 2D PCA graph, one 3D PCA graph and the same 3D PCA graph in a rgl window (only if \texttt{D3.mouvement=FALSE}). These PCA graphs are identical to the outputs of \textbf{PCAanalysis()} but samples are colored with different colors for different HCPC clusters. \end{itemize} <>= ResHCPCYeast500 <- HCPCanalysis(SEresNorm=resNORMyeast500, gene.deletion=NULL, sample.deletion=NULL, Supp.del.sample=FALSE, Plot.HCPC=TRUE, Cex.label=0.9, Cex.point=0.7, epsilon=0.2, Phi=25,Theta=140, D3.mouvement=FALSE, path.result=NULL) @ The graphs are \begin{itemize} \item displayed if \texttt{Plot.HCPC=TRUE} \item saved in a folder if the user selects a folder path in \texttt{path.result}. If \texttt{path.result=NULL} the results will not be saved in a folder. \end{itemize} This function has similar inputs as \textbf{PCAanalysis()} and the optimal number of clusters is automatically computed by the R function \Rfunction{HCPC()}. Write \texttt{?HCPCanalysis} in your console for more information about the function. \subsubsection{Leukemia data (case 3 with only two biological conditions)} \label{sec:FactorialLeuk} In this section we use the Chronic lymphocytic leukemia (CLL) subdataset\newline \textbf{RawCounts\_Schleiss2021\_CLLsub500} (see Subsection \nameref{sec:dataSchleiss2021}) in order to explain \textbf{PCAanalysis()} and \textbf{HCPCanalysis()} in \textbf{case 3} when there only two biological conditions. \subsubsubsection{PCA (case 3 two conditions)}\label{sec:PCALeuk} When samples belong to different biological conditions and different time points, the previous lines of code return from the results of the function \textbf{DATAnormalization()}: %%%%%%% \begin{itemize} \item The results of the function \Rfunction{PCA()} \item one 2D PCA graph, one 3D PCA graph and the same 3D PCA graph in a rgl window (only if \texttt{D3.mouvement=FALSE}) where samples are colored with different colors for different biological conditions. Furthermore, lines are drawn between each pair of consecutive points for each sample (if \texttt{Mean.Accross.Time=FALSE}, otherwise it will be only between means). \item one 2D PCA graph, one 3D PCA graph and the same 3D PCA graph in a rgl window (only if \texttt{D3.mouvement=FALSE}) for each biological condition, where samples are colored with different colors for different time points. Furthermore, lines are drawn between each pair of consecutive points for each sample (if \texttt{Mean.Accross.Time=FALSE}, otherwise it will be only between means). \item the same graphs describe above but without lines. \end{itemize} <>= ResPCALeuk500 <- PCAanalysis(SEresNorm=resNORMleuk500, gene.deletion=NULL, sample.deletion=NULL, Supp.del.sample=FALSE, Plot.PCA=FALSE, Mean.Accross.Time=FALSE, Color.Group = NULL, Cex.label=0.9, Cex.point=0.8,epsilon=0.2, Phi=25,Theta=140, D3.mouvement=FALSE, path.result=NULL, Name.folder.pca=NULL) @ The graphs are \begin{itemize} \item displayed if \texttt{Plot.PCA=TRUE} \item similar to those described in subsection \nameref{sec:PCAFission}. \item saved in a folder if the user selects a folder path in \texttt{path.result}. If \texttt{path.result=NULL} the results will not be saved in a folder. \end{itemize} By default (if \texttt{Color.Group=NULL}), a color will be automatically assigned to each biological condition. The user can change the colors by creating the following data.frame <>= data.col.Leuk<-data.frame(Name=c("NP","P"), Col=c("black","red")) @ and setting \texttt{Color.Group=data.col.Leuk}. The user cannot change the color associated to each time point. If the user wants to delete, for instance, the genes 'ABCA7' and 'ADAM28' (respectively the second and sixth gene) and/or delete the samples 'CLL\_P\_r1\_t1' and 'CLL\_P\_r2\_t2', he can set \begin{itemize} \item \texttt{gene.deletion=c("ABCA7","ADAM28")} and/or\newline \texttt{sample.deletion=c("CLL\_P\_r1\_t1", "CLL\_P\_r2\_t2")} \item \texttt{gene.deletion=c(2,6)} and/or \texttt{sample.deletion=c(3,13)}.\newline The integers in \texttt{gene.deletion} and \texttt{sample.deletion} represent respectively the row numbers and the column numbers of \texttt{RawCounts} where the selected genes and samples are located. \end{itemize} Write \texttt{?PCAanalysis} in your console for more information about the function. \subsubsubsection{HCPC (case 3 two conditions)}\label{sec:HCPCLeuk} The lines of code below return from the results of the function \textbf{DATAnormalization()}: %%%%%%%%%%% \begin{itemize} \item The results of the function \Rfunction{HCPC()} \item A dendrogram \item A graph indicating by color for each sample, its cluster, the biological condition and the time point associated to the sample. \item One 2D PCA graph, one 3D PCA graph and the same 3D PCA graph in a rgl window (only if \texttt{D3.mouvement=FALSE}). These PCA graphs are identical to the outputs of \texttt{PCAanalysis()} with all samples but samples are colored with different colors for different clusters. \end{itemize} <>= ResHCPCLeuk500 <- HCPCanalysis(SEresNorm=resNORMleuk500, gene.deletion=NULL, sample.deletion=NULL, Supp.del.sample=FALSE, Plot.HCPC=FALSE, Phi=25,Theta=140, Cex.point=0.7, epsilon=0.2, Cex.label=0.9, D3.mouvement=FALSE, path.result=NULL, Name.folder.hcpc=NULL) @ The graphs are \begin{itemize} \item displayed if \texttt{Plot.HCPC=TRUE} \item similar to those described in subsection \nameref{sec:HCPCFission}. \item saved in a folder if the user selects a folder path in \texttt{path.result}. If \texttt{path.result=NULL} the results will not be saved in a folder. \end{itemize} The optimal number of clusters is automatically computed by the R function \Rfunction{HCPC()}. Write \texttt{?HCPCanalysis} in your console for more information about the function. \subsubsection{Mouse data 2 (case 3 with more than two biological conditions)} \label{sec:FactorialMus2} In this section we use the mouse subdataset \textbf{RawCounts\_Weger2021\_MOUSEsub500} (see Subsection \nameref{sec:dataWeger2021}) in order to explain \textbf{PCAanalysis()} and \textbf{HCPCanalysis()} in \textbf{case 3} when there are more than two biological conditions. \subsubsubsection{PCA (case 3 with more than two conditions)}\label{sec:PCAMus2} When samples belong to different biological conditions and different time points, the previous lines of code return from the results of the function \textbf{DATAnormalization()}: %%%%%%%%% \begin{itemize} \item The results of the function \Rfunction{PCA()} \item one 2D PCA graph, one 3D PCA graph and the same 3D PCA graph in a rgl window (only if \texttt{D3.mouvement=FALSE}) where samples are colored with different colors for different biological conditions. Furthermore, lines are drawn between each pair of consecutive points for each sample (if \texttt{Mean.Accross.Time=FALSE}, otherwise it will be only between means). \item one 2D PCA graph, one 3D PCA graph and the same 3D PCA graph in a rgl window (only if \texttt{D3.mouvement=FALSE}) for each biological condition, where samples are colored with different colors for different time points. Furthermore, lines are drawn between each pair of consecutive points for each sample (if \texttt{Mean.Accross.Time=FALSE}, otherwise it will be only between means). \item the same graphs describe above but without lines. \end{itemize} <>= ResPCAMus2500 <- PCAanalysis(SEresNorm=resNORMmus2500, gene.deletion=NULL, sample.deletion=NULL, Supp.del.sample=FALSE, Plot.PCA=FALSE, Mean.Accross.Time=FALSE, Color.Group=NULL, Cex.label=0.6, Cex.point=0.7, epsilon=0.2, Phi=25,Theta=140, D3.mouvement=FALSE, path.result=NULL, Name.folder.pca=NULL) @ The graphs are \begin{itemize} \item displayed if \texttt{Plot.PCA=TRUE} \item similar to those described in subsection \nameref{sec:PCAFission}. \item saved in a folder if the user selects a folder path in \texttt{path.result}. If \texttt{path.result=NULL} the results will not be saved in a folder. \end{itemize} By default (if \texttt{Color.Group=NULL}), a color will be automatically applied for each biological condition. You can change the colors by creating the following data.frame <>= data.col.Mus2<-data.frame(Name=c("BmKo","BmWt","CrKo","CrWt"), Col=c("red","blue","orange","darkgreen")) @ and setting \texttt{Color.Group=data.col.Mus2}. The user can not change the color associated to each time point. If you want to delete, for instance, the genes 'ENSMUSG00000025921' and 'ENSMUSG00000026113' (respectively the second and sixth gene) and/or delete the samples 'BmKo\_t2\_r1' and 'BmKo\_t5\_r2', set \begin{itemize} \item \texttt{gene.deletion=c("ENSMUSG00000025921", "ENSMUSG00000026113")} and/or\newline \texttt{sample.deletion=c("BmKo\_t2\_r1", "BmKo\_t5\_r2")} \item \texttt{gene.deletion=c(2,6)} and/or \texttt{sample.deletion=c(3,13)}.\newline The integers in \texttt{gene.deletion} and \texttt{sample.deletion} represent respectively the row numbers and the column numbers of \texttt{RawCounts} where the selected genes and samples are located. \end{itemize} Write \texttt{?PCAanalysis} in your console for more information about the function. \subsubsubsection{HCPC (case 3 with more than two conditions)} \label{sec:HCPCMus2} The lines of code below return from the results of the function \textbf{DATAnormalization()}: %%%%%%%%%%%% \begin{itemize} \item The results of the function \Rfunction{HCPC()} \item A dendrogram \item A graph indicating by color for each sample, its cluster, the biological condition and the time point associated to the sample. \item One 2D PCA graph, one 3D PCA graph and the same 3D PCA graph in a rgl window (only if \texttt{D3.mouvement=FALSE}). These PCA graphs are identical to the outputs of \texttt{PCAanalysis()} with all samples but samples are colored with different colors for different clusters. \end{itemize} The user can realize the clustering with HCPC using the function \textbf{HCPCanalysis()} as below. <>= ResHCPCMus2500 <- HCPCanalysis(SEresNorm=resNORMmus2500, gene.deletion=NULL, sample.deletion=NULL, Supp.del.sample=FALSE, Plot.HCPC=FALSE, Phi=25,Theta=140, Cex.point=0.6, epsilon=0.2, Cex.label=0.6, D3.mouvement=FALSE, path.result=NULL, Name.folder.hcpc=NULL) @ The graphs are \begin{itemize} \item displayed if \texttt{Plot.HCPC=TRUE} \item similar to those described in subsection \nameref{sec:HCPCFission}. \item saved in a folder if the user selects a folder path in \texttt{path.result}. If \texttt{path.result=NULL} the results will not be saved in a folder. \end{itemize} The optimal number of clusters is automatically computed by the R function \Rfunction{HCPC()}. Write \texttt{?HCPCanalysis} in your console for more information about the function. \clearpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Temporal clustering analysis with MFUZZanalysis()} \label{sec:MFUZZanalysis} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Fission data (case 2)}\label{sec:MFUZZfission} In this section we use the fission yeast subdataset \textbf{RawCounts\_Leong2014\_FISSIONsub500wt} (see the subsection \nameref{sec:dataLeong2014}) in order to explain \textbf{MFUZZanalysis()} in \textbf{case 2}. The following function realizes the temporal clustering analysis. It takes as input, a number of clusters (\texttt{DataNumberCluster}) that can be chosen automatically if \texttt{DataNumberCluster=NULL} and the results of the function \textbf{DATAnormalization()}. The lines of code below return %%%% \begin{itemize} \item the summary of the results of the function \Rfunction{mfuzz()}. \item the scaled height plot, computed with the \Rfunction{HCPC()} function, and shows the cluster chosen automatically (if \texttt{DataNumberCluster=NULL}). If \texttt{Method="hcpc"}, the function plots the scaled within-cluster inertia, but if \texttt{Method="kmeans"}, the function plots the scaled within-cluster inertia. As the number of genes can be very high, we recommend to select \texttt{Method="hcpc"} which is by default. \item the output graphs from the R package Mfuzz, showing the most common temporal behavior among all genes. \end{itemize} <>= ResMfuzzYeast500 <- MFUZZanalysis(SEresNorm=resNORMyeast500, DataNumberCluster=NULL, Method="hcpc", Membership=0.5, Min.std=0.1, Plot.Mfuzz=TRUE, path.result=NULL) @ The graphs are \begin{itemize} \item displayed if \texttt{Plot.Mfuzz=TRUE} \item saved in a folder if the user selects a folder path in \texttt{path.result}. If \texttt{path.result=NULL} the results will not be saved in a folder. \end{itemize} Other temporal information are shown in the alluvial graph of the subsection \nameref{sec:DEanalysis} that can be compared with the previous graphs. Write \texttt{?MFUZZanalysis} in your console for more information about the function. \subsubsection{Leukemia data (case 3 with only two biological condition)} \label{sec:MFUZZleuk} In this section we use the Chronic lymphocytic leukemia (CLL) subdataset\newline \textbf{RawCounts\_Schleiss2021\_CLLsub500} (see Subsection \nameref{sec:dataSchleiss2021}) in order to explain \textbf{MFUZZanalysis()} in \textbf{case 3} when there only two biological conditions. The following function realizes the temporal clustering analysis. It takes as input, a number of clusters (\texttt{DataNumberCluster}) that can be chosen automatically if \texttt{DataNumberCluster=NULL} and the results of the function \textbf{DATAnormalization()}. The lines of code below return for each biological condition %%%% \begin{itemize} \item the summary of the results of the function \Rfunction{mfuzz()} \item the scaled height plot, computed with the \Rfunction{HCPC()} function, and shows the cluster chosen automatically (if \texttt{DataNumberCluster=NULL}). If \texttt{Method="hcpc"}, the function plots the scaled within-cluster inertia, but if \texttt{Method="kmeans"}, the function plots the scaled within-cluster inertia. As the number of genes can be very high, we recommend to select \texttt{Method="hcpc"} which is by default. \item the output graphs from the R package Mfuzz showing the most common temporal behavior among all genes for each biological condition. The plots below correspond to the biological condition 'P'. \end{itemize} <>= ResMfuzzLeuk500 <- MFUZZanalysis(SEresNorm = resNORMleuk500, DataNumberCluster=NULL, Method="hcpc", Membership=0.7, Min.std=0.1, Plot.Mfuzz=FALSE, path.result=NULL, Name.folder.mfuzz=NULL) @ The graphs are \begin{itemize} \item displayed if \texttt{Plot.Mfuzz=TRUE} \item similar to those described in subsection \nameref{sec:MFUZZfission} \item saved in a folder if the user selects a folder path in \texttt{path.result}. If \texttt{path.result=NULL} the results will not be saved in a folder. \end{itemize} Other temporal information are shown in the alluvial graph of the subsection \nameref{sec:DEanalysis} that can be compared with the previous graphs. Write \texttt{?MFUZZanalysis} in your console for more information about the function. \subsubsection{Mouse data 2 (case 3 with more than two biological conditions)} \label{sec:MFUZZmus2} In this section we use the mouse subdataset \textbf{RawCounts\_Weger2021\_MOUSEsub500} (see Subsection \nameref{sec:dataAntoszewski2022}) in order to explain \textbf{MFUZZanalysis()} in \textbf{case 3} when there are more than two biological conditions. The following function realizes the temporal clustering analysis. It takes as input, a number of clusters (\texttt{DataNumberCluster}) that can be chosen automatically if \texttt{DataNumberCluster=NULL} and the results of the function \textbf{DATAnormalization()}. The lines of code below return for each biological condition %%%% \begin{itemize} \item the summary of the results of the function \Rfunction{mfuzz()} \item the scaled height, computed with the \Rfunction{HCPC()} function, and shows the cluster chosen automatically (if \texttt{DataNumberCluster=NULL}). If \texttt{Method="hcpc"}, the function plots the scaled within-cluster inertia, but if \texttt{Method="kmeans"}, the function plots the scaled within-cluster inertia. As the number of genes can be very high, we recommend to select \texttt{Method="hcpc"} which is by default. \item the output graph from the R package Mfuzz which represents the most common temporal behavior among all genes for the biological condition `BmKo.' \end{itemize} <>= ResMfuzzLeuk500 <- MFUZZanalysis(SEresNorm=resNORMmus2500, DataNumberCluster=NULL, Method="hcpc", Membership=0.5, Min.std=0.1, Plot.Mfuzz=FALSE, path.result=NULL, Name.folder.mfuzz=NULL) @ The graphs are \begin{itemize} \item displayed if \texttt{Plot.Mfuzz=TRUE} \item similar to those described in subsection \nameref{sec:MFUZZfission}. \item saved in a folder if the user selects a folder path in \texttt{path.result}. If \texttt{path.result=NULL} the results will not be saved in a folder. \end{itemize} Other temporal information are shown in the alluvial graph of the subsection \nameref{sec:DEanalysis} that can be compared with the previous graphs. Write \texttt{?MFUZZanalysis} in your console for more information about the function. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Plot expression of data with DATAplotExpressionGenes()} \label{sec:PlotExpression} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Mouse data (case 1)}\label{sec:PlotExpressionMus1} In this section we use the mouse subdataset \textbf{RawCounts\_Antoszewski2022\_MOUSEsub500} (see the subsection \nameref{sec:dataAntoszewski2022}) in order to explain \textbf{DATAplotExpressionGenes()} in \textbf{case 1}. The lines of code below allow to plot, from the results of the function \textbf{DATAnormalization()}, the expression profile of the 10th gene showing for each biological condition: a box plot, a violin plot, and error bars (standard deviation). Each box plot, violin plot and error bars is associated to the distribution of the expression of all sample belonging to the corresponding biological condition. <>= resEVOmus1500 <- DATAplotExpressionGenes(SEresNorm=resNORMmus1500, Vector.row.gene=c(10), Color.Group=NULL, Plot.Expression=TRUE, path.result=NULL) @ If the user wants to select several gene (so several output graphs), for instance the 28th, the 38th, the 39th and the 50th, he needs to set \texttt{Vector.row.gene=c(28,38:39,50)}. The graphs are \begin{itemize} \item displayed if \texttt{Plot.Expression=TRUE} \item saved in a folder if the user selects a folder path in \texttt{path.result}. If \texttt{path.result=NULL} the results will not be saved in a folder. \end{itemize} By default (if \texttt{Color.Group=NULL}), a color will be automatically applied for each biological condition. The user can change the colors by creating the following data.frame <>= data.col.mus50<-data.frame(Name=c("N1wtT1wt","N1haT1wt","N1haT1ko","N1wtT1ko"), Col=c("black","red","green","blue")) @ and setting \texttt{Color.Group= data.col.mus50}. Write \texttt{?DATAplotExpressionGenes} in your console for more information about the function. \subsubsection{Fission data (case 2)}\label{sec:PlotExpressionFission} In this section we use the fission yeast subdataset \textbf{RawCounts\_Leong2014\_FISSIONsub500wt} (see the subsection \nameref{sec:dataLeong2014}) in order to explain \textbf{DATAplotExpressionGenes()} in \textbf{case 2}. The lines of code below allow to plot, from the results of the function \textbf{DATAnormalization()}, %%%%% \begin{itemize} \item the evolution of the 17th gene expression of all three replicates across time (purple lines) \item the evolution of the mean and the standard deviation of the 17th gene expression across time (black lines). \end{itemize} <>= resEVOyeast500<-DATAplotExpressionGenes(SEresNorm=resNORMyeast500, Vector.row.gene=c(17), Plot.Expression=TRUE, path.result=NULL) @ The graphs are \begin{itemize} \item displayed if \texttt{Plot.Expression=TRUE} \item saved in a folder if the user selects a folder path in \texttt{path.result}. If \texttt{path.result=NULL} the results will not be saved in a folder. \end{itemize} If the user wants to select several genes, for instance the 1st, the 2nd, the 17th and the 19th, he needs to set \texttt{Vector.row.gene=c(1,2,17,19)}. Write \texttt{?DATAplotExpressionGenes} in your console for more information about the function. %%%%%%%%%%%%%% %%%%%%%%%%%%%% \subsubsection{Leukemia data (case 3 with only two biological conditions)} \label{sec:PlotExpressionLeuk} In this section we use the Chronic lymphocytic leukemia (CLL) subdataset\newline \textbf{RawCounts\_Schleiss2021\_CLLsub500} (see Subsection \nameref{sec:dataSchleiss2021}) in order to explain \textbf{DATAplotExpressionGenes()} in \textbf{case 3} when there only two biological conditions. The lines of code below allow to plot, from the results of the function \textbf{DATAnormalization()}, for each biological condition: the evolution of the 25th gene expression of the three replicates across time and the evolution of the mean and the standard deviation of the 25th gene expression across time. The color of the different lines are different for different biological conditions. <>= resEVOleuk500 <- DATAplotExpressionGenes(SEresNorm=resNORMleuk500, Vector.row.gene=c(25), Color.Group = NULL, Plot.Expression=FALSE, path.result=NULL, Name.folder.profile=NULL) @ The graphs are \begin{itemize} \item displayed if \texttt{Plot.Expression=TRUE} \item saved in a folder if the user selects a folder path in \texttt{path.result}. If \texttt{path.result=NULL} the results will not be saved in a folder. \end{itemize} By default (if \texttt{Color.Group=NULL}), a color will be automatically assigned to each biological condition. The user can change the colors by creating the following data.frame <>= data.col.Leuk<-data.frame(Name=c("NP","P"), Col=c("black","red")) @ and setting \texttt{Color.Group=data.col.Leuk}. If the user wants to select several genes, for instance the 97th, the 192th, the 194th and the 494th, he needs to set \texttt{Vector.row.gene=c(97,192,194,494)}. Write \texttt{?DATAplotExpressionGenes} in your console for more information about the function. %%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Mouse 2 data (case 3 with more thant two biological condition)} \label{sec:PlotExpressionMus2} In this section we use the mouse subdataset \textbf{RawCounts\_Weger2021\_MOUSEsub500} (see Subsection \nameref{sec:dataWeger2021}) in order to explain \textbf{DATAplotExpressionGenes()} in \textbf{case 3} when there are more than two biological conditions. The previous lines of code allow to plot, from the results of the function \textbf{DATAnormalization()}, for each biological condition: the evolution of the 17th gene expression of the three replicates across time and the evolution of the mean and the standard deviation of the 17th gene expression across time. The color of the different lines are different for different biological conditions. <>= resEVOmus2500 <- DATAplotExpressionGenes(SEresNorm=resNORMmus2500, Vector.row.gene=c(17), Color.Group=NULL, Plot.Expression=FALSE, path.result=NULL) @ The graphs are \begin{itemize} \item displayed if \texttt{Plot.Expression=TRUE} \item saved in a folder if the user selects a folder path in \texttt{path.result}. If \texttt{path.result=NULL} the results will not be saved in a folder. \end{itemize} By default (if \texttt{Color.Group=NULL}), a color will be automatically assigned to each biological condition. The user can change the colors by creating the following data.frame <>= data.col.MusBmCrKoWt<-data.frame(Name=c("BmKo","BmWt","CrKo","CrWt"), Col=c("black","blue","green","red")) @ and setting \texttt{Color.Group=data.col.MusBmCrKoWt}. If the user wants to select several genes, for instance the 97th, the 192th, the 194th and the 494th, he needs to set \texttt{Vector.row.gene=c(97,192,194,494)}. Write \texttt{?DATAplotExpressionGenes} in your console for more information about the function. \clearpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Statistical analysis of the transcriptional response for the four dataset (supervised analysis).} \label{sec:SupervisedAnalysis} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{DE analysis with DEanalysisGlobal()}\label{sec:DEanalysis} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Mouse data (case 1)}\label{sec:DEanalysisMus1} In this section we use the mouse subdataset \textbf{RawCounts\_Antoszewski2022\_MOUSEsub500} (see the subsection \nameref{sec:dataAntoszewski2022}) in order to explain \textbf{DEanalysisGlobal()} in \textbf{case 1}. The function below %%%%%%%% \begin{itemize} \item returns a data.frame (output \texttt{DE.results}). See subection \nameref{sec:DEanalysisMus1Summary} \item plots the following graphs %%%%%%% \begin{itemize} \item an upset graph, realized with the R package UpSetR \cite{Conway2017UpSetR}, which corresponds to a Venn diagram barplot. If there are only two biological conditions, the graph will not be plotted. See subsection \nameref{sec:DEmus1GRAPHS} \item a barplot showing the number of specific genes per biological condition. See subection \nameref{sec:DEmus1GRAPHS}. \end{itemize} \end{itemize} Uncomment lines (\texttt{Results\_DEanalysis\_sub500}) contains the results of \texttt{DEanalysisGlobal()} of three dataset which included \texttt{RawCounts\_Antoszewski2022\_MOUSEsub500}. The results of \texttt{DEanalysisGlobal()} with \texttt{RawCounts\_Antoszewski2022\_MOUSEsub500} are saved in \texttt{Results\_DEanalysis\_sub500\$DE\_Antoszewski2022\_MOUSEsub500}. <>= ResDEMus500 <- DEanalysisGlobal(SEres=resSEmus1500, pval.min=0.05, pval.vect.t=NULL, log.FC.min=1, LRT.supp.info=FALSE, Plot.DE.graph=TRUE, path.result=NULL, Name.folder.DE=NULL) # data("Results_DEanalysis_sub500") # ResDEMus500<-Results_DEanalysis_sub500$DE_Antoszewski2022_MOUSEsub500 @ The graphs are \begin{itemize} \item displayed if \texttt{Plot.DE.graph=TRUE} \item saved in a folder if the user selects a folder path in \texttt{path.result}. If \texttt{path.result=NULL} the results will not be saved in a folder. \end{itemize} Write \texttt{?DEanalysisGlobal} in your console for more information about the function. \subsubsubsection{Data.frame summarising all the DE analysis (case 1)} \label{sec:DEanalysisMus1Summary} The output data.frame \texttt{DE.results} can be observed with the following lines of code, <>= Res.DE.results.Mus1 <- SummarizedExperiment::rowData(ResDEMus500) @ as the glossary of the column names with <>= ResGlossary.Mus1 <- S4Vectors::metadata(ResDEMus500)$DESeq2obj$List.Glossary @ and then write \texttt{Res.DE.results.Mus1} and \texttt{ResGlossary.Mus1} in the R console. The data.frame contains \begin{itemize} \item gene names \item pvalues, log2 fold change and DE genes between each pairs of biological conditions. \item a binary column (1 and 0) where 1 means the gene is DE between at least one pair of biological conditions. \item \(N_{bc}=4\) binary columns (with \(N_{bc}\) the number of biological conditions), which give the genes specific for each biological condition. A '1' in one of these columns means the gene is specific to the biological condition associated to the given column. 0 otherwise. A gene is called specific to a given biological condition BC1, if the gene is DE between BC1 and any other biological conditions, but not DE between any pair of other biological conditions. \item \(N_{bc}=4\) columns filled with -1, 0 and 1, one per biological condition. A '1' in one of these columns means the gene is up-regulated (or over-expressed) for the biological condition associated to the given column. A gene is called up-regulated for a given biological condition BC1 if the gene is specific to the biological condition BC1 and expressions in BC1 are higher than in the other biological conditions. A '-1' in one of these columns means the gene is down-regulated (or under-expressed) for the biological condition associated to the given column. A gene is called down-regulated for a given biological condition BC1 if the gene is specific to the biological condition BC1 and expressions in BC1 are lower than in the other biological conditions. A '0' in one of these columns means the gene is not specific to the biological condition associated to the given column. \end{itemize} \subsubsubsection{Description of graphs}\label{sec:DEmus1GRAPHS} The function plots three graphs The upset graph plots the number of genes corresponding to each possible intersection in a Venn barplot. We say that %%%%%%% \begin{itemize} \item a set of pairs of biological conditions forms an intersection when there is at least one gene which is DE for each of these pairs of biological conditions, but not for the others. \item a gene corresponds to a given intersection if this genes is DE for each pair of biological conditions in the intersection, but not for the others. \end{itemize} For instance, the second column gives the number of genes corresponding to the intersection \[\biggl\{\{\mbox{N1wtT1wt, N1haT1wt}\}, \{\mbox{N1wtT1wt, N1haT1ko}\}, \{\mbox{N1wtT1wt, N1wtT1ko}\}\biggl\} \ .\] In other words, this is the number of genes DE between \begin{itemize} \item N1wtT1wt versus N1haT1wt \item N1wtT1wt versus N1haT1ko \item N1wtT1wt versus N1wtT1ko \end{itemize} and not DE between any other pair of biological conditions. Note that this columns actually gives the number of specific genes to the biological condition N1wtT1wt. The first barplot plots for each biological condition BC1 %%%%%%%% \begin{itemize} \item The number of up- and down-regulated genes which are specific to the biological condition BC1. \item The number of genes categorized as ``Other.'' A gene belongs to the ``Other'' category if it is DE between BC1 and at least one other biological condition and not specific. \end{itemize} The second barplot is the same graph without ``Other'' will be plotted too. If there are only two biological condition, only the second barplot will be plotted. \clearpage \subsubsection{Fission data (case 2)}\label{sec:DEanalysisFission} In this section we use the fission yeast subdataset \textbf{RawCounts\_Leong2014\_FISSIONsub500wt} (see the subsection \nameref{sec:dataAntoszewski2022}) in order to explain \textbf{DEanalysisGlobal()} in \textbf{case 2}. The function below %%%%%%%%%%% \begin{itemize} \item returns a data.frame (output \texttt{DE.results}). See subsection \nameref{sec:DEanalysisFissionSummary} \item plots the following graphs % \begin{itemize} \item an alluvial graph. See subsection \nameref{sec:DEyeastGRAPHS} \item a graph showing the number of DE genes as a function of time for each temporal group. By temporal group, we mean the sets of genes which are first DE at the same time. See subsection \nameref{sec:DEyeastGRAPHS} \item a barplot showing the number of DE genes (up- or down-regulated) for each time. See subsection \nameref{sec:DEyeastGRAPHS}. \item two upset graphs, realized with the R package UpSetR \cite{Conway2017UpSetR}, showing the number of DE genes belonging to each DE temporal pattern. By temporal pattern, we mean the set of times ti such that the gene is DE between ti and the reference time t0. See subsection \nameref{sec:DEyeastGRAPHS}. \end{itemize} % \end{itemize} The lines are commented because they take too much time. Uncomment them in order to use the function \Rfunction{DEanalysisGlobal()}. \texttt{Results\_DEanalysis\_sub500} contains the results of \Rfunction{DEanalysisGlobal()} of three dataset which included \texttt{RawCounts\_Leong2014\_FISSIONsub500wt}. The results of \Rfunction{DEanalysisGlobal()} with \texttt{RawCounts\_Leong2014\_FISSIONsub500wt} are saved in \texttt{Results\_DEanalysis\_sub500\$DE\_Leong2014\_FISSIONsub500wt}. <>= # DEyeast500wt <- DEanalysisGlobal(SEres=resSEfission500, # pval.min=0.05, # pval.vect.t=NULL, # log.FC.min=1, # LRT.supp.info=FALSE, # Plot.DE.graph =FALSE, # path.result=NULL, # Name.folder.DE=NULL) data("Results_DEanalysis_sub500") DEyeast500wt <- Results_DEanalysis_sub500$DE_Leong2014_FISSIONsub500wt @ The graphs are \begin{itemize} \item displayed if \texttt{Plot.DE.graph=TRUE} \item saved in a folder if the user selects a folder path in \texttt{path.result}. If \texttt{path.result=NULL} the results will not be saved in a folder. \end{itemize} Write \texttt{?DEanalysisGlobal} in your console for more information about the function. \subsubsubsection{Data.frame summarising all the DE analysis (case 2)} \label{sec:DEanalysisFissionSummary} The output data.frame \texttt{DE.results} can be observed with the following lines of code, <>= Res.DE.results.Fission <- SummarizedExperiment::rowData(DEyeast500wt) @ as the glossary of the column names with <>= ResGlossary.Fission <- S4Vectors::metadata(DEyeast500wt)$DESeq2obj$List.Glossary @ and then write \texttt{Res.DE.results.Fission} and \texttt{ResGlossary.Fission} in the R console. The data.frame contains \begin{itemize} \item gene names \item pvalues, log2 fold change and DE genes between each time ti versus the reference time t0. \item a binary column (1 and 0) where 1 means the gene is DE at at least between one time ti versus the reference time t0. \item a column where each element is succession of 0 and 1. The positions of `1' indicate the set of times ti such that the gene is DE between ti and the reference time t0. So each element of the column is what we called previously, a temporal pattern. \end{itemize} \subsubsubsection{Description of graphs}\label{sec:DEyeastGRAPHS} The function returns the following plots %%%%%%% \begin{enumerate} \item An alluvial graph. The x-axis of the alluvial graph is labeled with all times except t0. For each vertical barplot, there are two strata: 1 and 0 whose sizes indicate respectively the number of DE genes and of non DE genes, between the time corresponding to the barplot and the reference time t0. \newline %%%%%% The alluvial graph is composed of curves each corresponding to a single gene, which are gathered in alluvia. An alluvium is composed of all genes having the same curve: for example, an alluvium going from the stratum 0 at time t1 (for instance) to the stratum 1 at time t2 corresponds to the set of genes which are not differentially expressed (DE) at t1 and are DE at time t2. Each alluvium connects pairs of consecutive barplots and its thickness gives the number of genes in the set. %%%%%%%%%%%% The color of each alluvium indicates the temporal group, defined as the set of genes which are all first DE at the same time with respect to the reference time t0. \item A graph giving the time evolution of the number of DE genes within each temporal group. The x-axis labels indicate all times except t0. %%%%%%%% \item A barplot showing the number of DE genes per time. The x-axis labels indicate all times except t0. %%%%%%%%% For each DE gene, we compute the sign of the log2 fold change between time ti and time t0. If the sign is positive (resp. negative), the gene is categorized as up-regulated (resp. down-regulated). In the graph, the up-regulated (resp. down-regulated) genes are indicated in red (resp. in blue). \item An upset graph, realized with the R package UpSetR \cite{Conway2017UpSetR}, plotting the number of genes in each DE temporal pattern in a Venn barplot. By DE temporal pattern, we mean a subset of times in t1,\ldots,tn. We say that a gene belongs to a DE temporal pattern if the gene is DE versus t0 only at the times in this DE temporal patterns. %%%%%%%%%%%% For each gene in a given DE temporal pattern, we compute the number of DE times where it is up-regulated and we use a color code in the Venn barplot to indicate the number of genes in a temporal pattern that are up-regulated a given number of times. \item The same upset graph is also given without colors. \end{enumerate} \clearpage \subsubsection{Leukemia data (case 3 with only two biological conditions)} \label{sec:DEanalysisLeuk} In this section we use the Chronic lymphocytic leukemia (CLL) subdataset\newline \textbf{RawCounts\_Schleiss2021\_CLLsub500} (see Subsection \nameref{sec:dataSchleiss2021}) in order to explain \textbf{DEanalysisGlobal()} in \textbf{case 3} when there only two biological conditions. The lines of code below %%%%%%% \begin{itemize} \item returns a data.frame (output \texttt{DE.results}). See subsection \nameref{sec:DEsummaryLeuk} \item plots the following graphs % \begin{itemize} \item \emph{Results from the temporal statistical analysis} (\textbf{case 2} for each biological condition). See subsection \nameref{sec:DEleukGRAPHStemporel} \item \emph{Results from the statistical analysis by biological condition} (\textbf{case 1} for each fixed time). See subsection \nameref{sec:DEleukGRAPHSbiocond}. \item \emph{Results from the combination of temporal and biological statistical analysis}. See subsection \nameref{sec:DEleukGRAPHSboth} \end{itemize} \end{itemize} <>= ResDELeuk500 <- DEanalysisGlobal(SEres=resSEleuk500, pval.min=0.05, pval.vect.t=NULL, log.FC.min=1, LRT.supp.info=FALSE, Plot.DE.graph =TRUE, path.result=NULL, Name.folder.DE=NULL) # data("Results_DEanalysis_sub500") # ResDELeuk500<-Results_DEanalysis_sub500$DE_Schleiss2021_CLLsub500 @ \texttt{Results\_DEanalysis\_sub500} contains the results of \Rfunction{DEanalysisGlobal()} of three dataset which included \texttt{RawCounts\_Schleiss2021\_CLLsub500}. The results of \Rfunction{DEanalysisGlobal()} with \texttt{RawCounts\_Schleiss2021\_CLLsub500} are saved in \texttt{Results\_DEanalysis\_sub500\$DE\_Schleiss2021\_CLLsub500}. The graphs are \begin{itemize} \item displayed if \texttt{Plot.DE.graph=TRUE} \item saved in a folder if the user selects a folder path in \texttt{path.result}. If \texttt{path.result=NULL} the results will not be saved in a folder. \end{itemize} Write \texttt{?DEanalysisGlobal} in your console for more information about the function. \subsubsubsection{Data.frame summarising all the DE analysis (case 3 with only two biological conditions)} \label{sec:DEsummaryLeuk} The output data.frame \texttt{DE.results} can be observed with the following lines of code, <>= Res.DE.results.Leuk <- SummarizedExperiment::rowData(ResDELeuk500) @ as the glossary of the column names with <>= ResGlossary.Leuk <- S4Vectors::metadata(ResDELeuk500)$DESeq2obj$List.Glossary @ and then write \texttt{Res.DE.results.Leuk} and \texttt{ResGlossary.Leuk} in the R console. The data.frame contains \begin{itemize} \item gene names \item \emph{Results from the temporal statistical analysis} (\textbf{case 2} for each biological condition) % \begin{itemize} \item pvalues, log2 fold change and DE genes between each time ti versus the reference time t0, for each biological condition. \item \(N_{bc}=2\) binary columns (1 and 0), one per biological condition (with \(N_{bc}\) the number of biological conditions). A `1' in one of these two columns means the gene is DE at at least between one time ti versus the reference time t0, for the biological condition associated to the given column. \item \(N_{bc}=2\) columns where each element is succession of 0 and 1, one per biological condition. The positions of `1,' in one of these two columns, indicate the set of times ti such that the gene is DE between ti and the reference time t0, for the biological condition associated to the given column. So each element of the column is what we called previously, a temporal pattern. \end{itemize} % \item \emph{Results from the statistical analysis by biological condition} (\textbf{case 1} for each fixed time) % \begin{itemize} \item pvalues, log2 fold change and DE genes between each pairs of biological conditions, for each fixed time. \item \(T=9\) binary columns (1 and 0), one per time (with \(T\) the number of time measurements). A `1,' in one of these columns, means the gene is DE between at least one pair of biological conditions, for the fixed time associated to the given column. \item \(N_{bc} \times T=2\times 9=18\) binary columns, which give the genes specific for each biological condition at each time ti. A `1' in one of these columns means the gene is specific to the biological condition at a fixed time associated to the given column. `0' otherwise. A gene is called specific to a given biological condition BC1 at a time ti, if the gene is DE between BC1 and any other biological conditions at time ti, but not DE between any pair of other biological conditions at time ti. \item \(N_{bc} \times T=2\times 9=18\) columns filled with -1, 0 and 1. A `1' in one of these columns means the gene is up-regulated (or over-expressed) for the biological condition at a fixed time associated to the given column. A gene is called up-regulated for a given biological condition BC1 at time ti if the gene is specific to the biological condition BC1 at time ti and expressions in BC1 at time ti are higher than in the other biological conditions at time ti. A `-1' in one of these columns means the gene is down-regulated (or under-expressed) for the biological condition at a fixed time associated to the given column. A gene is called down-regulated for a given biological condition at a time ti BC1 if the gene is specific to the biological condition BC1 at time ti and expressions in BC1 at time ti are lower than in the other biological conditions at time ti. A `0' in one of these columns means the gene is not specific to the biological condition at a fixed time associated to the given column. \item \(N_{bc}=2\) binary columns (1 and 0). A `1' in one of these columns means the gene is specific at at least one time ti, for the biological condition associated to the given column. `0' otherwise. \end{itemize} \item \emph{Results from the combination of temporal and biological statistical analysis} % \begin{itemize} \item \(N_{bc} \times T=2\times 9=18\) binary columns, which give the signatures genes for each biological condition at each time ti. A `1' in one of these columns means the gene is signature gene to the biological condition at a fixed time associated to the given column. `0' otherwise. A gene is called signature of a biological condition BC1 at a given time ti, if the gene is specific to the biological condition BC1 at time ti and DE between ti versus the reference time t0 for the biological condition BC1. \item \(N_{bc}=2\) binary columns (1 and 0). A `1' in one of these columns means the gene is signature at at least one time ti, for the biological condition associated to the given column. `0' otherwise. % \end{itemize} % \end{itemize} \subsubsubsection{Graphs from the results of the temporal statistical analysis} \label{sec:DEleukGRAPHStemporel} The function plots % \begin{itemize} \item \(N_{bc}=2\) alluvial graphs, one per biological condition (with \(N_{bc}\) the number of biological conditions). The x-axis of the graph is labeled with all times except t0. For each vertical barplot, there are two strata: 1 and 0 whose sizes indicate respectively the number of DE genes and of non DE genes, between the time corresponding to the barplot and the reference time t0. %% The alluvial graph is composed of curves each corresponding to a single gene, which are gathered in alluvia. An alluvium is composed of all genes having the same curve: for example, an alluvium going from the stratum 0 at time t1 (for instance) to the stratum 1 at time t2 corresponds to the set of genes which are not differentially expressed (DE) at t1 and are DE at time t2. Each alluvium connects pairs of consecutive barplots and its thickness gives the number of genes in the set. %% The color of each alluvium indicates the temporal group, defined as the set of genes which are all first DE at the same time with respect to the reference time t0. %%%%%%%%%%%%%%%%%%% \item \(N_{bc}=2\) graphs showing the number of DE genes as a function of time for each temporal group, one per biological condition. By temporal group, we mean the sets of genes which are first DE at the same time. The graphs plot the time evolution of the number of DE genes within each temporal group, one per biological condition. The x-axis labels indicate all times except t0. %%%%%%%%%%%%%%%%%%% \item a barplot showing the number of DE genes up-regulated and down-regulated per time, for each biological condition. The graph shows the number of DE genes per time, for each biological condition. The x-axis labels indicate all times except t0. For each DE gene, we compute the sign of the log2 fold change between time ti and time t0. If the sign is positive (resp. negative), the gene is categorized as up-regulated (resp. down-regulated). In the graph, the up-regulated (resp. down-regulated) genes are indicated in red (resp. in blue). %%%%%%%%%%%%%%%%%%% \item \(2\times N_{bc}=4\) upset graphs, realized with the R package UpSetR \cite{Conway2017UpSetR}, showing the number of DE genes belonging to each DE temporal pattern, for each biological condition. By temporal pattern, we mean the set of times ti such that the gene is DE between ti and the reference time t0. The graphs, realized with the R package UpSetR \cite{Conway2017UpSetR}, plot the number of genes in each DE temporal pattern in a Venn barplot, one per biological condition. By DE temporal pattern, we mean a subset of times in t1,\ldots,tn. We say that a gene belongs to a DE temporal pattern if the gene is DE versus t0 only at the times in this DE temporal patterns. For each gene in a given DE temporal pattern, we compute the number of DE times where it is up-regulated and we use a color code in the Venn barplot to indicate the number of genes in a temporal pattern that are up-regulated a given number of times. The same graph is also given without colors. %%%%%%%%%%%%%%%%%%% \item an alluvial graph for DE genes which are DE at at least one time for each group. The graph shows common and no common genes between which are DE at at least one time for the two biological conditions. \end{itemize} \subsubsubsection{Graphs from the results of the biological condition analysis} \label{sec:DEleukGRAPHSbiocond} The function plots %%%%%%%%%%%% \begin{itemize} \item a barplot showing the number of specific genes per biological condition, for each time. A gene is called specific to a given biological condition BC1 at a time ti, if the gene is DE between BC1 and any other biological conditions at time ti, but not DE between any pair of other biological conditions at time ti. \item only if there are more than two biological conditions (which is not the case here), \(\binom{N_{bc}}{2}=\binom{4}{2}=6\) upset graph, realized with the R package UpSetR \cite{Conway2017UpSetR}, which corresponds to Venn diagram barplots for each time ti. Each graph the number of genes corresponding to each possible intersection in a Venn barplot at a given time. We say that \begin{itemize} \item a set of pairs of biological conditions forms an intersection at a given time ti when there is at least one gene which is DE for each of these pairs of biological conditions at time ti, but not for the others at time ti. \item a gene corresponds to a given intersection if this genes is DE for each pair of biological conditions in the intersection at time ti, but not for the others at time ti. \end{itemize} \item only if there are more than two biological conditions (which is not the case here), an alluvial graph for DE genes which are specific at at least one time for each group. The plots allows to detect genes that are specific only for one biological condition. \end{itemize} \subsubsubsection{Graphs from the results of the combination of temporal and biological statistical analysis} \label{sec:DEleukGRAPHSboth} The function plots % \begin{itemize} \item a barplot showing the number of signature genes and DE genes (but not signature) per time, for each biological condition. A gene is called signature of a biological condition BC1 at a given time ti, if the gene is specific to the biological condition BC1 at time ti and DE bewteen ti versus the reference time t0 for the biological condition BC1. %%%%%%%%%%%%%%%%%% \item a barplot showing the number of genes which are DE at at least one time, specific at at least one time and signature at at least one time, for each biological condition. The barplot summarizes well the combination of temporal and biological analysis because the figure gives for each biological condition the number of genes which are DE at at least one time, the number of specific genes at at least one time and the number of signature at at least one time. \item only if there are more than two biological conditions (which is not the case here), an alluvial graph for DE genes which are signature at at least one time for each group. The plots allows to detect genes that are signature only for one biological condition. \end{itemize} \clearpage \subsubsection{Mouse data 2 (case 3 with more than two biological conditions)} \label{sec:DEanalysisMus2} In this section we use the mouse subdataset \textbf{RawCounts\_Weger2021\_MOUSEsub500} (see Subsection~\ref{sec:dataWeger2021}) in order to explain \textbf{DEanalysisGlobal()} in \textbf{case 3} when there are more than two biological conditions. For the speed of the algorithm, we will take only three biological conditions and three times <>= Sub3bc3T <- RawCounts_Weger2021_MOUSEsub500[, seq_len(73)] SelectTime <- grep("_t0_", colnames(Sub3bc3T)) SelectTime <- c(SelectTime, grep("_t1_", colnames(Sub3bc3T))) SelectTime <- c(SelectTime, grep("_t2_", colnames(Sub3bc3T))) Sub3bc3T <- Sub3bc3T[, c(1, SelectTime)] resSEmus25003b3t <- DATAprepSE(RawCounts=Sub3bc3T, Column.gene=1, Group.position=1, Time.position=2, Individual.position=3) @ The lines of code below %%%%%%% \begin{itemize} \item returns a data.frame (output \texttt{DE.results}). See subsection \nameref{sec:DEsummaryLeuk} \item plots the following graphs (similar to those described in section \nameref{sec:DEanalysisLeuk}) % \begin{itemize} \item \emph{Results from the temporal statistical analysis} (\textbf{case 2} for each biological condition). See subsection \nameref{sec:DEleukGRAPHStemporel} \item \emph{Results from the statistical analysis by biological condition} (\textbf{case 1} for each fixed time). See subsection \nameref{sec:DEleukGRAPHSbiocond}. \item \emph{Results from the combination of temporal and biological statistical analysis}. See subsection \nameref{sec:DEleukGRAPHSboth} \end{itemize} \end{itemize} <>= ResDEMusBmCrKoWt500<-DEanalysisGlobal(SEres=resSEmus25003b3t, pval.min=0.05, pval.vect.t=NULL, log.FC.min=1, LRT.supp.info=FALSE, Plot.DE.graph=FALSE, path.result=NULL, Name.folder.DE=NULL) @ As we are in the similar case described in section \nameref{sec:DEanalysisLeuk}, the results will not be described here. \clearpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Volcano plots, ratio intensity (MA) plots and Heatmaps with DEplotVolcanoMA() and DEplotHeatmaps()} \label{sec:DEvolcanoMAheatmap} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Mouse data (case 1)}\label{sec:DEvolcanoMAheatmapMus1} In this section we use the mouse subdataset \textbf{RawCounts\_Antoszewski2022\_MOUSEsub500} (see the subsection \nameref{sec:dataAntoszewski2022}) in order to explain \textbf{DEplotVolcanoMA()} and \textbf{DEplotHeatmaps()} in \textbf{case 1}. \subsubsubsection{Volcano and MA plots (case 1)}\label{sec:DEvolcanoMAMus1} The following lines of code allow to plot \begin{itemize} \item \(\dbinom{N_{bc}}{2}=\frac{N_{bc}(N_{bc}-1)}{2}=6\) volcano plots (with \(N_{bc}=4\) the number of biological conditions). \item \(\frac{N_{bc}(N_{bc}-1)}{2}=6\) MA plots. \end{itemize} allowing to separate non DE genes, DE genes below a threshold of log2 fold change and DE genes above a threshold of log2 fold change. <>= resVolcanoMAMouse <- DEplotVolcanoMA(SEresDE=ResDEMus500, NbGene.plotted=2, SizeLabel=3, Display.plots=TRUE, Save.plots=FALSE) @ If the user wants to save the graphs, the input \texttt{Save.plots} must be \begin{itemize} \item either \texttt{Save.plots=TRUE}, and the graph will be saved in the same location than the input \texttt{path.result} of the function \Rfunction{DEanalysisGlobal()}. \item either a strings of characters giving the path to a folder where the graphs will be saved. The user then choose the path of the folder where results can be saved. \end{itemize} If the user wants to display the graph, the user must set \texttt{Display.plots=TRUE}. Write \texttt{?DEplotVolcanoMA} in your console for more information about the function. %%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%% \subsubsubsection{Heatmaps (case 1)}\label{sec:DEheatmapMus1} The following lines of code allow to plot a correlation heatmap between samples and a heatmap accross samples and genes for a subset of genes that can be selected by the user. <>= resplotHeatmapMus <- DEplotHeatmaps(SEresDE=ResDEMus500, ColumnsCriteria=2,#c(2,4), Set.Operation="union", NbGene.analysis=20, SizeLabelRows=5, SizeLabelCols=5, Display.plots=TRUE, Save.plots=FALSE) @ For the rle heatmap (so the heatmap accross samples and subset of genes), the subset of genes is selected as follow \begin{enumerate} \item the user select one or more binary column with the input \texttt{ColumnsCriteria} which corresponds to the column number of \texttt{rowData(ResDEMus500)} \item Then the selected genes will be % \begin{itemize} \item If \texttt{Set.Operation="union"} then the rows extracted from the different datasets included in \texttt{ResDEMus500} are those such that the sum of the selected columns of \texttt{SummarizedExperiment::rowData(ResDEMus500)} by \texttt{ColumnsCriteria} is >0. \item If \texttt{Set.Operation="intersect"} then the rows extracted from the different datasets included in \texttt{ResDEMus500} are those such that the product of the selected columns of \texttt{SummarizedExperiment::rowData(ResDEMus500)} by \texttt{ColumnsCriteria} is >0. \item If \texttt{Set.Operation="setdiff"} then the rows extracted from the different datasets included in \texttt{ResDEMus500} are those such that only one element of the selected columns of \texttt{SummarizedExperiment::rowData(ResDEMus500)} by \texttt{ColumnsCriteria} is >0. \end{itemize} % \item Finally, the subset of genes will be the \texttt{NbGene.analysis} genes, among the selected genes, which have the highest sum of absolute log2 fold change. \end{enumerate} If the user wants to save the graphs, the input \texttt{Save.plots} must be \begin{itemize} \item either \texttt{Save.plots=TRUE}, and the graph will be saved in the same location than the input \texttt{path.result} of the function \Rfunction{DEanalysisGlobal()}. \item either a strings of characters giving the path to a folder where the graphs will be saved. The user then choose the path of the folder where results can be saved. \end{itemize} If the user wants to display the graph, the user must set \texttt{Display.plots=TRUE}. Write \texttt{?DEplotHeatmaps} in your console for more information about the function. \clearpage \subsubsection{Fission data (case 2)}\label{sec:DEvolcanoMAheatmapFission} In this section we use the fission yeast subdataset \textbf{RawCounts\_Leong2014\_FISSIONsub500wt} (see the subsection \nameref{sec:dataLeong2014}) in order to explain \textbf{DEplotVolcanoMA()} and \textbf{DEplotHeatmaps()} in \textbf{case 2}. \subsubsubsection{Volcano and MA plots (case 2)}\label{sec:DEvolcanoMAFission} The following lines of code allow to plot \begin{itemize} \item \(T-1=6-1=5\) volcano plots (with \(T=6\) the number time points) \item \(T-1=5\) MA plots with. \end{itemize} allowing to separate non DE genes, DE genes below a threshold of log2 fold change and DE genes above a threshold of log2 fold change. <>= resVolcanoMAFission <- DEplotVolcanoMA(SEresDE=DEyeast500wt, NbGene.plotted=2, SizeLabel=3, Display.plots=TRUE, Save.plots=TRUE) @ If the user wants to save the graphs, the input \texttt{Save.plots} must be \begin{itemize} \item either \texttt{Save.plots=TRUE}, and the graph will be saved in the same location than the input \texttt{path.result} of the function \Rfunction{DEanalysisGlobal()}. \item either a strings of characters giving the path to a folder where the graphs will be saved. The user then choose the path of the folder where results can be saved. \end{itemize} If the user wants to display the graph, the user must set \texttt{Display.plots=TRUE}. Write \texttt{?DEplotVolcanoMA} in your console for more information about the function. \subsubsubsection{Heatmaps (case 2)}\label{sec:DEheatmapFission} The following lines of code allow to plot a correlation heatmap between samples and a heatmap across samples and genes for a subset of genes that can be selected by the user. <>= resplotHeatmapFission <- DEplotHeatmaps(SEresDE=DEyeast500wt, ColumnsCriteria=2, Set.Operation="union", NbGene.analysis=20, Color.Group=NULL, SizeLabelRows=5, SizeLabelCols=5, Display.plots=TRUE, Save.plots=FALSE) @ For the rle heatmap (so the heatmap accross samples and subset of genes), the subset of genes is selected as follow \begin{enumerate} \item the user select one or more binary column with the input \texttt{ColumnsCriteria} which corresponds to the column number of \texttt{rowData(DEyeast500wt)} \item Then the selected genes will be % \begin{itemize} \item If \texttt{Set.Operation="union"} then the rows extracted from the different datasets included in \texttt{DEyeast500wt} are those such that the sum of the selected columns of \texttt{SummarizedExperiment::rowData(DEyeast500wt)} by \texttt{ColumnsCriteria} is >0. \item If \texttt{Set.Operation="intersect"} then the rows extracted from the different datasets included in \texttt{DEyeast500wt} are those such that the product of the selected columns of \texttt{SummarizedExperiment::rowData(DEyeast500wt)} by \texttt{ColumnsCriteria} is >0. \item If \texttt{Set.Operation="setdiff"} then the rows extracted from the different datasets included in \texttt{DEyeast500wt} are those such that only one element of the selected columns of \texttt{SummarizedExperiment::rowData(DEyeast500wt)} by \texttt{ColumnsCriteria} is >0. \end{itemize} % \item Finally, the subset of genes will be the \texttt{NbGene.analysis} genes, among the selected genes, which have the highest sum of absolute log2 fold change. \end{enumerate} If the user wants to save the graphs, the input \texttt{Save.plots} must be \begin{itemize} \item either \texttt{Save.plots=TRUE}, and the graph will be saved in the same location than the input \texttt{path.result} of the function \Rfunction{DEanalysisGlobal()}. \item either a strings of characters giving the path to a folder where the graphs will be saved. The user then choose the path of the folder where results can be saved. \end{itemize} If the user wants to display the graph, the user must set \texttt{Display.plots=TRUE}. Write \texttt{?DEplotHeatmaps} in your console for more information about the function. \clearpage \subsubsection{Leukemia data (case 3 with only two biological conditions)} \label{sec:DEvolcanoMAheatmapLeuk} In this section we use the Chronic lymphocytic leukemia (CLL) subdataset \textbf{RawCounts\_Schleiss2021\_CLLsub500} (see Subsection \nameref{sec:dataSchleiss2021}) in order to explain \textbf{DEplotVolcanoMA()} and \textbf{DEplotHeatmaps()} in \textbf{case 3} when there only two biological conditions. \subsubsubsection{Volcano and MA plots (case 3 two conditions)} \label{sec:DEvolcanoMALeuk} The following lines of code allow to plot \begin{itemize} \item \(\dbinom{N_{bc}}{2} \times T + (T-1)\times N_{bc} =\frac{N_{bc}(N_{bc}-1)}{2}\times T + (T-1)\times N_{bc}=25\) volcano plots (with \(N_{bc}=2\) the number of biological conditions and \(T=9\) the number of time points). \item \(\dbinom{N_{bc}}{2} \times T + (T-1)\times N_{bc}=25\) MA plots. \end{itemize} allowing to separate non DE genes, DE genes below a threshold of log2 fold change and DE genes above a threshold of log2 fold change. <>= resVolcanoMACLL<-DEplotVolcanoMA(SEresDE=ResDELeuk500, NbGene.plotted=2, SizeLabel=3, Display.plots=FALSE, Save.plots=FALSE) @ The graphs are a combination of the results described in subsection \nameref{sec:DEvolcanoMAMus1} and subsection \nameref{sec:sec:DEvolcanoMAFission}. If the user wants to save the graphs, the input \texttt{Save.plots} must be \begin{itemize} \item either \texttt{Save.plots=TRUE}, and the graph will be saved in the same location than the input \texttt{path.result} of the function \Rfunction{DEanalysisGlobal()}. \item either a strings of characters giving the path to a folder where the graphs will be saved. The user then choose the path of the folder where results can be saved. \end{itemize} If the user wants to display the graph, the user must set \texttt{Display.plots=TRUE}. Write \texttt{?DEplotVolcanoMA} in your console for more information about the function. \paragraph{Heatmaps (case 3 two conditions)}\label{sec:DEheatmapLeuk} The following lines of code allow to plot a correlation heatmap between samples and a heatmap across samples and genes for a subset of genes that can be selected by the user. <>= resplotHeatmapCLL<-DEplotHeatmaps(SEresDE=ResDELeuk500, ColumnsCriteria=c(12, 13), Set.Operation="union", NbGene.analysis=20, SizeLabelRows=5, SizeLabelCols=5, Display.plots=FALSE, Save.plots=FALSE) @ Both graphs are similar to those described in subsection \nameref{sec:DEheatmapFission} and \nameref{sec:DEheatmapMus1}. For the rle heatmap (so the heatmap accross samples and subset of genes), the subset of genes is selected as follow \begin{enumerate} \item the user select one or more binary column with the input \texttt{ColumnsCriteria} which corresponds to the column number of \texttt{rowData(ResDELeuk500)} \item Then the selected genes will be % \begin{itemize} \item If \texttt{Set.Operation="union"} then the rows extracted from the different datasets included in \texttt{ResDELeuk500} are those such that the sum of the selected columns of \texttt{SummarizedExperiment::rowData(ResDELeuk500)} by \texttt{ColumnsCriteria} is >0. \item If \texttt{Set.Operation="intersect"} then the rows extracted from the different datasets included in \texttt{ResDELeuk500} are those such that the product of the selected columns of \texttt{SummarizedExperiment::rowData(ResDELeuk500)} by \texttt{ColumnsCriteria} is >0. \item If \texttt{Set.Operation="setdiff"} then the rows extracted from the different datasets included in \texttt{ResDELeuk500} are those such that only one element of the selected columns of \texttt{SummarizedExperiment::rowData(ResDELeuk500)} by \texttt{ColumnsCriteria} is >0. \end{itemize} % \item Finally, the subset of genes will be the \texttt{NbGene.analysis} genes, among the selected genes, which have the highest sum of absolute log2 fold change. \end{enumerate} If the user wants to save the graphs, the input \texttt{Save.plots} must be \begin{itemize} \item either \texttt{Save.plots=TRUE}, and the graph will be saved in the same location than the input \texttt{path.result} of the function \Rfunction{DEanalysisGlobal()}. \item either a strings of characters giving the path to a folder where the graphs will be saved. The user then choose the path of the folder where results can be saved. \end{itemize} If the user wants to display the graph, the user must set \texttt{Display.plots=TRUE}. Write \texttt{?DEplotHeatmaps} in your console for more information about the function. \clearpage \subsubsection{Mouse data 2 (case 3 with more than two biological conditions)} \label{sec:DEvolcanoMAheatmapMus2} In this section we use the mouse subdataset \textbf{RawCounts\_Weger2021\_MOUSEsub500} (see Subsection \nameref{sec:dataWeger2021}) in order to explain \textbf{DEplotVolcanoMA()} and \textbf{DEplotHeatmaps} in \textbf{case 3} when there are more than two biological conditions. \subsubsubsection{Volcano and MA plots (case 3 with more than two biological conditions)} \label{sec:DEvolcanoMAMus2} The following lines of code allow to plot \begin{itemize} \item \(\dbinom{N_{bc}}{2} \times T + (T-1)\times N_{bc} =\frac{N_{bc}(N_{bc}-1)}{2}\times T + (T-1)\times N_{bc}=56\) volcano plots (with \(N_{bc}=4\) the number of biological conditions and \(T=6\) the number of time points). \item \(\dbinom{N_{bc}}{2} \times T + (T-1)\times N_{bc}=56\) MA plots. \end{itemize} allowing to separate non DE genes, DE genes below a threshold of log2 fold change and DE genes above a threshold of log2 fold change. <>= resVolcanoMAMouse2<-DEplotVolcanoMA(SEresDE=ResDEMusBmCrKoWt500, NbGene.plotted=2, SizeLabel=3, Display.plots=FALSE, Save.plots=FALSE) @ The graphs are a combination of the results described in subsection \nameref{sec:DEvolcanoMAMus1} and subsection \nameref{sec:sec:DEvolcanoMAFission}. If the user wants to save the graphs, the input \texttt{Save.plots} must be \begin{itemize} \item either \texttt{Save.plots=TRUE}, and the graph will be saved in the same location than the input \texttt{path.result} of the function \Rfunction{DEanalysisGlobal()}. \item either a strings of characters giving the path to a folder where the graphs will be saved. The user then choose the path of the folder where results can be saved. \end{itemize} If the user wants to display the graph, the user must set \texttt{Display.plots=TRUE}. Write \texttt{?DEplotVolcanoMA} in your console for more information about the function. \subsubsubsection{Heatmaps (case 3 with more than two biological conditions)}\label{sec:DEheatmapMus2} The following lines of code allow to plot a correlation heatmap between samples and a heatmap across samples and genes for a subset of genes that can be selected by the user. <>= resplotHeatmapMouse2<-DEplotHeatmaps(SEresDE=ResDEMusBmCrKoWt500, ColumnsCriteria=2:5, Set.Operation="union", NbGene.analysis=20, SizeLabelRows=5, SizeLabelCols=5, Display.plots=FALSE, Save.plots=FALSE) @ Both graphs are similar to those described in subsection \nameref{sec:DEheatmapFission} and \nameref{sec:DEheatmapMus1}. For the rle heatmap (so the heatmap accross samples and subset of genes), the subset of genes is selected as follow \begin{enumerate} \item the user select one or more binary column with the input \texttt{ColumnsCriteria} which corresponds to the column number of \texttt{rowData(ResDEMusBmCrKoWt500)} \item Then the selected genes will be % \begin{itemize} \item If \texttt{Set.Operation="union"} then the rows extracted from the different datasets included in \texttt{ResDEMusBmCrKoWt500} are those such that the sum of the selected columns of \texttt{SummarizedExperiment::rowData(ResDEMusBmCrKoWt500)} by \texttt{ColumnsCriteria} is >0. \item If \texttt{Set.Operation="intersect"} then the rows extracted from the different datasets included in \texttt{ResDEMusBmCrKoWt500} are those such that the product of the selected columns of \texttt{SummarizedExperiment::rowData(ResDEMusBmCrKoWt500)} by \texttt{ColumnsCriteria} is >0. \item If \texttt{Set.Operation="setdiff"} then the rows extracted from the different datasets included in \texttt{ResDEMusBmCrKoWt500} are those such that only one element of the selected columns of \texttt{SummarizedExperiment::rowData(ResDEMusBmCrKoWt500)} by \texttt{ColumnsCriteria} is >0. \end{itemize} % \item Finally, the subset of genes will be the \texttt{NbGene.analysis} genes, among the selected genes, which have the highest sum of absolute log2 fold change. \end{enumerate} If the user wants to save the graphs, the input \texttt{Save.plots} must be \begin{itemize} \item either \texttt{Save.plots=TRUE}, and the graph will be saved in the same location than the input \texttt{path.result} of the function \Rfunction{DEanalysisGlobal()}. \item either a strings of characters giving the path to a folder where the graphs will be saved. The user then choose the path of the folder where results can be saved. \end{itemize} If the user wants to display the graph, the user must set \texttt{Display.plots=TRUE}. Write \texttt{?DEplotHeatmaps} in your console for more information about the function. \clearpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Gene Ontology (GO) analysis with GSEAQuickAnalysis() and GSEApreprocessing()} \label{sec:GOenrichment} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Mouse data (case 1)}\label{sec:GOenrichmentMus1} In this section we use the mouse subdataset \textbf{RawCounts\_Antoszewski2022\_MOUSEsub500} (see the subsection \nameref{sec:dataAntoszewski2022}) in order to explain \textbf{GSEAQuickAnalysis()} and \textbf{GSEApreprocessing()} in \textbf{case 1}. \subsubsubsection{Gene ontology with the R package gprofiler2} \label{sec:GOenrichmentMus1Gprofiler2} The lines of code below realize an enrichment analysis with the R package gprofiler2 for a selection of genes. Beware, an internet connection is needed. The function returns %%%%%%% \begin{itemize} \item a data.frame (output \texttt{metadata(resGSEAgprofiler2Mus)\$Rgprofiler2\$GSEAresults}) giving information about all detected gene ontologies the list of associated genes. \item a lollipop graph (see section \nameref{sec:stepsGO}). The y-axis indicates the \texttt{MaxNumberGO} most significant gene ontologies and pathways associated to the selected DE genes. The gene ontologies and patways are sorted into descending order. The x-axis indicates the -log10(pvalues). The higher is a lollipop the more significant is a gene ontology or pathway. A lollipop is yellow if the pvalues is smaller than 0.05 (significant) and blue otherwise. \item A mahattan plot (see section \nameref{sec:stepsGO}) indicating all genes ontologies ordered according to the functional database (G0::BP, G0::CC, G0::MF and KEGG) \end{itemize} <>= resGSEAgprofiler2Mus<-GSEAQuickAnalysis(Internect.Connection=FALSE, SEresDE=ResDEMus500, ColumnsCriteria=2, ColumnsLog2ordered=NULL, Set.Operation="union", Organism="mmusculus", MaxNumberGO=10, Background=FALSE, Display.plots=FALSE, Save.plots=FALSE) # # head(4Vectors::metadata(resGSEAgprofiler2Mus)$Rgprofiler2$GSEAresults) @ As \textbf{GSEAQuickAnalysis()} requires an internet connection, the input \texttt{Internect.Connection} is set by default to \texttt{FALSE}. Once the user is sure to have an internet connection, the user may set \texttt{Internect.Connection=TRUE} in order to realize the enrichment analysis. The subset of genes is selected as follow \begin{enumerate} \item the user select one or more binary column with the input \texttt{ColumnsCriteria} which corresponds to the column number of \texttt{rowData(ResDEMus500)} \item Then the subset of genes will be \begin{itemize} \item If \texttt{Set.Operation="union"} then the rows extracted from the different datasets included in \texttt{ResDEMus500} are those such that the sum of the selected columns of \texttt{SummarizedExperiment::rowData(ResDEMus500)} by \texttt{ColumnsCriteria} is >0. \item If \texttt{Set.Operation="intersect"} then the rows extracted from the different datasets included in \texttt{ResDEMus500} are those such that the product of the selected columns of \texttt{SummarizedExperiment::rowData(ResDEMus500)} by \texttt{ColumnsCriteria} is >0. \item If \texttt{Set.Operation="setdiff"} then the rows extracted from the different datasets included in \texttt{ResDEMus500} are those such that only one element of the selected columns of \texttt{SummarizedExperiment::rowData(ResDEMus500)} by \texttt{ColumnsCriteria} is >0. \end{itemize} \end{enumerate} If \texttt{ColumnsLog2ordered} is a vector of integers, it corresponds to the columns number of \texttt{rowData(ResDEMus500)}, the output of \Rfunction{DEanalysisGlobal()}, which must contains \(log_2\) fold change values. The rows of \texttt{rowData(ResDEMus500)} (corresponding to genes) will be decreasingly ordered according to the sum of absolute \(log_2\) fold change (the selected columns must contain \(log_2\) fold change values) before the enrichment analysis. The enrichment analysis will take into account the genes order as the first genes will be considered to have the highest biological importance and the last genes the lowest. If the user wants to save the graphs, the input \texttt{Save.plots} must be \begin{itemize} \item either \texttt{Save.plots=TRUE}, and the graph will be saved in the same location than the input \texttt{path.result} of the function \Rfunction{DEanalysisGlobal()}. \item either a strings of characters giving the path to a folder where the graphs will be saved. The user then choose the path of the folder where results can be saved. \end{itemize} Write \texttt{?GSEAQuickAnalysis} in your console for more information about the function. \subsubsubsection{Preprocessing for GSEA, DAVID, WebGestalt, gProfiler, Panther, ShinyGO, Enrichr and GOrilla} \label{sec:GOenrichmentMus1preprocessing} The following lines of code will generate all files, for a selection of genes, in order to use the following software and online tools : GSEA, DAVID, WebGestalt, gProfiler, Panther, ShinyGO, Enrichr and GOrilla. <>= resGSEApreprocessingFission<-GSEApreprocessing(SEresDE=ResDEMus500, ColumnsCriteria=2, Set.Operation="union", Rnk.files=FALSE, Save.files=FALSE) @ The subset of genes is selected as follow \begin{enumerate} \item the user select one or more binary column with the input \texttt{ColumnsCriteria} which corresponds to the column number of \texttt{rowData(ResDEMus500)} \item Then the subset of genes will be % \begin{itemize} \item If \texttt{Set.Operation="union"} then the rows extracted from the different datasets included in \texttt{ResDEMus500} are those such that the sum of the selected columns of \texttt{SummarizedExperiment::rowData(ResDEMus500)} by \texttt{ColumnsCriteria} is >0. \item If \texttt{Set.Operation="intersect"} then the rows extracted from the different datasets included in \texttt{ResDEMus500} are those such that the product of the selected columns of \texttt{SummarizedExperiment::rowData(ResDEMus500)} by \texttt{ColumnsCriteria} is >0. \item If \texttt{Set.Operation="setdiff"} then the rows extracted from the different datasets included in \texttt{ResDEMus500} are those such that only one element of the selected columns of \texttt{SummarizedExperiment::rowData(ResDEMus500)} by \texttt{ColumnsCriteria} is >0. \end{itemize} % \end{enumerate} If the user wants to save the files, the input \texttt{Save.plots} must be \begin{itemize} \item either \texttt{Save.plots=TRUE}, and the graph will be saved in the same location than the input \texttt{path.result} of the function \Rfunction{DEanalysisGlobal()}. \item either a strings of characters giving the path to a folder where the graphs will be saved. The user then choose the path of the folder where results can be saved. \end{itemize} Write \texttt{?GSEApreprocessing} in your console for more information about the function. \subsubsection{Fission data (case 2)}\label{sec:GOenrichmentFission} In this section we use the fission yeast subdataset \textbf{RawCounts\_Leong2014\_FISSIONsub500wt} (see the subsection \nameref{sec:dataLeong2014}) in order to explain \textbf{GSEAQuickAnalysis()} and \textbf{GSEApreprocessing()} in \textbf{case 2}. \subsubsubsection{Gene ontology with the R package gprofiler2} \label{sec:GOenrichmentFissionGprofiler2} The lines of code below realize an enrichment analysis with the R package gprofiler2 for a selection of genes. Beware, an internet connection is needed. The function returns %%%%%%%%%%%%% \begin{itemize} \item a data.frame (output \texttt{metadata(resGSEAgprofiler2Fission)\$Rgprofiler2\$GSEAresults}) giving information about all detected gene ontologies the list of associated genes. \item a lollipop graph (see section \nameref{sec:stepsGO}). The y-axis indicates the \texttt{MaxNumberGO} most significant gene ontologies and pathways associated to the selected DE genes. The gene ontologies and pathways are sorted into descending order. The x-axis indicates the \(-log10(pvalues)\). The higher is a lollipop the more significant is a gene ontology or pathway. A lollipop is yellow if the pvalues is smaller than 0.05 (significant) and blue otherwise. \item A Manhattan plot (see section \nameref{sec:stepsGO}) indicating all genes ontologies ordered according to the functional database (G0::BP, G0::CC, G0::MF and KEGG) \end{itemize} <>= resGSEAgprofiler2Fission<-GSEAQuickAnalysis(Internect.Connection=FALSE, SEresDE=DEyeast500wt, ColumnsCriteria=2, ColumnsLog2ordered=NULL, Set.Operation="union", Organism="spombe", MaxNumberGO=20, Background=FALSE, Display.plots=FALSE, Save.plots=FALSE) # # head(resGSEAgprofiler2Fission$GSEAresults) @ As \textbf{GSEAQuickAnalysis()} requires an internet connection, the input \texttt{Internect.Connection} is set by default to \texttt{FALSE}. Once the user is sure to have an internet connection, the user may set \texttt{Internect.Connection=TRUE} in order to realize the enrichment analysis. The subset of genes is selected as follow \begin{enumerate} \item the user select one or more binary column with the input \texttt{ColumnsCriteria} which corresponds to the column number of \texttt{rowData(DEyeast500wt)} \item Then the subset of genes will be % \begin{itemize} \item If \texttt{Set.Operation="union"} then the rows extracted from the different datasets included in \texttt{DEyeast500wt} are those such that the sum of the selected columns of \texttt{SummarizedExperiment::rowData(DEyeast500wt)} by \texttt{ColumnsCriteria} is >0. \item If \texttt{Set.Operation="intersect"} then the rows extracted from the different datasets included in \texttt{DEyeast500wt} are those such that the product of the selected columns of \texttt{SummarizedExperiment::rowData(DEyeast500wt)} by \texttt{ColumnsCriteria} is >0. \item If \texttt{Set.Operation="setdiff"} then the rows extracted from the different datasets included in \texttt{DEyeast500wt} are those such that only one element of the selected columns of \texttt{SummarizedExperiment::rowData(DEyeast500wt)} by \texttt{ColumnsCriteria} is >0. \end{itemize} % \end{enumerate} If \texttt{ColumnsLog2ordered} is a vector of integers, it corresponds to the columns number of \texttt{rowData(DEyeast500wt)}, the output of \Rfunction{DEanalysisGlobal()}, which must contains \(log_2\) fold change values. The rows of \texttt{rowData(DEyeast500wt)} (corresponding to genes) will be decreasingly ordered according to the sum of absolute \(log_2\) fold change (the selected columns must contain \(log_2\) fold change values) before the enrichment analysis. The enrichment analysis will take into account the genes order as the first genes will be considered to have the highest biological importance and the last genes the lowest. If the user wants to save the graphs, the input \texttt{Save.plots} must be \begin{itemize} \item either \texttt{Save.plots=TRUE}, and the graph will be saved in the same location than the input \texttt{path.result} of the function \texttt{DEanalysisGlobal()}. \item either a strings of characters giving the path to a folder where the graphs will be saved. The user then choose the path of the folder where results can be saved. \end{itemize} Write \texttt{?GSEAQuickAnalysis} in your console for more information about the function. \subsubsubsection{Preprocessing for GSEA, DAVID, WebGestalt, gProfiler, Panther, ShinyGO, Enrichr and GOrilla} \label{sec:GOenrichmentFissionPreprocessing} The following lines of code will generate all files, for a selection of genes, in order to use the following software and online tools : GSEA, DAVID, WebGestalt, gProfiler, Panther, ShinyGO, Enrichr and GOrilla. <>= resGSEApreprocessingFission<-GSEApreprocessing(SEresDE=DEyeast500wt, ColumnsCriteria=2, Set.Operation="union", Rnk.files=FALSE, Save.files=FALSE) @ The subset of genes is selected as follow \begin{enumerate} \item the user select one or more binary column with the input \texttt{ColumnsCriteria} which corresponds to the column number of \texttt{rowData(DEyeast500wt)} \item Then the subset of genes will be \end{enumerate} \begin{itemize} \item If \texttt{Set.Operation="union"} then the rows extracted from the different datasets included in \texttt{DEyeast500wt} are those such that the sum of the selected columns of \texttt{SummarizedExperiment::rowData(DEyeast500wt)} by \texttt{ColumnsCriteria} is >0. \item If \texttt{Set.Operation="intersect"} then the rows extracted from the different datasets included in \texttt{DEyeast500wt} are those such that the product of the selected columns of \texttt{SummarizedExperiment::rowData(DEyeast500wt)} by \texttt{ColumnsCriteria} is >0. \item If \texttt{Set.Operation="setdiff"} then the rows extracted from the different datasets included in \texttt{DEyeast500wt} are those such that only one element of the selected columns of \texttt{SummarizedExperiment::rowData(DEyeast500wt)} by \texttt{ColumnsCriteria} is >0. \end{itemize} If the user wants to save the files, the input \texttt{Save.plots} must be \begin{itemize} \item either \texttt{Save.plots=TRUE}, and the graph will be saved in the same location than the input \texttt{path.result} of the function \texttt{DEanalysisGlobal()}. \item either a strings of characters giving the path to a folder where the graphs will be saved. The user then choose the path of the folder where results can be saved. \end{itemize} Write \texttt{?GSEApreprocessing} in your console for more information about the function. \subsubsection{Leukemia data (case 3 with only two biological conditions)} \label{sec:GOenrichmentLeuk} In this section we use the Chronic lymphocytic leukemia (CLL) subdataset \textbf{RawCounts\_Schleiss2021\_CLLsub500} (see Subsection \nameref{sec:dataSchleiss2021}) in order to explain \textbf{GSEAQuickAnalysis()} and \textbf{GSEApreprocessing()} in \textbf{case 3} when there only two biological conditions. \subsubsubsection{Gene ontology with the R package gprofiler2} \label{sec:GOenrichmentLeukGprofiler2} The lines of code below realize an enrichment analysis with the R package gprofiler2 for a selection of genes. Beware, an internet connection is needed. The function returns %%%%%%%%%%%%%%%%% \begin{itemize} \item a data.frame (output \texttt{metadata(resGSEAgprofiler2Leuk)\$Rgprofiler2\$GSEAresults}) giving information about all detected gene ontologies the list of associated genes. \item a lollipop graph (see section \nameref{sec:stepsGO}). The y-axis indicates the \texttt{MaxNumberGO} most significant gene ontologies and pathways associated to the selected DE genes. The gene ontologies and patways are sorted into descending order. The x-axis indicates the -log10(pvalues). The higher is a lollipop the more significant is a gene ontology or pathway. A lollipop is yellow if the pvalues is smaller than 0.05 (significant) and blue otherwise. \item A manhattan plot (see section \nameref{sec:stepsGO}) indicating all genes ontologies ordered according to the functional database (G0::BP, G0::CC, G0::MF and KEGG) \end{itemize} <>= resGSEAgprofiler2Leuk<-GSEAQuickAnalysis(Internect.Connection=FALSE, SEresDE=ResDELeuk500, ColumnsCriteria=c(12,13), ColumnsLog2ordered=NULL, Set.Operation="union", Organism="hsapiens", MaxNumberGO=20, Background=FALSE, Display.plots=FALSE, Save.plots=FALSE) # # head(resGSEAgprofiler2Leuk$GSEAresults) @ As \textbf{GSEAQuickAnalysis()} requires an internet connection, the input \texttt{Internect.Connection} is set by default to \texttt{FALSE}. Once the user is sure to have an internet connection, the user may set \texttt{Internect.Connection=TRUE} in order to realize the enrichment analysis. The subset of genes is selected as follow \begin{enumerate} \item the user select one or more binary column with the input \texttt{ColumnsCriteria} which corresponds to the column number of \texttt{rowData(ResDELeuk500)} \item Then the subset of genes will be \end{enumerate} \begin{itemize} \item If \texttt{Set.Operation="union"} then the rows extracted from the different datasets included in \texttt{ResDELeuk500} are those such that the sum of the selected columns of \texttt{SummarizedExperiment::rowData(ResDELeuk500)} by \texttt{ColumnsCriteria} is >0. \item If \texttt{Set.Operation="intersect"} then the rows extracted from the different datasets included in \texttt{ResDELeuk500} are those such that the product of the selected columns of \texttt{SummarizedExperiment::rowData(ResDELeuk500)} by \texttt{ColumnsCriteria} is >0. \item If \texttt{Set.Operation="setdiff"} then the rows extracted from the different datasets included in \texttt{ResDELeuk500} are those such that only one element of the selected columns of \texttt{SummarizedExperiment::rowData(ResDELeuk500)} by \texttt{ColumnsCriteria} is >0. \end{itemize} If \texttt{ColumnsLog2ordered} is a vector of integers, it corresponds to the columns number of \texttt{rowData(ResDELeuk500)}, the output of \Rfunction{DEanalysisGlobal()}, which must contains \(log_2\) fold change values. The rows of \texttt{rowData(ResDELeuk500)} (corresponding to genes) will be decreasingly ordered according to the sum of absolute \(log_2\) fold change (the selected columns must contain \(log_2\) fold change values) before the enrichment analysis. The enrichment analysis will take into account the genes order as the first genes will be considered to have the highest biological importance and the last genes the lowest. If the user wants to save the graphs, the input \texttt{Save.plots} must be \begin{itemize} \item either \texttt{Save.plots=TRUE}, and the graph will be saved in the same location than the input \texttt{path.result} of the function \Rfunction{DEanalysisGlobal()}. \item either a strings of characters giving the path to a folder where the graphs will be saved. The user then choose the path of the folder where results can be saved. \end{itemize} Write \texttt{?GSEAQuickAnalysis} in your console for more information about the function. \subsubsubsection{Preprocessing for GSEA, DAVID, WebGestalt, gProfiler, Panther, ShinyGO, Enrichr and GOrilla} \label{sec:GOenrichmentLeukpreprocessing} The following lines of code will generate all files, for a selection of genes, in order to use the following software and online tools : GSEA, DAVID, WebGestalt, gProfiler, Panther, ShinyGO, Enrichr and GOrilla. <>= resGSEApreprocessingLeuk<-GSEApreprocessing(SEresDE=ResDELeuk500, ColumnsCriteria=c(12,13), Set.Operation="union", Rnk.files=FALSE, Save.files=FALSE) @ The subset of genes is selected as follow \begin{enumerate} \item the user select one or more binary column with the input \texttt{ColumnsCriteria} which corresponds to the column number of \texttt{ResDELeuk500\$DE.results} \item Then the subset of genes will be % \begin{itemize} \item If \texttt{Set.Operation="union"} then the rows extracted from the different datasets included in \texttt{ResDELeuk500} are those such that the sum of the selected columns of \texttt{SummarizedExperiment::rowData(ResDELeuk500)} by \texttt{ColumnsCriteria} is >0. \item If \texttt{Set.Operation="intersect"} then the rows extracted from the different datasets included in \texttt{ResDELeuk500} are those such that the product of the selected columns of \texttt{SummarizedExperiment::rowData(ResDELeuk500)} by \texttt{ColumnsCriteria} is >0. \item If \texttt{Set.Operation="setdiff"} then the rows extracted from the different datasets included in \texttt{ResDELeuk500} are those such that only one element of the selected columns of \texttt{SummarizedExperiment::rowData(ResDELeuk500)} by \texttt{ColumnsCriteria} is >0. \end{itemize} % \end{enumerate} If the user wants to save the files, the input \texttt{Save.plots} must be \begin{itemize} \item either \texttt{Save.plots=TRUE}, and the graph will be saved in the same location than the input \texttt{path.result} of the function \Rfunction{DEanalysisGlobal()}. \item either a strings of characters giving the path to a folder where the graphs will be saved. The user then choose the path of the folder where results can be saved. \end{itemize} Write \texttt{?GSEApreprocessing} in your console for more information about the function. \subsubsection{Mouse data 2 (case 3 with more than two biological conditions)} \label{sec:GOenrichmentMus2} In this section we use the mouse subdataset \textbf{RawCounts\_Weger2021\_MOUSEsub500} (see Subsection \nameref{sec:dataWeger2021}) in order to explain \textbf{GSEAQuickAnalysis()} and \textbf{GSEApreprocessing()} in \textbf{case 3} when there are more than two biological conditions. \subsubsubsection{Gene ontology with the R package gprofiler2} \label{sec:GOenrichmentMus2Gprofiler2} The lines of code below realize an enrichment analysis with the R package gprofiler2 for a selection of genes. Beware, an internet connection is needed. The function returns %%%%%%%%%%%%%%%%% \begin{itemize} \item a data.frame (output \texttt{metadata(resGSEAgprofiler2Mouse2)\$Rgprofiler2\$GSEAresults}) giving information about all detected gene ontologies the list of associated genes. \item a lollipop graph (see section \nameref{sec:stepsGO}). The y-axis indicates the \texttt{MaxNumberGO} most significant gene ontologies and pathways associated to the selected DE genes. The gene ontologies and patways are sorted into descending order. The x-axis indicates the -log10(pvalues). The higher is a lollipop the more significant is a gene ontology or pathway. A lollipop is yellow if the pvalues is smaller than 0.05 (significant) and blue otherwise. \item A manhattan plot (see section \nameref{sec:stepsGO}) indicating all genes ontologies ordered according to the functional database (G0::BP, G0::CC, G0::MF and KEGG) \end{itemize} <>= resGSEAgprofiler2Mouse2<-GSEAQuickAnalysis(Internect.Connection=FALSE, SEresDE=ResDEMusBmCrKoWt500, ColumnsCriteria=2:5, ColumnsLog2ordered=NULL, Set.Operation="union", Organism="mmusculus", MaxNumberGO=20, Background=FALSE, Display.plots=FALSE, Save.plots=FALSE) # # head(resGSEAgprofiler2Mouse2$GSEAresults) @ As \textbf{GSEAQuickAnalysis()} requires an internet connection, the input \texttt{Internect.Connection} is set by default to \texttt{FALSE}. Once the user is sure to have an internet connection, the user may set \texttt{Internect.Connection=TRUE} in order to realize the enrichment analysis. The subset of genes is selected as follow \begin{enumerate} \item the user select one or more binary column with the input \texttt{ColumnsCriteria} which corresponds to the column number of \texttt{resGSEAgprofiler2Mouse2\$DE.results} \item Then the subset of genes will be \end{enumerate} \begin{itemize} \item If \texttt{Set.Operation="union"} then the rows extracted from the different datasets included in \texttt{ResDEMusBmCrKoWt500} are those such that the sum of the selected columns of \texttt{SummarizedExperiment::rowData(ResDEMusBmCrKoWt500)} by \texttt{ColumnsCriteria} is >0. \item If \texttt{Set.Operation="intersect"} then the rows extracted from the different datasets included in \texttt{ResDEMusBmCrKoWt500} are those such that the product of the selected columns of \texttt{SummarizedExperiment::rowData(ResDEMusBmCrKoWt500)} by \texttt{ColumnsCriteria} is >0. \item If \texttt{Set.Operation="setdiff"} then the rows extracted from the different datasets included in \texttt{ResDEMusBmCrKoWt500} are those such that only one element of the selected columns of \texttt{SummarizedExperiment::rowData(ResDEMusBmCrKoWt500)} by \texttt{ColumnsCriteria} is >0. \end{itemize} If \texttt{ColumnsLog2ordered} is a vector of integers, it corresponds to the columns number of \texttt{rowData(ResDEMusBmCrKoWt500)}, the output of \Rfunction{DEanalysisGlobal()}, which must contains \(log_2\) fold change values. The rows of \texttt{rowData(ResDEMusBmCrKoWt500)} (corresponding to genes) will be decreasingly ordered according to the sum of absolute \(log_2\) fold change (the selected columns must contain \(log_2\) fold change values) before the enrichment analysis. The enrichment analysis will take into account the genes order as the first genes will be considered to have the highest biological importance and the last genes the lowest. If the user wants to save the graphs, the input \texttt{Save.plots} must be \begin{itemize} \item either \texttt{Save.plots=TRUE}, and the graph will be saved in the same location than the input \texttt{path.result} of the function \Rfunction{DEanalysisGlobal()}. \item either a strings of characters giving the path to a folder where the graphs will be saved. The user then choose the path of the folder where results can be saved. \end{itemize} Write \texttt{?GSEAQuickAnalysis} in your console for more information about the function. \subsubsubsection{Preprocessing for GSEA, DAVID, WebGestalt, gProfiler, Panther, ShinyGO, Enrichr and GOrilla} \label{sec:GOenrichmentMus2preprocessing} The following lines of code will generate all files, for a selection of genes, in order to use the following software and online tools : GSEA, DAVID, WebGestalt, gProfiler, Panther, ShinyGO, Enrichr and GOrilla. <>= resGSEApreprocessing<-GSEApreprocessing(SEresDE=ResDEMusBmCrKoWt500, ColumnsCriteria=2:5, Set.Operation="union", Rnk.files=FALSE, Save.files=TRUE) @ The subset of genes is selected as follow \begin{enumerate} \item the user select one or more binary column with the input \texttt{ColumnsCriteria} which corresponds to the column number of \texttt{ResDEMusBmCrKoWt500\$DE.results} \item Then the subset of genes will be \end{enumerate} \begin{itemize} \item If \texttt{Set.Operation="union"} then the rows extracted from the different datasets included in \texttt{ResDEMusBmCrKoWt500} are those such that the sum of the selected columns of \texttt{SummarizedExperiment::rowData(ResDEMusBmCrKoWt500)} by \texttt{ColumnsCriteria} is >0. \item If \texttt{Set.Operation="intersect"} then the rows extracted from the different datasets included in \texttt{ResDEMusBmCrKoWt500} are those such that the product of the selected columns of \texttt{SummarizedExperiment::rowData(ResDEMusBmCrKoWt500)} by \texttt{ColumnsCriteria} is >0. \item If \texttt{Set.Operation="setdiff"} then the rows extracted from the different datasets included in \texttt{ResDEMusBmCrKoWt500} are those such that only one element of the selected columns of \texttt{SummarizedExperiment::rowData(ResDEMusBmCrKoWt500)} by \texttt{ColumnsCriteria} is >0. \end{itemize} If the user wants to save the files, the input \texttt{Save.plots} must be \begin{itemize} \item either \texttt{Save.plots=TRUE}, and the graph will be saved in the same location than the input \texttt{path.result} of the function \Rfunction{DEanalysisGlobal()}. \item either a strings of characters giving the path to a folder where the graphs will be saved. The user then choose the path of the folder where results can be saved. \end{itemize} Write \texttt{?GSEApreprocessing} in your console for more information about the function. \clearpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Session info}\label{sec:sessioninfo} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Here is the output of \Rfunction{sessionInfo()} on the system on which this document was compiled. %%%% <>= utils::toLatex(sessionInfo()) @ % % % %--------------------------------------------------------- \newpage %\section*{Bibliography} \bibliographystyle{apalike}%plain \bibliography{MultiRNAflowBiblio} % %--------------------------------------------------------- \end{document}