--- title: STATegRa User's Guide author: The STATegra Consortium date: "2014-10-01" bibliography: bibliography.bib output: BiocStyle::html_document: toc: true fig_retina: false vignette: > %% \VignetteEngine{knitr::rmarkdown} %% \VignetteIndexEntry{STATegRa User's Guide} --- ```{r, echo=F, message=F, results="asis"} # this block is invisible BiocStyle::markdown() require(STATegRa) require(Biobase) require(gridExtra) require(ggplot2) g_legend<-function(a.gplot){ tmp <- ggplot_gtable(ggplot_build(a.gplot)) leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box") legend <- tmp$grobs[[leg]] return(legend)} ``` # Introduction Recent developments in high-throughput technologies for studying biological systems enables the researcher to simultaneously obtain several different types of data ("omics") over the course of an experiment. There exist many techniques for analysing the behaviour of these omics individually, but combining multiple classes of omics data can be used to give a better understanding of the biological system in question; the whole is greater than the sum of its parts. Integration of different types of omics data is an increasingly important technique for studying biological systems. The first step in this kind of integration analysis is to identify patterns in data shared by all the omics classes, and use these patterns to identify outliers. The most common techniques for this sort of analysis are clustering and principal components analysis. The `STATegRa` package provides several different techniques for the evaluation of reproducibility among samples and across experimental conditions by combining the information contained multiple omics datasets. This is intended as a starting point for further integration analysis of any multi-omics dataset. The `STATegRa` package implements two main utilities for this purpose: component analysis and clustering. [Component Analysis](#omics-component-analysis) ~ Three different techniques for analysing the common and distinctive variability between two different, multi-omics datasets are provided, along with various utility functions and plotting tools to evaluate the results. [Clustering](#omics-clustering) ~ Methods are provided to cluster together features across different omics types, with a view to finding interesting similarities between features (rather than similarities between samples). Furthermore, the next important step is the identification of genes, which are differential expressed between the experimental conditions under study, according to the different data types considered as a whole. The `STATegRa` package provides the possibility to identify differential expressed genes which are supported by at least one or more data types. One main utility is implemented for this purpose: holistOmics. [HolistOmics](#HolistOmics) ~ HolistOmics, which is specifically devised for ‘holistically’ combining different omics data, identifies differentially expressed genes between the experimental conditions under study. It applies the NonParametric Combination (NPC) methodology [@Pesarin2010] in order to combine all available information to identify genes which are supported by at least one or more data types. This guide provides an overview of the different techniques included in the package, some worked examples of using the tools and some guidance on interpretation of the results obtained. # Getting Started The `r Biocpkg("STATegRa")` package can be obtained from the [Bioconductor repository](http://www.bioconductor.org/). Load the `STATegRa` package into an `R` session by typing: ```{r, eval=F} library(STATegRa) # Load STATegRa package ``` General information about usage of the package and the algorithms used can be found in the package vignette. In addition, every public function in the package is documented and help can be found in the normal R fashion: ```{r, eval=F} help(package="STATegRa") ## Package help ?omicsCompAnalysis ## Specific function help ``` # Omics Component Analysis ## Overview The joint analysis of multiple omic datasets, both containing different classes of data and from different experimental conditions, could provide a "global" view on the biological system of interest. The major challenge in this type of analysis is to distinguish between the underlying mechanisms affecting all datasets, and the particular mechanisms which affect each omic dataset separately. Three different methods are provided to this end: *DISCO-SCA* [@van2012disco; @schouteden2013sca; @schouteden2014performing], *JIVE* [@lock2013joint] and *O2PLS* [@trygg2003o2]. Each method provides the user with a decomposition of the variability of the composite data into common and distinctive variability. All of them are based on singular value decomposition (SVD) of the data matrix, however they use different models to accomplish this. The *DISCO-SCA* [@van2012disco] approach consists of two steps. First a Simultaneous Components Analysis (SCA) is performed, then the scores obtained are rotated into a *DIS*tinctive and *CO*mmon structure (hence, *DISCO*). Therefore, by applying SCA approach, each block of data $X_k$ of size $I\times J_k$ becomes $$X_k=TP_k^T+E_k$$ with $T$ the $I\times R$ matrix of components scores that is shared between all blocks and $P_k$ the $J_k\times R$ matrix of components loadings for block $k$. Then, a rotation criterion is used where the target is the rotation which specifies distinctive components as components having zero scores in the positions that correspond to the data blocks the component does not underlie, and the remaining entries are arbitrary. The rotation matrix $B$ is found by minimizing $min(B)||W \circ (P_{target}-[P_1^TP_2^T]B)||^2 \mbox{ such that } B^TB=I=BB^T$, where $W$ is a binary matrix having ones in the positions of the entries in the target and zero elsewhere. The *JIVE* approach [@lock2013joint] model is as following: Let $X_1,X_2$ be two blocks of data and $X=[X_1,X_2]$ represent the joint data, then the JIVE decomposition is defined as: $$X_i=J_i+A_i+\epsilon_i \mbox{, }i=1,2$$ where $J=[J_1,J_2]$ is the $p\times n$ matrix of rank $r] data-set. The dataset used for this section was obtained from the dataset described in [@van2012disco] and is available from TCGA processed data. We extracted the [classification](https://tcga-data.nci.nih.gov/docs/publications/gbm_exp/TCGA_unified_CORE_ClaNC840.txt) and the [unified gene expression](https://tcga-data.nci.nih.gov/docs/publications/gbm_exp/unifiedScaledFiltered.txt) from TCGA. The miRNA was downloaded from TCGA directly. The full dataset can be loaded by typing: ```{r} data("STATegRa_S1") ls() ``` `Block1` includes mRNA data and `Block2` includes miRNA data. ### Computing the distance between genes by using mRNA data: the bioDistclass class Firstly, we generate an `ExpressionSet` object for both the miRNA and mRNA data. ```{r} # Block1 - Expression data mRNA.ds <- createOmicsExpressionSet(Data=Block1, pData=ed, pDataDescr=c("classname")) # Block2 - miRNA expression data miRNA.ds <- createOmicsExpressionSet(Data=Block2, pData=ed, pDataDescr=c("classname")) ``` Secondly, we compute the distance between all genes in `Block1` (mRNA data) using Spearman correlation. ```{r} # Create Gene-gene distance computed through mRNA data bioDistmRNA <- bioDistclass(name="mRNAbymRNA", distance=cor(t(exprs(mRNA.ds)), method="spearman"), map.name="id", map.metadata=list(), params=list()) ``` The `bioDistmRNA` object, generated with the `bioDistclass` function, is a `bioDistclass` object that contains both the original data and the computed distance between features. ### Loading the map between miRNA and genes: the bioMap class In this section we load and store the map between miRNA and mRNA. Data file (`STATegRa_S2`) contains, as a processed matrix, the information available from TargetScan [@targetscan], which provided a set of miRNA target predictions for humans. ```{r} data(STATegRa_S2) ls() ``` This data is stored in a `bioMap` class object generated through the `bioMap` function as follows: ```{r} MAP.SYMBOL <- bioMap(name = "Symbol-miRNA", metadata = list(type_v1="Gene", type_v2="miRNA", source_database="targetscan.Hs.eg.db", data_extraction="July2014"), map=mapdata) ``` ### miRNA-Surrogate gene Distances: the bioDist function The `bioDist` function returns a `bioDistclass` object. The input is a reference feature list (genes in this example), surrogate data (miRNA, in `Block2`) and the bioMap object between reference and surrogate features. ```{r} bioDistmiRNA <- bioDist(referenceFeatures=rownames(Block1), reference="Var1", mapping=MAP.SYMBOL, surrogateData=miRNA.ds, referenceData=mRNA.ds, maxitems=2, selectionRule="sd", expfac=NULL, aggregation="sum", distance="spearman", noMappingDist=0, filtering=NULL, name="mRNAbymiRNA") ``` ### Computing weighted distances: the bioDistW function Having `bioDistmiRNA` and `bioDistmRNA` `bioDistclass` objects containing distances between genes, we aim to use weighted combinations of them to compute an single distance matrix. First we make a list of `bioDistclass` objects: ```{r} bioDistList <- list(bioDistmRNA, bioDistmiRNA) ``` Secondly we make a matrix listing containing the weighted combinations to be generated. Each row is interpreted as a combination to generate, with the elements of the row interpreted as the weight for each of the input omics. ```{r} sample.weights <- matrix(0, 4, 2) sample.weights[, 1] <- c(0, 0.33, 0.67, 1) sample.weights[, 2] <- c(1, 0.67, 0.33, 0) sample.weights ``` This matrix corresponds to generating four combinations, with the first consisting of $0\times mRNA + 1\times miRNA$ and so on. Finally, the `bioDistW` function computes the weighted combinations in the weights matrix and stores it into a `bioDistWclass` list. ```{r} bioDistWList <- bioDistW(referenceFeatures=rownames(Block1), bioDistList=bioDistList, weights=sample.weights) length(bioDistWList) ``` ## Plots ### Plotting the feature distance of each weighted combination Each `bioDistWclass` object contains a distance matrix computed through a weighted combination of distances derived from different omics. By considering the distances between these distance matrices we can project in two dimensions using Multi-Dimensional Scaling. By this approach we can visualize the effect of the different weights on the feature-to-feature distance structure. To generate such a plot: ```{r, warning=F} bioDistWPlot(referenceFeatures=rownames(Block1), listDistW=bioDistWList, method.cor="spearman") ``` ### Plotting associated features The purpose of this analysis is to generate an overall distance measure between features, so it follows that given a feature of interest we will want to find other features that are near it. For this example we use the gene IDH1, which was shown to be of relevance in the original data analysis. In order to find all other genes for which at least one weighted combination has a correlation greater than $0.7$, we do: ```{r} IDH1.F <- bioDistFeature(Feature="IDH1", listDistW=bioDistWList, threshold.cor=0.7) ``` The `bioDistFeature` function generates a matrix of associated genes (columns) depending on weighted combinations (rows); rows are named by the `bioDistWclass`'s name slot. The `IDH1.F` matrix can be plotted with the `bioDistFeaturePlot` function, as shown below. (This function is a wrapper around `heatmap.2` from `r CRANpkg("gplots")` with appropriate options, but `IDH1.F` is a normal R matrix and can be used with most other matrix-plotting tools). ```{r, message=F} bioDistFeaturePlot(data=IDH1.F) ``` ### OmicsClustering Requirements The requirements for running OmicsClustering are minimal. The data considered has to be compatible with the distance measure selected. The relevant aspect is that the mapping between features needs to be `informative enough`. We do not consider the use of OmicsClustering when the mapping between features involves less than 15-25% of the reference set of features; this number was obtained from preliminary analysis over few data sets however further investigation is being conducted. # HolistOmics ```{r, echo=F, message=F} # clear the environment before starting the holistOmics example rm(list=ls()) ``` ## Overview ### The Problem In recent years, the advantages of omics technologies made possible to measure several data modalities (ChIP-sequencing, RNA-sequencing, protein expression) on a system of interest, in both treated and untreated samples. In such settings one of the most common and important biological questions is the identification of genes, which are differential expressed between the two conditions, according to the different data modalities considered as a whole. ### The holistOmics Approach HolistOmics is a novel application of the Non Parametric Combination (NPC) methodology [@Pesarin2010], specifically tailored for the idiosyncrasies of omics data. First, each datatype is analyzed independently using the appropriate method. Currently, holistOmics analyses static(one time point) RNAseq data, using voom+limma method, and static microarray data, using limma. In the future holistOmics will be extended to analyse more types of data as well as data with several time points measurements. The resulting p-values are combined employing Fisher, Liptak and Tippett combining functions. Tippett function returns findings which are supported by at least one omics modality. Liptak function returns findings which are supportd by most modalities. Fisher function has an intermediate behavior between those of Tippett and Liptak. Several important features make the application appropriate to solve the problem under discussion. A) HolistOmics is able to perform integrative analysis of different data modalities employing **minimal assumptions**, as permutation is employed throughout the process. B) Its final output is a **p-value** that is an easily interpretable metric. C) It is characterized by **great flexibility** as it returns results, which are supported by at least one modality and assigns a lower p-value to findings which are supported by more modalities. D) It frees the researcher from the necessity to define and **model the dependence relations** among different data modalities [@Pesarin2010]. A worked example with code is given below. ## Usage ### Loading the data For this example we use an existing TCGA^[The Cancer Genome Atlas, ] data-set. We downloaded sixteen tumour samples and the sixteen matching normal, for Breast invasive carcinoma, BRCA, batch 93. We used three types of data modalities, “RNAseq”, “RNAseqV2” and “Expression-Gene”. The “Data Level” was set to Level 3. RNAseq corresponds to RNA sequensing data, IlluminaHiSeq-RNASeq platform, RNAseqV2 corresponds to RNA sequensing data, IlluminaHiSeq-RNASeqV2 platform and Expression-Gene corresponds to array based expression data, AgilentG4502A-07-3 platform. More information concerning data types can be fould here: . For RNAseq and RNAseqV2 data we keep the "raw counts". For each data type, we pooled all data to one matrix, where rows corresponded to genes and columns to samples. Then, we aligned columns and rows across matrices so as each gene and each sample to correspond to the same position. In this example the first 100 genes will be used. Finally, each matrix converted to an ExpressionSet object and the resulting objects were saved in a list. The datasets can be loaded by typing: ```{r} data("TCGA_BRCA_Batch_93") ls() ``` `TCGA_BRCA_Data` is a list, which includes three objects of the ExpressionSets class. The matrixes containing the expression values can be visualized by typing: ```{r, results='hide'} exprs(TCGA_BRCA_Data$RNAseq) # displays the RNAseq data exprs(TCGA_BRCA_Data$RNAseqV2) # displays the RNAseqV2 data exprs(TCGA_BRCA_Data$Microarray) # displays the Exp-Gene data ``` Each row corresponds to a gene and each column to a sample. Normal samples are located from column 1 to 16, whereas tumor samples are located from column 17 to 32. The class of each sample can be visualized by typing ```{r, results='hide'} pData(TCGA_BRCA_Data$RNAseq) # class of RNAseq samples pData(TCGA_BRCA_Data$RNAseqV2) # class of RNAseqV2 samples pData(TCGA_BRCA_Data$Microarray) # class of Exp-Gene samples ``` Please note that genes and samples are the same across all types of data. 0 corresponds to normal samples and 1 to tumor samples. ### Setting the dataTypes variable The dataTypes includes a character vector with possible values: 'RNA-seq', 'Microarray'. Three ExpressionSets are analyzed in this example, RNAseq, RNAseqV2 and Microarray. The dataTypes includes three strings, the first two are "RNAseq" and the third is "Microarray". The ExpressionSets, which correspond to "RNAseq", are analyzed employing voom+limma method whereas ExpressionSets, which correspond to "Microarray", are analyzed using limma method. ```{r} dataTypes <- c("RNAseq", "RNAseq", "Microarray") ``` ### Setting the comb.method variable The comb.method is a character vector with possible values: 'Fisher', 'Liptak', 'Tippett'. All combining functions can be used simultaneously. ```{r} comb.method = c("Fisher", "Liptak", "Tippett") ``` ### Setting numPerm, numCores and verbose variables numPerm is the number of permutations to perform. The number of permutations can be adjusted depending on the number of available samples. In general 1000 permutation is usually a good point to start. ```{r} numPerm = 1000 ``` HolistOmics can perform the desired permutations in parallel. numCores specifies the number of cores to be used. In this example one core is used, but a higher number could also be set. Using one core this example runs in ~54 seconds whereas using two or eight cores it takes ~52 and ~24 seconds, respectively. ```{r} numCores = 1 ``` verbose is a logical. If it is set to TRUE, holistOmics prints out the step that it performs. Here it is set to TRUE. ```{r} verbose = TRUE ``` ### Run holistOmics analysis. To run holistOmics analysis one main function is available. Run the analysis by typing: ```{r} results <- holistOmics(dataInput = TCGA_BRCA_Data, dataTypes = dataTypes, comb.method = comb.method, numPerm = numPerm, numCores = numCores, verbose = verbose) ``` Results is a data.frame of p-values. Each row corresponds to a gene name. Each column corresponds to a method used in the analysis. Columns: 1. RNAseqPrograms, includes the p-values which are produced by voom+limma when the first data modality (RNAseq) is analyzed. 2. RNAseqPrograms1, includes the p-values, which are produced by voom+limma when the second data modality (RNAseqV2) is analyzed. 3. MicroarrayPrograms, includes the p-values, which are produced by limma when the third data modality (Microarray) is analyzed. 4. RNAseqPerm, includes the p-values, which are produced using permutation and refer to the first data modality, RNAseq. 5. RNAseqPerm.1,includes the p-values, which are produced using permutation and refer to the second data modality, RNAseqV2. 6. MicroarrayPerm, includes the p-values, which are produced using permutation and refer to the third data modality, Microarray. 7. Fisher, includes the p-values, which are produced by combining the p-values from the forth, fifth and sixth columns employing Fisher function. 8. Liptak, includes the p-values, which are produced by combining the p-values from the forth, fifth and sixth columns employing Liptak function. 9. Tippett includes the p-values, which are produced by combining the p-values from the forth, fifth and sixth columns employing Tippett function. # References