--- title: "MEAL case example" subtitle: "Carlos Ruiz, Carles Hernández, Juan R. González" author: | | Center for Research in Environmental Epidemiology (CREAL), Barcelona, Spain | Bioinformatics Research Group in Epidemiolgy (BRGE) | () date: "`r Sys.Date()`" package: "`r pkg_ver('MEAL')`" output: BiocStyle::html_document: number_sections: true toc: yes fig_caption: yes fig_height: 3 fig_width: 4 bibliography: ./vignette.bib vignette: > %\VignetteIndexEntry{MEAL case example} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r, echo = FALSE} fn = local({ i = 0 function(x) { i <<- i + 1 paste('Figure ', i, ': ', x, sep = '') } }) ``` # Introduction In this vignette, a workflow to analyse methylation and expression data will be presented. It will be also shown how to consider SNP data, since it has been demonstrated that the existence of meQTL and eQTL could confound the results. The example data will be taken from `r Rpackage("MEALData")` package (latest version available at [BRGE web](http://www.creal.cat/jrgonzalez/software.htm)), a data package obtained from GEO [GSE53261](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE53261). The dataset contains data from untransformed human fibroblasts. The script with the processing steps from the GEO to the final datasets can be found in the folder data_raw of `r Rpackage("MEALData")` package. Let's start by loading the required packages. ```{r Load_Packages, message = FALSE} library(MEAL) library(MultiDataSet) library(MEALData) library(GenomicRanges) ``` There are four objects in `r Rpackage("MEALData")`: betavals, pheno, eset and snps. Let's take a look to each of these objects. betavals and pheno belongs to the methylation experiment. betavals is a matrix of beta values corresponding to an Infinium HumanMethylation 450K BeadChip and pheno a data.frame with the phenotypes. ```{r Methylation_Data} betavals[1:5, 1:5] dim(betavals) summary(pheno) ``` betavals contains data from 451448 probes and 61 samples. pheno contains the same number of samples and three variables: gender, source and inv. gender and source come from the original data while inv correspond to the inversion genotypes for the inversion located in chromosome 17q21.31 [@Li2014] computed using invClust [@Caceres2015]. There is approximately the same number of cells coming from male and female donnors. Source is composed of Coriell cell lines and cells with unknown source. A quick overview shows that there are no homozygous for the inverted phenotype due to its reduced frequency. eset is an `ExpressionSet` with the expression data corresponding to an Illumina HumanRef-8 v3.0 expression beadchip. ```{r Expression_Data} eset summary(pData(eset)) ``` There are 21916 expression probes and 61 samples. The phenotype data contains the variables gender and inv that are equivalent to that of pheno. Finally, let's take a look at snps data. snps is a list containing two objects, the genotypes and the mapping. ```{r Snps_Data} str(snps) ``` Genotypes are enclosed in a matrix and can be retrieved by: ```{r} snps$genotypes[1:5, 1:5] ``` SNP probes mapping can be retrieved by typing `snps$map`. Here, the original object has been modified to fit the requirements of MEAL objects: SNPs annotation data.frame must contain a column called `position` and a column called `chromosome` with the name of the chromosome. ```{r} head(snps$map) ``` Let's illustrate the use of this package by analyzing the effect of the inversion genotypes in methylation and expression. First, a genome wide analysis will be performed and then the region funcionalities will be introduced. It should be noticed that methylation and expression analyses will include hypothesis testing and visualitzation. # Methylation Analysis This demonstration will show the main functions needed to perform a methylation analysis. A more exhaustive description of these and other auxiliary functions can be found at `r Rpackage("MEAL")` vignette. Let's start the analysis. The first step is to create the `MethylationSet` with the betas, the phenotype and the snps. ```{r Meth_Object, message = FALSE} methSet <- prepareMethylationSet(matrix = betavals, phenotypes = pheno) table(fData(methSet)$chr) ``` As it can be seen, more than 10000 probes are placed in sexual chromosomes. Given that they vary a lot depending on sex, some authors recommend to not include them in the analyses to avoid introducing confounding results. This step can be easily done with the following code: ```{r Filtering_Meth} annot <- fData(methSet) autosomiccpgs <- rownames(annot)[!annot$chr %in% c("chrX", "chrY")] methSet <- methSet[autosomiccpgs, ] methSet ``` Once the object is filtered, we are now ready to run the analysis with `DAPipeline`. It can work with an `ExpressionSet` or a `MethylationSet`. In both cases, for each feature, it computes the effect of the variables indicated in the parameter variable\_names. `DAPipeline` also allows the inclusion of adjustment variables (such as cell count) with the parameter covariable\_names. In addition, if a `MethylationSet` is used, algorithms to detect differentially methylated regions will be run (Bumphunter and DMRcate). In our case, we will set variable\_names to "inv". If we would like to analyse more than one variable, a vector containing all the variables names could be used. Finally, probe\_method is set to "ls" (least squares) to speed up the demonstration. If we would like to use robust linear regression, the option is "robust". ```{r Meth_Analysis} methRes <- DAPipeline(set = methSet, variable_names = "inv", probe_method = "ls") methRes ``` The analysis generates an `AnalysisResults` object containing the results. The printing of the object shows that 3 probes have been detected as differentially methylated. There are two plots that gives us a quick overview of the results of the analysis. A Manhattan plot provides an overview of the distribution of the probes (figure 1). We can observe that the differentially methylated probes are clustered in chromosome 17. On the other hand, with the QQplot we can assess the validity of the model (figure 2). Almost all of the points are on the theoretical line and only the differentially methylated probes are separated, thus no further adjustement seems to be required. ```{r Meth_Manhattan, fig.cap = fn("Manhattan plot of methylation results. Some differentially methylated probes are detected in the 17q21.31 region.")} plotEWAS(methRes) ``` ```{r Meth_QQplot, fig.cap = fn("QQplot of methylation analysis. All the p-values are on the theoretical line but the three probes differentially methylated.")} plotQQ(methRes) ``` With this analysis, we see that the inversion genotypes can affect the methylation values of the probes inside the inversion region. \newpage # Expression Analysis The expression analysis can be performed in the same way than methylation. The main difference is that an `ExpressionSet` will be used intead of a `MethylationSet`. In order to assure consistent results, `ExpressionSet`s passed to `DAPipeline` needs to follow some constraints: its fData should contain columns chromosome, start and end. Let's check our object: ```{r Exp show} eset fvarLabels(eset) ``` We have this data in our object but the column containing the chromosomes is called **chr** instead of **chromosome**. In the following lines, we will fix it and we will run the analysis. ```{r Exp Analysis} colnames(fData(eset))[colnames(fData(eset)) == "chr"] <- "chromosome" expRes <- DAPipeline(eset, variable_names = "inv", probe_method = "ls") expRes ``` ```{r Expr_Manhattan, fig.cap = fn("Manhattan plot of expression results. No probe is differentially expressed after multiple testing correction.")} plotEWAS(expRes) ``` ```{r ExprQQplot, fig.cap = fn("QQplot of expression results. Most of the p-values are below the theoretical line so there is a great deflation.")} plotQQ(expRes) ``` There are no significant probes after multiple testing (figure 6) and the QQplot shows a great deflation (figure 7). This situation usually happens when there are hidden variables (such as batch) that we are not taking into account so we could try surrogate variables analysis (SVA). SVA tries to infer these hidden variables, which we can include in our model. `DAPipeline` performs this process automatically by setting `sva = TRUE`. ```{r Exp Analysis SVA} expResSVA <- DAPipeline(eset, variable_names = "inv", probe_method = "ls", sva = TRUE) expResSVA ``` ```{r ExprManhattanSVA, fig.cap = fn("Manhattan plot of expression results using SVA. Again, no probe is differentially expressed after multiple testing correction.")} plotEWAS(expResSVA) ``` ```{r ExprQQSVA, fig.cap = fn("QQ plot of expression results using SVA.")} plotQQ(expResSVA) ``` We can see that there are no probes with a significant expression change after bonferroni correction (figure 8). The QQ-plot still shows a bit of deflation but the points are closer to the theoretical values (figure 9). Consequently, SVA has improved the analysis a bit but not enough to detect differentially expressed probes. ```{r ExprVolcanoSVA, fig.cap = fn("Volcano of expression results using SVA. All the probes are inside the non significant region.")} plotVolcano(expResSVA) ``` ```{r} head(probeResults(expResSVA)[[1]]) ``` In figure 10, a Volcano plot of the results is shown. Given that none of the points has a log fold change bigger than 1, none of the points is labeled. It should be noticed that although an `ExpressionSet` can be the input for `DAPipeline` and an `AnalysisResults` object is generated, not all functions will be available. The two functions that might not work properly are `plotRegionR2` and `plotRegion`. ## Region Analysis Region analysis is also available for expression data. Now, the region analysis will be run directly on the `ExpressionSet` so the SNPs effect will not be considered: ```{r Exprs Region Analysis} range <- GRanges(seqnames = Rle("chr17"), ranges = IRanges(43000000, end = 45000000)) exprsRegion <- DARegionAnalysis(set = eset, range = range, variable_names = "inv", probe_method = "ls") exprsRegion ``` As in the methylation analysis, the returned object is an `AnalysisRegionResults`. ```{r ExprsRDA, fig.cap = fn("RDA plot for the range analysis for expression data. The different genotypes are very closed but separated."), fig.width=8, fig.height=5} plotRDA(exprsRegion) ``` The RDA plot shows that NI/NI and NI/I samples are grouped but they are very close (figure 11). However, the R^2^ and the p-value discard the possible correlation between expression and the inversion. ```{r RDA hits expression} rdahits <- topRDAhits(exprsRegion) rdahits ``` The `topRDAhits` function cannot detect any probe correlated with the inversion phenotype, reinforcing the idea that expression and inversion haplotypes are not correlated. \newpage # Methylation and expression integration In this last step, the correlation between expression and methylation will be studied. There are two functions in `MEAL` that can compute the relationship between these two omics datasets. `correlationMethExprs` looks for expression probes that are correlated to differentially methylated probes. On the other hand, `multiCorrMethExprs` tests if, in a chromosomic region, the methylation pattern can explain the expression pattern. Both functions requires a `MultiDataSet`, so let's create a new `MultiDataSet` with expression and methylation data. ```{r New Multi Meth Exp} multi2 <- new("MultiDataSet") multi2 <- add_genexp(multi2, eset) multi2 <- add_methy(multi2, methSet) ``` It should be noticed that `add_genexp` share the same constraints for the `ExpressionSet`, so only objects with fData containing chromosomes, start and end columns will be accepted. Another interesting feature is that `MultiDataSet` adding functions ensure that sample names are the same for all sets. To do so, when a new set is added, samples of the new set and samples of the multiset are filtered to only contain the common samples. ## Methylation expression correlation There is a general procedure to integrate methylation and expression. The method starts with a list of differentially methylated CpGs (DMPs) obtained from a single methylation analysis (e.g. comparing cases vs. controls, exposed vs. non-exposed ...). Then, DMPs are paired to nearby expression probes and the relationship between them is tested. In MEAL, the function `correlationMethExprs` applies this method as it was described in [@Steenaard2015]. DMPs and expression probes are paired by proximity. Expression probes are paired to a DMP if they are completely inside a range of $\pm$ 250Kb from the DMP. This distance of 250Kb is set by default but it can be changed with the parameter flank (whose units are bases). The correlation between methylation and expression is done by a linear regression. To account for technical (e.g. batch) or biological (e.g. sex, age) artifacts, a model including those variables (z) is fitted: $$ x_{ij} = \sum_{k = 1}^K{\beta_{ik} z_{kj}} + r_{ij}, i = 1, ..., P $$ where $x_{ij}$ is the methylation or expression level of probe i (P is the total number of probes) for individual j, $\beta_{ik}$ is the effect of variable k at probe i, $z_{kj}$ is the value of variable k (K is the total number of covariates) for indivudal j and $r_{ij}$ is the residual value of probe i and individual j. The residuals of methylation and expression are used to assess the correlation. These analyses take the most significant CpGs so we will retrieve the top 50 CpGs and we will look for correlated expression probes. Using `add_methy` in a `MultiDataSet` containing methylation, the methylation data is substituted, and the same applies for `add_genexp` and expression: ```{r Corr Meth Exp} topcpgs <- feats(methRes) methExprs <- correlationMethExprs(multi2, sel_cpgs = topcpgs) head(methExprs) ``` The results shows the CpG name, the expression probe, the change of the relationship and se, and the p-value and adjusted p-value (using B&H). Results are ordered by the adjusted p-value, so we can than in our data there are no correlated CpGs-expression probes. # Conclusions This case example shows the main functionalities of `r Biocpkg("MEAL")` package. It has been shown how to evaluate the effect of a phenotype on the methylation and on the expression at genome wide and at region level. Finally, the integration between methylation and expression is tested. First, we have looked if there are expression probes correlated to the differentially methylated probes. Then, we have checked if methylation and expression are correlated between them and with the inversion haplotypes in the region of the inversions. The results suggests that the inversion can modify the methylation pattern but there is no effect onk the expression profile. # References