--- title: "AffiXcan" author: name: Alessandro Lussana affiliation: > Department of Molecular Biotechnologies and Health Sciences, MBC, University of Turin, Italy email: alessandro.lussana@protonmail.com abstract: > This package includes a set of functions to train and to apply statistical models to estimate GReX (genetically regulated expression). package: AffiXcan 1.3.6 output: BiocStyle::html_document: df_print: paged toc_float: true pdf_document: default vignette: > %\VignetteIndexEntry{AffiXcan} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc} --- # Background Understanding and predicting how genetic variation influences gene expression is of great interest in modern biological and medical sciences. Taking advantage of whole genome sequencing (WGS) and RNA sequencing (RNA-seq) technologies many efforts have been made to link single nucleotide polymorphisms (SNPs) to expression variation in cells and tissues. The present methods to estimate the genetic contribution to gene expression do not take into account functional information in identifying _expression quantitative trait loci_ (eQTL), i.e. those genetic variants that contribute to explaining the variation of gene expression. Relying on SNPs as predictors allows to make significant models of gene expression only for those genes for which SNPs with a fairly good effect size exist, but this condition is not satisfied for the majority of genes, despite their expression having a non-zero heritability (h²). To address this issue, new, different strategies to analyze genetic variability of regulatory regions and their influence on transcription are needed. __AffiXcan__ (total binding AFFInity-eXpression sCANner) implements a functional approach based on the [TBA](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0143627) (Total Binding Affinity) score to make statistical models of gene expression, being able to make significant predictions on genes for which SNPs with strong effect size are absent. Furthermore, such a functional approach allows to make mechanistic interpretations in terms of transcription factors binding events that drive differential transcriptional expression. These features are of considerable importance for eQTL discovery and to improve the capability to estimate a [GReX](#grex) (genetically regulated expression) for a greater number of genes, at the same time giving insights on the possible molecular mechanisms that are involved in differential expression of genes. In the effort to grant reproducibility and resources' availability, AffiXcan package includes also the functions to train the GReX models. It is our purpose to expand and enhance these functions and, in the future, to provide data packages to impute GReX on many different tissues with ready-to-use trained models. # GReX ## What is GReX? GReX (genetically regulated expression) is the component of gene expression (here defined as the transcript level, e.g. RPKM) explained by an individual's genetics. The abundance of a transcript in a cell is determined by many factors, including genetics, environmental factors, and disease. It can have an impact on the cell's physiology and alter the expression of other transcripts or proteins, their activity and regulation. Since transcription is initiated by the binding of transcription factors to DNA, a portion of gene expression can be directly explained by variants in _cis_ regulatory regions. ## Why GReX? The estimation of GReX can be useful to perform TWAS when the real total expression profile is unknown or can not be measured, for example in those tissues - like the brain - that are inaccessible to _in vivo_ safe biopsies, or in ancient genomes. GReX can be also exploited to estimate the constitutive susceptibility of a genome to a certain status, the existence of which is at least partially influenced by gene expression. ## Estimate GReX Some efforts have been made to develop computational methods to predict GReX from genotype data using mathematical models. [Gamazon et al.](http://www.nature.com/articles/ng.3367) developed a method consisting of multiple-SNP prediction of expression levels, where the estimated GReX for a gene is given by an additive model in which SNPs are the independent variables. [AffiXcan](https://github.com/alussana/AffiXcan) takes into account the contribution of all polymorphisms of given genomic regions that are associated to the expression of a gene for a specific individual. This is done using affinity scores - [TBA](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0143627) (Total Binding Affinity) - between those regions and a set of transcription factors. A principal component analysis (PCA) is performed on these scores and for each expressed gene a linear model is fitted. We observed that the GReX of the majority of genes for which AffiXcan manages to generate a significant model is not predictable by the method cited above. Arguably, this is due to the nature of [TBA](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0143627) score, that allows to take into account the additive small effect of all variants in a genomic region. Furthermore, the goodness of prediction achieved by AffiXcan on both shared and non-shared genes was significantly greater. For brief insights on AffiXcan's results in preliminary tests, see [AffiXcan performance](#affixcan-performance) section. # AffiXcan Workflow AffiXcan's estimation of GReX is based on a functional approach that involves a score to quantify the affinity between a Position-specific Weight Matrix (PWM) and a DNA segment: the [Total Binding Affinity](https://journals.plos.org/plosone/article?id=10.1371/\ journal.pone.0143627) (TBA). TBA can be computed using [vcf_rider](http://github.com/vodkatad/vcf_rider) program, starting from phased genotypes in vcf format. ## Training the models Here are described the input files needed by AffiXcan to perform the training phase. The function __affiXcanTrain()__ returns an object that can be later used by __affiXcanImpute()__ to estimate GReX. See help("affiXcanTrain") for usage. ### TBA matrices As a first step, AffiXcan performs a principal component analysis (PCA) on the [TBA](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0143627) (Total Binding Affinity) scores for each regulatory region. The user has to provide the paths to __rds__ files that contain TBA matrices, in the form of [MultiAssayExperiment](https://bioconductor.org/packages/release/bioc/html/\ MultiAssayExperiment.html) objects. A toy example of one of these objects is shown below: ```{r} suppressMessages(library(MultiAssayExperiment)) tba <- readRDS(system.file("extdata","training.tba.toydata.rds", package="AffiXcan")) ## This is a MultiAssayExperiment object names(assays(tba)) ``` In this case __assays(tba)__ returns a list of 4 matrices, each of which contains the log₂(TBA) values for a different regulatory region. The matrices must be named with unambiguous identifiers of the regulatory regions. For illustrative purposes a small portion of the TBA matrix of the region named "ENSG00000139269.2" is displayed below. Rows are individual's IDs and columns are PWMs: ```{r} assays(tba)$ENSG00000139269.2[1:5,1:4] ``` Centering and scaling (optional) of TBA values is done before computing principal components. The user has to specify the minimum percentage of variance to be explained by the principal components selected by AffiXcan to train the model's coefficients, in order to achieve a good compromise between sensibility and overfitting. ### Expression matrix AffiXcan needs real expression values to train the models. The user has to specify a [SummarizedExperiment](https://bioconductor.org/packages/release/\ bioc/html/SummarizedExperiment.html) object and the name (here, "values") of the object in __assays()__ that contains the expression matrix. A toy example with only two genes is shown below. In the expression matrix, rows are expressed genes and columns are individual's IDs: ```{r} suppressMessages(library(SummarizedExperiment)) load("../data/exprMatrix.RData") assays(exprMatrix)$values[,1:5] ``` ### Gene - Region(s) associations The user has to provide a table with the association between expressed genes and regulatory regions. Every expressed gene must be associated to at least one regulatory region. To fit the model for one gene, AffiXcan includes the selected principal components of all regulatory regions associated to that gene, e.g.: GReX_geneA ~ PC1_regionA1 + PC2_regionA1 + PC3_regionA1 + PC4_regionA1 + PC1_regionA2 + PC2_regionA2 ... The associations table's header must contain the strings "REGULATORY_REGION" and "EXPRESSED_REGION". An example is shown below: ```{r} load("../data/regionAssoc.RData") regionAssoc[1:3,] ``` Here it can be observed that the expressed gene "ENSG00000139269.2" is associated to three different regulatory regions. The expressed genes' names must be the same as found in the [expression matrix](#expression-matrix) and the regulatory regions' names must be consistent with those used for the [TBA matrices](#training-the-models). ### Pupulation structure covariates Finally, AffiXcan computes p-values for each model. Optionally, population structure covariates for each individual can be passed to __affiXcanTrain__ to be included in the models to assess if the estimation of GReX is significantly independent from the population's genetic structure. Here is shown an example of an R object that can be used for this purpose and that contains the first three PCs of the population structure: ```{r} load("../data/trainingCovariates.RData") head(trainingCovariates) ``` If no population structure covariates are specified, the models' p-value are simply computed from the f statistic of the model summary(model)$fstatistic Benjamini-Hochberg correction for multiple testing is eventually performed on the models' P-values. ## Imputing GReX Here are described the input files needed by AffiXcan to perform the prediction phase. The function __affiXcanImpute()__ uses the output of __affiXcanTrain()__ to compute the imputed GReX values in a population of individuals. See help("affiXcanImpute") for usage. ### TBA matrices TBA values for regulatory regions referring to the population for which we want to estimate GReX are needed. The user has to provide paths to __rds__ files that contain TBA matrices, in the form of [MultiAssayExperiment](https://\ bioconductor.org/packages/release/bioc/html/MultiAssayExperiment.html) objects. This type of data is described in the [training phase](#training-the-models) section. To apply the models consistently, TBA must be calculated on the same regions and using the same PWM set as done for the training phase. The unambiguous regions' IDs used to name the TBA matrices stored in MultiAssayExperiment objects need to match those used in the training phase. ### Eigenvectors AffiXcan performs a matrix product between TBA values and eigenvectors to obtain the selected principal components that will be used as variables when estimating GReX. Eigenvectors are computed by __affiXcanTrain()__ when performing principal components analysis (PCA) on the training dataset. The user has to specify the object in which the results of the training phase are stored. ### Models' Coefficients For every gene the selected principal components of the TBA are multiplied by the model's coefficients, previously trained on the training dataset by __affiXcanTrain()__. The user has to specify the object in which the results of the training phase are stored. ## Final Output __affiXcanImpute()__ returns a [SummarizedExperiment](https://bioconductor.org/\ packages/release/bioc/html/SummarizedExperiment.html) object containing a matrix with the imputed GReX values. To access it we can use assays()$GReX as shown below. Here it is a toy example to impute the GReX of a single gene in a cohort of 115 individuals. In the GReX matrix the rows are genes and the columns are individual's IDs: ```{r} suppressMessages(library("AffiXcan")) trainingTbaPaths <- system.file("extdata","training.tba.toydata.rds", package="AffiXcan") data(exprMatrix) data(regionAssoc) data(trainingCovariates) assay <- "values" training <- affiXcanTrain(exprMatrix=exprMatrix, assay=assay, tbaPaths=trainingTbaPaths, regionAssoc=regionAssoc, cov=trainingCovariates, varExplained=80, scale=TRUE) testingTbaPaths <- system.file("extdata","testing.tba.toydata.rds", package="AffiXcan") exprmatrix <- affiXcanImpute(tbaPaths=testingTbaPaths, affiXcanTraining=training, scale=TRUE) grexMatrix <- assays(exprmatrix)$GReX as.data.frame(grexMatrix)[,1:5] ``` # Cross-Validation __affiXcanTrain()__ can be used in k-fold cross-validation mode by specifying the argument _kfold_ > 0. For example, with _kfold_ = 5, a 5-fold cross-validation will be performed. Cross-validation mode is not conceived to generate final GReX models. Therefore, the output of __affiXcanTrain()__ in cross-validation mode can not be used by __affiXcanImpute()__ to make GReX imputation on new data, since it consists of a report useful to evaluate the prediction performance for each gene in each fold. In the following example a 5-fold cross-validation is performed on dummy data: ```{r} trainingTbaPaths <- system.file("extdata","training.tba.toydata.rds", package="AffiXcan") data(exprMatrix) data(regionAssoc) data(trainingCovariates) assay <- "values" training <- affiXcanTrain(exprMatrix=exprMatrix, assay=assay, tbaPaths=trainingTbaPaths, regionAssoc=regionAssoc, cov=trainingCovariates, varExplained=80, scale=TRUE, kfold=5) ``` # Parallelization AffiXcan processes can take a certain amount of time to be completed, but all the functions support parallelization. [BiocParallel package](http://\ bioconductor.org/packages/BiocParallel/) is required for parallel evaluation. The user can construct a BiocParallelParam object and pass it as the BPPARAM argument when calling __affiXcanTrain()__ or __affiXcanImpute()__, or leave BPPARAM as default for automatic parallel evaluation. # AffiXcan Performance This section has the only purpose to briefly show the predictive performance obtained using AffiXcan in preliminary tests, and its comparison against the multiple-SNP prediction method described in [Gamazon et al.](http://www.nature.com/articles/ng.3367). Much further work, also regarding other datasets and multivariate models, is still in progress. ## Cross-validation on GEUVADIS dataset AffiXcan models were cross-validated on a cohort of 344 individuals of European descent for whom phased genotype data and expression data (RNA-seq of EBV-transformed lymphocites) are available in the [GEUVADIS public dataset](https://www.ebi.ac.uk/arrayexpress/files/E-GEUV-1/). The cohort was randomly splitted in a training set of 229 individuals and a testing set of 115 individuals. The training phase was performed on the training set, then the trained models were applied on the testing set to impute GReX. Each gene was associated to only one regulatory region, which consisted in a genomic window spanning upstream and downstream the Transcription Start Site (TSS). The minimum percentage of variance of TBA to be explained by the selected principal components was set to 80. ## Predictive Performance The number of genes (~3000) for which a significant model was generated by AffiXcan was almost identical to the number of genes for which a GReX could be imputed using the method described in [Gamazon et al.](http://www.nature.com/articles/ng.3367) The predictive performance was assessed observing the squared correlation (R²) between the imputed GReX values and the real total expression values for each gene. The overall mean of R² values obtained with AffiXcan was greater than the one obtained with the multiple-SNP method (0.099 vs 0.070) ![](predictive_performance.png) In the graph: The R² values from AffiXcan's predictions, for the >3000 genes for which a GReX could be imputed, are sorted in increasing order. ## Predictive Performance Comparison Remarkably, the overlap between the genes for which an imputed GReX could be computed by the two methods is only slightly greater than one third (1123) of the amount computed by each method. Arguably, this is due to the implementation of [TBA score](https://journals.plos.org/plosone/article?id=10.1371/journal.\ pone.0143627) to take into account the contribution of all genetic variants in a regulatory region, rather then only those SNPs with a greater effect size on gene expression. Supposedly, AffiXcan manages to generate a significant model to estimate GReX in genes where the transcriptional expression is influenced by many variants, each contributing to GReX with a small effect size, where the multiple-SNP prediction method fails to have good statistical predictors. ![](r_squares_correlation.png) In the graph: for each gene for which a GReX could be imputed by both methods, a blue circle is plotted with the coordinates (R² from multiple-SNP method's prediction, R² from AffiXcan's prediction) Observing the squared correlation (R²) between the imputed GReX values and the real total expression values on the shared genes, a Wilcoxon-Mann-Whitney paired test was performed to asses if the two distributions of R² values were significantly different. R² values from AffiXcan proved to be significantly higher: ![](r_squares_differences.png) In the graph: histogram of the differences between R² values: R² from AffiXcan's prediction - R² from multiple-SNP method's prediction (computed for each gene for which a GReX could be imputed by both methods). ## Conclusion In conclusion, AffiXcan could increase the amount of genes for which a GReX can be estimated by a factor >1.6, at the same time enhancing the goodness of prediction.