---
title: "AffiXcan"
author:
name: Alessandro Lussana
affiliation: >
Department of Molecular Biotechnologies and Health Sciences, MBC, University
of Turin, Italy
email: alessandro.lussana@protonmail.com
abstract: >
This package includes a set of functions to train and to apply statistical
models to estimate GReX (genetically regulated expression).
package: AffiXcan 1.3.6
output:
BiocStyle::html_document:
df_print: paged
toc_float: true
pdf_document: default
vignette: >
%\VignetteIndexEntry{AffiXcan}
%\VignetteEngine{knitr::rmarkdown}
%\usepackage[utf8]{inputenc}
---
# Background
Understanding and predicting how genetic variation influences gene
expression is of great interest in modern biological and medical sciences.
Taking advantage of whole genome sequencing (WGS) and RNA sequencing (RNA-seq)
technologies many efforts have been made to link single nucleotide polymorphisms
(SNPs) to expression variation in cells and tissues.
The present methods to estimate the genetic contribution to gene expression do
not take into account functional information in identifying _expression
quantitative trait loci_ (eQTL), i.e. those genetic variants that contribute to
explaining the variation of gene expression. Relying on SNPs as predictors
allows to make significant models of gene expression only for those genes for
which SNPs with a fairly good effect size exist, but this condition is not
satisfied for the majority of genes, despite their expression having a non-zero
heritability (h2). To address this issue, new, different strategies
to analyze genetic variability of regulatory regions and their influence on
transcription are needed.
__AffiXcan__ (total binding AFFInity-eXpression sCANner) implements a functional
approach based on the
[TBA](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0143627)
(Total Binding Affinity) score to make statistical models of gene expression,
being able to make significant predictions on genes for which SNPs with strong
effect size are absent. Furthermore, such a functional approach allows to make
mechanistic interpretations in terms of transcription factors binding events
that drive differential transcriptional expression. These features are of
considerable importance for eQTL discovery and to improve the capability to
estimate a [GReX](#grex) (genetically regulated expression) for a greater number
of genes, at the same time giving insights on the possible molecular mechanisms
that are involved in differential expression of genes.
In the effort to grant reproducibility and resources' availability, AffiXcan
package includes also the functions to train the GReX models. It is our purpose
to expand and enhance these functions and, in the future, to provide data
packages to impute GReX on many different tissues with ready-to-use trained
models.
# GReX
## What is GReX?
GReX (genetically regulated expression) is the component of gene expression
(here defined as the transcript level, e.g. RPKM) explained by an individual's
genetics.
The abundance of a transcript in a cell is determined by many factors,
including genetics, environmental factors, and disease. It can have an impact
on the cell's physiology and alter the expression of other transcripts or
proteins, their activity and regulation. Since transcription is initiated by
the binding of transcription factors to DNA, a portion of gene expression can be
directly explained by variants in _cis_ regulatory regions.
## Why GReX?
The estimation of GReX can be useful to perform TWAS when the real total
expression profile is unknown or can not be measured, for example in those
tissues - like the brain - that are inaccessible to _in vivo_ safe biopsies,
or in ancient genomes.
GReX can be also exploited to estimate the constitutive susceptibility of a
genome to a certain status, the existence of which is at least partially
influenced by gene expression.
## Estimate GReX
Some efforts have been made to develop computational methods to predict GReX
from genotype data using mathematical models.
[Gamazon et al.](http://www.nature.com/articles/ng.3367) developed a method
consisting of multiple-SNP prediction of expression levels, where the estimated
GReX for a gene is given by an additive model in which SNPs are the independent
variables.
[AffiXcan](https://github.com/alussana/AffiXcan) takes into account the
contribution of all polymorphisms of given genomic regions that are associated
to the expression of a gene for a specific individual. This is done using affinity scores -
[TBA](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0143627)
(Total Binding Affinity) - between those regions and a set of transcription
factors. A principal component analysis (PCA) is performed on these scores and
for each expressed gene a linear model is fitted.
We observed that the GReX of the majority of genes for which AffiXcan manages to
generate a significant model is not predictable by the method cited above.
Arguably, this is due to the nature of
[TBA](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0143627)
score, that allows to take into account the additive small effect of all
variants in a genomic region. Furthermore, the goodness of prediction achieved
by AffiXcan on both shared and non-shared genes was significantly greater. For
brief insights on AffiXcan's results in preliminary tests, see
[AffiXcan performance](#affixcan-performance) section.
# AffiXcan Workflow
AffiXcan's estimation of GReX is based on a functional approach that involves a
score to quantify the affinity between a Position-specific Weight Matrix (PWM) and a DNA
segment: the [Total Binding Affinity](https://journals.plos.org/plosone/article?id=10.1371/\
journal.pone.0143627) (TBA). TBA can be computed using
[vcf_rider](http://github.com/vodkatad/vcf_rider) program, starting from phased
genotypes in vcf format.
## Training the models
Here are described the input files needed by AffiXcan to perform the training
phase. The function __affiXcanTrain()__ returns an object that can be later used
by __affiXcanImpute()__ to estimate GReX. See help("affiXcanTrain") for usage.
### TBA matrices
As a first step, AffiXcan performs a principal component analysis (PCA) on the
[TBA](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0143627)
(Total Binding Affinity) scores for each regulatory region. The user has to
provide the paths to __rds__ files that contain TBA matrices, in the form of
[MultiAssayExperiment](https://bioconductor.org/packages/release/bioc/html/\
MultiAssayExperiment.html) objects. A toy example of one of these objects is
shown below:
```{r}
suppressMessages(library(MultiAssayExperiment))
tba <- readRDS(system.file("extdata","training.tba.toydata.rds",
package="AffiXcan")) ## This is a MultiAssayExperiment object
names(assays(tba))
```
In this case __assays(tba)__ returns a list of 4 matrices, each of which
contains the log2(TBA) values for a different regulatory region.
The matrices must be named with unambiguous identifiers of the regulatory
regions. For illustrative purposes a small portion of the TBA matrix of the
region named "ENSG00000139269.2" is displayed below. Rows are individual's IDs
and columns are PWMs:
```{r}
assays(tba)$ENSG00000139269.2[1:5,1:4]
```
Centering and scaling (optional) of TBA values is done before computing
principal components. The user has to specify the minimum percentage of variance
to be explained by the principal components selected by AffiXcan to train the
model's coefficients, in order to achieve a good compromise between sensibility
and overfitting.
### Expression matrix
AffiXcan needs real expression values to train the models. The user has to
specify a [SummarizedExperiment](https://bioconductor.org/packages/release/\
bioc/html/SummarizedExperiment.html) object and the name (here, "values") of
the object in __assays()__ that contains the expression matrix. A toy example
with only two genes is shown below. In the expression matrix, rows are expressed
genes and columns are individual's IDs:
```{r}
suppressMessages(library(SummarizedExperiment))
load("../data/exprMatrix.RData")
assays(exprMatrix)$values[,1:5]
```
### Gene - Region(s) associations
The user has to provide a table with the association between expressed genes and
regulatory regions. Every expressed gene must be associated to at least one
regulatory region. To fit the model for one gene, AffiXcan includes the selected
principal components of all regulatory regions associated to that gene, e.g.:
GReX_geneA ~ PC1_regionA1 + PC2_regionA1 + PC3_regionA1 + PC4_regionA1 +
PC1_regionA2 + PC2_regionA2 ...
The associations table's header must contain the strings "REGULATORY_REGION" and
"EXPRESSED_REGION". An example is shown below:
```{r}
load("../data/regionAssoc.RData")
regionAssoc[1:3,]
```
Here it can be observed that the expressed gene "ENSG00000139269.2" is
associated to three different regulatory regions. The expressed genes' names
must be the same as found in the [expression matrix](#expression-matrix) and the
regulatory regions' names must be consistent with those used for the
[TBA matrices](#training-the-models).
### Pupulation structure covariates
Finally, AffiXcan computes p-values for each model. Optionally, population
structure covariates for each individual can be passed to __affiXcanTrain__ to
be included in the models to assess if the estimation of GReX is significantly
independent from the population's genetic structure.
Here is shown an example of an R object that can be used for this purpose and
that contains the first three PCs of the population structure:
```{r}
load("../data/trainingCovariates.RData")
head(trainingCovariates)
```
If no population structure covariates are specified, the models' p-value are
simply computed from the f statistic of the model summary(model)$fstatistic
Benjamini-Hochberg correction for multiple testing is eventually performed on
the models' P-values.
## Imputing GReX
Here are described the input files needed by AffiXcan to perform the prediction
phase. The function __affiXcanImpute()__ uses the output of __affiXcanTrain()__
to compute the imputed GReX values in a population of individuals.
See help("affiXcanImpute") for usage.
### TBA matrices
TBA values for regulatory regions referring to the population for which we want
to estimate GReX are needed. The user has to provide paths to __rds__ files that
contain TBA matrices, in the form of [MultiAssayExperiment](https://\
bioconductor.org/packages/release/bioc/html/MultiAssayExperiment.html) objects.
This type of data is described in the [training phase](#training-the-models)
section.
To apply the models consistently, TBA must be calculated on the same regions and
using the same PWM set as done for the training phase. The unambiguous regions'
IDs used to name the TBA matrices stored in MultiAssayExperiment objects need to
match those used in the training phase.
### Eigenvectors
AffiXcan performs a matrix product between TBA values and eigenvectors to obtain
the selected principal components that will be used as variables when estimating
GReX. Eigenvectors are computed by __affiXcanTrain()__ when performing principal
components analysis (PCA) on the training dataset. The user has to specify the
object in which the results of the training phase are stored.
### Models' Coefficients
For every gene the selected principal components of the TBA are multiplied by
the model's coefficients, previously trained on the training dataset by
__affiXcanTrain()__. The user has to specify the object in which the results of
the training phase are stored.
## Final Output
__affiXcanImpute()__ returns a [SummarizedExperiment](https://bioconductor.org/\
packages/release/bioc/html/SummarizedExperiment.html) object containing a matrix
with the imputed GReX values. To access it we can use assays()$GReX as shown
below. Here it is a toy example to impute the GReX of a single gene in a cohort
of 115 individuals. In the GReX matrix the rows are genes and the columns are
individual's IDs:
```{r}
suppressMessages(library("AffiXcan"))
trainingTbaPaths <- system.file("extdata","training.tba.toydata.rds",
package="AffiXcan")
data(exprMatrix)
data(regionAssoc)
data(trainingCovariates)
assay <- "values"
training <- affiXcanTrain(exprMatrix=exprMatrix, assay=assay,
tbaPaths=trainingTbaPaths, regionAssoc=regionAssoc, cov=trainingCovariates,
varExplained=80, scale=TRUE)
testingTbaPaths <- system.file("extdata","testing.tba.toydata.rds",
package="AffiXcan")
exprmatrix <- affiXcanImpute(tbaPaths=testingTbaPaths,
affiXcanTraining=training, scale=TRUE)
grexMatrix <- assays(exprmatrix)$GReX
as.data.frame(grexMatrix)[,1:5]
```
# Cross-Validation
__affiXcanTrain()__ can be used in k-fold cross-validation mode by specifying
the argument _kfold_ > 0. For example, with _kfold_ = 5, a 5-fold cross-validation
will be performed.
Cross-validation mode is not conceived to generate final GReX models. Therefore,
the output of __affiXcanTrain()__ in cross-validation mode can not be used by
__affiXcanImpute()__ to make GReX imputation on new data, since it consists of
a report useful to evaluate the prediction performance for each gene in each
fold.
In the following example a 5-fold cross-validation is performed on dummy data:
```{r}
trainingTbaPaths <- system.file("extdata","training.tba.toydata.rds",
package="AffiXcan")
data(exprMatrix)
data(regionAssoc)
data(trainingCovariates)
assay <- "values"
training <- affiXcanTrain(exprMatrix=exprMatrix, assay=assay,
tbaPaths=trainingTbaPaths, regionAssoc=regionAssoc, cov=trainingCovariates,
varExplained=80, scale=TRUE, kfold=5)
```
# Parallelization
AffiXcan processes can take a certain amount of time to be completed, but all
the functions support parallelization. [BiocParallel package](http://\
bioconductor.org/packages/BiocParallel/) is required for parallel evaluation.
The user can construct a BiocParallelParam object and pass it as the BPPARAM
argument when calling __affiXcanTrain()__ or __affiXcanImpute()__, or leave
BPPARAM as default for automatic parallel evaluation.
# AffiXcan Performance
This section has the only purpose to briefly show the predictive performance
obtained using AffiXcan in preliminary tests, and its comparison against the
multiple-SNP prediction method described in
[Gamazon et al.](http://www.nature.com/articles/ng.3367). Much further work,
also regarding other datasets and multivariate models, is still in progress.
## Cross-validation on GEUVADIS dataset
AffiXcan models were cross-validated on a cohort of 344 individuals of European
descent for whom phased genotype data and expression data (RNA-seq of
EBV-transformed lymphocites) are available in the
[GEUVADIS public dataset](https://www.ebi.ac.uk/arrayexpress/files/E-GEUV-1/).
The cohort was randomly splitted in a training set of 229 individuals and a
testing set of 115 individuals. The training phase was performed on the training
set, then the trained models were applied on the testing set to impute GReX.
Each gene was associated to only one regulatory region, which consisted in a
genomic window spanning upstream and downstream the Transcription Start Site
(TSS). The minimum percentage of variance of TBA to be explained by the selected
principal components was set to 80.
## Predictive Performance
The number of genes (~3000) for which a significant model was generated by
AffiXcan was almost identical to the number of genes for which a GReX could be
imputed using the method described in
[Gamazon et al.](http://www.nature.com/articles/ng.3367)
The predictive performance was assessed observing the squared correlation
(R2) between the imputed GReX values and the real total expression
values for each gene. The overall mean of R2 values obtained with
AffiXcan was greater than the one obtained with the multiple-SNP method
(0.099 vs 0.070)
![](predictive_performance.png)
In the graph: The R2 values from AffiXcan's predictions, for the
>3000 genes for which a GReX could be imputed, are sorted in increasing order.
## Predictive Performance Comparison
Remarkably, the overlap between the genes for which an imputed GReX could be
computed by the two methods is only slightly greater than one third (1123) of
the amount computed by each method. Arguably, this is due to the implementation
of [TBA score](https://journals.plos.org/plosone/article?id=10.1371/journal.\
pone.0143627) to take into account the contribution of all genetic variants in
a regulatory region, rather then only those SNPs with a greater effect size on
gene expression. Supposedly, AffiXcan manages to generate a significant model
to estimate GReX in genes where the transcriptional expression is influenced by
many variants, each contributing to GReX with a small effect size, where the
multiple-SNP prediction method fails to have good statistical predictors.
![](r_squares_correlation.png)
In the graph: for each gene for which a GReX could be imputed by both methods,
a blue circle is plotted with the coordinates (R2 from multiple-SNP
method's prediction, R2 from AffiXcan's prediction)
Observing the squared correlation (R2) between the imputed GReX
values and the real total expression values on the shared genes, a
Wilcoxon-Mann-Whitney paired test was performed to asses if the two
distributions of R2 values were significantly different.
R2 values from AffiXcan proved to be significantly higher:
![](r_squares_differences.png)
In the graph: histogram of the differences between R2 values:
R2 from AffiXcan's prediction - R2 from multiple-SNP
method's prediction (computed for each gene for which a GReX could be imputed by
both methods).
## Conclusion
In conclusion, AffiXcan could increase the amount of genes for which a GReX can
be estimated by a factor >1.6, at the same time enhancing the goodness of
prediction.