---
title: "INTACT: Integrate TWAS and Colocalization Analysis for Gene Set Enrichment Analysis"
shorttitle: "INTACT"
author:
- name: Jeffrey Okamoto
  affiliation: 
  - &id Department of Biostatistics, University of Michigan
  email: jokamoto@umich.edu
- name: Xiaoquan Wen
  email: xwen@umich.edu
  affiliation: *id
abstract: >
  This package integrates colocalization probabilities from 
    colocalization analysis with transcriptome-wide association study (TWAS) 
    scan summary statistics to implicate genes that may be biologically relevant
    to a complex trait. Given gene set annotations, this package can estimate 
    gene set enrichment using posterior probabilities from the 
    TWAS-colocalization integration step.
package: INTACT
output: 
  BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{INTACT: Integrate TWAS and Colocalization Analysis for Gene Set Enrichment}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
toc_float: true
---

# Installation

To install this package, run the following code chunk (in R 4.3 or later):

```{r,eval=FALSE}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("INTACT")
```

# Methodology Reference

For a comprehensive description of the probabilistic framework behind INTACT 
please refer to: 


Okamoto, Jeffrey, et al. "Probabilistic integration of transcriptome-wide 
association studies and colocalization analysis identifies key molecular 
pathways of complex traits." *The American Journal of Human Genetics* 110.1 
(2023): 44-57.

# Introduction

Integrative genetic association methods have shown great promise in post-GWAS
(genome-wide association study) analyses, in which one of the most challenging 
tasks is identifying putative causal genes and uncovering molecular mechanisms 
of complex traits. Prevailing computational approaches include 
transcriptome-wide association studies (TWASs) and colocalization analysis. 
TWASs aim to assess the correlation between predicted gene expression of a 
target gene and a GWAS trait. Common output for TWASs include gene-level 
z-statistics. Colocalization analysis attempts to determine whether genetic 
variants that are causal for a molecular phenotype (such as gene expression) 
overlap with variants that are causal for a GWAS trait. Common output for 
colocalization analysis often include gene-level colocalization probabilities,
which provide evidence regarding whether there exists a colocalized variant 
for the expression of a target gene and GWAS trait. Recent studies suggest that 
TWASs and colocalization analysis are individually imperfect, but their joint 
usage can yield robust and powerful inference results. INTACT is a computational 
framework to integrate probabilistic evidence from these distinct types of 
analyses and implicate putative causal genes. This procedure is flexible and can 
work with a wide range of existing integrative analysis approaches. It has the 
unique ability to quantify the uncertainty of implicated genes, enabling 
rigorous control of false-positive discoveries. INTACT-GSE is an efficient 
algorithm for gene set enrichment analysis based on the integrated probabilistic 
evidence. This package is intended for performing integrative genetic 
association analyses in tandem with other Bioconductor packages such as 
`biomaRt` or `GO.db`, which could be used to obtain gene set annotations for
gene set enrichment analysis.

# Included Data Sets

To illustrate the functionality of the `INTACT` package, we include a 
simulated data set `simdat`. See the methodology reference for an 
explanation of the simulation design. The data is organized as a 1197 row by 
3 column data frame, where rows correspond to genes, the GLCP column provides  
gene-level colocalization probabilities, and the TWAS_z column provides TWAS 
scan z-scores. 

Additionally, we include a simulated gene set list `gene_set_list`, 
which contains two gene sets. The first gene set has 503 gene members and is 
significantly enriched among the genes included in `simdat`, based on the 
probabilistic INTACT output. The second gene set has 200 gene members and is 
not enriched among the `simdat` genes. We include `gene_set_list` to 
show how to perform gene set enrichment estimation using INTACT-GSE.

# INTACT: Integrating TWAS Scan and Colocalization Analysis Results

The first main functionality of this package is integrating results from a 
transcriptome-wide association study (TWAS) scan and a colocalization analysis. 
The TWAS scan results must be in the form of gene-level z scores, and the 
colocalization analysis results should be in the form of gene-level 
colocalization probabilities. These are provided as output by most popular TWAS 
and colocalization methods. Some TWAS methods that we utilize in our work are 
[PTWAS](https://github.com/xqwen/ptwas), 
[PrediXcan](https://github.com/hakyimlab/PrediXcan), and 
[SMR](https://yanglab.westlake.edu.cn/software/smr/#Overview). We 
recommend using [fastENLOC](https://github.com/xqwen/fastenloc) for 
colocalization analysis, as it estimates enrichment of QTL among 
GWAS hits and does not require specification of prior probabilities.

Below, we include an example of how to use INTACT to integrate TWAS scan and 
colocalization results for a simulated data set `simdat`. 

```{r}
library(INTACT)
data(simdat)

rst <- INTACT::intact(GLCP_vec=simdat$GLCP, 
                      z_vec=simdat$TWAS_z)
```

The `intact` function takes a vector of gene-level colocalization 
probabilities `GLCP_vec` and TWAS scan z-scores `z_vec`. It outputs 
gene-level posterior probabilities of putative causality. The example included 
above uses default settings for the prior function and truncation threshold 
$t$ (`prior_fun = linear` and `t=0.05`). There are three additional 
prior functions implemented in the `INTACT` software, including `expit`,
`step`, and `hybrid`. The `expit` and `hybrid` options have an 
additional curvature shrinkage parameter `D`, with a default value of 0.1. 
The default truncation parameter value for the step prior function is 0.5, 
while the default value is 0.05 for all other prior functions. Below are three 
additional examples of how to integrate the TWAS z scores and colocalization 
probabilies from the simulated data using different prior function, truncation 
threshold, and curvature shrinkage settings:


```{r}
rst1 <- INTACT::intact(GLCP_vec=simdat$GLCP, 
                       prior_fun=INTACT::expit, 
                       z_vec = simdat$TWAS_z,
                       t = 0.02,D = 0.09)
rst2 <- INTACT::intact(GLCP_vec=simdat$GLCP, 
                       prior_fun=INTACT::step, 
                       z_vec = simdat$TWAS_z,
                       t = 0.49)
rst3 <- INTACT::intact(GLCP_vec=simdat$GLCP, 
                       prior_fun=INTACT::hybrid, 
                       z_vec = simdat$TWAS_z,
                       t = 0.49,D = 0.05)
```

If the user wishes to specify TWAS Bayes factors instead of z-scores, they can 
do so through the argument `twas_BFs`. The Bayes factors should be a numeric 
vector with genes in the same order as the colocalization probabilities vector. 
If the user wishes to specify gene-specific TWAS priors, they can do so through 
the argument `twas_priors`. If no input is supplied, INTACT computes a 
scalar prior using the TWAS data (see the methodology reference for more 
details).

We provide an additional function `fdr_rst` that is useful if the user wishes
to perform Bayesian FDR control on the INTACT output. An example of how to 
apply this function at a target control level of 0.05 is shown below.

```{r}
fdr_example <- fdr_rst(rst1, alpha = 0.05)
head(fdr_example)
```

# INTACT-GSE: Gene Set Enrichment Estimation Using INTACT results

The `INTACT` package provides the `intactGSE` function to perform gene 
set enrichment estimation and inference using integrated TWAS scan z-scores and 
colocalization probabilities. This function requires a data frame 
`gene_data` containing gene names and corresponding colocalization 
probabilities and TWAS z-scores for each gene. Column names should be "gene", 
"GLCP", and "TWAS_z'. If the user wishes to specify TWAS Bayes factors instead 
of z-scores, use the column name "TWAS_BFs". If the user wishes to specify 
gene-specific TWAS priors, use the column name "TWAS_priors".

In addition to `gene_data`, the user must provide a list of gene sets 
`gene_sets`. The format of `gene_sets` must match the included example 
`gene_set_list`: it must named list of gene sets for which enrichment is to 
be estimated. List items should be character vectors of gene IDs. Gene ID format 
should match the gene column in `gene_data`.

The user can specify the same prior-related arguments as in the `intact` 
function, including    `prior_fun`,  `t`, and `D`(only when the 
prior function is specified as `expit` or `hybrid`). 

The user can specify the method by which the standard error of the enrichment 
estimate is computed. Options include a numerical differentiation of the score 
function (default): `NDS`; a profile likelihood approach: 
`profile_likelihood`, and bootstrapping: `bootstrap`. For hypothesis 
testing, the user can specify a significance threshold, which is 0.05 by 
default. 

An example of how to estimate gene set enrichment in the gene sets provided in 
`gene_set_list` (using default settings) is shown below:

```{r}
data(gene_set_list)
INTACT::intactGSE(gene_data = simdat,gene_sets = gene_set_list)
```

The output of `intactGSE` includes one row per gene set and eight columns: 
the gene set name, the enrichment parameter $\alpha_1$ estimate, the enrichment
parameter estimate standard error, the z-score, the p-value, the left and right
CIs, and the convergence flag (if CONVERGED = 1, then the enrichment estimation
algorithm converged. If not, CONVERGED = 0). Some data sets are not informative 
for gene set enrichment estimation; in this case, the algorithm will fail to 
converge. We emphasize that failure of the algorithm to converge does not 
provide information regarding the enrichment (or lack thereof) for a given 
gene set.


Finally, we include three additional examples of how to estimate enrichment for 
the same data sets using non-default prior parameters:

```{r}
INTACT::intactGSE(gene_data = simdat,prior_fun = INTACT::step,
                  t = 0.45,gene_sets = gene_set_list)
INTACT::intactGSE(gene_data = simdat,prior_fun = INTACT::expit,
                  t = 0.08,D = 0.08, gene_sets = gene_set_list)
INTACT::intactGSE(gene_data = simdat,prior_fun = INTACT::hybrid,
                  t = 0.08,D = 0.08, gene_sets = gene_set_list)
```

# Integrating Additional Gene Product Data

Although INTACT achieves higher power than colocalization alone and better 
false discovery control than TWAS alone, it is not robust to scenarios in which 
a causal gene impacts the complex trait, but not through gene expression. 
Motivated by recent protein and splicing QTL studies suggesting these scenarios 
are quite common, we introduce Multi-INTACT, an empirical Bayes framework that 
extends the INTACT model to jointly consider multiple molecular gene products 
(e.g. encoded RNA transcript levels and protein levels). Multi-INTACT comprises
two stages: a scanning stage and model selection stage. The scanning stage 
integrates GWAS results and QTL data for two or more molecular gene products 
to implicate putative causal genes. The model selection stage aims to determine
which of the measured molecular gene products for a gene implicated in the 
scanning stage directly affects the complex trait-of-interest. We employ a 
Bayesian procedure to compare possible underlying molecular mechanisms using 
probabilistic evidence. While leveraging additional molecular gene product 
information, Multi-INTACT shares strengths with INTACT such as computational 
efficiency and probabilistic uncertainty quantification for causal gene 
nomination. This implementation of Multi-INTACT can jointly consider up to 
2 molecular gene products at a time.

<!-- For a full description of the Multi-INTACT algorithm, refer to [] -->

# Included Data Sets to Demonstrate Multi-INTACT Functionality

To illustrate the functionality of the Multi-INTACT functions, we include a 
simulated data set `multi_simdat`. See the methodology reference for an 
explanation of the simulation design. The data is organized as a 1197 row by 
6 column data frame, where rows correspond to genes, the `GLCP_1` and `GLCP_2` 
columns provide pairwise gene-level colocalization probabilities (between the 
complex trait/gene expression and complex trait/protein levels, respectively),
`z_1` and `z_2` columns provide TWAS and protein-TWAS (PWAS) z-scores,
respectively, and `chisq` provides multivariate Wald chi-square test statistics 
from the joint regression of the complex trait on predicted protein and 
expression levels.

If individual-level data are available, recommend using the function `wald.test` 
from the `aod` package in order to generate the chi-square test statistics. If 
individual-level data are not available, it is still possible to approximate the 
statistic. We include simulated data sets `ld_sumstats`, `exprwt_sumstats`,
`protwt_sumstats`, `z_sumstats`, to show how to approximate the multivariate 
Wald statistic using summary-level data.

# Estimating the Chi-square Statistic From Summary-level Data

To estimate the chi-square statistic, for each candidate gene, we will need a 
set of SNPs used to predict each gene product level, an estimated linkage 
disequilibrium (LD) correlation matrix for these SNPs, and prediction weights
for each gene product type, and TWAS z-scores for each gene product type. 
The example data `ld_sumstats` contains the LD matrix, `exprwt_sumstats` and
`protwt_sumstats` contain TWAS and PWAS prediction weights, and `z_sumstats` 
contains the TWAS and PWAS z-scores for the candidate gene.

To estimate the chi-square statistic, run:

```{r}
data(z_sumstats)
data(exprwt_sumstats)
data(protwt_sumstats)
data(ld_sumstats)

INTACT::chisq_sumstat(z_vec = z_sumstats,
              w = cbind(protwt_sumstats,exprwt_sumstats),
              R = ld_sumstats)
```

We highly recommend running this step in parallel if you have many candidate 
gene-trait pairs.

# Running Multi-INTACT

Once chi-square statistics, pairwise colocalization data, and marginal 
z-scores for each gene product are available, we are ready to run Multi-INTACT.
To compute gene probabilities of putative causality (GPPCs) and gene 
product relevance probabilities (GPRPs), run:

```{r}
data(multi_simdat)

rst <- INTACT::multi_intact(df = multi_simdat)
```

The output from the `multi_intact` function is a list object containing 3 items.
The first is a data frame with the GPPC, GPRP for expression (`GPRP_1`), and for 
protein (`GPRP_2`). The second is a numeric 3-vector containing conditional 
prior parameter estimates (denoted $h_e,h_p,h_{e+p}$ in the methodology 
reference). The third is a Boolean indicating whether the EM algorithm 
converged. 

The a preview of the output is shown below:
```{r,echo=FALSE}
print(head(rst[[1]]))
print(rst[[2]])
print(rst[[3]])
```

The GPPC is a form of probabilistic evidence that a target gene 
exerts a causal effect on the complex trait through at least one of the provided 
gene product data types. The GPRPs provide probabilistic evidence to 
determine the gene product(s) that exert a direct effect on the complex trait. 
For example,in the output above, based on the Multi-INTACT output, it is very 
likely that the gene ENSG00000038274 has a causal effect, and it is likely that
encoded protein levels exert a direct effect. There is relatively little 
evidence that the gene's expression levels exert a direct effect (although it 
remains possible that gene expression exerts and effect that is mediated by 
protein levels). The prior parameter estimates represent prior probabilistic 
evidence of three possible underlying causal models (in which only expression, 
only protein levels, or both expression and protein levels exert an effect, 
respectively). Estimation of these prior parameters is required to estimate 
GPRPs. We visualize the complete results below:

```{r,echo=FALSE}
library(ggplot2)

output <- rst[[1]]

print(ggplot(output, aes(x=GPPC)) +
        geom_histogram(binwidth=0.01) +
        ylab("Number of simulated genes") +
        theme_bw() +
        theme(text = element_text(size=10,face="bold"))) 

gprp1 <- data.frame("GPRP" = output$GPRP_1,"Gene_product" = "Expression")
gprp2 <- data.frame("GPRP" = output$GPRP_2,"Gene_product" = "Protein")
gprp <- rbind.data.frame(gprp1,gprp2)

print(ggplot(gprp,aes(x=GPRP,fill=Gene_product)) +
        geom_histogram(binwidth=0.01)+
        facet_wrap(~Gene_product) +
        ylab("Number of simulated genes") + 
        scale_fill_manual(values=c("blue","red")) +
        theme_bw() +
        theme(text = element_text(size=10,face="bold")))

```

From the histograms above, we can see that the probabilistic evidence of a 
causal effect on the complex trait is low for most genes (top plot). In the
bottom plot, we display the distribution of GPRPs. Recall that all genes have 2 
GPRPs (`GPRP_1` for protein evidence, and `GPRP_2` for expression), so we
visualize the distribution of each GPRP type in each panel. Additionally, we can 
see that the distribution of probabilistic evidence of a direct protein effect 
is similar to that of probabilistic evidence of a direct expression effect. 
These observed qualities are by design, as this is simulated data.

If you want to see the model posteriors for expression-only 
(`posterior_1`), protein-only (`posterior_2`), and expression-and-protein 
(`posterior_12`), run:


```{r}
rst <- INTACT::multi_intact(df = multi_simdat,return_model_posteriors = TRUE)
```
```{r,echo=FALSE}
print(head(rst[[1]]))
```

If you only want to compute GPPCs (and not run the EM algorithm to compute 
GPRPs), run:

```{r}
rst <- INTACT::multi_intact(df = multi_simdat,em_algorithm = FALSE)
```
```{r,echo=FALSE}
print(head(rst))
```

Session information is included below:

```{r}
sessionInfo()
```