--- title: "gQTLBase: infrastructure for storage and interrogation of eQTL, mQTL, dsQTL etc. archives" author: "Vincent J. Carey, stvjc at channing.harvard.edu" date: "January 2015" output: BiocStyle::html_document: highlight: pygments number_sections: true theme: united toc: true BiocStyle::pdf_document: toc: true number_sections: true --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() ``` # Introduction It is well-recognized that cis-eQTL searches with dense genotyping yields billions of test results. While many are consistent with no association, it is hard to draw an objective threshold, and targeted analysis may reveal signals of interest that do not deserve penalization for genome-wide search. We recently performed a comprehensive cis-eQTL search with the GEUVADIS FPKM expression measures. The most prevalent transcript types in this dataset are ```{r lktt,echo=FALSE} litt = structure(c(15280L, 4853L, 1450L, 1153L, 476L, 114L), .Dim = 6L, .Dimnames = structure(list( c("protein_coding", "pseudogene", "antisense", "lincRNA", "processed_transcript", "IG_V_gene")), .Names = "") ) litt ``` *cis*-associated variation in abundance of these entities was assessed using 20 million 1000 genomes genotypes with radius 1 million bases around each transcribed region. There are 185 million SNP-transcript pairs in this analysis. This package (`r Biocpkg("gQTLBase")`) aims to simplify interactive interrogation of this resource. ## Brief view of how the tests were done The following function takes as argument 'chunk' a list with elements chr (character token for indexing chromosomes in genotype data in VCF) and genes (vector of gene identifiers). It implicitly uses a `TabixFile` reference to acquire genotypes on the samples managed in the `r Biocexptpkg("geuvPack")` package. ```{r eval=FALSE} gettests = function( chunk, useS3=FALSE ) { library(VariantAnnotation) snpsp = gtpath( chunk$chr, useS3=useS3) tf = TabixFile( snpsp ) library(geuvPack) if (!exists("geuFPKM")) data(geuFPKM) clipped = clipPCs(regressOut(geuFPKM, ~popcode), 1:10) set.seed(54321) ans = cisAssoc( clipped[ chunk$genes, ], tf, cisradius=1000000, lbmaf=0.01 ) metadata(ans)$prepString = "clipPCs(regressOut(geuFPKM, ~popcode), 1:10)" ans } ``` `cisAssoc` returns a `GRanges` instance with fields relevant to computing FDR for cis association. A `r CRANpkg("BatchJobs")` registry is created as follows: ```{r bjreg,eval=FALSE} flatReg = makeRegistry("flatReg", file.dir="flatStore", seed=123, packages=c("GenomicRanges", "GGtools", "VariantAnnotation", "Rsamtools", "geuvPack", "GenomeInfoDb")) ``` For any list 'flatlist' of pairs (chr, genes), the following code asks the scheduler to run gettests on every element, when it can. Using the Channing cumulus cloud, the job ran on 40 hosts at a cost of 170 USD. ```{r dosub,eval=FALSE} batchMap(flatReg, gettests, flatlist) submitJobs(flatReg) ``` This creates a 'sharded' archive of 7GB of results managed by a Registry object. ## Illustration on a subset We have extracted 3 shards from the job for illustration with the `r Biocpkg("gQTLBase")` package. ```{r getstuff,results="hide", echo=FALSE} suppressPackageStartupMessages({ library(BiocGenerics) library(Homo.sapiens) library(stats4) library(IRanges) library(GGtools) library(gQTLBase) library(geuvStore2) options(BBmisc.ProgressBar.style="off") }) ``` ```{r doone} library(gQTLBase) library(geuvStore2) mm = makeGeuvStore2() mm ``` `mm` here is an instance of the `ciseStore` class. This is a BatchJobs `Registry` wrapped with additional information concerning the map from identifiers or ranges to jobs in the registry. There are various approaches available to get results out of the store. At present we don't want a full API for result-level operations, so work from BatchJobs directly: ```{r getr} loadResult(mm@reg, 1)[1:3] ``` On a multicore machine or cluster, we can visit job results in parallel. The `storeApply` function uses `r CRANpkg("BatchJobs")` `reduceResultsList` to transform job results by a user-supplied function. The reduction events occur in parallel through `r Biocpkg("BiocParallel")` `bplapply` over a set of job id chunks whose character can be controlled through the `n.chunks` parameter. We'll illustrate by taking the length of each result. ```{r setuplen} library(BiocParallel) library(parallel) mp = MulticoreParam(workers=max(c(1, detectCores()-4))) register(mp) ``` ```{r getlen, cache=TRUE} lens = storeApply(mm, length) summary(unlist(lens)) ``` It is possible to limit the scope of application by setting the `ids` parameter in `storeApply`. ## Interrogating by probe For a known GEUVADIS Ensembl identifier (or vector thereof) we can acquire all cis association test results as follows. ```{r doex,cache=TRUE} pvec = mm@probemap[1:4,1] # don't want API for map, just getting examples litex = extractByProbes( mm, pvec ) length(litex) litex[1:3] ``` We also have extractByRanges. # Towards estimation of distributions relevant to FDR computation ## Small-footprint collection of association statistics In the gQTLstats package, we will use the plug-in FDR algorithm of Hastie, Tibshirani and Friedman *Elements of Statistical Learning* ch. 18.7, algorithm 18.3. We will not handle hundreds of millions of scores directly in a holistic way, except for the estimation of quantiles of the observed association scores. This particular step is carried out using `r CRANpkg("ff")` and `r CRANpkg("ffbase")` packages. We illustrate with our subset of GEUVADIS scores. ```{r getfff} allassoc = storeToFf(mm, "chisq") length(allassoc) object.size(allassoc) allassoc[1:4] ``` ## Other statistical functions Refer to the `r Biocpkg("gQTLstats")` package for additional functions that generate quantile estimates, histograms, and FDR estimates based on `ciseStore` contents and various filtrations thereof.