--- title: "geuvStore: sharded storage for cis-association statistics" author: "Vincent J. Carey, stvjc at channing.harvard.edu" date: "February 2015" output: BiocStyle::pdf_document: toc: yes number_sections: yes BiocStyle::html_document: highlight: pygments number_sections: yes theme: united toc: yes --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() ``` # Introduction The `r Biocpkg("geuvStore")` package demonstrates an approach to management of large numbers of statistics generated in integrative genomic analyses. The specific use case demonstrated here is cis-eQTL discovery. The following considerations motivated the design used here. * Cluster computing will typically be used to perform cis-eQTL searches. Scalable performance is greatly aided by the `r CRANpkg("BatchJobs")` infrastructure, which will create an archive of results. - This archive includes a database that holds information on job status (including time and memory required to complete) and result location. We consider this information worth saving. - The collection of results is, by default, "sharded" into a reasonable number of folders holding serialized R objects. We find this approach useful for supporting parallelizable retrieval of results. * It makes sense to store results of cis-association analyses so that queries based on genomic addresses are rapidly resolved. Thus all the results are stored in `GRanges` instances, and queries based on `GRanges` are efficiently resolvable if an optional index is prepared before use. # Illustration ## Construction of mediator and indices The most basic entity mediating access to the information is the `BatchJobs` registry object. This is typically not created in a portable format, but includes directory information that we modify during package installation. ```{r lkone} suppressPackageStartupMessages(library(geuvStore)) prst = partialRegistry() prst ``` Note that the show method mentions 2283 "jobs". The job set was defined using a partition of 22830 genes into sets of 10 genes each. Association statistics were recorded between expression levels of each gene (as recorded in the GEUVADIS FPKM report) and all SNP with MAF $> .01$ lying within a radius of 1 million bp upstream or downstream from the gene region. This package provides access to a selection of 92 jobs. We use the `ciseStore` class to mediate between the user and the results data. This includes optional mappings based on gene identifiers (in the case of this example, these are Ensembl gene IDs) and `GRanges`. We have stored the maps, but they can be computed in real time if need be. ```{r lktwo,cache=TRUE} library(gQTLBase) # prstore = ciseStore(prst, addProbeMap=TRUE, addRangeMap=TRUE) prstore = makeGeuvStore() prstore ``` ## Extraction of content For a vector of gene identifiers, all available results are extracted. ```{r lookup1} head( extractByProbes(prstore, probeids=c("ENSG00000183814.10", "ENSG00000174827.9")) ) ``` For a request based on genomic coordinates, a `GRanges` can be used to query. `findOverlaps` is used, and all results for genes whose regions overlap the query ranges are returned. ```{r lookup2} head( extractByRanges(prstore, GRanges("1", IRanges(146000000, width=1e6))) ) ``` ## Applicative programming The `storeApply` function will be evaluated on all store elements. It dispatches to `r Biocpkg("BiocParallel")` `bplapply`, and the registered 'BiocParallelParam' is used implicitly. Some list flattening may be required after use. ```{r doapp} lens = storeApply(prstore, length) summary(unlist(lens)) ``` ## Visualization support As of March 5, 2015 "biocLite('vjcitn/gQTLbrowser')" will acquire a package including an interactive visualization function. "example('gQTLbrowse')" will load a queriable interface into the browser, with tooltips on the Manhattan plot for the selected gene. ## Origins for data up to 1.3.0 This will be replaced with better audited sva-based correction, Nov 2015 ```{r echo=FALSE,eval=FALSE} # /udd/stvjc/VM/GEUVADIS/SEPT_2014/FLAT_1M/CHANNING_SGE/newflat.R library(BatchJobs) gtpath = function (chrdigit, useS3 = FALSE) { tmplate = "/proj/rerefs/reref00/EBI/GEUVADIS/VCF/GEUVADIS.chr%%N%%.PH1PH2_465.IMPFRQFILT_BIALLELIC_PH.annotv2.genotypes.vcf.gz" if (useS3) tmplate = "http://1000genomes.s3.amazonaws.com/release/20110521/ALL.chr%%N%%.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz" gsub("%%N%%", chrdigit, tmplate) } flatReg = makeRegistry("flatReg", file.dir="flatStore", seed=123, packages=c("GenomicRanges", "GGtools", "VariantAnnotation", "Rsamtools", "geuvPack", "GenomeInfoDb")) chunkGenes = function(chrint, chunk.size=10) { cat(chrint) stopifnot(is(chrint, "integer")) library(geuvPack) if (!exists("geuFPKM")) data(geuFPKM) seqlevelsStyle(geuFPKM) = "NCBI" # snps indexed this way too curgenes = rownames( geuFPKM[ which(seqnames(geuFPKM)==chrint), ] ) glist = chunk( curgenes, chunk.size=chunk.size ) lapply(glist, function(x) list(chr=chrint, genes=x) ) } gettests = function( chunk, useS3=FALSE ) { library(VariantAnnotation) snpsp = gtpath( chunk$chr, useS3=useS3) tf = TabixFile( snpsp ) library(geuvPack) if (!exists("geuFPKM")) data(geuFPKM) seqlevelsStyle(geuFPKM) = "NCBI" clipped = clipPCs(regressOut(geuFPKM, ~popcode), 1:10) set.seed(54321) ans = cisAssoc( clipped[ chunk$genes, ], tf, cisradius=1000000, lbmaf=0.01 ) metadata(ans)$prepString = "clipPCs(regressOut(geuFPKM, ~popcode), 1:10)" ans } autosomes = as.integer(1:22) genechunks = lapply(autosomes, chunkGenes, chunk.size=10) flatlist = unlist(genechunks, recursive=FALSE) batchMap(flatReg, gettests, flatlist) submitJobs(flatReg, 1:2) ```