---
title: "geuvStore: sharded storage for cis-association statistics"
author: "Vincent J. Carey, stvjc at channing.harvard.edu"
date: "February 2015"
output:
  BiocStyle::pdf_document:
    toc: yes
    number_sections: yes
  BiocStyle::html_document:
    highlight: pygments
    number_sections: yes
    theme: united
    toc: yes
---

<!--
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{geuvStore: sharded storage for cis-association statistics}
-->

```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```

# Introduction

The `r Biocpkg("geuvStore")` package demonstrates an approach
to management of large numbers of statistics generated in
integrative genomic analyses.  The specific use case demonstrated
here is cis-eQTL discovery.  The following considerations 
motivated the design used here.

* Cluster computing will typically be used to perform cis-eQTL
searches.  Scalable performance is greatly aided by the
`r CRANpkg("BatchJobs")` infrastructure, which will create an
archive of results.  

    - This archive includes a database that
      holds information on job status (including time and memory
      required to complete) and result location.  We consider
      this information worth saving.

    - The collection of results is, by default, "sharded" into
      a reasonable number of folders holding serialized R objects.
      We find this approach useful for supporting parallelizable
      retrieval of results.

* It makes sense to store results of cis-association analyses
so that queries based on genomic addresses are rapidly resolved.
Thus all the results are stored in `GRanges` instances, and
queries based on `GRanges` are efficiently resolvable if an optional
index is prepared before use.

# Illustration

## Construction of mediator and indices

The most basic entity mediating access to the information
is the `BatchJobs` registry object.  This is typically
not created in a portable format, but includes directory
information that we modify during package installation.
```{r lkone}
suppressPackageStartupMessages(library(geuvStore))
prst = partialRegistry()
prst
```
Note that the show method mentions 2283 "jobs".  The job
set was defined using a partition of 22830 genes into
sets of 10 genes each.  Association statistics were recorded
between expression levels of each gene (as recorded in the
GEUVADIS FPKM report) and all SNP with MAF $> .01$ lying within a radius of
1 million bp upstream or downstream from the gene region.
This package
provides access to a selection of 92 jobs.  

We use the `ciseStore` class to mediate between the user
and the results data.  This includes optional mappings based
on gene identifiers (in the case of this example, these
are Ensembl gene IDs) and `GRanges`.  We have
stored the maps, but they can be computed in real time
if need be.

```{r lktwo,cache=TRUE}
library(gQTLBase)
# prstore = ciseStore(prst, addProbeMap=TRUE, addRangeMap=TRUE)
prstore = makeGeuvStore()
prstore
```

## Extraction of content

For a vector of gene identifiers, all available results 
are extracted.  
```{r lookup1}
head(
extractByProbes(prstore, 
   probeids=c("ENSG00000183814.10", "ENSG00000174827.9"))
)
```

For a request based on genomic coordinates, a
`GRanges` can be used to query.  `findOverlaps` is
used, and all results for genes whose regions
overlap the query ranges are returned.
```{r lookup2}
head(
extractByRanges(prstore, GRanges("1", IRanges(146000000, width=1e6)))
)
```

## Applicative programming

The `storeApply` function will be evaluated on all
store elements.  It dispatches to `r Biocpkg("BiocParallel")`
`bplapply`, and the registered 'BiocParallelParam'
is used implicitly.
Some list flattening may be required after
use.

```{r doapp}
lens = storeApply(prstore, length)
summary(unlist(lens))
```

## Visualization support

As of March 5, 2015 "biocLite('vjcitn/gQTLbrowser')" will
acquire a package including an interactive visualization function.
"example('gQTLbrowse')" will load a queriable interface into
the browser, with tooltips on the Manhattan plot for the selected gene.

## Origins for data up to 1.3.0

This will be replaced with better audited sva-based correction, Nov 2015

```{r echo=FALSE,eval=FALSE}
# /udd/stvjc/VM/GEUVADIS/SEPT_2014/FLAT_1M/CHANNING_SGE/newflat.R

library(BatchJobs)

gtpath = function (chrdigit, useS3 = FALSE) 
{
    tmplate = "/proj/rerefs/reref00/EBI/GEUVADIS/VCF/GEUVADIS.chr%%N%%.PH1PH2_465.IMPFRQFILT_BIALLELIC_PH.annotv2.genotypes.vcf.gz"
    if (useS3) 
        tmplate = "http://1000genomes.s3.amazonaws.com/release/20110521/ALL.chr%%N%%.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz"
    gsub("%%N%%", chrdigit, tmplate)
}


flatReg = makeRegistry("flatReg",  file.dir="flatStore",
        seed=123, packages=c("GenomicRanges",
            "GGtools", "VariantAnnotation", "Rsamtools", 
            "geuvPack", "GenomeInfoDb"))


chunkGenes = function(chrint, chunk.size=10) {
  cat(chrint)
  stopifnot(is(chrint, "integer"))
  library(geuvPack)
  if (!exists("geuFPKM")) data(geuFPKM)
  seqlevelsStyle(geuFPKM) = "NCBI"  # snps indexed this way too
  curgenes = rownames( geuFPKM[ which(seqnames(geuFPKM)==chrint), ] )
  glist = chunk( curgenes, chunk.size=chunk.size )
  lapply(glist, function(x) list(chr=chrint, genes=x) )
}

gettests = function( chunk, useS3=FALSE ) {
  library(VariantAnnotation)
  snpsp = gtpath( chunk$chr, useS3=useS3) 
  tf = TabixFile( snpsp )
  library(geuvPack)
  if (!exists("geuFPKM")) data(geuFPKM)
  seqlevelsStyle(geuFPKM) = "NCBI"
  clipped = clipPCs(regressOut(geuFPKM, ~popcode), 1:10)
  set.seed(54321)
  ans = cisAssoc( clipped[ chunk$genes, ], tf, cisradius=1000000, lbmaf=0.01 )
  metadata(ans)$prepString = "clipPCs(regressOut(geuFPKM, ~popcode), 1:10)"
  ans
  }

autosomes = as.integer(1:22)
genechunks = lapply(autosomes, chunkGenes, chunk.size=10)

flatlist = unlist(genechunks, recursive=FALSE)

batchMap(flatReg, gettests, flatlist)

submitJobs(flatReg, 1:2)

```