%
% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%

%\VignetteIndexEntry{external -- decomposable smlSet representation}
%\VignetteDepends{GGtools}
%\VignetteKeywords{package-based data management}
%\VignettePackage{GGtools}

\documentclass[12pt]{article}

\usepackage{amsmath,pstricks}
% With MikTeX 2.9, using pstricks also requires using auto-pst-pdf or running
% pdflatex will fail. Note that using auto-pst-pdf requires to set environment
% variable MIKTEX_ENABLEWRITE18=t on Windows, and to set shell_escape = t in
% texmf.cnf for Tex Live on Unix/Linux/Mac.
\usepackage{auto-pst-pdf}
\usepackage[authoryear,round]{natbib}
\usepackage{hyperref}


\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}


\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}

\textwidth=6.2in

\bibliographystyle{plainnat} 
 
\begin{document}
%\setkeys{Gin}{width=0.55\textwidth}

\title{Externalized expression plus genotype data structures}
\author{VJ Carey}
\maketitle

\section{Introduction}

The smlSet class of \textit{GGBase} has served for some time as
a holistic representation of an expression genetics experiment.
As SNP panels have grown to 8 million plus, it is impractical to
employ unified containers for such experiments.

This document describes the first phase of externalizing large
genotype data resources for expression genetics data structures.
Given an smlSet instance, the \texttt{externalize()} function
will create an R source package with chromosome-specific files
of genotype data in \texttt{snpStats::SnpMatrix} instances.
When this package is installed and loaded, the \texttt{getSS} function
will collect information necessary to construct an smlSet with
a specified number of chromosomes.

\section{HapMap Phase II for expression genetics}

The Bioconductor GGdata package has represented the Phase II HapMap
data for CEPH/CEU cohort as a single smlSet instance.  Loading
the data for use on an 8GB macbook pro takes

\begin{verbatim}
> library(GGdata)
Loading required package: illuminaHumanv1.db
> unix.time(data(hmceuB36))
   user  system elapsed 
 40.309   0.729  41.581 
\end{verbatim}

At this point we can access 4 million SNP for each subject very easily,
but we seldom need to do this atomically; we would be just as happy to
step across chromosomes and so the slow load and large memory footprint
\begin{verbatim}
> gc() # after above
           used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells  7166465 382.8   10591793 565.7  7167437 382.8
Vcells 73027380 557.2   80745864 616.1 73028403 557.2
\end{verbatim}
are inessential results of a holistic representation approach that is
not effective with very large SNP panels.

The externalize() function can be called

\begin{verbatim}
> externalize(hmceuB36, "ggdPack")
now install ggdPack 
NULL
\end{verbatim}

After installation we can use
\begin{verbatim}
library(ggdPack)
\end{verbatim}
which prints the message
\begin{verbatim}
To get a tailored smlSet, use getSS("ggdPack", [chrvec]) 
available chromosomes are named  1 10 ... X Y 
\end{verbatim}


We use \texttt{getSS}
\begin{verbatim}
> unix.time(gg1 <- getSS("ggdPack", "1"))
   user  system elapsed 
  3.514   0.090   3.605 
> gg1
SnpMatrix-based genotype set:
number of samples:  90 
number of chromosomes present:  1 
annotation: illuminaHumanv1.db 
Expression data dims: 47293 x 90 
Phenodata: An object of class "AnnotatedDataFrame"
  sampleNames: NA06985 NA06991 ... NA12892 (90 total)
  varLabels: famid persid ... male (7 total)
  varMetadata: labelDescription
> gc()
           used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells  3504115 187.2    4953636 264.6  3885278 207.5
Vcells 10444243  79.7   12242489  93.5 11388329  86.9
\end{verbatim}
and see rapid access and a relatively small footprint.

We can easily visit all chromosomes, as this naive code illustrates:
\begin{verbatim}
> allc = as.character(1:22)
> unix.time(ll <- lapply(allc, function(x) getSS("ggdPack", x)))
   user  system elapsed 
 68.533   1.493  70.131 
> gc()
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells   7053675  376.8   10591793  565.7   8189107  437.4
Vcells 163291667 1245.9  173709474 1325.3 165150995 1260.1
\end{verbatim}
The memory footprint of growing the entire list is unpleasant.
Also, there are excessive copies of the expression data (one for
each chromosome).

Instead, if we process and then discard:
\begin{verbatim}
> unix.time(ll <- lapply(allc, function(x) {tmp <- getSS("ggdPack", x); NULL})) 
   user  system elapsed 
 44.805   0.826  45.743 
> gc()
          used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells 3148343 168.2    5241317 280.0  4169255 222.7
Vcells 1828070  14.0    9260197  70.7 11572531  88.3
\end{verbatim}
all is well.

\end{document}