%\VignetteIndexEntry{genoset}
%\VignetteDepends{}
%\VignetteKeywords{genoset}
%\VignetteKeywords{bafset}
%\VignetteKeywords{cnset}
%\VignetteKeywords{BAFSet}
%\VignetteKeywords{CNSet}
%\VignettePackage{genoset}
\documentclass[10pt]{article}

\usepackage{times}
\usepackage{hyperref}
\usepackage{Sweave}

\newcommand{\Rfunction}[1]{{\texttt{#1}}} 
\newcommand{\Robject}[1]{{\texttt{#1}}} 
\newcommand{\Rpackage}[1]{{\textit{#1}}} 
\newcommand{\Rclass}[1]{{\textit{#1}}} 

\textwidth=6.5in
\textheight=8.5in
\oddsidemargin=-.1in
\evensidemargin=-.1in
\headheight=-.3in

\title{An Introduction to the ``genoset'' Package}
\author{Peter M. Haverty}
\date{\today}

\begin{document}

\maketitle

<<<options,echo=FALSE>>=
  options(width=90)
@ 

\section{Introduction}
The genoset package offers an extension of the BioConductor eSet object for genome
arrays. The package offers three classes. The first class is the
``GenoSet'' class which can hold an arbitrary number of equal-sized
matrices in its assayData slot. The principal addition of the GenoSet
class is a ``locData'' slot that holds a RangedData object from the
IRanges package. The locData slot allows for quick subsetting by genome position.

Two classes extend GenoSet: CNSet and BAFSet.  CNSet is the basic copy number object. It
keeps its data in the ``cn'' slot, similar to the ``exprs'' slot of the ExpressionSet.
BAFSet is intended to store ``LRR'' or Log-R Ratio and ``BAF'' or B-Allele Frequency data for
SNP arrays.  LRR and BAF come from the terms coined by Illumina. LRR is copynumber data
processed on a per-snp basis to remove some variability using the expected log-ratio of normal
samples with the same genotype. BAF represents the fraction of signal coming from the ``B''
allele, relative to the ``A'' allele, where A and B are arbitrarily assigned. BAF has the
expected value of 0 or 1 for HOM alleles and 0.5 for HET alelles.  Deviation from these
expected values can be interpreted as Allelic Imbalance, which is a sign of gain, loss, or
copy-neutral LOH.

I will use a CNSet to demonstrate the features that apply to both classes.  Some
BAFSet-specific features will be handled later in the document.

\subsection{Creating Objects}

GenoSet, CNSet and BAFSet objects can be created using the functions
with the same name.  Let's load up some fake data to experiment
with. Don't worry too much about how the fake data gets made.

Notice how CNSet requires a 'cn' element, BAFSet
requires 'lrr' and 'baf', and genoset can take assayData elements with
any name.  Also notice how assayData elements can be matrices or
DataFrames with Rle columns.

<<objectcreation>>=
library(genoset)
data(genoset)
genoset.ds = GenoSet( locData=locData.rd, foo=fake.lrr, pData=fake.pData, annotation="SNP6", universe="hg19" )

baf.ds = BAFSet( locData=locData.rd, lrr=fake.lrr, baf=fake.baf, pData=fake.pData, annotation="SNP6", universe="hg19" )

cn.ds = CNSet( locData=locData.rd, cn=fake.lrr, pData=fake.pData, annotation="SNP6", universe="hg19" )
cn.ds

rle.baf.ds = BAFSet( locData=locData.rd,
  lrr=DataFrame(K=Rle(c(rep(1.5,300),rep(2.3,700))),L=Rle( c(rep(3.2,700),rep(2.1,300)) ), M=Rle(rep(1.1,1000)),row.names=rownames(fake.lrr)),
  baf=DataFrame(K=Rle(c(rep(0.05,600),rep(0.5,400))),L=Rle( c(rep(0,700),rep(0.5,300)) ), M=Rle(rep(1,1000)),row.names=rownames(fake.baf)),
  pData=fake.pData,
  annotation="SNP6"
  )
@ 

Let's have look at what's inside these objects.   CNSets extend
ExpressionSets and so have all of the same functions and slots, except 
that the exprs slot and getter and setter functions have been replaced with ``cn''.  The locData slot has
been added.  This slot holds a RangedData object that keeps track of the feature positions and
allows for quick subsetting. The RangedData object is always sorted such that the chromosomes
are in contiguous blocks, which provides a constraint on the ordering of the CNSet.

<<<objectinspection>>= >>=
slotNames(genoset.ds)
phenoData(genoset.ds)
pData(genoset.ds)
universe(genoset.ds)
annotation(genoset.ds)
locData(genoset.ds)
@ 

The assayData slot is a collection of equally sized matrix-like objects.
Access to required elements can be done by special accessor functions. Others require more typing.

<<objectassaydata>>=
assayDataElementNames(baf.ds)
head( baf(baf.ds) )
head( lrr(baf.ds) )
head( cn(cn.ds) )
head( assayDataElement(genoset.ds,"foo") )
@ 

\subsection{Accessing Genome Information}
Now lets look at some special functions for accessing genome
information from a genoset object. Methods for RangedData have also
been added to give the same API. We can access per-probe information
as well as summaries of chromosome boundaries in base-pair or
row-index units.   There are a number of functions for getting
portions of the locData data. ``chr'' and ``pos'' return the
chromosome and position information 
for each feature. ``genoPos'' is like ``pos'', but it
returns the base positions counting from the first base in the genome, with the
chromosomes in order by number and then alphabetically for the letter
chromosomes. ``chrInfo'' returns the genoPos of the first and last feature on each chromosome in
addition to the offset of the first feature from the start of the genome. ``chrInfo'' results are
used for drawing chromosome boundaries on genome-scale plots. ``pos''
and ``genoPos'' are defined as the floor of the average of each
features start and end positions.

<<accessgenomeinfo>>=
chrNames(genoset.ds)
chrOrder(c("chr12","chr12","chrX","chr8","chr7","chrY"))
chrInfo(genoset.ds)
chrIndices(genoset.ds)
elementLengths(genoset.ds)

head(ranges(genoset.ds))
head(chr(genoset.ds))
head(start(genoset.ds))
head(end(genoset.ds))
head(pos(genoset.ds))
tail(pos(genoset.ds))
tail(genoPos(genoset.ds))
@ 

\subsection{Using the Subset Features}

GenoSet objects can be subset using matrix notation. The ``features''
index can be a set of ranges. chrIndices with a chromosome argument is
a convenient way to get the indices needed to subset by chromosome.

Subset by chromosome   
<<subsetbychr>>=
chr12.ds = cn.ds[ chrIndices(cn.ds,"chr12"), ]
chr12.ds
@ 

Subset by a collection of gene locations
<<subsetbygene>>=
gene.rd = RangedData(ranges=IRanges(start=c(35e6,127e6),end=c(35.5e6,129e6),                       
                       names=c("HER2","CMYC")), space=c("chr17","chr8"))
gene.ds = cn.ds[ gene.rd, ]
gene.ds
@ 

GenoSet objects can also be subset by a group of samples and/or
features, just like an ExpressionSet, or a matrix for that matter.

<<subsetsamples>>=
cn.ds[1:4,1:2]
cn.ds[ c("p1","p2"), cn.ds$b %in% c("E","F")]
@ 

eSet-derived classes tend to have special functions to get and set
specific assayDataElement members (the big data matrices). For
example, ExpressionSet has the ``exprs'' function. Likewise, BAFSet
has ``baf'' and ``lrr'' and CNSet has ``cn''. It is common to put
other optional matrices in assayData too (e.g. genotypes, quality
scores, etc.). These can be get and set with the ``assayDataElement''
function, but typing that out can get old.  GenoSet and derived
classes use the ``k'' argument to the matrix subsetting bracket to
subset from a specific assayDataElement. In addition to saving some
typing, you can directly use a set of ranges to subset the
assayDataElement.

<<subsetassaydata>>=
all( cn.ds[ 1:4,1:2,"cn"] == cn(cn.ds)[1:4,1:2] )
all( cn.ds[ 1:4,1:2,"cn"] == assayDataElement(cn.ds,"cn")[1:4,1:2] )
cn.ds[ gene.rd, 1:2, "cn" ]
@ 

\subsection{Genome Order}
GenoSet and RangedData objects can be set to and checked for genome
order. Weak genome order requires that features be ordered within each
chromosome. Strong genome order requires a certain order of chromosomes
as well. Features must be ordered so that features from the same
chromosome are in contiguous blocks.

Certain methods on GenoSet objects expect the rows to be in genome
order. In this package, the use of a RangedData object for feature
location info (the locData slot) keeps data for each chromosome in a
contiguous block of rows. Users are free to rearrange rows within
chromosome as they please. The function ``isGenomeOrder'' can check if
a RangedData object, like the locData slot, is in genome order. Here
``genome order'' means that the rows within each chromosome are
ordered by start position. ``Strict genome order'' requires that the
chromosome be in a certain order too. The proper order of chromosomes
is desirable for full-genome plots and is specified by the
``chrOrder'' function. The function ``genomeOrder'' returns a list of
integer indices to use to set a GenoSet or RangedData to weak, or
optionaly, strict genome order. The object creation methods ``Genoset'',
``BAFSet'', and ``CNSet'' order their objects in strict genome order. 
Methods, like those in the next section, that require genome order use
``genomeOrder'' to set it themselves, so users are not required to
keep track of whether or not they have changed the row order.

<<genomeorder, eval=TRUE>>=
chrOrder(names(genoset.ds))
isGenomeOrder(genoset.ds, strict=FALSE)
isGenomeOrder(genoset.ds, strict=TRUE)
genoset.ds = toGenomeOrder(genoset.ds, strict=TRUE)
isGenomeOrder(genoset.ds, strict=TRUE)
@

\subsection{Gene Level Summaries}
GenoSets contain feature level data. Often it is desirable to get
summaries of assayData matrices over an arbitrary set of ranges, like
genes or cytobands. The function ``rangeSampleMeans'' serves this
purpose. Given a RangedData of arbitrary genome ranges, a
GenoSet-based object, and the name of a member of that objects
assayData slot, ``rangeSampleMeans'' will return a matrix of values
for each of those ranges and for each sample. For each range,
``rangeSampleMeans'' uses ``boundingIndices'' to select the features 
bounding that range. The bounding features are the features with locations equal to the start
and end of the range or those outside the range and closest to the
ends of the range. This bounding ensures that the full extent of the
range is accounted for, and more importantly, at least two features
are included for each gene, even if the range falls between two
features. ``rangeColMeans'' is used to do a fast average of each of a
set of such bounding indices for each sample. These functions are
optimized for speed.  For example, with 2.5M features and ~750
samples, it takes 0.12 seconds to find the features bounded by all
Entrez Genes (one RefSeq each). Calculating the mean value for each
gene and sample takes ~9 seconds for a matrix of array data and ~30
seconds for a DataFrame of compressed Rle objects.

As an example, let's say you want to get the copynumber of your two
favorite genes from the subsetting example:

Get the gene-level summary:
<<genelevel, eval=TRUE>>=
boundingIndices( c(127e6,127.5e6), c(127e6,128e6), start(chr12.ds) )
rangeSampleMeans(gene.rd, baf.ds, "lrr")
@ 

\section{Processing Data}
Segmentation is the process of identifying blocks of the genome in
each sample that have the same copynumber value. It is a smoothing
method that attempts to replicate the biological reality where chunks
of chromosome have been deleted or amplified.

Genoset contains a convenience function for segmenting data for each sample/chr using the
DNACopy package (CBS).  GenoSet adds features to split jobs among
processor cores. When the library ``multicore'' is loaded, the
argument n.cores can control the number of processor cores
utilized. Additionally, GenoSet stores segment values so that they can be
accessed quickly at both the feature and segment level. We use a
``DataFrame'' object from IRanges where each column is a
Run-Length-Encoded ``Rle'' object. This dramatically reduces the
amount of memory required to store the segments. Note how the segmented values
become just another member of the assayData slot.

Try running CBS directlyq
<<cbsdirect>>=
library(DNAcopy)
cbs.cna = CNA(cn(cn.ds), chr(cn.ds), pos(cn.ds) )
cbs.smoothed.CNA = smooth.CNA( cbs.cna )
cbs.segs = segment( cbs.cna )
@ 

Or use the convenience function runCBS
<<runCBS>>=
assayDataElement(cn.ds,"cn.segs") = runCBS(cn(cn.ds),locData(cn.ds))
@ 

Try it with multicore
<<cbsmulticore,eval=FALSE>>=
library(multicore)
assayDataElement(cn.ds,"cn.segs") = runCBS(cn(cn.ds),locData(cn.ds), n.cores=3)
assayDataElement(cn.ds,"cn.segs")[1:5,1:3]
segTable( assayDataElement(cn.ds,"cn.segs"), locData(cn.ds) )
@ 

Other segmenting methods can also be used of course. See "segs2Rle",
"segs2RleDataFrame", and "segs2RangedData" for ways to turn their
results into a DataFrame of Rle.

This function makes use of the multicore package to run things in
parallel, so plan ahead when picking ``n.cores''. Memory usage can be
a bit hard to predict.

\subsection{Correction of Copy Number for local GC Content}
Copy number data generally shows a GC content effect that appears as
slow ``waves'' along the genome (Diskin et al., NAR, 2008). The
function ``gcCorrect'' can be used to remove this effect resulting in
much clearer data and more accurate segmentation. GC content is best
measured as the gc content in windows around each feature, about 2Mb
in size. The function ``loadGC'' can calculate GC content in bins of a
specified size using the genome sequence object from a package like
BSgenome. Be careful to use a version of BSgenome that matches the
``universe'', or mapping, of your data set. loadGC adds a ``gc''
column to the locData slot.

<<GC, eval=FALSE>>=
library(BSgenome.Hsapiens.UCSC.hg19)
cn.ds2 = cn.ds
cn.ds2 = loadGC(cn.ds2,expand=1e6,bsgenome=Hsapiens)
head(locData(cn.ds2)$gc)
genoPlot(genoPos(cn.ds),locData(cn.ds2)$gc,locs=locData(cn.ds))
cn(cn.ds2) = gcCorrect(cn(cn.ds2),locData(cn.ds2)$gc)
@ 

\subsection{Plots}
CNSet has some handy plots for plotting data along the genome.
Segmented data ``knows'' it should be plotted as lines, rather than
points. One often wants to plot just one chromosome, so a convenience
argument for chromosome subsetting is provided. ``genoPlot'' can plot 
single samples from a GenoSet (or derived) object. ``genoPlot'' marks 
chromosome boundaries and labels positions in ``bp'', ``kb'', ``Mb'', 
or ``Gb'' units as appropriate. Simple data and numeric position data 
can be plotted too.

<<fig=FALSE, echo=FALSE>>=
png(file="genoset-plotgenome.png",width=12,height=5,res=150,units="in")
@ 

<<plotgenome, echo=TRUE>>=
genoPlot(cn.ds, 1)
genoPlot(cn.ds, 1, element="cn.segs", add=TRUE, col="red")
@ 

<<fig=FALSE, echo=FALSE>>=
invisible(dev.off())
@ 

The result is shown in Fig.~\ref{plotgenome}.
\begin{figure}[tb]
  \begin{center}
     \includegraphics[width=0.85\textwidth]{genoset-plotgenome.png}  
     \caption{\label{plotgenome}%
     Segmented copy number across the genome for 1st sample}
  \end{center}
\end{figure}

<<fig=FALSE, echo=FALSE>>=
png(file="genoset-plotchr.png",width=12,height=5,res=150,units="in")
@ 

<<plotchr, echo=TRUE>>=
genoPlot(cn.ds,1,chr="chr12")
genoPlot(cn.ds,1,chr="chr12",element="cn.segs",add=TRUE, col="red")
@ 

<<fig=FALSE, echo=FALSE>>=
invisible(dev.off())
@ 

The result is shown in Fig.~\ref{plotchr}.
\begin{figure}[tb]
  \begin{center}
     \includegraphics[width=0.85\textwidth]{genoset-plotchr.png}  
     \caption{\label{plotchr}%
     Segmented copy number across chromosome 12 for 1st sample}
  \end{center}
\end{figure}

Plot data without a GenoSet object using numeric or Rle data:
<<plotchrsimple, eval=FALSE>>=
chr12.ds = cn.ds[chr(cn.ds) == "chr12",]
genoPlot(pos(chr12.ds),cn(chr12.ds)[,1],locs=locData(chr12.ds))  # Numeric data and location
genoPlot(pos(chr12.ds),assayDataElement(chr12.ds,"cn.segs")[,1],add=TRUE, col="red") # Rle data and numeric position
@ 

\section{BAFSet Objects}
BAFSet objects are similar in most ways to CNSet objects, but require
at least two types of data, LRR and BAF, in AssayData. ``BAF''
represent the B-Allele Frequency or the fraction of array signal
coming from one allele. BAF provides a measure of Loss of
Heterozygosity (LOH).``LRR'' is the Log R Ratio, or log2(
tumor/expected ), which is basically the traditional copynumber
logratio by another name. Both terms come from Illumina and are
discussed in Peiffer et al., 2008.

\subsection{Processing Data}
BAF data can be converted into the ``Modified BAF'' or mBAF metric, introduced by Staaf, et
al., 2008. mBAF folds the values around the 0.5 axis and makes the HOM positions NA.  The
preferred way to identify HOMs is to use genotype calls from a matched
normal (AA, AC, AG, etc.), but NA'ing values over a certain value works OK.  A hom.cutoff of 0.90 is
suggested for Affymetrix arrays and 0.95 for Illumina arrays, following Staaf, et al.

Return data as a matrix:
<<mbafcutoff>>=
mbaf.data = baf2mbaf(baf(baf.ds),hom.cutoff = 0.90)
assayDataElement(baf.ds,"mbaf") = mbaf.data
@ 

... or use compress it to a DataFrame of Rle. This uses ~1/3 the space
on our random test data.
<<mbaftorle>>=
mbaf.data = DataFrame( sapply(colnames( mbaf.data),
  function(x) { Rle( mbaf.data[,x] ) },
  USE.NAMES=TRUE, simplify=FALSE ) )
as.numeric(object.size( assayDataElement(baf.ds,"mbaf")))  / as.numeric( object.size(mbaf.data))
@ 

Using the HOM SNP calls from the matched normal works much better. A
matrix of genotypes can be used to set the HOM SNPs to NA. A list of
sample names matches the columns of the genotypes to the columns of
your baf matrix. The names of the list should match column names in
your baf matrix and the values of the list should match the column
names in your genotype matrix.  If this method is used and some
columns in your baf matrix do not have an entry in this list, then
those baf columns are cleaned of HOMs using the hom.cutoff, as above.

Both mBAF and LRR can and should be segmented. Consider storing mBAF
as a DataFrame of Rle as only the ~1/1000 HET positions are being used
and all those NA HOM positions will compress nicely.

<<segment>>=
assayDataElement(baf.ds,"baf.segs") = runCBS( assayDataElement(baf.ds,"mbaf"), locData(baf.ds) )
assayDataElement(baf.ds,"lrr.segs") = runCBS( lrr(baf.ds), locData(baf.ds) )
@ 

\subsection{Plots}

BAFSets (and GenoSets) have a genoPlot function as well that additionally requires a ``element'' argument, which
can be ``lrr'' or ``baf'' or any other element you happen to have put in ``assayData''.

<<fig=FALSE, echo=FALSE>>=
png(file="genoset-plotlrr.png",width=7,height=4,res=150,units="in")
@ 
<<plotlrr>>=
genoPlot(baf.ds,1,chr="chr12",element="lrr",main="LRR of chr12")
genoPlot(baf.ds,1,chr="chr12",element="lrr.segs",add=TRUE,col="red")
@
<<fig=FALSE, echo=FALSE>>=
invisible(dev.off())
@ 

The result is shown in Fig.~\ref{plotlrr}.
\begin{figure}[tb]
  \begin{center}
     \includegraphics[width=0.85\textwidth]{genoset-plotlrr.png}  
     \caption{\label{plotlrr}%
     Segmented copy number across the genome for 1st sample}
  \end{center}
\end{figure}


<<fig=FALSE, echo=FALSE>>=
png(file="genoset-plotbaf.png",width=7,height=8,res=150,units="in")
@ 
<<plotbaf, echo=TRUE>>=
par(mfrow=c(2,1))
genoPlot(baf.ds,1,chr="chr12",element="baf", main="BAF of chr12")
genoPlot(baf.ds,1,chr="chr12",element="mbaf", main="mBAF of chr12")
genoPlot(baf.ds,1,chr="chr12",element="baf.segs", add=TRUE,col="red")
@ 
<<fig=FALSE, echo=FALSE>>=
invisible(dev.off())
@ 


The result is shown in Fig.~\ref{plotbaf}.
\begin{figure}[tb]
  \begin{center}
     \includegraphics[width=0.85\textwidth]{genoset-plotbaf.png}  
     \caption{\label{plotbaf}%
     Segmented copy number across the genome for 1st sample}
  \end{center}
\end{figure}

\subsection{Cross-sample summaries}
You can quickly calculate summaries across samples to identify regions
with frequent alterations.  A bit more care is necessary to work one
sample at a time if your data ``matrix'' is a DataFrame.

<<crosssample>>=
gain.list = lapply(sampleNames(baf.ds),
  function(sample.name) {
    as.logical( assayDataElement(baf.ds,"lrr.segs")[,sample.name] > 0.3 )
})
gain.mat = do.call(cbind, gain.list)
gain.freq = rowMeans(gain.mat,na.rm=TRUE)
@ 

GISTIC (by Behroukhim and Getz of the Broad Institute) is the standard 
method for assessing significance of such summaries. You'll find 
``segTable'' convenient for getting your data formatted for input. I 
find it convenient to load GISTIC output as a RangedData for 
intersection with gene locations.

\section{Big Data and bigmemory}
Genome-scale data can be huge and keeping everything in memory can get
you into trouble quickly, especially if you like using multicore.

GenoSet supports using big.matrix objects from the bigmemory package
as assayDataElements.  GenoSet uses one minor modification of the
big.matrix object to facilitate saving to and reloading from
disk. Specifically, genoset places the path to the big.matrix
``description'' file in an attribute of the object called ``desc''.
This is handled by the convertToBigMatrix function.

Additionally, the function readGenoSet takes care of re-attaching all
big.matrix objects to their on-disk components. readGenoSet takes the
path to a RData file as its argument. It assumes the RData file
contains only only object, which is an object of class eSet or an
object of a derived class.  readGenoSet works for any GenoSet, not
just those that contain big.matrix matrices.

<<bigmatrix, eval=FALSE>>=
cn.ds = convertToBigMatrix(cn.ds)
save(cn.ds,file="cn.RData")
reloaded.cn.ds = readGenoSet("cn.RData")
@ 
  
\end{document}