%\VignetteIndexEntry{Visualising very long data vectors with the Hilbert curve}
%\VignettePackage{HilbertVis}

\documentclass{article}

\usepackage{Sweave}
\usepackage[a4paper]{geometry}
\usepackage{hyperref,graphicx}

\SweaveOpts{keep.source=TRUE,eps=FALSE,include=FALSE,width=4,height=4.5} 
\newcommand{\Robject}[1]{\texttt{#1}}
\newcommand{\Rpackage}[1]{\textit{#1}}
\newcommand{\Rclass}[1]{\textit{#1}}
\newcommand{\Rfunction}[1]{{\small\texttt{#1}}}

\author{Simon Anders\\[1em]European Bioinformatics Institute,\\ Hinxton, Cambridge, UK\\[1em]
\texttt{sanders@fs.tum.de}}

\title{\textsf{\textbf{Visualising very long data vectors\\ with the Hilbert curve}\\
Description of the Bioconductor packages\\ \texttt{HilbertVis} and \texttt{HilbertVisGUI}.}}

\date{2009-06-29}

\begin{document}

\maketitle

\section{Introduction}

In a number of different fields, intermediate results often take the form of
extremely long data vectors with many millions of entries. In order to grasp the
overall statistical structure and the distribution of features, specialdef
visualisation techniques can be most helpful. The Bioconductor package
``HilbertVis'' offers such a tool.

In the present vignette, the package is demonstrated with random example 
data. To see the use of \texttt{HilbertVis} with real biological data (namely
ChIP-Seq data), see the publication \cite{HilbertVisBioinf} or the vignette 
\cite{ShortRead_and_HilbertCurveDisplay}.

\section{The problem}

<<setSeed,echo=FALSE>>=
set.seed(271828)
@

The command \texttt{makeRandomTestData} (which is provided only for demonstration purposes) 
produces a very long vector filled with random data
<<>>=
library( HilbertVis )
vec <- makeRandomTestData( )
@

This is a very long vector, with 10,000,000 entries:
<<>>=
length(vec)
@

An obvious first way to inspect it is to make a needle plot it with a command 
like
\begin{verbatim}
plot( vec, type="h" ) 
\end{verbatim}

\begin{figure}
\centering
\includegraphics{HilbertVis-plotLongVector}
\caption{A simple plot of the example data vector, made with
the function \texttt{plotLongVector}.}
\label{plotLongVector}
\end{figure}


This command takes very long to complete because so many needles are plotted on
top of each other. The following command produces the same output in a more
reasonable speed
<<plotLongVector,width=9,height=4,fig=TRUE,include=FALSE>>=
plotLongVector( vec )
@

From an inspection of this plot (Fig.\ \ref{plotLongVector}), we may only conclude that the data vector
contains a lot of peaks which seem to be distributed densely but randomly. But
are all these peaks similar? Are they really evenly distributed? Are the
individual needles in the plot single peaks, or several peaks which are plotted
on top of each other?

\begin{figure}
\centering
\includegraphics{HilbertVis-zoomIn}
\caption{Zooming into a randomly chosen region of the data of Fig.\
\ref{plotLongVector}. }
\label{zoomIn}
\end{figure}

The straight-forward solution is to zoom in and plot a region picked at random:
<<zoomIn,fig=TRUE,width=9,height=4,include=FALSE>>=
start <- 1500000
plotLongVector( vec[ start + 0:100000 ], offset=start )
@
In order to see whether this region (Fig.\ \ref{zoomIn}) is now typical, and whether all peaks look
like these, we would have to make many more such plots.

\section{Hilbert curves}

The problem with Fig.\ \ref{plotLongVector} is that each pixel corresponds to
several thousands of vector elements. Even if we made the figure much wider and
print it finer, the horizontal axis can only hold a very limited amount of
discernible values. The idea of the Hilbert curve visualisation is to use the
space in both the horizontal and vertical direction in order to show more
details.

\begin{figure}
\centering
\includegraphics{HilbertVis-HilbertTest}
\caption{Hilbert curve plot of the example data vector from Fig.\ \ref{plotLongVector}}
\label{HilbertTest}
\end{figure}

The Hilbert curve plot is an approach to display an as detailed picture of the
whole data vector as possible by letting each pixel of a large square represent 
a quite short part of the chromosome, coding with its colour for the maximum 
value in this short stretch, where the pixels are arranged such that 
neighbouring parts of the data vector appear next to each other in the square. 
Furthermore, parts which are not directly neighbouring but are in close 
distance should not be separated much in the square either. Fig.\ \ref{HilbertTest} 
shows our data vector visualised in this way.

\begin{figure}
\centering
\includegraphics{HilbertVis-HilbertCurves}
\caption{The first four levels of the Hilbert curve fractal.}
\label{HilbertCurves}
\end{figure}

In order to understand this plot you need to know how the pixels are arranged 
to fulfill the requirements just outlined as well as possible. To my knowledge, 
the first to come up with the solution also 
used here was D.\ A.\ Keim in Ref.\ \cite{HilbertVisualization_first} (where 
he used the technique to visualise long time-series data of stock-market prices). 
He went back to an old idea of Peano \cite{PeanoCurve_first} and Hilbert 
\cite{HilbertCurve_first}, space-filling curves. Peano astonished the 
mathematics community at the end of the 19th century by presenting a continuous 
mapping of a line to a square, i.e., showed that a line can be folded up such 
that it passes through every point within a square, thus blurring the seemingly 
clear-cut distinction between one- and two-dimensional objects. Such a 
space-filling curve is a fractal, i.e., it has infinitely many corners and 
repeats its overall form in all levels of its details. Fig.\ \ref{HilbertCurves} 
shows the first four level of the construction of Hilbert's variant of Peano's 
curve. Observe how at level $k$ a line of length $2^{2k}$ passes through each 
``pixel'' of a square of dimension $2^k \times 2^k$, and how this curve is 
produced connecting four copies (in different orientations) of the curve at 
the previous level, $k-1$.

For Fig.\ \ref{HilbertTest}, a Hilbert curve at level 9 has been used, i.e., the 
image has $2^9\times 2^9 = 262,144$ pixels. As \texttt{vec} has a length of 10 million 
elements, each pixel now represents around 38 elements. While we were unsure about 
how to interpret the needle-like peaks in Fig.\ \ref{plotLongVector}, Fig.\ 
\ref{HilbertTest} gives a much clearer idea. Each of small blue or blue-red 
features is one peak, and its area is the width of the peak. (The colour 
gives the height.)

We can now see a feature of the data vector that was not noticeable in 
Figs.\ \ref{plotLongVector} and \ref{zoomIn}: While most of the peaks have an
area of of one to ten pixels corresponding to peak width up to approx.\ 400), 
there is a region, in the lower left-hand corner, with much wider peaks. This 
spatial inhomogeneity is very obvious in the Hilbert plot but easy to miss with 
more conventional plots.

Before I show an example with biological data, I explain in the next section 
how the plots in this section were made.

\section{Usage}

Figure \ref{HilbertCurves} has been produced with the function 
\texttt{plotHilbertCurve} which is provided just for demonstration purposes. 
Typically, it takes one parameter, the level of the curve.
Here it is used within a loop to arrange the four plots in the figure.
<<HilbertCurves,fig=TRUE,include=FALSE>>=
library( grid )
pushViewport( viewport( layout=grid.layout( 2, 2 ) ) )
for( i in 1:4 ) {
   pushViewport( viewport( 
       layout.pos.row=1+(i-1)%/%2, layout.pos.col=1+(i-1)%%2 ) )
   plotHilbertCurve( i, new.page=FALSE )
   popViewport( )
}   
@

To get the Hilbert plot of a data vector (Fig.\ \ref{HilbertTest}),  
use the function \texttt{hilbertImage}:
<<>>=
hMat <- hilbertImage( vec )
@

\texttt{hMat} is now a matrix of dimensions $2^9\times 2^9$ (i.e., $512\times
512$) that contains the 
intensity values for the plot. The function \texttt{hilbertImage} takes some optional arguments,
for example \texttt{level}, in case you want another level (i.e. another matrix 
size) than the default $9$.

To plot the matrix, we could just pass it to the standard functions \texttt{image} 
or \texttt{levelplot}. The function \texttt{showHilbertImage} is a convenient wrapper around
\texttt{levelplot} that sets a suitable colour palette and supresses labels and tickmarks:
<<eval=FALSE>>=
showHilbertImage( hMat ) 
@
<<HilbertTest,echo=FALSE,fig=TRUE,include=FALSE>>=
show( showHilbertImage( hMat ) )
@

If you look at such an image on screen, you will notice that the \texttt{image} 
function does not display one matrix element as one pixel but rather scales the
image according to the window size. This typically results in distracting
artifacts, and hence, it is preferable to use the \texttt{display} function from
the \texttt{EBImage} package that allows to suppress scaling. The 'mode' argument of
\texttt{showHilbertImage} allows to use the \texttt{EBImage} package instead of 
the \texttt{lattice} package to display the matrix:
<<eval=FALSE>>=
showHilbertImage( hMat, mode="EBImage" )
@

\section{Interactive tool for data exploration}

\begin{figure}
\centering
\includegraphics[width=.6\textwidth]{HilbertDisplay_GUI}
\caption{The graphical user interface provided by \texttt{HilbertVisGUI}.}
\label{GUI}
\end{figure}

Everything so far only required the package \texttt{HilbertVis}. Provided you have
GTK+ and gtkmm installed on your system (see Appendix for more explanation) you 
can also install \texttt{HilbertVisGUI} which provides a graphical tool to
interactively explore Hilbert curve plots (Fig.\ \ref{GUI}). 

To start the tool, load the package
<<eval=FALSE>>=
library( HilbertVisGUI )
@
and simply write
<<eval=FALSE>>=
hilbertDisplay( vec )
@

First notice the red pointer in the field labelled ``Displayed part of sequence''.
As you move the mouse cursor through the Hilbert plot the pointer shows you where
in the data vector (``sequence'') you are. This visual feed-back helps to understand
how the Hilbert curve ``folds'' the vector into the image. You can also read off
the precise coordinate corresponding to the pixel under the mouse cursor (or rather:
the midpoint of the range of coordinates (``bin'') represented by this pixel) in the line
above the gauge. 

By clicking onto the image, you can zoom into the image. The program will select
that quarter of the displayed curve that contains your mouse click and display it.
The left gauge (labelled ``full sequence'') represents the full data vector, with
the currently displayed part highlighted in red.

If you pass several vectors to \texttt{hilbertDisplay}, you can use the buttons 
labelled ``Previous'' and ``Next'' to switch back and forth between the data
vectors in order to compare them. 

By switching the ``Effect of the left mouse button'' to ``Linear plot'', you can get
an ordinary 1D plot of the region around the pixel you clicked on. Click onto a
peak and observe how you can inspect the peak's shape in the plot.

To do the linear plot, \texttt{hilbertDisplay} calls the R function 
\texttt{simpleLinPlot} defined as well in the \texttt{HilbertVisGUI} package. 
This is a simple wrapper around the function \texttt{plotLongVector} discussed
earlier. You may want to have a look at its definition.

You can replace this function by supplying your own plotting function as the 
argument \texttt{plotFun} to \texttt{hilbertDisplay}. Your function must take two 
arguments that should be called \texttt{data} and \texttt{info}, as above, and will 
be filled in by \texttt{hilbertDisplay} with the displayed vector and information 
about where the user clicked and which part of the vector is being displayed. Try the 
following example to see the format of this data:
<<eval=FALSE>>=
dumpDataInsteadOfPlotting <- function( data, info ) {
   str( data )
   print( info )
}
hilbertDisplay( vec, plotFunc=dumpDataInsteadOfPlotting )
@

Zoom in a bit, then switch to "linear plot" and click somewhere. \texttt{dumpDataInsteadOfPlotting} 
will be called and output such as the following appears on your R console:

\begin{verbatim}
num [1:10000000] 0 0 0 0 0 0 0 0 0 0 ...
$binLo
[1] 9555093

$bin
[1] 9555112

$binHi
[1] 9555131

$dispLo
[1] 7500001

$dispHi
[1] 10000001

$seqIdx
[1] 1

$seqName
[1] "vec"
\end{verbatim} %$

The function \texttt{hilbertDisplay} has a number of useful optional arguments which 
are explained in its help page. Have a look there for details.


\section{Three-channel Hilbert plots}

Let us construct a second data vector that differs from \texttt{vec} only 
in a subtle way:
<<>>=
vec2 <- vec
vec2[ 7000000 : 8000000 ] <- vec[ 7010000 : 8010000 ]
@

If we compared the Hilbert plots of \texttt{vec} and \texttt{vec2} side-by-side
we might miss the difference. Loading them both into the interactive tool, with
<<eval=FALSE>>=
hilbertDisplay( vec, vec2 )
@
makes spotting the difference easier: When clicking the ``Previous'' and ``Next'' 
buttons one notices that most peaks stay put while a few move a bit.

\begin{figure}
\centering
\includegraphics[width=.6\textwidth]{ThreeChTest}
\caption{A three-channel plot (only using the red and the green channel).}
\label{ThreeChannel}
\end{figure}

Another possibility is to overlay these two plots, displaying \texttt{vec} in red
and \texttt{vec2} in green. The function \texttt{hilbertDisplayThreeChannels} start
the interactive tool in this mode. Different from \texttt{hilbertDisplay}, this
function expects the values for the channel to be normalised to values between 0
(corresponding to black) and 1 (corresponding to the channel's colour in full saturation).
Hence, we divide both vectors by the maximum of \texttt{vec}.
<<eval=FALSE>>=
hilbertDisplayThreeChannel( 
   vec / max(vec), vec2 / max(vec), rep( 0, length(vec) ) )
@
The third argument is the vector for the blue channel. As we don't have a third vector,
we simply pass a vector of the same length, filled with zeroes only.

In this display style, those peaks common to both vectors appear in yellow, while the
those different are red and green (Fig.\ \ref{ThreeChannel}).

It might be desirable to have also a non-interactive
version of this functionality. This can be done by calling \texttt{hilbertImage}
for each data vector and then using the functionality of the \texttt{EBImage} package
to overlay the three images in three colour channels. As the API of \texttt{EBImage} is 
due to be overhauled soon I do not describe this here as it would be outdated soon.

\section{Reading in GFF and wiggle data}

An important use for the \texttt{HilbertVis} package is genomic data. Imagine a vector the
length of a human chromosome that gives for each base pair a score of some genome-position-dependent
quantity. Such data is typically displayed as a ``wiggle track'' in genome browsers such
 as the UCSC browser at \url{http://genome.ucsc.edu/} \cite{UCSC_genome_browser}.
 
The common formats to supply data for wiggle tracks are the General Feature Format (GFF) \cite{GFF_Spec},
and the wiggle track format \cite{Wiggle_Spec}. The function \texttt{makeWiggleVector} is
supplied to read in such files and provide their content as a ``wiggle vector'', i.e., a
vector with one element per base pair that gives the score value for this base pair. Such a
wiggle vector can then be visualised in the same way as the data vector \texttt{vec} in the
examples above. For details, see the help page for \texttt{makeWiggleVector}.

Another source of genomic data are aligned short reads from high-throughput sequencing 
experiments. See \cite{ShortRead_and_HilbertCurveDisplay} for an explanation how to deal with such
data.

\section{Rle vectors}

Long data vectors with are step-wise constant are more suitably dealt with in
run-length encoding (RLE), i.e., instead of storing all values, one has two
vectors, one that has the lengths of the constant stretches, one with the
corresponding values. The \texttt{Rle} class in the \texttt{IRanges} package
provides this functionality.

The functions \texttt{hilbertDisplay} and \texttt{hilbertDisplayThreeChannel}
can both work with \texttt{Rle} vectors as well as with ordinary vectors (since
version 1.3.1 of \texttt{HilbertVisGUI}).
 
\appendix

\section{Installation}

The packages \texttt{HilbertVis} and \texttt{HilbertVisGUI} are meant to be used 
with the statistics programming language R. While \texttt{HilbertVis} can be installed
on any R system (with R version at least 2.7), you need to have \texttt{gtkmm}
available on your system in order to install \texttt{HilbertVisGUI}. You need 
\texttt{HilbertVisGUI} only if you want to use the interactive tool described in 
Section \label{SecGUI}. Don't worry, it is very easy to install gtkmm and the installer will tell you
what to do.

There is also a stand-alone version of HilbertVis that can be used independent of
R. See \url{http://www.ebi.ac.uk/~anders/hilbert} for more information.

\subsection{Installing HilbertVis}

To install \texttt{HilbertVis}, simply use the standard procedure explained on 
the Bioconductor web site, i.e., type in an R session
\begin{verbatim}
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install( "HilbertVis" )
\end{verbatim}

\subsection{Installing HilbertVisGUI}

The GUI component uses the \texttt{gtkmm} library, the C++ bindings for the GTK+
widget system. \footnote{See \url{http://www.gtk.org/} for more information on the GTK+ 
project.}

Install \texttt{HilbertVisGUI} with \texttt{biocManager}, i.e., by typing

\begin{verbatim}
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install( "HilbertVisGUI" )
\end{verbatim}

On platforms with binary packages (i.e., Linux and the like), the installer will check whether gtkmm is
available. If it cannot find it, it will display an explanation how to install it. 

On platforms with binary packages (Mac and Windows), this explanation appears once you load the package for the first
time. You will be asked to simply download an automatic installer from a given web address.


\section{Session Info}

<<>>=
sessionInfo()
@

\section{References}

\bibliographystyle{simon2}
\bibliography{hilbert}

\end{document}