Skip to content.

bioconductor.org

Bioconductor is an open source and open development software project
for the analysis and comprehension of genomic data.

Sections

Lab2b.Rnw

% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % % \VignetteIndexEntry{Lab 2b} %\VignetteDepends{Biobase, ellipse, annotate, geneplotter, hgu95av2} %\VignetteKeywords{Microarray, visualization} \documentclass[12pt]{article}

\usepackage{amsmath,pstricks} \usepackage[authoryear,round]{natbib} \usepackage{hyperref}

\textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle}

\newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}}

\bibliographystyle{plainnat}

\title{An Introduction to Some Graphics in Bioconductor}

\begin{document}

\maketitle

\section*{Introduction}

We first need to set up the basic data regarding the genome of interest. The \Robject{chromLocation} class describes the necessary components for plotting expression data in its chromosomal location. To do the plotting we need to know how many chromosomes there are, what their lengths are and various other pieces of information. Additionally mappings from the probe identifiers to chromosomes and specific locations on chromosomes is also needed. These data are all contained in the Bioconductor meta-data packages and one of these packages can be used to create an instance of the \Robject{chromLocation} class.More details on how to do this and examples are given in the vignettes for the \Rpackage{geneplotter} package.

<>= library(annotate) library(geneplotter) library(hgu95av2) newChrom <- buildChromLocation("hgu95av2") newChrom

@

\section*{Whole Genome Plots}

We have a number of different plotting features available to us. The functions \Rfunction{cPlot} and \Rfunction{cColor} provide plots where the chromosomes are laid out (3' to 5') in a linear fashion and the genes are displayed as short perpendicular lines. Lines that go up indicate genes that are located on the sense strand while lines that go down indicate genes that are located on the antisense strand. The perpendicular lines can be colored differently. In some cases you could color those genes high in one group red and those that were highly expressed in another group blue.

You can plot all chromosomes for an organism or any selected subset of the chromosomes. Chromosomes can be rendered using their relative lengths or scaled to fill the whole plot. Again, more details are available in the manual pages and the vignettes of the \Rpackage{geneplotter} package.

<>= cPlot(newChrom) data(eset) cColor(geneNames(eset), "red", newChrom) @

Specific genes can be plotted in color if desired. We can now add the locations of the genes contained in the test data set \Robject{eset}. Alternatively, you can restrict attention to just a specific set of chromosomes. <>= cPlot(newChrom,c("1","2"),fg="yellow",scale="relative") cColor(geneNames(eset), "red", newChrom) @

In this next plot we show the differences between the two scalings that can be used for the chromosomes. These are relative and max, where all chromosomes are presented as being the same length.

<>=

par(mfrow=c(2,1)) for (sc in c("max","relative")) cPlot(newChrom,fg="blue",scale=sc)

@

There are many other sources of information that could be easily added to these plots. Particularly things such as indicating syntenic regions, regions of high GC content, or CpG islands could all be added to these plots.

\section{Single Chromosome Plotting}

Another type of plot that is sometimes of use is provided by \Rfunction{alongChrom}. The purpose of this plot is to show gene expression values in certain selected regions of a chromosome. Researchers might be interested in regions with known genetic defects or suspected amplifications and deletions.

%%FIXME:plot two looks odd
need to make sure that this is fixed << alongChrom,fig=TRUE >>=

cols <- c("red", "green", "blue") cols <- cols[eset$cov3]

par(mfrow=c(3,2)) alongChrom(eset, "1", newChrom, xloc="equispaced", plotFormat="cumulative", col=cols,lwd=2) alongChrom(eset, "1", newChrom, xloc="physical", col=cols,lwd=2) alongChrom(eset, "1", newChrom, xloc="equispaced", plotFormat="local", col=cols,lwd=2) alongChrom(eset, "1", newChrom, xloc="equispaced", plotFormat="local", col=cols, type="p", pch=16) alongChrom(eset, "1", newChrom, xlim = c(87511280,127717880), xloc="equispaced", plotFormat="local", col=cols, type="p", pch=16) alongChrom(eset, "1", newChrom, xloc="equispaced", plotFormat="image")

@

\section{Heatmaps and other tools}

As of release 1.7.0 of R there is a \Rfunction{heatmap} function available. Heatmaps are interesting data displays made popular for genomic data by \citet{Eisen99}. The ideas are older and can be seen in work of Bertin (and other sources).

Basically a heatmap is a false color display where the rows and the columns have been permuted to show \textit{interesting} patterns. In the R implementation the ordering is carried out by sorting the data using a hierarchical clustering algorithm.

In the code segment below we extract the expression values from the test data set, \Robject{eset}, and then get meaningful names for the genes. We do this by finding the mapping between the probe identifiers that are stored with the data and the gene symbols.

The dendrograms on the top and side are created by the clutering function. By default this is hierarchical clustering with complete linkage.

<>= data(eset) Exprs <- exprs(eset) gN <- geneNames(eset) syms <- getSYMBOL(gN, "hgu95av2") syms <- ifelse(is.na(syms),gN, syms) row.names(Exprs) <- syms

heatmap(Exprs[1:100,], col=rev(dChip.colors(50))) @

We can change the behavior by simply defining a function that will do single linkage clustering. Notice the usual oddities of single-linkage clustering. The plot suggests that there is a group of about 5 genes that are quite close to each other (at the top) and one off by itself at the bottom. Additionally, one of the samples seems rather far from the others. Single linkage clustering is a good mechanism for finding outliers -- and hence is probably a plot worth examining. <>= slf <- function(d) hclust(d, method="single") heatmap(Exprs[1:50,], col=rev(dChip.colors(50)), hclustfun=slf) @

As for the other R functions, there are a host of options that you can set and control. The clustering mechanism, the distance metric used. You can supply your own row or column dendrograms and directly manipulate the plots.

Choosing a good set of colors is an important part of the construction of any visualization tool. We recommend that you look at the work of Cynthia Brewer (\url{www.colorbrewer.org}) as a source of reasonable palettes. Her work is also available in the package \Rpackage{RColorBrewer} available from CRAN.

\section*{Visualizing Distances}

For microarray data and many other types of experimental data you will want to apply different machine learning algorithms during the analysis. All such algorithms (either clustering or classification) depend on some notion of distance between the objects being clustered or classified. There is \textbf{no} \textit{a priori} best distance. You will need to carefully consider the metric that is best for the data and the classifier being used.

The resulting pairwise distances can be hard to comprehend and use. In this section we consider different methods of visualizing distances. We consider three different techniques that have proven useful. \begin{itemize} \item Plotting the pairwise distance matrix using image or the \Rpackage{ellipse} package. \item Using multidimensional scaling to reduce the dimensionality. \item Using the \Rfunction{heatmap} function on the pairwise distances. \end{itemize}

Again, we will carry use the example data in \Robject{eset} to demonstrate some of these functionalities.

First we examine using the \Rfunction{image} function and the \Rpackage{ellipse} package.

<>=

##Euclidean distances d1 <- dist(t(scale(Exprs))) dN <- dimnames(Exprs)[[2]] nS <- length(dN) d1M <- as.matrix(d1) dimnames(d1M) <- list(dN, dN) par(mfrow=c(1,1)) image(1:nS,1:nS,d1M, col=dChip.colors(50), axes=FALSE, xlab="", ylab="", main="Between Sample Distances") axis(1, at =(1:nS), label=dN, tick=FALSE ) axis(2, at =(1:nS), label=dN, tick=FALSE )

@ We have been fairly fancy in setting up this plot, but that is one of the advantages of using R as our engine for these analyses. There are many more options that you could explore. Notice that array \texttt{R} seems to be quite different from the other arrays. The colors are encoded so that red means a large distance so we see that array \texttt{R} is a far from the others.

We can also see this using the ellipse package.

<>= library(ellipse) r <- cor(Exprs) d2M <- 1-r plotcorr(r, main="Correlation Matrix for eset data")

@ Here we see that the correlation between all samples (for these genes) is positive. The tightness of the plotted ellipse indicates the strength of the correlation. Now we see that array \texttt{R} has a lower correlation with other arrays (and hence a larger distance as we saw in the previous plot).

Finally, it is worth drawing a heatmap using the distance matrix. Here we need to make sure that the rows (and or columns) are not scaled. Scaling within rows is the default behavior for heatmap, but for distance matrices it is not appropriate.

<>= heatmap(d1M, col=dChip.colors(50), scale="none") @ <>= heatmap(r, col=dChip.colors(50), scale="none") @

\end{document}

News
2009-10-26

BioC 2.5, consisting of 352 packages and designed to work with R 2.10.z, was released today.

2009-01-07

R, the open source platform used by Bioconductor, featured in a series of articles in the New York Times.