% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % %\VignetteIndexEntry{} %\VignetteDepends{} %\VignetteKeywords{} %\VignettePackage{} \documentclass[12pt]{article} \usepackage{amsmath,pstricks} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \textwidth=6.2in \bibliographystyle{plainnat} \begin{document} %\setkeys{Gin}{width=0.55\textwidth} \title{Data representation and visualization for genomic data analysis with Bioconductor: Brixen 2013} \author{VJ Carey, Channing Division of Network Medicine, Harvard Medical School} \maketitle \tableofcontents \clearpage \section{Prologue} This course will be taught using R 3.0 and Bioconductor 2.12. You are expected to install all necessary packages prior to attendance. A script for performing and verifying the installation will be provided. <>= att = c("ggbio", "TxDb.Hsapiens.UCSC.hg19.knownGene", "GenomicFeatures", "gwascat", "graph", "snpStats", "Matrix", "lattice", "survival", "mgcv", "yeastCC", "VariantAnnotation", "Rsamtools", "ALL", "yeast2probe", "yeast2.db", "org.Sc.sgd.db", "RSQLite", "DBI", "AnnotationDbi", "Biobase", "BSgenome.Scerevisiae.UCSC.sacCer2", "BSgenome", "Biostrings", "GenomicRanges", "IRanges", "BiocGenerics", "randomForest", "reshape2", "ggplot2", "BiocInstaller", "weaver", "codetools", "digest") lo = c("biomaRt", "biovizBase", "bitops", "cluster", "colorspace", "dichromat", "gridExtra", "gtable", "Hmisc", "labeling", "MASS", "munsell", "nlme", "plyr", "proto", "RColorBrewer", "RCurl", "rtracklayer", "scales", "stringr", "XML", "zlibbioc", "org.Hs.eg.db", "GenomicRanges", "HilbertVis", "lattice") sapply(att, require, character.only=TRUE) sapply(lo, require, character.only=TRUE) @ \section{Preliminaries with R and statistical analysis} \subsection{Statistical analysis of some small datasets} \subsubsection{Ambiguities of simple statistical summaries} The \texttt{anscombe} data are specially constructed to illustrate limitations of standard statistical summaries. <>= data(anscombe) anscombe[1:3,] <>= par(mfrow=c(2,2)) for (i in 1:4) plot(anscombe[,i], anscombe[,i+4], xlab=paste("x", i, sep=""), ylab=paste("y", i, sep="")) @ Show that the correlation and (simple linear) regression coefficients relating $y_i$ to $x_i$, $i = 1, \ldots, 4$ are identical to 3 decimal places. Use quadratic regression to model the second configuration and comment on the distribution of residuals. \subsubsection{Multivariate analysis of plant anatomy} \subsubsection*{Three visualizations of iris flower measurements} Fisher's iris data are readily available. The scatterplot matrix is easily formed and suggests that the measurements can be used to discriminate species. <>= head(iris) <>= pairs(iris[,-5], col=factor(iris[,5]), pch=19) @ It is not straightforward to annotate this plot. With some reshaping of the data, \textit{ggplot2} may be more communicative. <>= oopt = options() owidth = getOption("width") options(width=45) <>= irsep = cbind(iris[,c(1,2,5)],feat="Sepal") irpet = cbind(iris[,c(3,4,5)],feat="Petal") names(irsep)[1:2] = c("Length", "Width") names(irpet)[1:2] = c("Length", "Width") ir = rbind(irsep, irpet) head(ir) library(ggplot2) ggplot(ir) + geom_point(aes(x=Length, y=Width, colour=paste(Species, feat), shape=paste(Species, feat))) @ For a final view of the data, we will reshape the data frame and use \textit{ggplot2} again. <>= library(reshape2) miris = melt(iris) ggplot(miris) + geom_boxplot(aes(x=variable, y=value, colour=Species)) @ \subsubsection*{Categorical data analysis} It is sometimes useful from an interpretive perspective to consider discretizations of measured quantities. For example, in the iris data, we may define thresholds for long, short and intermediate petal lengths. There is an ad hoc aspect to choosing the thresholds, and in some cases interpretation may be sensitive in a substantively important way to details of the choice. Relationships among categorical variables can be described using statistical analysis of contingency tables. Observations are cross-classified according to categories occupied by different attributes. The discipline of loglinear modeling establishes likelihood-based procedures for comparing different formal models for relationships among variables. A hypothesis that can be tested in loglinear modeling of variables X, Y, Z is ``X is conditionally independent of Y given Z''. Networks of relationships can often be usefully characterized by elaborating statements of this type, and the discipline of graphical modeling pursues this idea. To illustrate this on a very small scale, we will form tertiles for the iris sepal width measurements and test for independence of this discretized score with iris species. The mosaic plot can give an indication of goodness of fit for a specific loglinear model for a contingency table. Exercise: Check the documentation of the functions used here, focusing on the interpretation of the margin parameter, and develop more comprehensive models for discretized versions of the iris measurements. <>= tertiles = function(x) cut(x, quantile(x, c(0,1/3,2/3,1))) ti <- with(iris, table(SepWidTert=tertiles(Sepal.Width), Species)) ti mosaicplot(ti, margin=list(1,2), shade=TRUE) chisq.test(ti) chisq.test(ti[,-1]) @ \subsubsection*{Principal components} The simplest approach works from a matrix representation of the numerical data. <>= di = data.matrix(iris[,1:4]) pdi = prcomp(di) pairs(pdi$x, col=factor(iris$Species)) @ Interpret: <>= pairs(pdi$x, col=factor(iris$Species)) cor(pdi$x[,1], iris[,1:4]) cor(pdi$x[,2], iris[,1:4]) @ \subsubsection{Machine learning: unsupervised} We'll use hierarchical clustering to look for structure in the iris measurements. <>= c1 = hclust(dist(di)) plot(c1) @ Choice of features and distance for comparing and clustering objects are key determinants of results of cluster analyses. The exercises involve assessment of distances used in this simple hierarchical clustering. Consider how to assess the sensitivity of the cluster assignments to choice of feature set and distance function. Exercise: Evaluate \texttt{names(c1)}. Explain the value of \texttt{c1\$merge[1,]} (consult the help page for \texttt{hclust}). Explain: \begin{verbatim} > which(c1$height>.25)[1] [1] 44 > c1$merge[44,] [1] -69 -88 > dist(iris[c(69,88),-5]) 69 88 0.2645751 \end{verbatim} \subsubsection{Machine learning: supervised} We'll take a training sample from the iris data, use ``random forests'' to generate a prediction procedure, and check concordance of predictions and given labels for a test set. <>= set.seed(1234) s1 = sample(1:nrow(iris), size=nrow(iris)*.75, replace=FALSE) train = iris[s1,] test = iris[-s1,] library(randomForest) rf1 = randomForest(Species~., data=train) rf1 @ Here we will use a generic `predict' method with \texttt{newdata} argument. <>= table(predict(rf1, newdata=test), test$Species) @ \textbf{Exercise:} The misclassified test cases must be 'hard'. Write the R code to find them and examine the raw values in relation to the correctly classified cases (perhaps summarized statistically). <>= options(width=owidth) @ \subsection{Data input} Various bioinformatic workflows generate data in various formats. A key hurdle for analysts is conversion from the workflow export into analyzable structures. Bioconductor addresses lots of such problems. For data of modest volume "CSV" is a common hub format -- scientists often use MS Excel to examine and manage data, and it is easy to export Excel spreadsheets into textual comma-separated values. R's \texttt{read.csv} will handle any genuine CSV file. The ``data import and export'' manual at the R project web site should be carefully studied. We will deal with specific problems throughout the course. For genomic data, particularly annotations (browser tracks) the \textit{rtracklayer} package performs many useful tasks of import and export. @ \section{Representing genome-scale data with R} For most practical purposes, ``genomes" are big data, and even if we consider management of genomic sequence data to be well under control, management of the refinement of reference sequences and annotations at the species or individual level rapidly involves us in substantial problems of complexity and volume. The internet is a unifying principle, and some approaches to large-scale distributed computing (Amazon EC2, Google Compute) confer some unification on the approaches that will often be taken to deal with very large-scale genomic data processing. We adopt the R programming language for the representation and analysis of genome-scale data for many reasons. Some are listed here. \begin{itemize} \item Many statisticians implement new algorithms for inference in R for the purpose of testing, refining, illustrating, and distributing new methods. We want access to these methods for our genomic research. \item R is freely distributed and redistributable, and is maintained in a freely redistributable form on a number of important hardware and software platforms. \item Contributions to the R data analysis environment foster interoperability with other software tools including relational databases, tools for network visualization and inference, sequence management software, and high-performance computation, so it is seldom the case that use of R is an important restriction on data-analytic versatility. \end{itemize} In any event, we'll consider some very basic issues of representation here. \subsection{Introduction to S4 object-oriented programming} Object-oriented programming is a style of programming that emphasizes the formalized unification of heterogeneous data in ``objects'', defining relationships among classes of objects, and promoting software designs in which function behavior can vary according to the classes of objects on which the function is evaluated. In R's ``S4'' object-oriented methodology, we use \begin{itemize} \item \texttt{setClass( [classname], [representation], ...) } to define a class of objects and the classes of entities from which instances of the class are composed, \item \texttt{setGeneric( [genericName], [interface], ...) } to establish a generic function whose behavior will depend upon the classes of objects on which it will be evaluated, \item \texttt{setMethod( [methodName], [signature], ...) } to establish the method for a specific ordered combination of objects (the ``signature'') on which the generic function is to be evaluated. \end{itemize} For full details, see monographs by Chambers (Software for Data Analysis) and Gentleman (R Programming for Bioinformatics). \subsection{Genomic sequence: BSgenome} In this section we use the \textit{Biostrings} infrastructure to check the genomic location of a yeast microarray probe. First, we obtain the reference genomic sequence for sacCer2 version of yeast. <>= library(BSgenome.Scerevisiae.UCSC.sacCer2) class(Scerevisiae) Scerevisiae @ We focus attention on chrIV for now. <>= c4 = Scerevisiae$chrIV class(c4) c4 @ Now we obtain the probe and annotation data for the Affy yeast2 array. <>= library(yeast2.db) library(yeast2probe) yeast2probe[1:3,] @ We'll pick one probe set, and then the sequence of one probe. <>= ypick = yeast2probe[yeast2probe$Probe.Set.Name=="1769311_at",] dim(ypick) ypick[1:3,] a = "ATGAGCACTATGTTTTCTGTTGGAT" ra = reverseComplement(DNAString(a)) @ Now we use the simple lookup of Biostrings. <>= matchPattern(ra, c4) get("1769311_at", yeast2CHRLOC) get("1769311_at", yeast2CHRLOCEND) @ Exercise. Discuss how to identify probes harboring SNP in the affy u133plus2 array. \subsection{ExpressionSet and self-description} The \texttt{ExpressionSet} class unifies information on a microarray experiment. <>= library(ALL) data(ALL) getClass(class(ALL)) experimentData(ALL) @ A very high-level aspect of self-description is the facility for binding information on the publication of expression data to the data object itself. The \texttt{abstract} method gives even more detail. Fundamental operations on \texttt{ExpressionSet} instances are \begin{itemize} \item \texttt{[} : subsetting, so that \texttt{X[G, S]} restricts the object to genes or probes enumerated in \texttt{G} and samples enumerated in \texttt{S}, \item \texttt{exprs()}: return $G \times N$ matrix of expression values \item \texttt{pData()}: return $N \times R$ data.frame instance with sample-level data \end{itemize} To enumerate additional formal methods, try \begin{verbatim} showMethods(classes="eSet", inherited=FALSE, showEmpty=FALSE, where="package:Biobase") \end{verbatim} \subsection{SummarizedExperiment, VCF} Microarrays were used primarily with named probes. You would find differential intensity for \texttt{1007\_s\_at} and look that token up. With short read sequencing or very dense arrays, it is more relevant to have access to the genomic location of the read or probe for interpretation. The \texttt{SummarizedExperiment} class addresses this concern. A large-scale illustration of this class is in the \textit{dsQTL} package, but it is very large so I do not require that you download it. <>= library(dsQTL) data(DSQ_17) DSQ_17 rowData(DSQ_17)[1:3] @ You can use \verb+example("SummarizedExperiment-class")+ to get a working minimal example. We use range-based operations to subset features; column subscripting still selects samples. The \texttt{VCF} class extends \texttt{SummarizedExperiment} to manage information on variant calls on a number of samples. <>= library(VariantAnnotation) fl <- system.file("extdata", "structural.vcf", package="VariantAnnotation") vcf <- readVcf(fl, genome="hg19") vcf @ \subsection{Management of information on HTS experiments} We'll get into details on this later. For now, consider what aspects of the structures you've encountered so far appear relevant to the structure and management of HTS data. \section{Visualization for genome-scale data} Base R graphics, \textit{grid}, and \textit{lattice} packages are a potent collection of resources for displaying data for statistical interpretation. In this lab, we will focus on two new approaches for data visualization in the context of genome-scale data analysis. We'll look briefly at the ``grammar of graphics'' (a basic abstraction attributed mainly to Leland Wilkinson), deployed for R in the \textit{ggplot2} package. We'll then focus on package \textit{ggbio}, which defines a grammar for genomic visualization. \subsection{Layered grammar of graphics} Briefly, layers of data graphics consist of \begin{itemize} \item data (variables and observations, in tabular form) and aesthetic mappings (which variable will be used as x, which as y, which to choose glyphs or colors) \item statistics (transformations of variables such as binnings, smooths, boxplot quantities) \item geometric objects (choice of points, lines, polygons) to communicate aspects of data \item position adjustments (jittering, dodging) \end{itemize} Layers are brought to view with selections of scales, coordinate systems, and facets that reflect groupings. In R, there is a strong connection between convenience of visualization or specification of desired specification and underlying data structure. Data reshaping is a high-level activity particularly when dealing with measurements over time. We'll contrast two approaches to visualizing expression trajectories in yeast colonies. \subsubsection{The yeast data} <>= chopstr = function(str, sep=" ", targlinel=60, do.cat=TRUE) { # # takes a prose string as returned by abstract() and cuts it # up into wordgroups of approximate character length targlinel # words = strsplit(str, sep)[[1]] numw = length(words) charcnt = cumsum(nchar(words)) allchr = max(charcnt) nlines = ceiling(allchr/targlinel) targs = seq(targlinel, allchr, by = targlinel) tobreak = sapply(1:length(targs), function(x) which.min(abs(targs[x]-charcnt))) starts = c(1, tobreak) ends = c(tobreak-1, numw) seqs = lapply(1:length(starts), function(i) seq(starts[i], ends[i])) lines = sapply(lapply(seqs, function(x) words[x]), paste, collapse=" ") if (do.cat) { cat(unlist(lines), sep="\n") return(invisible(NULL)) } lines } <>= library(yeastCC) data(spYCCES) chopstr(abstract(spYCCES)) @ \texttt{spYCCES} is an \texttt{ExpressionSet} instance. This is a container for various forms of information about a collection of gene expression studies comprising a family of microarrays. We have <>= spYCCES dim(spYCCES) dim(pData(spYCCES)) exprs(spYCCES)[1:5,1:5] table(spYCCES$syncmeth) @ Focus attention on samples synchronized with alpha pheromone. <>= alp = spYCCES[, spYCCES$syncmeth == "alpha" ] @ \subsubsection{A trajectory, a model, and an error} The following code computes a smoothing model for the relationship between YAL040C expression and time elapsed after alpha synchronization. <>= library(mgcv) m1 = gam(exprs(alp)["YAL040C", ] ~ s(time), data=pData(alp)) @ \begin{center} \setkeys{Gin}{width=0.65\textwidth} <>= par(mfrow=c(1,1)) plot(alp$time, exprs(alp)["YAL040C",]) xn = seq(min(alp$time), max(alp$time), len=100) lines(predict(m1, newdata=list(time=xn))) @ \end{center} Exercise. The preceding display is in error. Fix it. \subsection{The ggplot2 approach} There are two key phases to plotting with ggplot2. We initialize: \begin{verbatim} 'ggplot()' initializes a ggplot object. It can be used to declare the input data frame for a graphic and to specify the set of plot aesthetics intended to be common throughout all subsequent layers unless specifically overridden. \end{verbatim} and then we specify how to render by building up layers of representations of information in the data. \subsubsection{A smoothed enhancement to a trajectory scatterplot} \setkeys{Gin}{width=0.65\textwidth} \begin{center} <>= library(ggplot2) df = data.frame(yal040c = exprs(alp)["YAL040C",], time=alp$time) p1 = ggplot(data=df, aes(y=yal040c, x=time)) p1 + geom_point() + stat_smooth() @ \end{center} Exercise. Alter the smoothing approach used. Consider \begin{verbatim} stat_smooth(method="lm", formula=y~cos(x/9.5)) \end{verbatim} and perhaps an approach based on B-splines. What are the relative merits of the various approaches? \subsubsection{Summarizing a collection of conditional distributions} We will filter the yeast expression data to 800 genes for whom Spellman and colleagues found evidence of cell cycle-related regulation. We'll impute the mean expression value of zero for missing values. We'll create a simple partition with hierarchical clustering, and display four of the clusters. <>= alpcc = alp[orf800,] exprs(alpcc)[is.na(exprs(alpcc))] = 0 hc2 = cutree(hclust(dist(exprs(alpcc))),h=2) table(hc2)[1:8] library(reshape2) clquad = function(inds) { kp = t(exprs(alpcc)[which(hc2 %in% inds),]) colnames(kp) = paste0("cl", hc2[hc2 %in% inds]) mex = melt(kp) mex$time = as.numeric(gsub(".*_", "", mex[,1])) colnames(mex)[2:3] = c("clust", "expr") mm = ggplot(data=mex, aes(y=expr, x=factor(time))) mm + geom_boxplot() + facet_grid(clust~.) } clquad(c(6,46,15,14)) @ To dig into one of the clusters, we have a slight variation. <>= clpts = function(ind) { if (length(ind)>1) stop("want scalar ind") kp = t(exprs(alpcc)[which(hc2 %in% ind),]) mex = melt(kp) mex$time = as.numeric(gsub(".*_", "", mex[,1])) colnames(mex)[2:3] = c("gene", "expr") mm = ggplot(data=mex, aes(y=expr, x=factor(time))) mm + geom_point(aes(colour=gene)) } clpts(6) @ Exercise. How can you modify the code above to generate a function that will produce the spaghetti plot of expression trajectories? Consider <>= clli = function(ind, enh=geom_point, ...) { if (length(ind)>1) stop("want scalar ind") kp = t(exprs(alpcc)[which(hc2 %in% ind),]) mex = melt(kp) mex$time = as.numeric(gsub(".*_", "", mex[,1])) colnames(mex)[2:3] = c("gene", "expr") mm = ggplot(data=mex, aes(y=expr, x=factor(time), group=gene, colour=gene)) mm + enh(...) } @ By front-loading more information on aesthetics in the ggplot construction, we can use simple substitutions for enh to get different renderings of the data. The "..." parameter in clli employs a facility for transmission of arbitrary extra bindings to one enclosed call. This allows us to use stat\_smooth as an effective binding for enh. Try it and see what you need to do to avoid a big grey smear. In summary, we have cut up and reshaped the yeast data in various ways to obtain various views of expression clusters and trajectories using the grammar of graphics. We will now turn to the specific application of this grammar to genomic data. \subsection{ggbio: grammar of graphics for genome biology} \subsubsection{Karyogram layout} A ``whole genome'' view of GWAS hits can be easily obtained. Linear view of chromosomes with dark hash where a disease-associated SNP has been reported. <>= library(gwascat) library(ggbio) <>= gwr = as(gwrngs, "GRanges") ap1 = autoplot(gwr, layout="karyogram") <>= ap1 @ Exercise. Explain the following variation. <>= catp = cut(gwr$Pvalue_mlog, c(5, 7, 9, 11, 600)) gwr$catp = catp autoplot(gwr, layout="karyogram", aes(color=catp, fill=catp)) @ \subsubsection{GWAS visualization, with gene models} Here we will stack the scores for GWAS for various phenotypes against nearby transcript models. \setkeys{Gin}{width=0.85\textwidth} <>= library(TxDb.Hsapiens.UCSC.hg19.knownGene) <>= txdb = TxDb.Hsapiens.UCSC.hg19.knownGene selr3 = GRanges("chr17", IRanges(3.8022e7, width=1e5)) ap4 = autoplot(txdb, which=selr3) mp = traitsManh(gwrngs, selr=selr3) <>= tracks(mp, ap4) @ With the new AnnotationHub package it will be easy to obtain functionally annotated regions for visualizing contexts of associations. \section{Visualization of structure in long linear sequences} The following display is Figure 2 of \cite{karch}. Chromosome 3L of fruitfly has been analyzed and labeled with respect to a system of chromatin states. Labels map to colors, locations along the genome map to the rectangle through the Hilbert Curve transformation described and implemented by Anders \cite{anders}. The visualization helps to understand localization and distribution of the labeled states. \setkeys{Gin}{width=0.85\textwidth} \includegraphics{karchHilb} We will use Bioconductor to create a similar visualization with a simpler labeling of a human chromosome. We will use the classic annotation system to get gene locations (Entrez gene start and end addresses) and use a simple color mapping to allow us get some feel for gene length and proximity along chromosome 17. \subsection{Acquiring feature addresses} First we get the gene addresses as a mildly annotated GRanges. We use the following function. You could use TxDb facilities or OrganismDbi, but with these, decisions need to be made about what specific addressing scheme corresponds to 'gene start and stop'. <>= getglocs = function(chrnum) { require(org.Hs.eg.db) require(GenomicRanges) ucsccn = paste0("chr", chrnum) on17 = get(as.character(chrnum), revmap(org.Hs.egCHR)) st17 = abs(oki <- na.omit(sapply(sv <- mget(on17, org.Hs.egCHRLOC), "[", 1))) en17 = abs(na.omit(sapply(mget(on17, org.Hs.egCHRLOCEND), "[", 1))) ans = GRanges(ucsccn, IRanges(st17,en17)) ans$entrezid = names(sv)[-attr(oki,"na.action")] sl = org.Hs.egCHRLENGTHS[as.character(chrnum)] bad = which(end(ans)>sl) if (length(bad)>0) { baddies = ans[bad] ans = ans[-bad] warning("org.Hs.eg queries yield entrez id with location beyond CHRLENGTH end") warning("check the output GRanges metadata for more info") metadata(ans)[["badranges"]] = baddies } names(sl) = ucsccn seqlengths(ans)[ucsccn] = sl sort(ans) } @ Here's where we get the GRanges instance: <>= gr17 = getglocs(17) gr17 @ \subsection{Rles facilitate coloring} We now define a function that will convert a GRanges to an Rle representing a map from addresses to \{0,1\}. The idea is that we want to score intervals for which ranges are defined with `1', and the complementary intervals on the chromosome are scored `0'. <>= gr2brle = function(gr) { # transform GRanges to binary Rle instance # regions covered by GRanges get value 1, all others zero # seqlengths must succeed sn1 = as.character(seqnames(gr)[1]) zingap = gaps(ranges(gr), start=1, end=seqlengths(gr)[sn1]) zingap = GRanges(sn1, zingap) grs = gr values(grs) = DataFrame(score=rep(1, length(grs))) zingap$score = 0 ful = sort(c(grs, zingap)) Rle(ful$score, width(ful)) } @ We will use this on a simple tripartition of our gene ranges. The idea is that we will be able to see gene boundaries by noting changes in color, and we will also have a way of visualizing occurrence of overlaps. We are ignoring strand here, the elaboration of the graph to include information on strand (through color) is certainly feasible as an exercise. <>= nr = length(gr17) s1 = seq(1, nr, by=3) s2 = seq(2, nr, by=3) s3 = seq(3, nr, by=3) g17a = gr2brle(gr17[s1]) g17b = gr2brle(gr17[s2]) g17c = gr2brle(gr17[s3]) @ This gives, for example, <>= g17a @ \subsection{HilbertVis maps the intervals to the rectangle} We transform Rles to arrays representing the map from line to rectangle. <>= require(HilbertVis) im17a = hilbertImage(g17a) im17b = hilbertImage(g17b) im17c = hilbertImage(g17c) dim(im17a) @ \subsection{Lattice levelplot performs the color mapping and visualization} Here we combine the tripartition of gene regions, transform the scores to $[0,1]$, and use a lattice function to visualize the mapping from gene `segments' to the rectangle. We have used a convenient list of colors to use and set the color for gene deserts to grey. <>= require(lattice) sim17 = (im17a + 2*im17b + 3*im17c)/6 cs = topo.colors(200) # an arbitrary palette of colors levelplot(sim17, useRaster=TRUE, col.regions=c("#4D4D4D",cs), xlab=" ", ylab=" ", scales=list(draw=FALSE), main="chr17 gene locations") @ \textbf{Exercises.} \begin{enumerate} \item How can you determine which deserted region is the centromere? \item What scores correspond to overlapping genes in the \texttt{sim17} object? Can you count the overlapping regions on the display? How can we count the number of gene regions that actually overlap, using the annotation information on gene addresses? \item What would we need to change in the code above to allow more focused visualization of the region around a given gene, say ORMDL3? \item Describe how to add information on transcription factor binding sites to this display. This use case can lead to an abstraction for building up a planar display of linear coincidences of genomic elements. Sketch it as a family of data structures and functions with general applicability. \end{enumerate} Bonus: A different color scheme, study after drinking some fine Italian wine! \vspace*{1cm} \includegraphics{geneSeg} \section{Session information} <>= sessionInfo() @ \bibliography{bvis} \end{document}