Skip to content.

bioconductor.org

Bioconductor is an open source and open development software project
for the analysis and comprehension of genomic data.

Sections

Lab1.Rnw

% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % % \VignetteIndexEntry{Seattle Lab 1} %\VignetteDepends{Biobase} %\VignetteKeywords{Microarray} \documentclass[12pt]{article}

\usepackage{amsmath,pstricks} \usepackage[authoryear,round]{natbib} \usepackage{hyperref}

\textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle}

\bibliographystyle{plainnat}

\begin{document}

\section*{An Introduction to Bioconductor}

In this laboratory we will introduce some of the basic interactions with Bioconductor.

<>=

library(Biobase) library(annotate) library(golubEsets)

@

The package \texttt{golubEsets} contains three data sets that were obtained from the web and slightly massaged. They represent the data analysed in \citet{Golub99} to perform class prediction using microarray data. The data were collected on Affymetrix Hu 6800 chip and which contains probes for 7129 genes.

An \texttt{exprSet} basically consists of the gene expression matrix (optionally a set of standard errors for those estimates), the related experimental metadata (who did what when and to what), and the phenotypic data. Here phenotype is interpreted quite broadly -- it represents any physical characteristics of the sample.

<>=

golubTrain

golubTrain[,1:10]

golubTrain[1:100,]

@

Notice that when subsetting we have arranged it so that the \textit{rows} correspond to genes and the \textit{columns} correspond to samples.

The phenotypic data are stored in a separate, but linked, data frame. You can obtain it and interact with it using specific methods.

<>=

pD <- phenoData(golubTrain)

pD

pd <- pData(pD)

pd

@

An object of class \texttt{phenoData} is a combination of a dataframe containing the various data elements and a list that explains what each variable represents. This information is usually relegated to a help page but we felt that it was important to keep it more closely associated with the data.

The \verb+$+ operator performs the job of extracting particular variables from an object of class \texttt{phenoData}. It also can be used directly on the \texttt{exprSet}.

<>=

table(pD$ALL.AML)

##different data table(golubTest$ALL.AML)

@

The S4 methods package has introduced substantial new capabilities into R. To obtain the manual pages for S4 classes you should use the following syntax \texttt{class?exprSet}. Please do that now and we will look at help page.

\section*{Annotation}

Now we will look at how to map from Affymetrix identifiers to the different biological meta data available from the Bioconductor Web site. Other data can be obtained from other sources and used in a similar fashion.

Because these data sources are very large, are constantly evolving and are similar across species we have adopted the strategy of distributing data in R packages. Each mapping is contained in an \textit{environment}. For our purposes an environment is simply a \textit{hash} table. It provides a very fast way of looking up values for specific text keys. This implementation is not the best and we hope to develop real hash tables for R but that will take some time.

Load the data and obtain the Affymetrix identifiers.

<>=

library(hu6800) ls(pos=2)

affynames <- ls(env=hu6800CHR) length(affynames)

hu6800() @

After loading the data we see that there are a number of different data sets that have been loaded. These provide the many different mappings from Affymetrix identifiers to other biological data, such as chromosomal location which is found in \texttt{hu6800CHR}.

We see that there are 7129 identifiers (which matches the \texttt{exprSets} we have.

Finally, quality control data and counts etc. of the mappings found is available by calling the function \texttt{hu6800()} and by examining the help page, \texttt{?hu6800}.

<>=

mygene <- affynames[4001] get(mygene, env = hu6800ACCNUM) get(mygene, env = hu6800LOCUSID) get(mygene, env = hu6800SYMBOL) get(mygene, env = hu6800GENENAME) get(mygene, env = hu6800SUMFUNC) get(mygene, env = hu6800UNIGENE) get(mygene, env = hu6800CHR) get(mygene, env = hu6800CHRLOC) get(mygene, env = hu6800CHRORI) get(mygene, env = hu6800MAP) get(mygene, env = hu6800PMID) get(mygene, env = hu6800GO)

multiget(affynames[1:10], env = hu6800PMID)

@ We have arbitrarily selected the probe set with Affymetrix identifier \verb+"U18237_at"+. And then obtained all the mappings for it. A variety of information about this gene can now easily be obtained.

In some cases we need to obtain data on several genes at once. To do that we wrote a special function called \texttt{multiget}. Notice that in some cases there are a number of PubMed abstracts associated with the different genes.

We can do some other interesting things as well. <>=

whChrom <- multiget(affynames, env=hu6800CHR)

table(unlist(whChrom))

vv<-sapply(whChrom, length) table(vv)

whChrom[vv==2] @

So we see the distribution of probe sets according to chromosome. Also note that there are 10 genes that have two chromosomes assigned. We might want to explore these further by examining resources at the NCBI.

\subsection*{Querying PubMed}

For more details please look at the short article in R News,

The National Library of Medicine (NLM) provides a great deal of information on biological data. We have only just started to explore these data. We need further functionality on MeSH and other data that are there; but for now we concentrate on PubMed.

A mapping has been done between LocusLink and PubMed. This associates specific articles with specific genes. These can be queried interactively using tools available in R and other systems. Full text abstracts are readily available, in some cases other full text articles are available (we just don't have the tools to make use of them). We next demonstrate how you could interact with these.

<>=

pmids<-get(mygene, env = hu6800PMID) pubmed(pmids, disp="browser")

# With pm.getabst absts2<-pm.getabst(mygene,"hu6800") pm.titles(absts2) sapply(absts2, function(x) pm.abstGrep("[Pp]rotein", x)) @

We obtain the abstracts and then search for the word \texttt{protein}.

Another source of data is GenBank. We can explore interactions with it in much the same way as with PubMed.

<>= g10<-affynames[1:10] gbacc<-multiget(g10,hu6800ACCNUM) genbank(gbacc,disp="browser") gb<-genbank(gbacc[1],disp="data") @

Finally we consider interactions with LocusLink. Here, we use local data (LocusLink does not provide a good interface for automatic querying). We generate HTML output. This is often an easy way to communicate your results with other researchers working on the same analysis.

<>= llid<-multiget(affynames[1:10],hu6800LOCUSID) map<-multiget(affynames[1:10],hu6800MAP) symb<-multiget(affynames[1:10],hu6800SYMBOL)

res<-data.frame(affynames[1:10],cbind(unlist(symb),unlist(map))) names(res)<-c("Affy ID","Gene symbol", "Chromosomal location") ll.htmlpage(llid, filename="LocusLink.html", title="HTML report", othernames=res, table.head=c("LocusID",names(res)), table.center = TRUE) @

\end{document}

News
2010-05-21

Advanced R Programming for Bioinformatics course material now available

2010-04-23

Bioconductor 2.6, consisting of 389 packages and designed to work with R version 2.11, was released today.