%\VignetteIndexEntry{Introduction to genome projects} \documentclass[12pt]{article} \usepackage{Sweave} \usepackage{fullpage} \usepackage{hyperref} \newcommand{\R}{\textsf{R}} \newcommand{\Rcmd}[1]{\texttt{#1}} \newcommand{\pkg}[1]{\texttt{#1}} \title{ Genome project tables in the genomes package } \author{Chris Stubben} \begin{document} \maketitle %% for cutting and pasting use continue ="" %% change margins on every chunk <>= library(genomes) options(warn=-1, width=75, digits=2, scipen=3, "prompt" = "R> ", "continue" = " ") options(SweaveHooks=list(fig=function() par(mar=c(5,4.2,1,1)))) @ The number of genome sequencing projects submitted to public sequence databases is growing rapidly. In addition to the raw sequence data, the amount of associated metadata describing each project is also increasing. The \pkg{genomes} package collects genome project metadata from NCBI (\url{http://www.ncbi.nlm.nih.gov}) and provides tools to summarize, compare and plot the data in the \R~programming environment. Genome tables are a defined class (\emph{genomes}) and each table is a data frame where rows are genome projects and columns are the fields describing the associated metadata. At a minimum, the table should have a column listing the project name, status, and release date. A number of methods are available that operate on genome tables including \Rcmd{print}, \Rcmd{summary}, \Rcmd{plot} and \Rcmd{update}. There are a number of ways to install this package. If you are running the most recent \R~version, you can use the \Rcmd{biocLite} command. <>= source("http://bioconductor.org/biocLite.R") biocLite("genomes") @ Since the format of online genome tables may change (and then \Rcmd{update} commands may fail), I would recommend downloading the development version for fixes in between the six month release cycle. <>= install.packages("genomes", repos="http://www.bioconductor.org/packages/devel/bioC") @ Genome tables from the Genome Project database at NCBI include prokaryotic projects (\Rcmd{lproks}), eukaryotic projects (\Rcmd{leuks}), metagenomes (\Rcmd{lenvs}) and viruses (\Rcmd{virus}). The \Rcmd{print} methods displays the first few rows and columns of the table (either select less than seven rows or convert the object to a \Rcmd{data.frame} to print all columns). The \Rcmd{summary} function displays the download date, a count of projects by status, and a list of recent submissions. The \Rcmd{plot} method displays a cumulative plot of genomes by release date ( Figure \ref{lproks}, use \Rcmd{lines} to add additional tables). <>= data(lproks) lproks summary(lproks) plot(lproks, log='y', las=1) data(leuks) data(lenvs) lines(leuks, col="red") lines(lenvs, col="green3") legend("topleft", c("Microbes", "Eukaryotes", "Metagenomes"), lty=1, bty='n', col=c("blue", "red", "green3")) @ \begin{figure}[t] \centering \includegraphics[height=3in,width=3in]{genome-tables-lproks.pdf} \caption{Cumulative plot of genome projects by release date at NCBI. } \label{lproks} \end{figure} Most importantly, the \Rcmd{update} method downloads the latest version of the table from NCBI and displays a message listing the number of project IDs added and removed (not run). <>= update(lproks) @ A number of additional functions assist in selecting, sorting and grouping genomes. The \Rcmd{species} and \Rcmd{genus} functions can be used to extract the species or genus from a scientific name. The \Rcmd{table2} function formats and sorts a contingency table by counts. <>= spp<-species(lproks$name) table2(spp) @ The \Rcmd{month} and \Rcmd{year} functions can be used to extract the month or year from the release date (Figure \ref{complete}). <>= complete <- subset(lproks, status == "Complete") x<-table(year(complete$released)) barplot(x, col="blue", ylim=c(0,max(x)*1.04), space=0.5, las=1, axis.lty=1, xlab="Year", ylab="Genomes per year") box() @ \begin{figure}[t] \centering \includegraphics[height=3in,width=5in]{genome-tables-complete.pdf} \caption{Number of complete microbial genomes released each year at NCBI} \label{complete} \end{figure} Because subsets of tables are often needed, the binary operator \Rcmd{like} allows pattern matching using wildcards. The \Rcmd{plotby} function can then be used to plot the release dates by status using labeled points, in this case to identify complete and draft sequences of \emph{Yersinia pestis} (Figure \ref{yersinia}). <>= ## Yersinia pestis yp<-subset(lproks, name %like% 'Yersinia pestis*') plotby(yp, labels=TRUE, cex=.5, lbty='n') @ \begin{figure}[t] \centering \includegraphics[height=3in,width=3in]{genome-tables-yersinia.pdf} \caption{Cumulative plot of \emph{Yersinia pestis} genomes by release date.} \label{yersinia} \end{figure} A number of recent functions have been added that allow \R~users to run Entrez queries. For example, users can retrieve genome summaries or neighbors using a valid Entrez search query, list taxonomy names matching taxonomy ids, find the published dates of pubmed ids, or return the release dates given accession numbers. The full details about these functions and many others can be found in \Rcmd{genomes} help pages. \end{document}