%\VignetteIndexEntry{Introduction to genome projects} \documentclass[12pt]{article} \usepackage{Sweave} \usepackage{fullpage} \usepackage{hyperref} \newcommand{\R}{\textsf{R}} \newcommand{\Rcmd}[1]{\texttt{#1}} \newcommand{\pkg}[1]{\texttt{#1}} \title{ Genome project tables in the genomes package } \author{Chris Stubben} \begin{document} \maketitle %% for cutting and pasting use continue ="" %% change margins on every chunk <>= library(genomes) options(warn=-1, width=75, digits=2, scipen=3, "prompt" = "R> ", "continue" = " ") options(SweaveHooks=list(fig=function() par(mar=c(5,4.2,1,1)))) @ The \pkg{genomes} package collects genome project metadata from NCBI using E-utility scripts (esearch, esummary, efetch and elink) or the NCBI genomes FTP. The packages also includes tools to summarize, compare and plot the data in the \R~programming environment. Genome tables are a defined class (\emph{genomes}) and each table is a data frame where rows are genome projects and columns are the fields describing the associated metadata. A number of methods are available that operate on genome tables including \Rcmd{print}, \Rcmd{summary}, \Rcmd{plot} and \Rcmd{update}. Genome tables from the Genomes FTP at NCBI include prokaryotic (\Rcmd{proks}), eukaryotic (\Rcmd{euks}) and virus genomes (\Rcmd{virus}). The \Rcmd{print} methods displays the first few rows and columns of the table (either select less than seven rows or convert the object to a \Rcmd{data.frame} to print all columns). The \Rcmd{summary} function displays the download date, a count of projects by status, and a list of recent submissions. The \Rcmd{plot} method displays a cumulative plot of genomes by release date. <>= data(proks) proks summary(proks) plot(proks, log='y', las=1) @ Most importantly, the \Rcmd{update} method downloads the latest version of the table from NCBI and displays a message listing the number of project IDs added and removed (not run). <>= update(proks) @ A number of additional functions assist in selecting, sorting and grouping genomes. The \Rcmd{species} and \Rcmd{genus} functions can be used to extract the species or genus from a scientific name. The \Rcmd{month} and \Rcmd{year} functions can be used to extract the month or year from the release date. The \Rcmd{table2} function formats and sorts a contingency table by counts. <>= spp<-species(proks$name) table2(spp) @ Because subsets of tables are often needed, the binary operator \Rcmd{like} allows pattern matching using wildcards. The \Rcmd{plotby} function can then be used to plot the release dates by status using labeled points, in this case to identify complete and draft sequences of \emph{Yersinia pestis} released before 2012 (Figure \ref{yersinia}). <>= ## Yersinia pestis yp<-subset(proks, name %like% 'Yersinia pestis*' & year(released)<2012 ) plotby(yp, labels=TRUE, cex=.5, lbty='n', curdate=FALSE) @ \begin{figure}[t] \centering \includegraphics[height=5in,width=5in]{genome-tables-yersinia.pdf} \caption{Cumulative plot of \emph{Yersinia pestis} genomes released before 2012.} \label{yersinia} \end{figure} \end{document}