%\VignetteIndexEntry{Introduction to genome projects} \documentclass[12pt]{article} \usepackage{Sweave} \usepackage{fullpage} \usepackage{hyperref} \newcommand{\R}{\textsf{R}} \newcommand{\Rcmd}[1]{\texttt{#1}} \newcommand{\pkg}[1]{\texttt{#1}} \title{ Genome project tables in the genomes package } \author{Chris Stubben} \begin{document} \maketitle %% for cutting and pasting use continue ="" %% change margins on every chunk <>= library(genomes) options(warn=-1, width=75, digits=2, scipen=3, "prompt" = "R> ", "continue" = " ") options(SweaveHooks=list(fig=function() par(mar=c(5,4.2,1,1)))) @ The \pkg{genomes} package collects genome project metadata from NCBI using E-utility scripts (esearch, esummary, efetch and elink) or from the ENA using the ENA Browser REST URL. The packages also includes genome tables from NCBI and and provides tools to summarize, compare and plot the data in the \R~programming environment. Genome tables are a defined class (\emph{genomes}) and each table is a data frame where rows are genome projects and columns are the fields describing the associated metadata. A number of methods are available that operate on genome tables including \Rcmd{print}, \Rcmd{summary}, \Rcmd{plot} and \Rcmd{update}. There are a number of ways to install this package. If you are running the most recent \R~version, you can use the \Rcmd{biocLite} command. <>= source("http://bioconductor.org/biocLite.R") biocLite("genomes") @ Since the format of online genome tables may change (and then \Rcmd{update} commands may fail), I would recommend downloading the development version for fixes in between the six month release cycle. <>= install.packages("genomes", repos="http://www.bioconductor.org/packages/devel/bioC", type="source") @ Genome tables from the Genome database at NCBI include prokaryotic (\Rcmd{proks}), eukaryotic (\Rcmd{euks}) and virus genomes (\Rcmd{virus}). The \Rcmd{print} methods displays the first few rows and columns of the table (either select less than seven rows or convert the object to a \Rcmd{data.frame} to print all columns). The \Rcmd{summary} function displays the download date, a count of projects by status, and a list of recent submissions. The \Rcmd{plot} method displays a cumulative plot of genomes by release date. <>= data(proks) proks summary(proks) plot(proks, log='y', las=1) @ Most importantly, the \Rcmd{update} method downloads the latest version of the table from NCBI and displays a message listing the number of project IDs added and removed (not run). <>= update(proks) @ A number of additional functions assist in selecting, sorting and grouping genomes. The \Rcmd{species} and \Rcmd{genus} functions can be used to extract the species or genus from a scientific name. The \Rcmd{table2} function formats and sorts a contingency table by counts. <>= spp<-species(proks$name) table2(spp) @ The \Rcmd{month} and \Rcmd{year} functions can be used to extract the month or year from the release date (Figure \ref{complete}). <>= complete <- subset(proks, status == "Complete") x <- table(year(complete$released)) barplot(x, col="blue", ylim=c(0,max(x)*1.04), space=0.5, las=1, axis.lty=1, xlab="Year", ylab="Genomes per year") box() @ \begin{figure}[t] \centering \includegraphics[height=3in,width=5in]{genome-tables-complete.pdf} \caption{Number of complete microbial genomes released each year at NCBI} \label{complete} \end{figure} Because subsets of tables are often needed, the binary operator \Rcmd{like} allows pattern matching using wildcards. The \Rcmd{plotby} function can then be used to plot the release dates by status using labeled points, in this case to identify complete and draft sequences of \emph{Yersinia pestis} released before 2012 (Figure \ref{yersinia}). <>= ## Yersinia pestis yp<-subset(proks, name %like% 'Yersinia pestis*' & year(released)<2012 ) plotby(yp, labels=TRUE, cex=.5, lbty='n', curdate=FALSE) @ \begin{figure}[t] \centering \includegraphics[height=5in,width=5in]{genome-tables-yersinia.pdf} \caption{Cumulative plot of \emph{Yersinia pestis} genomes by release date.} \label{yersinia} \end{figure} A number of recent functions have been added that allow \R~users to query NCBI databases or the European Nucleotide Archive. These functions will be described in a separate vignette. \end{document}