%\VignetteIndexEntry{How to forge a BSgenome data package} %\VignetteKeywords{Genome, BSgenome, DNA, Sequence, UCSC, BSgenome data package} %\VignettePackage{BSgenome} % % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % \SweaveOpts{keep.source=TRUE} \documentclass[10pt]{article} %\usepackage{amsmath} %\usepackage[authoryear,round]{natbib} % % NOTE -- There is an obscure issue with the use of \url from the hyperref % package that will trigger a MiKTeX/pdflatex error: % ! pdfTeX error (ext4): \pdfendlink ended up in different nesting level than \pd % fstartlink. % \AtBegShi@Output ...ipout \box \AtBeginShipoutBox % \fi \fi % l.96 \end{document} % % ! ==> Fatal error occurred, no output PDF file produced! % Transcript written on BSgenomeForge1.log. % The error is hard to reproduce. I've observed it on the r34270 version of this % vignette and with the following version of the MiKTeX/pdflatex command: % MiKTeX-pdfTeX 2.7.3147 (1.40.9) (MiKTeX 2.7) \usepackage{hyperref} \usepackage{underscore} \textwidth=6.5in \textheight=8.5in \parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\R}{\textsf{R}} \newcommand{\code}[1]{\texttt{#1}} \newcommand{\term}[1]{\emph{#1}} \newcommand{\Rpackage}[1]{\textsf{#1}} \newcommand{\Rfunction}[1]{\texttt{#1}} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\Rclass}[1]{\textit{#1}} \newcommand{\Rmethod}[1]{\textit{#1}} \newcommand{\Rfunarg}[1]{\textit{#1}} \bibliographystyle{plainnat} \begin{document} \title{How to forge a BSgenome data package} \author{Herv\'e Pag\`es \\ Gentleman Lab \\ Fred Hutchinson Cancer Research Center \\ Seattle, WA} \date{\today} \maketitle \tableofcontents \section{Introduction} This document describes the process of forging a \term{BSgenome data package}. It is intended for Bioconductor users who want to make a new \term{BSgenome data package}, not for regular users of these packages. Start {\R} (make sure you are using the latest release version), load the \Rpackage{BSgenome} package, and use the \Rfunction{available.genomes} function to get the list of \term{BSgenome data packages} available in the current release version of Bioconductor. So you confirm that none of those genomes suits your needs? And you want to make your own package? If your answer is yes to these 2 questions, then you've come to the right place. Requirements: \begin{itemize} \item Some basic knowledge of the Unix/Linux command line is required. The commands that you will most likely need are: \code{cd}, \code{mkdir}, \code{mv}, \code{rmdir}, \code{tar}, \code{gunzip}, \code{unzip}, \code{ftp} and \code{wget}. Also you will need to create and edit some text files. \item You need access to a good Unix/Linux build machine with a decent amount of RAM (>= 4GB), especially if your genome is big. For smaller genomes, 2GB or even 1GB of RAM might be enough. \item You need the latest release versions of {\R} plus the \Rpackage{Biostrings} and \Rpackage{BSgenome} packages installed on the build machine. To check your installation, start {\R} and try to load the \Rpackage{BSgenome} package. \item Finally, you need to obtain the \term{source data files} of the genome that you want to build a package for. There are 2 groups of \term{source data files}: (1) the files containing the sequence data (those files are required), and (2) the files containing the mask data (those files are optional). For most organisms, these files have been made publicly available on the internet by genome providers like UCSC, NCBI, FlyBase, TAIR, etc. The next section of this document explains how to obtain and prepare these files. \end{itemize} Refer to the \textit{R Installation and Administration} manual \footnote{\url{http://cran.r-project.org/doc/manuals/R-admin.html}} if you need to install {\R} or upgrade your {\R} version, and to the \textit{Installation Instructions} page \footnote{\url{http://bioconductor.org/install/}} on the Bioconductor website if you need to install or update the \Rpackage{Biostrings} or \Rpackage{BSgenome} packages. Questions, comments or bug reports about this document or about the BSgenomeForge functions are welcome. Please address them to the author (\code{hpages@fhcrc.org}) or post them on The Bioconductor Project Mailing List (\code{bioconductor@stat.math.ethz.ch}). Don't forget to visit the Bioconductor website \footnote{\url{http://bioconductor.org/}} and subscribe to this list if you've not already done so. In this document, we call \term{target package} the \term{BSgenome data package} package that we want to forge. \section{Obtain and prepare the source data files} As mentioned earlier, there are 2 groups of \term{source data files}: (1) the files containing the sequence data (required), and (2) the files containing the mask data (optional). \subsection{Sequence data (group 1)} Group 1 must be FASTA files. You need 1 file per sequence that you want to put in the \term{target package}. The name of each file must be of the form \textit{}\textit{}\textit{} where \textit{} is the name of the sequence in it and \textit{} and \textit{} are a prefix and a suffix (eventually empty) that are the same for all the files. For example the FASTA files available at UCSC in the ``Data set by chromosome'' section for Stickleback (\url{http://hgdownload.cse.ucsc.edu/goldenPath/gasAcu1/chromosomes/}) could already be considered to have names of this form (\texttt{chrI.fa.gz}, \texttt{chrII.fa.gz}, \texttt{chrIII.fa.gz}, ..., \texttt{chrXXI.fa.gz}, \texttt{chrM.fa.gz} and \texttt{chrUn.fa.gz}). However, because the files will need to be uncompressed after download and before they can be used by the forging process, the suffix will need to be set to \texttt{.fa}, not \texttt{.fa.gz}. Also the prefix here is empty, not \texttt{chr}, because \texttt{chr} is considered to be part of the sequence names (a chromosome naming convention commonly used at UCSC). Note that, alternatively, you can download and extract the big \texttt{chromFa.tar.gz} tarball located in the ``Full data set'' section (aka the \textit{bigZips} folder) for Stickleback (\url{http://hgdownload.cse.ucsc.edu/goldenPath/gasAcu1/bigZips/}): it should contain the same files as the ``Data set by chromosome'' folder. You can use the \Rfunction{fasta.info} function from the \Rpackage{Biostrings} package to see what's in a FASTA file: <<>>= library(Biostrings) file <- system.file("extdata", "ce2chrM.fa", package="BSgenome") fasta.info(file) @ \subsection{Mask data (group 2)} The mask data are not available for all organisms. What you download exactly depends of course on what's available and also on what built-in masks you want to have in the \term{target package}. 4 kinds of built-in masks are currently supported by BSgenomeForge: \begin{itemize} \item the masks of assembly gaps, aka ``the AGAPS masks''; \item the masks of intra-contig ambiguities, aka ``the AMB masks''; \item the masks of repeat regions that were determined by the RepeatMasker software, aka ``the RM masks''; \item the masks of repeat regions that were determined by the Tandem Repeats Finder software (where only repeats with period less than or equal to 12 were kept), aka ``the TRF masks''. \end{itemize} For the AGAPS masks, you need UCSC ``gap'' or NCBI ``agp'' files. It can be one file per chromosome or a single big file containing the assembly gap information for all the chromosomes together. In the former case, the name of each file must be of the form \textit{}\textit{}\textit{}. Like for the FASTA files in group 1, \textit{} must be the name of the sequence (sequence names for FASTA files and AGAPS masks must match) and \textit{} and \textit{} must be a prefix and a suffix (eventually empty) that are the same for all the files (this prefix/suffix doesn't need to be, and typically is not, the same as for the FASTA files in group 1). You don't need any file for the AMB masks. For the RM masks, you need RepeatMasker \texttt{.out} files. Like for the AGAPS masks, it can be one file per chromosome or a single big file containing the RepeatMasker information for all the chromosomes together. In the former case, the name of each file must also be of the form \textit{}\textit{}\textit{}. Same comments apply as for the AGAPS masks above. For the TRF masks, you need Tandem Repeats Finder \texttt{.bed} files. Again, it can be one file per chromosome or a single big file. In the former case, the name of each file must also be of the form \textit{}\textit{}\textit{}). Same comments apply as for the AGAPS masks above. Again, for some organisms none of the masks above are available or only some of them are. \subsection{An example} Here is how the \term{source data files} for the \Rpackage{BSgenome.Rnorvegicus.UCSC.rn4} package were obtained and prepared: \begin{itemize} \item Group 1: \begin{itemize} \item Single sequences: file \texttt{chromFa.tar.gz} was downloaded from the UCSC \textit{bigZips} folder \footnote{\url{http://hgdownload.cse.ucsc.edu/goldenPath/rn4/bigZips/}} for \code{rn4} and extracted with: \begin{verbatim} tar zxf chromFa.tar.gz \end{verbatim} \item Multiple sequences: files \texttt{upstream1000.fa.gz}, \texttt{upstream2000.fa.gz} and \texttt{upstream5000.fa.gz} were downloaded from the same \textit{bigZips} folder and uncompressed with: \begin{verbatim} for file in upstream*.fa.gz; do gunzip $file ; done \end{verbatim} \end{itemize} \item Group 2: \begin{itemize} \item AGAPS masks: all the \texttt{chr*\_gap.txt.gz} files (UCSC ``gap'' files) were downloaded from the UCSC \textit{database} folder \footnote{\url{http://hgdownload.cse.ucsc.edu/goldenPath/rn4/database/}} for \code{rn4}. This was done with the standard Unix/Linux \code{ftp} command: \begin{verbatim} ftp hgdownload.cse.ucsc.edu # login as "anonymous" cd goldenPath/rn4/database prompt mget chr*_gap.txt.gz \end{verbatim} Then all the downloaded files were uncompressed with: \begin{verbatim} for file in chr*_gap.txt.gz; do gunzip $file ; done \end{verbatim} \item RM masks: file \texttt{chromOut.tar.gz} was downloaded from the UCSC \textit{bigZips} folder and extracted with: \begin{verbatim} tar zxf chromOut.tar.gz \end{verbatim} \item TRF masks: file \texttt{chromTrf.tar.gz} was downloaded from the UCSC \textit{bigZips} folder and extracted with: \begin{verbatim} tar zxf chromTrf.tar.gz \end{verbatim} \end{itemize} \end{itemize} \subsection{The \textit{} and \textit{} folders} From now we assume that you've downloaded (checking the md5sums is always a good idea), extracted, and eventually renamed all the \term{source data files}, and that they are located in the \textit{} folder for group 1 and in the \textit{} folder for group 2. Note that all the \term{source data files} should be located directly in the \textit{} and \textit{} folders, not in subfolders of these folders. For example, depending on the genome, UCSC provides either a big \texttt{chromFa.tar.gz} or \texttt{chromFa.zip} file that contains the sequence data for all the chromosomes. But it could be that, after extraction of this big file, the individual FASTA files for each chromosome end up being located one level down the \textit{} folder (granted that you were in this folder when you extracted the file). If this is the case, then you will need to move them one level up (use \code{mv -i */*.fa .} for this, then remove all the empty subfolders with \code{rmdir *}). \section{Prepare the BSgenome data package seed file} \subsection{Overview} The \term{BSgenome data package seed file} will contain all the information needed by the \Rfunction{forgeBSgenomeDataPkg} function to forge the \term{target package}. The format of this file is DCF (Debian Control File), which is also the format used for the \texttt{DESCRIPTION} file of any {\R} package. The valid fields of a \term{seed file} are divided in 3 categories: \begin{enumerate} \item Standard \texttt{DESCRIPTION} fields. These fields are actually the mandatory fields found in any \texttt{DESCRIPTION} file. They will be copied to the \texttt{DESCRIPTION} file of the \term{target package}. \item Non-standard \texttt{DESCRIPTION} fields. These fields are specific to \term{seed files} and they will also be copied to the \texttt{DESCRIPTION} file of the \term{target package}. In addition, the values of those fields will be stored in the \Rclass{BSgenome} object that will be contained in the \term{target package}. This means that the users of the \term{target package} will be able to retrieve these values via the accessor methods defined for \Rclass{BSgenome} objects. See the man page for the \Rclass{BSgenome} class (\code{{?}`BSgenome-class`}) for a description of these methods. \item Additional fields that don't fall in the 2 first categories. \end{enumerate} The 3 following subsections give an extensive descriptions of all the valid fields of a \term{seed file}. Alternatively, the reader in a hurry can go directly to the last subsection of this section for an example of \term{seed file}. \subsection{Standard \texttt{DESCRIPTION} fields} \begin{itemize} \item \code{Package}: Name to give to the \term{target package}. The convention used for the packages built by the Bioconductor project is to use names made of 4 parts separated by a dot. Part 1 is always \code{BSgenome}. Part 2 is the abbreviated name of the organism (when the name of the organism is made of 2 words, we put together the first letter of the first word in upper case followed by the entire second word in lower case e.g. \code{Rnorvegicus}). Part 3 is the name of the organisation who provided the genome (e.g. \code{UCSC}). Part 4 is the release string or number used by this organisation to identify this version of the genome (e.g. \code{rn4}). \item \code{Title}: The title of the package. E.g. \code{Rattus norvegicus full genome (UCSC version rn4)}. \item \code{Description}, \code{Version}, \code{Author}, \code{Maintainer}, \code{License}: Like the 2 previous fields, these are mandatory fields found in any \texttt{DESCRIPTION} file. Please refer to the \textit{The DESCRIPTION file} section of the \textit{Writing R Extensions} manual \footnote{\url{http://cran.r-project.org/doc/manuals/R-exts.html\#The-DESCRIPTION-file}} for more information about these fields. If you plan to distribute the package that you are going to forge, please pickup the license carefully and make sure that it is compatible with the license of the \term{source data files} if any. \end{itemize} \subsection{Non-standard \texttt{DESCRIPTION} fields} \begin{itemize} \item \code{organism}: The full name of the organism (e.g. \code{Rattus norvegicus}). \item \code{species}: The name of the species. For the packages built by the Bioconductor project from a UCSC genome, this field corresponds to the \code{SPECIES} column of the \textit{List of UCSC genome releases} table \footnote{\url{http://genome.ucsc.edu/FAQ/FAQreleases\#release1}}. \item \code{provider}: The provider of the \term{source data files} e.g. \code{UCSC}, \code{NCBI}, \code{BDGP}, \code{FlyBase}, etc... Should preferably match part 3 of the package name (field \code{Package}). \item \code{provider\_version}: The provider-side version of the genome. Should preferably match part 4 of the package name (field \code{Package}). For the packages built by the Bioconductor project from a UCSC genome, this field corresponds to the \code{UCSC VERSION} field of the \textit{List of UCSC genome releases} table. \item \code{release\_date}: When this assembly of the genome was released. For the packages built by the Bioconductor project from a UCSC genome, this field corresponds to the \code{RELEASE DATE} field of the \textit{List of UCSC genome releases} table. \item \code{release\_name}: The release name or build number of this assembly of the genome. For the packages built by the Bioconductor project from a UCSC genome, this field corresponds to the \code{RELEASE NAME} field of the \textit{List of UCSC genome releases} table. \item \code{source\_url}: The permanent URL where the \term{source data files} used to forge this package can be found. \item \code{organism\_biocview}: The official biocViews term for this organism. This is generally the same as the \code{organism} field except that spaces should be replaced by underscores. The value of this field matters only if the \term{target package} is going to be added to a Bioconductor repository since it will determine under which subview of the \textit{Bioconductor Task View} for \textit{Organism} \footnote{\url{http://bioconductor.org/packages/release/Organism.html}} the package will appear. Note that this is the only field in this category that won't be stored in the \Rclass{BSgenome} object that will be contained in the \term{target package}. \end{itemize} \subsection{Other fields} \begin{itemize} \item \code{BSgenomeObjname}: Should match part 2 of the package name (field \code{Package}). \item \code{seqnames}: An {\R} expression returning the names of the single sequences to forge (in a character vector). E.g. \code{paste("chr", c(1:20, "X", "M", "Un", paste(c(1:20, "X", "Un"), "\_random", sep="")), sep="")}. \item \code{circ\_seqs}: [OPTIONAL] An {\R} expression returning the names of the circular sequences (in a character vector). This must be a subset of the \code{seqnames} specified above. E.g. \code{"chrM"} for rn4 or \code{c("chrM", "2micron")} for the sacCer2 genome (Yeast) from UCSC. The default value for \code{circ\_seqs} is \code{NULL} (no circular sequence). \item \code{mseqnames}: [OPTIONAL] An {\R} expression returning the names of the multiple sequences to forge (in a character vector). E.g. \code{paste("upstream", c("1000", "2000", "5000"), sep="")}. The default value for \code{mseqnames} is \code{NULL} (no multiple sequence). \item \code{nmask\_per\_seq}: [OPTIONAL] The number of masks per sequence (0 to 4). The default value for \code{nmask\_per\_seq} is \code{0} (no mask). \item \code{PkgDetails}: [OPTIONAL] Some arbitrary text that will be copied to the \code{Details} section of the man page of the \term{target package}. \item \code{SrcDataFiles1}, \code{SrcDataFiles2}: [OPTIONAL] Some arbitrary text that will be copied to the \code{Note} section of the man pages of the \term{target package}. \code{SrcDataFiles1} should describe briefly where the \term{source data files} for the sequences are coming from. \code{SrcDataFiles2} should do the same for the masks. Permanent URLs are a must. \item \code{PkgExamples}: [OPTIONAL] Some {\R} code (eventually with comments) that will be added to the \code{Examples} section of the man page of the \term{target package}. \item \code{seqs\_srcdir}, \code{masks\_srcdir}: The path to the \textit{} and \textit{} folders, respectively. \item \code{seqfiles\_prefix}, \code{seqfiles\_suffix}: [OPTIONAL] The common prefix and suffix that need to be added to all the sequence names (fields \code{seqnames} and \code{mseqnames}) to get the name of the corresponding FASTA file. Default values are the empty prefix for \code{seqfiles\_prefix} and \code{.fa} for \code{seqfiles\_suffix}. \item \code{AGAPSfiles\_type}: [OPTIONAL] Must be \code{gap} (the default) if the \term{source data files} containing the AGAPS masks information are UCSC ``gap'' files, or \code{agp} if they are NCBI ``agp'' files. \item \code{AGAPSfiles\_name}: [OPTIONAL] Omit this field if you have one \term{source data file} per single sequence for the AGAPS masks and use the \code{AGAPSfiles\_prefix} and \code{AGAPSfiles\_suffix} fields below instead. Otherwise, use this field to specify the name of the single big file. \item \code{AGAPSfiles\_prefix}, \code{AGAPSfiles\_suffix}: [OPTIONAL] Omit these fields if you have one single big \term{source data file} for all the AGAPS masks and use the \code{AGAPSfiles\_name} field above instead. Otherwise, use these fields to specify the common prefix and suffix that need to be added to all the single sequence names (field \code{seqnames}) to get the name of the file that contains the corresponding AGAPS mask information. Default values are the empty prefix for \code{AGAPSfiles\_prefix} and \code{\_gap.txt} for \code{AGAPSfiles\_suffix}. \item \code{RMfiles\_name}, \code{RMfiles\_prefix}, \code{RMfiles\_suffix}: [OPTIONAL] Those fields work like the \code{AGAPSfiles*} fields above but for the RM masks. Default values are the empty prefix for \code{RMfiles\_prefix} and \texttt{.fa.out} for \code{RMfiles\_suffix}. \item \code{TRFfiles\_name}, \code{TRFfiles\_prefix}, \code{TRFfiles\_suffix}: [OPTIONAL] Those fields work like the \code{AGAPSfiles*} fields above but for the TRF masks. Default values are the empty prefix for \code{TRFfiles\_prefix} and \texttt{.bed} for \code{TRFfiles\_suffix}. \end{itemize} \subsection{An example} The \term{seed files} used for the packages forged by the Bioconductor project are included in the \Rpackage{BSgenome} package: <<>>= library(BSgenome) seed_files <- system.file("extdata", "GentlemanLab", package="BSgenome") list.files(seed_files, pattern="-seed$") rn4_seed <- list.files(seed_files, pattern="rn4", full.names=TRUE) cat(readLines(rn4_seed), sep="\n") @ From now we assume that you have managed to prepare the \term{seed file} for your package. \section{Forge the \term{target package}} To forge the package, start \R, load the \Rpackage{BSgenome} package, and call the \Rfunction{forgeBSgenomeDataPkg} function on your \term{seed file}. For example, if the path to your \term{seed file} is \texttt{"path/to/my/seed"}, do: <>= library(BSgenome) forgeBSgenomeDataPkg("path/to/my/seed") @ Depending on the size of the genome and your hardware, this can take between 2 minutes and 1 or 2 hours. By default \Rfunction{forgeBSgenomeDataPkg} will create the source tree of the \term{target package} in the current directory. Once \Rfunction{forgeBSgenomeDataPkg} is done, ignore the warnings (if any), quit \R, and build the source package (tarball) with \begin{verbatim} R CMD build \end{verbatim} where \code{} is the path to the source tree of the package. Then check the package with \begin{verbatim} R CMD check \end{verbatim} where \code{} is the path to the tarball produced by \code{R CMD build}. Finally install the package with \begin{verbatim} R CMD INSTALL \end{verbatim} and use it! \section{Session information} The output in this vignette was produced under the following conditions: <<>>= sessionInfo() @ \end{document}