%\VignetteIndexEntry{How to forge a BSgenome data package}
%\VignetteKeywords{Genome, BSgenome, DNA, Sequence, UCSC, BSgenome data package}
%\VignettePackage{BSgenome}

%
% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%
\SweaveOpts{keep.source=TRUE}

\documentclass[10pt]{article}

%\usepackage{amsmath}
%\usepackage[authoryear,round]{natbib}

%
% NOTE -- There is an obscure issue with the use of \url from the hyperref
% package that will trigger a MiKTeX/pdflatex error:
%   ! pdfTeX error (ext4): \pdfendlink ended up in different nesting level than \pd
%   fstartlink.
%   \AtBegShi@Output ...ipout \box \AtBeginShipoutBox
%                                                     \fi \fi
%   l.96 \end{document}
%
%   !  ==> Fatal error occurred, no output PDF file produced!
%   Transcript written on BSgenomeForge1.log.
% The error is hard to reproduce. I've observed it on the r34270 version of this
% vignette and with the following version of the MiKTeX/pdflatex command:
%   MiKTeX-pdfTeX 2.7.3147 (1.40.9) (MiKTeX 2.7)
\usepackage{hyperref}

\usepackage{underscore}

\textwidth=6.5in
\textheight=8.5in
\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}

\newcommand{\R}{\textsf{R}}
\newcommand{\code}[1]{\texttt{#1}}
\newcommand{\term}[1]{\emph{#1}}
\newcommand{\Rpackage}[1]{\textsf{#1}}
\newcommand{\Rfunction}[1]{\texttt{#1}}
\newcommand{\Robject}[1]{\texttt{#1}}
\newcommand{\Rclass}[1]{\textit{#1}}
\newcommand{\Rmethod}[1]{\textit{#1}}
\newcommand{\Rfunarg}[1]{\textit{#1}}

\bibliographystyle{plainnat}

\begin{document}

\title{How to forge a BSgenome data package}

\author{Herv\'e Pag\`es \\
  Gentleman Lab \\
  Fred Hutchinson Cancer Research Center \\
  Seattle, WA}
\date{\today}
\maketitle

\tableofcontents


\section{Introduction}

This document describes the process of forging a \term{BSgenome data package}.
It is intended for Bioconductor users who want to make a new \term{BSgenome
data package}, not for regular users of these packages.

Start {\R} (make sure you are using the latest release version), load the
\Rpackage{BSgenome} package, and use the \Rfunction{available.genomes}
function to get the list of \term{BSgenome data packages} available
in the current release version of Bioconductor.
So you confirm that none of those genomes suits your needs? And you want to
make your own package? If your answer is yes to these 2 questions, then you've
come to the right place.

Requirements:
\begin{itemize}
\item Some basic knowledge of the Unix/Linux command line is required. The
      commands that you will most likely need are: \code{cd}, \code{mkdir},
      \code{mv}, \code{rmdir}, \code{tar}, \code{gunzip}, \code{unzip},
      \code{ftp} and \code{wget}. Also you will need to create and edit some
      text files.
\item You need access to a good Unix/Linux build machine with a decent amount
      of RAM (>= 4GB), especially if your genome is big. For smaller genomes,
      2GB or even 1GB of RAM might be enough.
\item You need the latest release versions of {\R} plus the
      \Rpackage{Biostrings} and \Rpackage{BSgenome} packages installed on the
      build machine. To check your installation, start {\R} and try to load
      the \Rpackage{BSgenome} package.
\item Finally, you need to obtain the \term{source data files} of the genome
      that you want to build a package for.
      There are 2 groups of \term{source data files}: (1) the files containing
      the sequence data (those files are required), and (2) the files
      containing the mask data (those files are optional).
      For most organisms, these files have been made publicly available on the
      internet by genome providers like UCSC, NCBI, FlyBase, TAIR, etc.
      The next section of this document explains how to obtain and prepare
      these files.
\end{itemize}

Refer to the \textit{R Installation and Administration} manual
\footnote{\url{http://cran.r-project.org/doc/manuals/R-admin.html}}
if you need to install {\R} or upgrade your {\R} version,
and to the \textit{Bioconductor - Install} page
\footnote{\url{http://bioconductor.org/install/}}
on the Bioconductor website if you need to install or update the
\Rpackage{Biostrings} or \Rpackage{BSgenome} packages.

Questions, comments or bug reports about this document or about the
BSgenomeForge functions are welcome. Please address them to the author
(\code{hpages@fhcrc.org}) or post them on The Bioconductor Project Mailing List
(\code{bioconductor@r-project.org}). Don't forget to visit the
Bioconductor website \footnote{\url{http://bioconductor.org/}}
and subscribe to this list if you've not already done so.

In this document, we call \term{target package} the \term{BSgenome
data package} that we want to forge.


\section{Obtain and prepare the source data files}

As mentioned earlier, there are 2 groups of \term{source data files}: (1) the
files containing the sequence data (required), and (2) the files containing
the mask data (optional).

\subsection{Sequence data (group 1)}

Group 1 must be FASTA files. You need 1 file per sequence that you
want to put in the \term{target package}. The name of each file must be of
the form \textit{<prefix>}\textit{<seqname>}\textit{<suffix>} where
\textit{<seqname>} is the name of the sequence in it and \textit{<prefix>}
and \textit{<suffix>} are a prefix and a suffix (eventually empty) that are
the same for all the files.

For example the FASTA files available at UCSC in the ``Data set by
chromosome'' section for Stickleback
(\url{http://hgdownload.cse.ucsc.edu/goldenPath/gasAcu1/chromosomes/})
could already be considered to have names of this form (\texttt{chrI.fa.gz},
\texttt{chrII.fa.gz}, \texttt{chrIII.fa.gz}, ..., \texttt{chrXXI.fa.gz},
\texttt{chrM.fa.gz} and \texttt{chrUn.fa.gz}). However, because the files
will need to be uncompressed after download and before they can be used
by the forging process, the suffix will need to be set to \texttt{.fa},
not \texttt{.fa.gz}. Also the prefix here is empty, not \texttt{chr},
because \texttt{chr} is considered to be part of the sequence names
(a chromosome naming convention commonly used at UCSC).
Note that, alternatively, you can download and extract the big
\texttt{chromFa.tar.gz} tarball located in the ``Full data set'' section
(aka the \textit{bigZips} folder) for Stickleback
(\url{http://hgdownload.cse.ucsc.edu/goldenPath/gasAcu1/bigZips/}):
it should contain the same files as the ``Data set by chromosome'' folder.

You can use the \Rfunction{fasta.info} function from the \Rpackage{Biostrings}
package to see what's in a FASTA file:
<<>>=
library(Biostrings)
file <- system.file("extdata", "ce2chrM.fa", package="BSgenome")
fasta.info(file)
@

\subsection{Mask data (group 2)}

The mask data are not available for all organisms. What you download exactly
depends of course on what's available and also on what built-in masks you
want to have in the \term{target package}.
4 kinds of built-in masks are currently supported by BSgenomeForge:
\begin{itemize}
\item the masks of assembly gaps, aka ``the AGAPS masks'';
\item the masks of intra-contig ambiguities, aka ``the AMB masks'';
\item the masks of repeat regions that were determined by the RepeatMasker
      software, aka ``the RM masks'';
\item the masks of repeat regions that were determined by the Tandem Repeats
      Finder software (where only repeats with period less than or equal to
      12 were kept), aka ``the TRF masks''.
\end{itemize}

For the AGAPS masks, you need UCSC ``gap'' or NCBI ``agp'' files. It can be one
file per chromosome or a single big file containing the assembly gap information
for all the chromosomes together. In the former case, the name of each file must
be of the form \textit{<prefix>}\textit{<seqname>}\textit{<suffix>}. Like for
the FASTA files in group 1, \textit{<seqname>} must be the name of the sequence
(sequence names for FASTA files and AGAPS masks must match) and
\textit{<prefix>} and \textit{<suffix>} must be a prefix and a suffix
(eventually empty) that are the same for all the files (this prefix/suffix
doesn't need to be, and typically is not, the same as for the FASTA files in
group 1).

You don't need any file for the AMB masks.

For the RM masks, you need RepeatMasker \texttt{.out} files. Like for the AGAPS
masks, it can be one file per chromosome or a single big file containing the
RepeatMasker information for all the chromosomes together.
In the former case, the name of each file must also be of the form
\textit{<prefix>}\textit{<seqname>}\textit{<suffix>}. Same comments apply
as for the AGAPS masks above.

For the TRF masks, you need Tandem Repeats Finder \texttt{.bed} files.
Again, it can be one file per chromosome or a single big file.
In the former case, the name of each file must also be of the form
\textit{<prefix>}\textit{<seqname>}\textit{<suffix>}). Same comments apply
as for the AGAPS masks above.

Again, for some organisms none of the masks above are available or only some
of them are.

\subsection{An example}

Here is how the \term{source data files} for the
\Rpackage{BSgenome.Rnorvegicus.UCSC.rn4} package were obtained and prepared:

\begin{itemize}
\item Group 1:
      \begin{itemize}
      \item Single sequences: file \texttt{chromFa.tar.gz} was downloaded
            from the UCSC \textit{bigZips} folder
            \footnote{\url{http://hgdownload.cse.ucsc.edu/goldenPath/rn4/bigZips/}}
            for \code{rn4} and extracted with:
            \begin{verbatim}
                tar zxf chromFa.tar.gz
            \end{verbatim}
      \item Multiple sequences: files \texttt{upstream1000.fa.gz},
            \texttt{upstream2000.fa.gz} and \texttt{upstream5000.fa.gz} were
            downloaded from the same \textit{bigZips} folder and uncompressed
            with:
            \begin{verbatim}
                for file in upstream*.fa.gz; do gunzip $file ; done
            \end{verbatim}
      \end{itemize}
\item Group 2:
      \begin{itemize}
      \item AGAPS masks: all the \texttt{chr*\_gap.txt.gz} files (UCSC ``gap''
            files) were downloaded from the UCSC \textit{database} folder
            \footnote{\url{http://hgdownload.cse.ucsc.edu/goldenPath/rn4/database/}}
            for \code{rn4}. This was done with the standard Unix/Linux
            \code{ftp} command:
            \begin{verbatim}
                ftp hgdownload.cse.ucsc.edu # login as "anonymous"
                cd goldenPath/rn4/database
                prompt
                mget chr*_gap.txt.gz
            \end{verbatim}
            Then all the downloaded files were uncompressed with:
            \begin{verbatim}
                for file in chr*_gap.txt.gz; do gunzip $file ; done
            \end{verbatim}
      \item RM masks: file \texttt{chromOut.tar.gz} was downloaded from
            the UCSC \textit{bigZips} folder and extracted with:
            \begin{verbatim}
                tar zxf chromOut.tar.gz
            \end{verbatim}
      \item TRF masks: file \texttt{chromTrf.tar.gz} was downloaded from
            the UCSC \textit{bigZips} folder and extracted with:
            \begin{verbatim}
                tar zxf chromTrf.tar.gz
            \end{verbatim}
      \end{itemize}
\end{itemize}

\subsection{The \textit{<seqs\_srcdir>} and \textit{<masks\_srcdir>} folders}

From now we assume that you've downloaded (checking the md5sums is always a
good idea), extracted, and eventually renamed all the \term{source data files},
and that they are located in the \textit{<seqs\_srcdir>} folder for group 1
and in the \textit{<masks\_srcdir>} folder for group 2.

Note that all the \term{source data files} should be located directly in the
\textit{<seqs\_srcdir>} and \textit{<masks\_srcdir>} folders, not in subfolders
of these folders. For example, depending on the genome, UCSC provides either a
big \texttt{chromFa.tar.gz} or \texttt{chromFa.zip} file that contains the
sequence data for all the chromosomes. But it could be that, after extraction
of this big file, the individual FASTA files for each chromosome end up being
located one level down the \textit{<seqs\_srcdir>} folder (granted that you
were in this folder when you extracted the file). If this is the case, then
you will need to move them one level up (use \code{mv -i */*.fa .} for this,
then remove all the empty subfolders with \code{rmdir *}).


\section{Prepare the BSgenome data package seed file}

\subsection{Overview}

The \term{BSgenome data package seed file} will contain all the information
needed by the \Rfunction{forgeBSgenomeDataPkg} function to forge the
\term{target package}.

The format of this file is DCF (Debian Control File), which is also the format
used for the \texttt{DESCRIPTION} file of any {\R} package. The valid fields of
a \term{seed file} are divided in 3 categories:
\begin{enumerate}
\item Standard \texttt{DESCRIPTION} fields. These fields are actually
      the mandatory fields found in any \texttt{DESCRIPTION} file.
      They will be copied to the \texttt{DESCRIPTION} file of the \term{target
      package}.

\item Non-standard \texttt{DESCRIPTION} fields. These fields are specific
      to \term{seed files} and they will also be copied to the
      \texttt{DESCRIPTION} file of the \term{target package}.
      In addition, the values of those fields will be stored in the
      \Rclass{BSgenome} object that will be contained in the \term{target
      package}. This means that the users of the \term{target package} will be
      able to retrieve these values via the accessor methods defined for
      \Rclass{BSgenome} objects. See the man page for the \Rclass{BSgenome}
      class (\code{{?}`BSgenome-class`}) for a description of these methods.

\item Additional fields that don't fall in the 2 first categories.
\end{enumerate}

The 3 following subsections give an extensive descriptions of all the valid
fields of a \term{seed file}.

Alternatively, the reader in a hurry can go directly to the last subsection of
this section for an example of \term{seed file}.

\subsection{Standard \texttt{DESCRIPTION} fields}

\begin{itemize}
\item \code{Package}: Name to give to the \term{target package}. The convention
      used for the packages built by the Bioconductor project is to use names made
      of 4 parts separated by a dot.
      Part 1 is always \code{BSgenome}.
      Part 2 is the abbreviated name of the organism (when the name of
      the organism is made of 2 words, we put together the first letter of the
      first word in upper case followed by the entire second word in lower
      case e.g. \code{Rnorvegicus}).
      Part 3 is the name of the organisation who provided the genome
      (e.g. \code{UCSC}).
      Part 4 is the release string or number used by this organisation
      to identify this version of the genome (e.g. \code{rn4}).

\item \code{Title}: The title of the package. E.g. \code{Rattus norvegicus full
      genome (UCSC version rn4)}.

\item \code{Description}, \code{Version}, \code{Author}, \code{Maintainer},
      \code{License}: Like the 2 previous fields, these are mandatory fields
      found in any \texttt{DESCRIPTION} file.
      Please refer to the \textit{The DESCRIPTION file} section
      of the \textit{Writing R Extensions} manual
      \footnote{\url{http://cran.r-project.org/doc/manuals/R-exts.html\#The-DESCRIPTION-file}}
      for more information about these fields.
      If you plan to distribute the package that you are going to forge, please
      pickup the license carefully and make sure that it is compatible with the
      license of the \term{source data files} if any.
\end{itemize}

\subsection{Non-standard \texttt{DESCRIPTION} fields}

\begin{itemize}
\item \code{organism}: The full name of the organism (e.g. \code{Rattus
      norvegicus}).

\item \code{species}: The name of the species.
      For the packages built by the Bioconductor project from a UCSC genome,
      this field corresponds to the \code{SPECIES} column of the
      \textit{List of UCSC genome releases} table
      \footnote{\url{http://genome.ucsc.edu/FAQ/FAQreleases\#release1}}.

\item \code{provider}: The provider of the \term{source data files} e.g.
      \code{UCSC}, \code{NCBI}, \code{BDGP}, \code{FlyBase}, etc...
      Should preferably match part 3 of the package name (field \code{Package}).

\item \code{provider\_version}: The provider-side version of the genome.
      Should preferably match part 4 of the package name (field \code{Package}).
      For the packages built by the Bioconductor project from a UCSC genome,
      this field corresponds to the \code{UCSC VERSION} field of the
      \textit{List of UCSC genome releases} table.

\item \code{release\_date}: When this assembly of the genome was released.
      For the packages built by the Bioconductor project from a UCSC genome,
      this field corresponds to the \code{RELEASE DATE} field of the
      \textit{List of UCSC genome releases} table.

\item \code{release\_name}: The release name or build number of this assembly
      of the genome.
      For the packages built by the Bioconductor project from a UCSC genome,
      this field corresponds to the \code{RELEASE NAME} field of the
      \textit{List of UCSC genome releases} table.

\item \code{source\_url}: The permanent URL where the \term{source data files}
      used to forge this package can be found.

\item \code{organism\_biocview}: The official biocViews term for this organism.
      This is generally the same as the \code{organism} field except that spaces
      should be replaced by underscores. The value of this field matters only if
      the \term{target package} is going to be added to a Bioconductor
      repository since it will determine under which subview of the
      \textit{Bioconductor Task View} for \textit{Organism}
      \footnote{\url{http://bioconductor.org/packages/release/Organism.html}}
      the package will appear.
      Note that this is the only field in this category that won't be stored
      in the \Rclass{BSgenome} object that will be contained in the \term{target
      package}.
\end{itemize}

\subsection{Other fields}

\begin{itemize}
\item \code{BSgenomeObjname}: Should match part 2 of the package name (field
      \code{Package}).

\item \code{seqnames}: An {\R} expression returning the names of the single
      sequences to forge (in a character vector).
      E.g. \code{paste("chr", c(1:20, "X", "M", "Un", paste(c(1:20, "X", "Un"),
      "\_random", sep="")), sep="")}.

\item \code{circ\_seqs}: [OPTIONAL] An {\R} expression returning the names of
      the circular sequences (in a character vector). This must be a subset
      of the \code{seqnames} specified above.
      E.g. \code{"chrM"} for rn4 or \code{c("chrM", "2micron")} for the sacCer2
      genome (Yeast) from UCSC.
      The default value for \code{circ\_seqs} is \code{NULL} (no circular
      sequence).

\item \code{mseqnames}: [OPTIONAL] An {\R} expression returning the names of
      the multiple sequences to forge (in a character vector).
      E.g. \code{paste("upstream", c("1000", "2000", "5000"), sep="")}.
      The default value for \code{mseqnames} is \code{NULL} (no multiple
      sequence).

\item \code{nmask\_per\_seq}: [OPTIONAL] The number of masks per sequence (0
      to 4). The default value for \code{nmask\_per\_seq} is \code{0}
      (no mask).

\item \code{PkgDetails}: [OPTIONAL] Some arbitrary text that will be copied
      to the \code{Details} section of the man page of the \term{target
      package}.

\item \code{SrcDataFiles1}, \code{SrcDataFiles2}: [OPTIONAL] Some arbitrary
      text that will be copied to the \code{Note} section of the man pages of
      the \term{target package}. \code{SrcDataFiles1} should describe briefly
      where the \term{source data files} for the sequences are coming from.
      \code{SrcDataFiles2} should do the same for the masks. Permanent URLs are
      a must.

\item \code{PkgExamples}: [OPTIONAL] Some {\R} code (eventually with comments)
      that will be added to the \code{Examples} section of the man page of the
      \term{target package}.

\item \code{seqs\_srcdir}, \code{masks\_srcdir}: The path to the
      \textit{<seqs\_srcdir>} and \textit{<masks\_srcdir>} folders,
      respectively.

\item \code{seqfiles\_prefix}, \code{seqfiles\_suffix}: [OPTIONAL] The common
      prefix and suffix that need to be added to all the sequence names
      (fields \code{seqnames} and \code{mseqnames}) to get the name of the
      corresponding FASTA file.
      Default values are the empty prefix for \code{seqfiles\_prefix} and
      \code{.fa} for \code{seqfiles\_suffix}.

\item \code{AGAPSfiles\_type}: [OPTIONAL] Must be \code{gap} (the default) if
      the \term{source data files} containing the AGAPS masks information are
      UCSC ``gap'' files, or \code{agp} if they are NCBI ``agp'' files.

\item \code{AGAPSfiles\_name}: [OPTIONAL] Omit this field if you have one
      \term{source data file} per single sequence for the AGAPS masks and use
      the \code{AGAPSfiles\_prefix} and \code{AGAPSfiles\_suffix} fields below
      instead.
      Otherwise, use this field to specify the name of the single big file.

\item \code{AGAPSfiles\_prefix}, \code{AGAPSfiles\_suffix}: [OPTIONAL] Omit
      these fields if you have one single big \term{source data file} for all
      the AGAPS masks and use the \code{AGAPSfiles\_name} field above instead.
      Otherwise, use these fields to specify the common prefix and suffix that
      need to be added to all the single sequence names (field \code{seqnames})
      to get the name of the file that contains the corresponding AGAPS mask
      information.
      Default values are the empty prefix for \code{AGAPSfiles\_prefix} and
      \code{\_gap.txt} for \code{AGAPSfiles\_suffix}.

\item \code{RMfiles\_name}, \code{RMfiles\_prefix}, \code{RMfiles\_suffix}:
      [OPTIONAL] Those fields work like the \code{AGAPSfiles*} fields above
      but for the RM masks.
      Default values are the empty prefix for \code{RMfiles\_prefix} and
      \texttt{.fa.out} for \code{RMfiles\_suffix}.

\item \code{TRFfiles\_name}, \code{TRFfiles\_prefix}, \code{TRFfiles\_suffix}:
      [OPTIONAL] Those fields work like the \code{AGAPSfiles*} fields above
      but for the TRF masks.
      Default values are the empty prefix for \code{TRFfiles\_prefix} and
      \texttt{.bed} for \code{TRFfiles\_suffix}.

\end{itemize}

\subsection{An example}

The \term{seed files} used for the packages forged by the Bioconductor project
are included in the \Rpackage{BSgenome} package:
<<>>=
library(BSgenome)
seed_files <- system.file("extdata", "GentlemanLab", package="BSgenome")
list.files(seed_files, pattern="-seed$")
rn4_seed <- list.files(seed_files, pattern="rn4", full.names=TRUE)
cat(readLines(rn4_seed), sep="\n")
@

From now we assume that you have managed to prepare the \term{seed file} for
your package.


\section{Forge the \term{target package}}

To forge the package, start \R, load the \Rpackage{BSgenome} package, and call
the \Rfunction{forgeBSgenomeDataPkg} function on your \term{seed file}.
For example, if the path to your \term{seed file} is
\texttt{"path/to/my/seed"}, do:
<<eval=false>>=
library(BSgenome)
forgeBSgenomeDataPkg("path/to/my/seed")
@

Depending on the size of the genome and your hardware, this can take between
2 minutes and 1 or 2 hours. By default \Rfunction{forgeBSgenomeDataPkg} will
create the source tree of the \term{target package} in the current directory.

Once \Rfunction{forgeBSgenomeDataPkg} is done, ignore the warnings (if any),
quit \R, and build the source package (tarball) with
\begin{verbatim}
    R CMD build <pkgdir>
\end{verbatim}
where \code{<pkgdir>} is the path to the source tree of the package.

Then check the package with
\begin{verbatim}
    R CMD check <tarball>
\end{verbatim}
where \code{<tarball>} is the path to the tarball produced by
\code{R CMD build}.

Finally install the package with
\begin{verbatim}
    R CMD INSTALL <tarball>
\end{verbatim}
and use it!

\section{Session information}

The output in this vignette was produced under the following conditions:

<<>>=
sessionInfo()
@

\end{document}