%\VignetteIndexEntry{geneXtendeR Vignette}
%\VignettePackage{geneXtendeR}

\documentclass[12pt]{article}

\usepackage{float}
\usepackage{cite}
\title{geneXtendeR}
\author{Bohdan B. Khomtchouk}

<<<style-Sweave, eval=TRUE, echo=FALSE, results=tex>>=
BiocStyle::latex()
@ 


\begin{document}
\SweaveOpts{concordance=TRUE}

\maketitle

\begin{figure}[H]
\centering
\includegraphics[width=50mm,scale=0.8]{figures/geneXtendeRlogo.png}
\end{figure}

\section*{Introduction}

This vignette describes \texttt{geneXtendeR} (Khomtchouk et al. 2016), an R/Bioconductor package for optimized annotation of genomic features (primarily peaks called from a ChIP-seq experiment, but any coverage island regions would work) with the nearest gene. ``Extending" refers to performing gene-feature overlaps after adding to the gene-span a user-specified region upstream of the start of the gene model and a fixed (500 bp) region downstream of the gene, resulting in assigning to a gene the features that do not physically overlap with it but are sufficiently close.  Extending is an automated iterative procedure in \texttt{geneXtendeR}, allowing the user to repeatedly align peaks to multiple gene transfer format (GTF) files to assess what global gene-spans optimize the genomewide alignment of peaks with their closest genes.  This facilitates the process of deciphering which differentially enriched peaks are dysregulating which specific genes.  This, in turn, aids experimental follow-up and validation in designing primers for a set of prospective genes during qPCR (Barbier et al. 2016).  

\subsection*{Rationale}

With an abundance of Bioconductor software currently available for peak annotation to nearby features (e.g., \texttt{ChIPpeakAnno} (Zhu et al. 2010)) as well as the existence of various command line tools (e.g., \texttt{BEDTools} \emph{closest} function (Quinlan and Hall, 2010), \texttt{HOMER} (Heinz et al. 2010)), what makes \texttt{geneXtendeR} different?  The simple answer is: \texttt{geneXtendeR} is designed for assessing the variability of peak overlap with cis-regulatory elements and proximal-promoter regions.  It is well-known that peak coordinates (peak start position, peak end position) exhibit a considerable degree of variance depending on the peak caller used (e.g., SICER (Zang et al. 2009), MACS2 (Zhang et al. 2008), etc.), both in terms of length distribution of peaks as well as the total number of peaks called, even when run at identical default parameter values (Koohy et al. 2014; Thomas et al. 2017).  Tuning algorithm-specific parameters produces even greater variance amongst peak callers, thereby complicating the issue further.  This variance becomes a factor when annotating peak lists genome-wide with their nearest genes as, depending on the peak caller, peaks can be either shifted in genomic position (towards 5' or 3' end) or be of different lengths.  As such, \texttt{geneXtendeR} represents a first step towards tailoring (or customizing) the functional annotation of a ChIP-seq peak dataset according to the details of the peak coordinates (chromosome number, peak start position, peak end position).  

The primary focus of \texttt{geneXtendeR} is to optimize the process of functional annotation of a ChIP-seq peak list whereby instead of just annotating peaks with their nearest genomic features (as statically defined by a given genome build's coordinates), \texttt{geneXtendeR} investigates how peaks dynamically align to various user-specified gene extensions (e.g., 500 bp upstream extensions, 2000 bp upstream extensions, etc. for all genes in the genome).  This shows where peaks localize across the genome with respect to their nearest gene, as well as what gene ontologies (BP, CC, and MF) are impacted at these various extension levels.  This, in turn, informs the user what gene extensions ideally capture the GO terms involved in the biology of their experiment.  For example, if a user's study is investigating the role of epigenetic enzymes in alcohol addiction and dependence, then functionally annotating a peak list using gene extensions that maximize the number of brain-related ontologies (for both BP, CC, and MF categories) makes sense.

With regards to histone modification ChIP-seq analysis, \texttt{geneXtendeR} computes optimal gene extensions tailored to the broadness of the specific epigenetic mark (e.g., H3K9me1, H3K27me3), as determined by a user-supplied ChIP-seq peak input file.  To accomplish this level of custom-tailored data analysis, \texttt{geneXtendeR} first optimally extends the boundaries of every gene in a genome by some genomic distance (in DNA base pairs) for the purpose of flexibly incorporating cis-regulatory elements, such as promoter regions, as well as downstream elements that are important to the function of the gene relative to an epigenetic histone modification ChIP-seq dataset.  This action effectively transforms genes into ``gene-spheres", a new term that we coin to emphasize the 3D-nature of heterochromatin.  A gene-sphere is composed of cis-regulatory elements (e.g., proximal promoters +/- $\approx 3$ kb from TSS), distal regulatory elements (e.g., enhancers), transcription start/end sites (TSS/TES), exons, introns, and downstream elements of a gene.  As such, \texttt{geneXtendeR} maximizes the signal-to-noise ratio of locating genes closest to and directly under peaks. By performing a computational expansion of this nature, ChIP-seq reads that would initially not map strictly to a specific gene can now be optimally mapped to the regulatory regions of the gene, thereby implicating the gene as a potential candidate, and thereby making the ChIP-seq analysis more successful. Such an approach becomes particularly important when working with epigenetic histone modifications that have inherently broad peaks with a diffuse range of signal enrichment (e.g., H3K9me1, H3K27me3). 


\section*{Quick start}

First, install the \texttt{geneXtendeR} package via:

<<eval=FALSE>>=
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("geneXtendeR")
@

<<>>=
library(geneXtendeR)
@

This automatically loads the \texttt{rtracklayer} R package, which contains the \texttt{readGFF()} command used to retrieve GTF files of any model organism.  As such, load in a GTF file into your R environment, e.g.:

<<results=hide, eval=FALSE>>=
rat <- readGFF("ftp://ftp.ensembl.org/pub/release-84/gtf/
                      rattus_norvegicus/Rattus_norvegicus.Rnor_6.0.84.chr.gtf.gz")
@

URLs may be obtained as direct links from: \url{http://useast.ensembl.org/info/data/ftp/index.html}.  Click on the ``GTF" link under the ``Gene sets" column for a particular species and then right-click (or command-click on Mac OS X) the name of the file containing the species name/version number and file extension \texttt{chr.gtf.gz} (e.g., Homo\_sapiens.GRCh38.84.chr.gtf.gz, Mus\_musculus.GRCm38.84.chr.gtf.gz, etc.), and copy the link address.  Then, paste the link address into the \texttt{readGFF()} as shown above.  This will create an R dataframe object containing the respective GTF file.

Next, the user must input their peak data from a peak caller (e.g., SICER, MACS2, etc.).  The peak data must contain only three tab-delimited columns (chromosome number, peak start, and peak end) and a header containing: ``chr", ``start", and ``end".  See \texttt{?samplepeaksinput} for an example.  Once the peak input data (e.g., ``somepeaksfile.txt") has been assembled properly (i.e., to contain only the three tab-delimited columns and header above), it must be properly formatted prior to the execution of downstream analyses.  

First, the user must set their working directory to point to the location of their peak data file.  Then type the following command:

<<>>=
peaksInput("somepeaksfile.txt")
@

This command properly formats the user's peaks file in preparation for subsequent analyses, producing a resultant ``peaks.txt" file in the user's working directory\footnote{Similarly, users can transform their peaks file into a file of merged peaks (see \texttt{peaksMerge()}) and use the resultant ``peaks.txt" file instead for the subsequent analysis.}.

To see how the above command works using a built-in example, the \texttt{geneXtendeR} package provides a peak input dataset\footnote{This peaks dataset comes from a ChIP-seq investigation of brain tissue (prefrontal cortex) in alcohol addiction and dependence (Barbier et al. 2016), see References section for details.} called ``somepeaksfile.txt", which can be loaded into memory like this:

<<>>=
fpath <- system.file("extdata", "somepeaksfile.txt", package="geneXtendeR")
peaksInput(fpath)
@

This creates a properly formatted (i.e., properly sorted) ``peaks.txt" file in the user's working directory.

Now, we may use the R object that we created with \texttt{readGFF()} earlier to create a bar chart visualization showing the number of peaks that are sitting directly on top of genes across a series of upstream extensions (of each gene in a genome):

\begin{figure}[H]
\begin{center}
<<fig=TRUE>>=
barChart(rat, 0, 10000, 500)
@
\end{center}
\end{figure}

This command first generates 21 individual whole-genome files: 0, 500, 1000, ..., and 10000 bp upstream extension files for the rat (\emph{Rattus norvegicus}) genome, each having an automatic 500 bp downstream extension.  In other words, each gene in the rat genome is extended upstream and downstream by some user-specified distance, thereby creating a ``gene-sphere."  As such, this bar chart command visualizes the raw count of the number of peaks that are sitting on top of genes at each individual upstream cutoff.  Clearly, the wider the gene-sphere, the more peaks-on-top-of-genes are found throughout the genome.  However, the law of diminishing returns begins to kick in at increasing upstream extension levels (see \texttt{linePlot()} for a visual representation):

\begin{figure}[H]
\begin{center}
<<fig=TRUE>>=
linePlot(rat, 0, 10000, 500)
@
\end{center}
\end{figure}

In this line plot, there is a sharp rise in the number of peaks-on-top-of-genes from a 0 bp upstream extension to a 1500 bp upstream extension, and from a 2000 bp upstream extension to a 3000 bp upstream extension.  This steady rise up until 3000 bp is followed by a steady decline at subsequent extension levels followed by some noisy fluctuations.  It may be interesting to investigate what is going on in the interval from 2000 bp to 3000 bp: 

\begin{figure}[H]
\begin{center}
<<fig=TRUE>>=
linePlot(rat, 2000, 3000, 100)
@
\end{center}
\end{figure}

Visually, there is a relative spike in the number of peaks-on-top-of-genes at the 2400 bp upstream extension (as compared to the 2300 bp extension).  This spike then drops back down at subsequent extension levels and fluctuates in a noisy manner.  However, a cumulative line plot shows that this ``spike" is more of a visual effect than anything else, since the graph is almost perfectly linear:

\begin{figure}[H]
\begin{center}
<<fig=TRUE>>=
cumlinePlot(rat, 2000, 3000, 100)
@
\end{center}
\end{figure}

Hence, one very useful function in \texttt{geneXtendeR} is called \texttt{hotspotPlot()}, which allows users to examine the ratio of statistically significant peaks\footnote{Note that statistical significance is set apriori by the user at the peak calling stage (prior to \texttt{geneXtendeR}) to give the user the freedom to choose how to filter out peak coordinates that only pass specific p-value and FDR cutoffs from a peak caller.  Peak caller output (e.g., from SICER) gives both p-value and FDR measures for each peak, thereby making it easy to extract only the peak coordinates that pass a specific set of statistical cutoff criteria.} to the total number of peaks at each genomic interval (e.g., 0-500 bp upstream of every gene in the genome, 500-1000 bp upstream of every gene in the genome, etc.). 

\begin{figure}[H]
\begin{center}
<<fig=TRUE>>=
allpeaks <- system.file("extdata", "totalpeaksfile.txt", package="geneXtendeR")
sigpeaks <- system.file("extdata", "significantpeaksfile.txt", package="geneXtendeR")
hotspotPlot(allpeaks, sigpeaks, rat, 0, 10000, 500)
@
\end{center}
\end{figure}

This line plot shows that the concentration of significant peaks in this dataset (Barbier et al. 2016) is highest between 0 and 1000 bp upstream of a gene, with over 90\% of peaks in these regions being statistically significant.  In contrast, between 1000 bp and 2500 bp, only about half of the total peaks contained in these intervals are significant.  Statistical significance then fluctuates noisly at further upstream genomic intervals, but with at least a quarter (25\%) of the total peaks in these further upstream regions being statistically significant.  As such, the take-home message is that genomic regions within the first 1000 bp upstream of their respective genes are most likely to contain significant peaks (relative to the total peak count in these regions) and are therefore hotspots, but regions beyond this also contain a fair share of statistically significant peaks.

One interesting area to investigate is the variance in the broadness of significant (or total) peaks across different genomic intervals\footnote{One can either observe the global distribution of peak lengths within specific genomic intervals (see \texttt{?peakLengthBoxplot()}), or observe the global distribution of peak lengths across all intervals (see \texttt{?allPeakLengths()}).}.  In other words, asking questions like ``are statistically significant peaks that are located very close to their nearest gene (e.g., 0-500 bp away) wider or narrower than peaks located 500-1000 bp away from their nearest gene?".  To answer this question we can do:

\begin{figure}[H]
\begin{center}
<<fig=TRUE>>=
sigpeaks <- system.file("extdata", "significantpeaksfile.txt", package="geneXtendeR")
peaksInput(sigpeaks)
meanPeakLengthPlot(rat, 0, 10000, 500)
@
\end{center}
\end{figure}

This line plot displays the mean (average) length of all significant peaks found within each genomic interval.  Clearly, the ``average peak" is slightly narrower in 0-500 bp intervals than in 500-1000 bp intervals yet, overall, peak lengths tend to fluctuate more or less stochastically at various intervals.  To get the exact peak length, we can do:

<<>>=
sigpeaks <- system.file("extdata", "significantpeaksfile.txt", package="geneXtendeR")
peaksInput(sigpeaks)
meanPeakLength(rat, 0, 500)
@

So the mean peak length in the interval 0-500 bp is approximately 1958 bp.  Although we see that there is no specific interval with peaks of extraordinary average lengths, it is still possible to see peak length outliers in certain cases (especially when looking at total peak sets):  

\begin{figure}[H]
\begin{center}
<<fig=TRUE>>=
allpeaks <- system.file("extdata", "totalpeaksfile.txt", package="geneXtendeR")
peaksInput(allpeaks)
meanPeakLengthPlot(rat, 0, 10000, 500)
@
\end{center}
\end{figure}

We see that the 4000-4500 bp and 8500-9000 bp intervals both look quite different in terms of their mean peak lengths relative to the other intervals.  To see if the mean might be influenced by a strong outlier(s), we can do:

\begin{figure}[H]
\begin{center}
<<fig=TRUE>>=
allpeaks <- system.file("extdata", "totalpeaksfile.txt", package="geneXtendeR")
peaksInput(allpeaks)
peak_lengths <- peakLengthBoxplot(rat, 4000, 4500)
@
\end{center}
\end{figure}

This box-and-whisker plot shows a clear outlier, which is an example of a very broad peak.  We can find the exact length of this outlier peak using: 

<<>>=
peak_lengths <- peakLengthBoxplot(rat, 4000, 4500)
max(peak_lengths)
@

So this outlier peak measures 114999 bp in total length, therefore making it an extremely broad peak.  To see what nearest gene it resides to, we can first extract the peak's index by:

<<>>=
peak_lengths <- peakLengthBoxplot(rat, 4000, 4500)
match(114999, peak_lengths)
@

which returns the index of where this peak length is found.  Then the following command finds all unique peaks that reside between 4000 and 4500 bp upstream of their nearest gene:

<<>>=
distinct(rat, 4000, 4500)
@

where we see that index \texttt{126} belongs to gene \texttt{AABR07044065.1}\footnote{This peak may not be statistically significant, but how could it be if it's so huge?  In situations like this, it may be a good idea to check what is known about the gene already: \url{http://panthertest2.usc.edu/genes/gene.do?acc=RAT\%7CEnsembl=ENSRNOG00000028578\%7CUniProtKB=A0A0G2K0W2}.  Clearly, not much is known yet.}.  Checking the arithmetic difference between column 3 and column 2 for this specific row verifies 114999, as these two columns represent the peak start position and peak end positions.  Now let's identify what the other columns represent by running the \texttt{distinct()} function again (but this time on a smaller interval to have less output printed to the screen):    

<<>>=
fpath <- system.file("extdata", "somepeaksfile.txt", package="geneXtendeR")
peaksInput(fpath)
distinct(rat, 2300, 2400)
@

This data table shows 29 separate entries sorted by chromosome and start position.  \texttt{V1-V3} denote the chromosome/start/end positions of the peaks, \texttt{V4-V6} denote the respective values for the genes, \texttt{V7} is the gene ID (e.g., Ensembl ID), \texttt{V8} is the gene name, and \texttt{V9} is the distance of each respective peak to its nearest gene.  It should be noted that the X chromosome is designated by the integer 100, the Y chromosome by the integer 200, and the mitochondrial chromosome by the integer 300.  This is done for sorting purposes (see \texttt{?peaksInput} for details).  In short, the \texttt{distinct()} command finds what peaks-on-top-of-genes would be missed if a 2300 bp upstream extension is used instead of a 2400 bp extension.  In other words, these 29 genes all reside between 2300-2400 bp upstream of their nearest gene.

It may be of interest to note the differential gene ontologies between these two upstream extensions:

<<>>=
library(org.Rn.eg.db)
library(GO.db)
x <- diffGO(rat, 2300, 2400, BP, org.Rn.eg.db)
head(x, 20)
@

This dataframe shows the first 20 unique gene ontology terms, their IDs, and respective gene symbols.  Clearly, gene name \emph{Gprc5b} has several BP ontologies related explicitly to the brain, while \emph{Taldo1} does not.  Considering that the ChIP-seq peaks dataset used as input into \texttt{geneXtendeR} comes from a ChIP-seq study investigating the prefrontal cortex, this suggests that a 2400 bp extension may be more suitable for this brain dataset.  However, such decisions are left entirely to the discretion and judgment of the user in deciding the relative importance of specific genes and their respective GO terms (BP, CC, or MF) to the goals of the computational analysis (as well as plans for experimental follow-up and validation).  See Discussion section for details.  

It is also critical to note that the \texttt{diffGO()} function returns ALL known gene ontologies, NOT a gene ontology enrichment analysis (more about this in Discussion section).  The goal is to provide users with knowledge regarding all possible known roles of any given gene.  For example, by knowing that a potential gene candidate has previously been linked with known brain-related ontologies, a user may be prompted to look more closely into the relevant literature behind this gene and its implications to the biological question under study (before embarking on making a decision about its potential impact and suitability as a good candidate for experimental validation).  

Furthermore, a user may plot the differential gene ontology results as an interactive network:

<<>>=
library(networkD3)
library(org.Rn.eg.db)
library(dplyr)
makeNetwork(rat, 2300, 2400, BP, org.Rn.eg.db)
@

\begin{figure}[H]
\centering
\includegraphics{figures/vignette_network_birdseye.png}
\caption{Orange color denotes gene names, purple color denotes GO terms.  A user can hover the mouse cursor over any given node to display its respective label directly within R Studio.  Likewise, users can dynamically drag and reorganize the spatial orientation of nodes, as well as zoom in and out of them for visual effect.}
\end{figure}

\begin{figure}[H]
\centering
\includegraphics{figures/vignette_network_Gprc5b.png}
\caption{Orange color denotes gene names, purple color denotes GO terms.  A user can hover the mouse cursor over any given node to display its respective label directly within R Studio.  Likewise, users can dynamically drag and reorganize the spatial orientation of nodes, as well as zoom in and out of them for visual effect.}
\end{figure}

In addition, users can generate word clouds comprised from words present in their GO terms:

<<results=hide, eval=FALSE>>=
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
makeWordCloud(rat, 2300, 2400, BP, org.Rn.eg.db)
@

\begin{figure}[H]
\centering
\includegraphics{figures/vignette_wordcloud.png}
\caption{Word cloud generated from words comprising gene ontology terms of category BP.  This word cloud shows the words that are used within BP gene ontology terms of peaks found to be present between 2300 and 2400 bp upstream of their nearest genes.}
\end{figure}

It may also be of interest to visually examine the most frequently used words found within GO terms:  

<<results=hide, eval=FALSE>>=
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
plotWordFreq(rat, 2300, 2400, BP, org.Rn.eg.db, 10)
@

\begin{figure}[H]
\centering
\includegraphics{figures/vignette_wordfreq.png}
\caption{This barplot shows the top 10 words used within gene ontology terms (specific to BP) of peaks found to be present between 2300 and 2400 bp upstream of their nearest genes.}
\end{figure}


Once the user has chosen the specific upstream extension to be used, the peak file is ready to be fully annotated:

<<>>=
annotate(rat, 2400)
@

which generates a fully annotated peaks outfile (in the user's working directory) containing various genomic features and labeled headers. 

\section*{Discussion}

Even though \texttt{geneXtendeR} is designed to compute (and analyze/display) optimal gene extensions tailored to the characteristics of a specific peak input file, \texttt{geneXtendeR} will not explicitly impose on the user the optimal extension to select, since this information is highly study-dependent and, as such, is ultimately reserved to the user's discretion.  For example, a user may choose a conservatively lower upstream extension (e.g., for studies investigating narrow peaks such as H3K4me3 or H3K9ac that exhibit a compact and localized enrichment pattern, where high upstream extensions may begin to lose biological relevance).  An example of such a user-driven decision would be the selection of a 1500 bp upstream extension instead of a 3500 bp extension in situations like this:

\begin{figure}[H]
\begin{center}
<<fig=TRUE, echo=FALSE>>=
linePlot(rat, 0, 10000, 500)
@
\end{center}
\end{figure}

This line plot is derived from the input peak dataset used from the H3K9me1 study examined earlier (Barbier et al. 2016).  If the study had examined a narrower chromatin mark (e.g., H3K4me3) then the decision process for choosing an optimal extension may have been different.  

In certain cases, additional extensions are unlikely to add significant value to the annotation of the peak file.  Taking the example of the 0-10000 bp line plot, an upstream extension beyond 3500 bp globally across every gene in a genome would most likely not accurately reflect the biology of the peak input file (since such large global upstream extensions are likely to reach considerably beyond known proximal promoter elements, especially for relatively narrow histone marks or transcription factors).  Such assumptions may be validated directly by the user by investigating the p-value and FDR of specific peaks using a combination of \texttt{HT-seq} (to count the reads) and \texttt{edgeR}/\texttt{DESeq2} (to assess statistical significance).  As such, \texttt{geneXtendeR} is designed to be used as part of a biological workflow involving subsequent statistical analysis:

\begin{figure}[H]
\centering
\includegraphics{figures/workflow.pdf}
\caption{Sample biological workflow using \texttt{geneXtendeR} in combination with existing statistical software to analyze peak significance.  Subsequent gene ontology enrichment or network analysis may be conducted on genes associated with statistically significant peaks.}
\end{figure}

It is entirely possible (and probable) for significant peaks to be present at relatively high upstream extension levels (i.e., large gene-spheres), albeit these significant peaks may be associated with biology not directly relevant to the study at-hand, due mainly to the sheer magnitude of the distance of the peak from traditional gene boundaries (where traditional gene boundaries may be loosely defined as +/- $\approx 3$ kb from TSS and +/- $\approx 0.5$kb from TES).  Consequently, it is likely for peaks-on-top-of-genes to exhibit higher levels of noise at higher upstream extension levels.  Nevertheless, this does not mean that potential enhancer activity should be discounted.  For instance, it is not uncommon to see a steady rise or even a surge in the number of peaks-on-top-of-genes at higher upstream extension levels:

\begin{figure}[H]
\begin{center}
<<fig=TRUE>>=
linePlot(rat, 7000, 8500, 100)
@
\end{center}
\end{figure}

This line plot shows that there are over 30 peaks in this dataset (across the rat genome) that reside between 8100 and 8200 bp upstream of their nearest gene.  In far-out cases like this, it is particularly recommended to examine the statistical significance of peaks to get a sense for the possibility of potential enhancer activity/regulation.  Of course, such computational findings would require experimental follow-up and/or database mining for known motifs.  Assessment of such statistical significance values is beyond the scope of \texttt{geneXtendeR}, in order to allow the user freedom to choose the most appropriate statistical package/technique for their analysis.  As before, first use the \texttt{distinct()} function to create a table of unique genes located under peaks between the two upstream extension levels:

%<<fig.width=6.2, fig.height=5>>=
<<results=hide, eval=FALSE>>=
distinct(rat, 8100, 8200)
@

Then, assess the statistical significance of these peaks using a combination of \texttt{HT-seq} (Anders et al. 2015) and \texttt{edgeR} (Robinson et al. 2010), or \texttt{HT-seq} and \texttt{DESeq2} (Love et al. 2014), or some other appropriate combination of existing software tools.  Genes associated with the resultant statistically significant peaks may then be further assessed with gene ontology enrichment analysis to help answer a variety of interesting research questions.  It should once again be noted that the \texttt{diffGO()} function does NOT perform gene ontology enrichment analysis.  Instead, it returns all known gene ontologies for each gene.  The purpose and utility of this is described in the previous section.

Moreover, DNA sequences under peaks may be checked for the presence of known regulatory motifs (e.g., using \texttt{TRANSFAC} (Matys et al. 2006) or \texttt{MEME}/\texttt{JASPAR} (Sandelin et al. 2004, Bailey et al. 2009)), or for the presence of biological repeats (e.g., using \texttt{RepeatMasker} (Smit et al. 2015)). Pending a prospective GO enrichment and network analysis, functional validation may be followed up in the lab to test any potential regulatory sites or prospective enhancer elements, thereby bringing the computational analysis pipeline back to the bench.

In addition to the computational workflows discussed above, \texttt{geneXtendeR}'s wide array of functions makes it possible to conduct some rather interesting and creative combinations of genomic analysis.  Let's say, for example, that a user wants to explore all known ontological differences across specific disparate sectors of the genome (e.g., 0-500 bp vs. 2000-3000 bp, but removing 501-1999 bp from consideration).  In other words, look at all peaks (across the entire genome) that reside between 0-500 bp upstream of their nearest gene (and 2000-3000 bp upstream of their nearest gene), and extract unique gene ontologies that differ between these two variable-length sectors (where one is 500 bp long and the other is 1000 bp in length).  This can be accomplished rather conveniently using \texttt{dplyr}:

<<>>=
library(dplyr)
library(org.Rn.eg.db)
library(GO.db)
a <- diffGO(rat, 0, 500, BP, org.Rn.eg.db)
b <- diffGO(rat, 2000, 3000, BP, org.Rn.eg.db)
dplyr::filter(b, TERM %in% a$TERM)

@

This displays all biological process (BP) ontologies present in \texttt{b} that are not present in \texttt{a}.  Similarly, one can look at all BP, CC, or MF ontologies present in \texttt{a} that are not present in \texttt{b}. 

\section*{Concluding remarks}

\texttt{geneXtendeR} is continually evolving, so any suggestions or new feature requests are always appreciated.  Likewise, any bug reports may be posted to \url{https://github.com/Bohdan-Khomtchouk/geneXtendeR/issues} or emailed to the package maintainer directly.    


\begin{thebibliography}{31}

\bibitem{anders}
Anders S, Pyl PT, Huber W:  \textit{HTSeq--a Python framework to work with high-throughput sequencing data.}  Bioinformatics.  2015, 31(2): 166--169. 

\bibitem{bailey}
Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS:  \textit{MEME SUITE: tools for motif discovery and searching.}  Nucleic Acids Research.  2009, 37 (2): W202--W208. 

\bibitem{barbier}
Barbier E, Johnstone AL, Khomtchouk BB, Tapocik JD, Pitcairn C, Rehman F, Augier E, Borich A, Schank JR, Rienas CA, Van Booven DJ, Sun H, N\"{a}tt D, Wahlestedt C, Heilig M: \textit{Dependence-induced increase of alcohol self-administration and compulsive drinking mediated by the histone methyltransferase PRDM2}.  Molecular Psychiatry. 2016, Nature Publishing Group. doi: 10.1038/mp.2016.131.

\bibitem{heinz}
Heinz S, Benner C, Spann N, Bertolino E et al.: \textit{Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities.} Mol Cell 2010, 38(4): 576--589.

\bibitem{khomtchouk}
Khomtchouk BB, Van Booven DJ, Wahlestedt C: \textit{geneXtendeR: R/Bioconductor package for functional annotation of histone modification ChIP-seq data in a 3D genome world}.  bioRxiv. 2016, 1--15. 

\bibitem{koohy}
Koohy H, Down TA, Spivakov M, Hubbard T: \textit{A Comparison of Peak Callers Used for DNase-Seq Data}.  PLoS One. 2014, 9(8): e105136. 

\bibitem{deseq2}
Love MI, Huber W, Anders S:  \textit{Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.}  Genome Biology.  2014, 15:550.

\bibitem{matys}
Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, Voss N, Stegmaier P, Lewicki-Potapov B, Saxel H, Kel AE, Wingender E:  \textit{TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes.}  2006.  Nucleic Acids Research.  34 (Database issue): D108--110.

\bibitem{quinlan}
Quinlan AR, Hall IM: \textit{BEDTools: a flexible suite of utilities for comparing genomic features}. Bioinformatics. 2010, 26(6): 841--842.

\bibitem{edgeR}
Robinson MD, McCarthy DJ, Smyth GK: \textit{edgeR: a Bioconductor package for differential expression analysis of digital gene expression data}. Bioinformatics. 2010, 26: 139--140.

\bibitem{sandelin}
Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B:  \textit{JASPAR: an open-access database for eukaryotic transcription factor binding profiles.}  Nucleic Acids Research.  2004, 32 (Database issue): D91--D94.  

\bibitem{smit}
Smit AFA, Hubley R, Green P. RepeatMasker Open-4.0.  2013-2015  \url{<http://www.repeatmasker.org>}.

\bibitem{thomas}
Thomas R, Thomas S, Holloway AK, Pollard KS: \textit{Features that define the best ChIP-seq peak calling algorithms}.  Briefings in Bioinformatics.  2017, 18(3): 441--450.

\bibitem{zang}
Zang C, Schones DE, Zeng C, Cui K, Zhao K, Peng W:  \textit{A clustering approach for identification of enriched domains from histone modification ChIP-Seq data.}  Bioinformatics.  2009, 25(15): 1952--1958.

\bibitem{zhang}
Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, Liu XS:  \textit{Model-based analysis of ChIP-Seq (MACS).}  Genome Biology.  2008, 9(9): R137.

\bibitem{zhu}
Zhu L, Gazin C, Lawson N, Pages H, Lin S, Lapointe D, Green M: \textit{ChIPpeakAnno: a Bioconductor package to annotate ChIP-seq and ChIP-chip data}.  BMC Bioinformatics.  2010, 11(1), pp. 237.

\end{thebibliography}


\end{document}