---
title: "fcScan: features cluster Scan"
author:
- name: Abdullah El-Kurdi
affiliation:
- &id1 American University of Beirut, Beirut, Lebanon
email: ak161@aub.edu.lb
- name: Ghiwa Khalil
affiliation:
- *id1
- name: Georges Khazen
affiliation: Lebanese American University, Byblos, Lebanon
email: gkhazen@lau.edu.lb
- name: Pierre Khoueiry
affiliation:
- *id1
email: pk17@aub.edu.lb
vignette: >
%\VignetteIndexEntry{fcScan}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF8}
output: BiocStyle::html_document
---
```{r style, echo=FALSE, results='asis'}
BiocStyle::markdown()
```
# Version Info
```{r, echo=FALSE, results="hide", warning=FALSE}
suppressPackageStartupMessages({library('fcScan')})
```
**R version**: `r R.version.string`
**Bioconductor version**: `r BiocManager::version()`
**Package version**: `r packageVersion("fcScan")`
```{r, setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
# Introduction
Biological information encoded in the DNA sequence is often organized into
independent modules or clusters. For instance, the eukaryotic system of
expression relies on combinations of homotypic or heterotypic transcription
factors (TFs) which play an important role in activating and repressing
target genes. Identifying clusters of genomic features is a frequent task and
includes application such as genomic regions enriched
for the presence of a combination of transcription factor binding sites or
enriched for the presence of mutations often referred as regions with increase
mutational hotspots.
fcScan is designed to detect clusters of genomic features,
based on user defined search criteria. Such criteria include:
* A Path to BED/VCF files, a dataframe or a GRanges object as input (required)
* A window size specifying the maximum cluster size allowed (required)
* The combination of features required and/or to exclude (required)
* The order of features required within identified clusters (optional)
* The allowed distance or overlap between identified clusters (optional)
* The strand(s) to build the clusters on (optional)
* The seqname(s) to build the clusters on (optional)
* The sites orientation to build the clusters on (optional)
* The distance between site in the cluster (optional)
fcScan is designed to handle large number of genomic features (i.e
data generated by High Throughput Sequencing).
# Dependencies
fcScan depends on the following packages:
* **stats**
* **plyr**
* **utils**
* **rtracklayer**
* **SummarizedExperiment**
* **IRanges**
* **VariantAnnotation**
* **GenomicRanges**
# Overview
Currently, fcScan has one main function, the `getCluster` function. Additional
functionality will be added for future releases including cross-species
identification of orthologous clusters.
# Input arguments for getCluster
## Input data (x)
The input for `getCluster` is given through the parameter `x`.
This function accepts input in data frame, GRanges as well as vector of files
in BED and VCF (compressed or not) formats. BED and VCF files are loaded
by packages `rtracklayer` and `VariantAnnotation` respectively. There is no
limit to the number of files the user can define.
When input is data frame or GRanges objects are given, they should contain
the following "named" 5 columns:
* **seqnames** : Contains the seqname name
* **start**: Contains the start coordinates
* **end**: Contains the end coordinates
* **strand**: Contains the strand relative to each site
* **site** : Contains the category name relative to genomic site
```{r, echo=FALSE, results='asis'}
df <- data.frame(seqnames = c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1"),
start = c(10, 17, 25, 27, 32, 41),
end = c(15, 20, 30, 35, 40, 48),
strand = c("+", "+", "+", "+", "+", "+"),
site = c("a", "b", "b", "a", "c", "b"))
knitr::kable(df)
```
The *seqnames*, *start* and *end* columns define the name of the chromosome,
the starting and ending position of the feature in the chromosome, respectively.
The *strand* column defines the strand.
The *site* column contains the ID of the site and will be used for clustering.
The *start* and *end* columns are numeric while the remaining columns are
characters.
Note: when input is data frame, the data is considered `zero-based`
as in BED format.
## Window Size (w)
Window size is set using `w`. It defines the `maximum` size of
the clusters.
## Condition (c)
The clustering condition `c` defines the number and name of genomic features to
search for based on features names defined in the `site` column
`c = c("a" = 1, "b" = 2)`
This searches for clusters with 3 sites, One *a* site and two *b* sites.
Another way of writing the condition, only if input is a vector to file paths,
is the following `x = ("a.bed", "b.bed"), c = c(1,2)`
Given 2 files, *a.bed* and *b.bed*, this condition states that the user is
looking for clusters made from 1 "a" site and 2 "b" sites. In this case, the
order of sites defined in `c` is relative to the order of files.
When input is a data frame or GRanges object (instead of files), the
user needs to **explicitly** define the site names along with the desired
number relative to each site.
For instance, giving the condition as `c = c(1,2)` for a data frame or GRanges
is not allowed.
`x = dataframe, c = c("a" = 1, "b" = 2)` where `a` and `b` are valid site
names in `site` column in the dataframe/GRanges
Users can exclude clusters containing a specific site(s).
This is done by specifying zero `0` in the condition as
`c = c("a" = 1, "b" = 2, "c" = 0)`. In this case, any
cluster(s) containing `c` site will be excluded even if it has 1 `a` and 2 `b`
sites.
## Seqnames (seqnames) and Strand (s).
By default, clustering will be performed on both strands and on all seqnames
unless specified by the user using the `s` and `seqnames` arguments to limit
the search on a specific strand and/or seqname.
Users can choose to cluster on one specific seqname `(seqnames = "chr1")`, or
on several seqnames `(seqnames = c("chr1","chr3","chr4"))`
(Default for `seqnames` is **NULL**) meaning that clustering on all seqnames
will be performed.
For `s`, the values allowed are:
* **+** : Build clusters on positive strand
* **-** : Build clusters on negative strand
* **\*** : Clusters are not strand specific
(Default is set to **\***)
## Overlap (overlap)
The gap/overlap between adjacent clusters, and not sites, can be controled
using the `overlap` option.
When `overlap` is a positive integer, adjacent clusters will be separated by
a minimum of the given value.
When `overlap` is negative, adjacent clusters will be allowed to overlap by
a maximum of the given value. (Default is set to **0**)
## Greedy vs Non-Greedy (greedy)
`greedy` allows the user to control the number of genomic features found
in clusters.
When `greedy = FALSE`, `getCluster` will build clusters having the required
window size and will label *TRUE* the ones that contain the **exact** number
of sites provided in the condition argument.
Clusters having the user defined window size but not satisfying
the condition will be labelled as *FALSE*.
When `greedy = TRUE`, additional sites defined in condition will be added
to the cluster as long as the cluster size is below the defined window size.
(Default is set to **FALSE**)
## Order (order)
The `order` option defines the desired order of the sites within identified
clusters. For instance, `order = c("a","b","b")`
will search for clusters containing 1 `a` and 2 `b` sites and checks if
they are in the specified order. Clusters with 1 `a` and 2 `b` sites that
do not contain the specified order will be rejected.
When greedy is set to `TRUE`, order can be satisfied if a subcluster contains
the desired order. For example if a cluster has `a, a, b, b, b` sites, it
satisfies the required order (a, b, b) and therefore will be considered as
a correct cluster . (Default is set to **NULL**)
## Sites Orientation (site_orientation)
The `site_orientation` option defines the orientation or strandness of sites
in the found clusters. This option cannot be used if `order` is `NULL`.
`site_orientation` should be specified for each site in `order`.
For instance, if `order = c("a","b","b")`, we can define `site_orientation`
for each site respectively as follow: `site_orientation = c("+","-","-")`.
The cluster will be correct if it satisfies the required order and sites
orientation. (Default is set to **NULL**)
## Distance between sites in clusters (site_overlap)
The `site_overlap` option defines the maximum or minimum distance allowed
between sites in the cluster.
When `site_overlap` is a positive integer, sites within a cluster should have
`minimum distance and above`, then the cluster is considered `TRUE` cluster.
When `site_overlap` is a negative integer, sites within a cluster should
have `max distance and below`, then the cluster is considered `TRUE` cluster.
When `site_overlap` is zero, distance between sites is not taken into
consideration. (Default is set to **0**)
## Clustering option (cluster_by)
The `cluster_by` option allows the usage of different ways in scanning for
sites. Using this option, scanning does not 'jump' clusters found.
`cluster_by` can be `startsEnds`, `endsStarts`, `starts`, `ends`,
or `middles`. If we choose `startsEnds`, the clustering begins at the start of
the first site and it ends at the end of the last site within the window size.
If we choose `endsStarts`, scanning begins at the end of the
first site and it ends at the start of the last site within the window size.
if we choose `starts`, the clustering begins at the start of the first site
and it ends at the start of the last site within the window size.
if we choose `ends`, the clustering begins at the end of the first site
and it ends at the end of the last site within the window size.
Also, if we choose `middles`, the clustering begins at the middle
of the first site and it ends at the middle of the last site within the
window size. (Default is set to **startsEnds**)
## Overlapping Clusters option (allow_clusters_overlap)
The `allow_clusters_overlap` allows the user whether to include overlapping
clusters or not. (Default is set to FALSE)
## Include partially overlapping sites (include_partial_sites)
This option can be only used in case where the "cluster_by" choosen is
"startsEnds" or "ends", otherwise it is ignored. If it is set to `TRUE`,
partially overlapping sites within window size (at the end of the cluster)
are allowed. However, if it is set to `FALSE`, partially overlapping sites
are not included in the cluster found. (Default is set to **FALSE**)
## Control addition of partially overlapping sites (partial_overlap_percentage)
This option allows the user to choose which partially overlapping site are
included. The amount of overlap between the last site found and the window
size controls the addition or removal of those sites. For example, if
`partial_overlap_percentage` is set to 0.6, this means that the window size
should overlap this site by 60% of its size. However, if it fails, this site
won't be included in the cluster found. This option can be only used in case
where the "cluster_by" chosen is "startsEnds" or "ends",
otherwise it is ignored. (Default is set to **NULL**)
## Multi-threading (threads)
This option allows to distribute the run over multiple threads. It is
recommended to use multiple threads when the input dataset is large.
When `threads` is set to 0, `getCluster` reserves the adequate number of
threads and performs the run. This will dramatically increase the speed of
`getCluster`. The user can choose the number of threads between 1 and the
maximum number of available threads. (Default is set to **1**)
## Verbose (verbose)
The `verbose` option allows the printing of additional messages and the
list of clusters that failed for lack of correct combination of sites.
This option is used mainly for debugging purposes. (Default is set to FALSE)
# Output of getCluster
The output of `getCluster` is a GRanges object with fields:
* **seqnames**: The seqname on which a cluster is found
* **ranges**: The ranges of the cluster
* **strand**: The strand of the cluster, if any
* **sites**: The combination of sites that define the cluster
* **isCluster**: A logical indicating if the cluster is TRUE or FALSE
* **status**: Describes the reason behind the rejection of a cluster
The algorithm returns all clusters containing the correct count of
sites/features, unless verbose is set to `TRUE`. If the combination,
overlap and order options are satisfied, the cluster is considered a
`TRUE` cluster.
The *status* of a cluster can be either PASS, excludedSites, orderFail,
siteOrientation or siteOverlap
`PASS` is a cluster that satisfied the desired combination, overlap, order
and sites orientation.
`orderFail` is a cluster that had the required combination but did not
satisfy the required order of sites.
`excludedSites` is a cluster that had the required combination and order
but it has one or more sites to exclude.
`siteOrientation` is a cluster that had the required combination and order
but it has one or more sites with different orientation than requested.
`siteOverlap` is a cluster that meets all the requirements
but the distance between its sites is not respected.
NOTE: If the user is using `greedy = FALSE` and `order` contains values more
than in the condition parameter (`c`), an error will be raised.
However, if `greedy = TRUE`, then using `order` with more values than the
condition parameter is allowed since the cluster may contain more
sites than the required `c` condition as long as the window size is satisfied.
# Example on Clustering
Example using `getCluster`:
`getCluster` looks for desired genomic regions of interest like
transcription factor binding sites, within a window size
and specific condition.
This function accepts a data frame and GRanges object. getCluster also accepts
BED or VCF (or mix of both) files as input.
The output of `getCluster` is a GRanges object that
contains the genomic coordinates (seqnames, ranges and strand) and
three metadata columns:
1. sites: contains clusters of sites that conforms with the condition `c`
specified in `getCluster`.
2. isCluster: `TRUE` if the cluster found conform with the condition `c`
and the `order` (if indicated in condition) and `FALSE`
if the cluster fails to conform with the condition or order.
3. status: `PASS` if `isCluster` equals `TRUE`. However, if `isCluster` is
`FALSE`, status shows why the found cluster is not a `TRUE` cluster. If the
order of sites is not respected in the found cluster, status would return
`OrderFail`. in Addition, if the cluster found contains non desired sites,
it returns `excludedSites`. Moreover, if the sites orientation is not
respected in found cluster, status would return `siteOrientation`.
Finally , if the distance between sites in the cluster found does not conform
to the condition specified by the user, status would return `siteOverlap`.
In this example, we ask `getCluster` to look for clusters that contains
one site "s1", one site "s2" and zero "s3" sites. In addition, we requested
clusters to have sites in the order s1,s2 and having orientation
"+","+" respectively with minimum distance between sites of 2bp.
```{r, eval=TRUE}
x1 = data.frame(seqnames = rep("chr1", times = 17),
start = c(1,10,17,25,27,32,41,47,60,70,87,94,99,107,113,121,132),
end = c(8,15,20,30,35,40,48,55,68,75,93,100,105,113,120,130,135),
strand = c("+","+","+","+","+","+","+","+","+",
"+","+","+","+","+","+","+","-"),
site = c("s3","s1","s2","s2","s1","s2","s1","s1","s2","s1","s2",
"s2","s1","s2","s1","s1","s2"))
clusters = getCluster(x1, w = 20, c = c("s1" = 1, "s2" = 1, "s3" = 0),
greedy = TRUE, order = c("s1","s2"), site_orientation=c("+","+"),
site_overlap = 2, verbose = TRUE)
clusters
```
Another example but using GRanges as input:
in this example, we ask `getCluster` to look for clusters that contains
one site `s1` and two sites `s2` within a window size of 25 bp. Also,
we requested clusters to be searched as `+` strand.
```{r, eval=TRUE}
suppressMessages(library(GenomicRanges))
x = GRanges(
seqnames = Rle("chr1", 16),
ranges = IRanges(start = c(10L,17L,25L,27L,32L,41L,47L,
60L,70L,87L,94L,99L,107L,113L,121L,132L),
end = c(15L,20L,30L,35L,40L,48L,55L,68L,75L,93L,100L,105L,
113L,120L,130L,135L)),
strand = Rle("+",16),
site = c("s1","s2","s2","s1","s2","s1","s1","s2",
"s1","s2","s2","s1","s2","s1","s1","s2"))
clusters = getCluster(x, w = 25, c = c("s1"=1,"s2"=2), s = "+")
clusters
```
# Session Info
```{r}
sessionInfo()
```