%\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{Using hiAnnotator} ## Introduction `hiAnnotator` contains set of functions which allow users to annotate a GRanges object with custom set of annotations. The basic philosophy of this package is to take two GRanges objects (query & subject) with common set of space/seqnames (i.e. chromosomes) and return associated annotation per space/seqname and rows from the query matching space/seqnames and rows from the subject (i.e. genes or cpg islands). This package comes with three types of annotation functions which calculates if a position from query is: within a feature, near a feature, or count features in defined window sizes. Moreover, each function is equipped with parallel backend to utilize the `foreach` package. The package is also equipped with wrapper functions, which finds appropriate columns needed to make a GRanges object from a common data frame. The work horse functions performing most of the calculations are from `GenomicRanges` package which comes from the Bioconductor repository. Most of the functions in the `hiAnnotator` package are wrapper around following functions: `nearest()`, and `findOverlaps()`. Below are few simple steps to get you started. ## Quick *hiAnnotator* tutorial. ### Step 1: Loading the tools First load this package and the parallel backend of choice. See loading parallel backend section at the bottom of the page for more choices. ```{r} library(hiAnnotator) ``` ### Step 2: Loading & formatting the data The package comes with example dataframes: `sites` and `genes`. In the rest of this tutorial we will use sites as query and genes as subject. Using the `makeGRanges()` function supplied with the package, one can easily go from a dataframe to a GRanges object without too much hassle. ```{r} data(sites) ## sites object doesn't have a start & stop column to denote genomic range, hence soloStart parameter must be TRUE or a nasty error will be thrown! alldata.rd <- makeGRanges(sites, soloStart = TRUE) data(genes) ## adding freeze populates SeqInfo slot of GRanges object. genes.rd <- makeGRanges(genes, freeze = "hg18") ``` The package also comes with wrapper functions to download annotation tracks off of UCSC genome browser using `rtracklayer` package. ```{r, eval = FALSE} refflat <- getUCSCtable("refFlat", "RefSeq Genes") genes <- makeGRanges(refflat) ``` ### Step 3: Annotating the data With the data loaded and formatted, next series of functions highlight various ways they can be annotated. One thing to keep in mind is that, only the `intersect` of spaces/chromosomes/seqnames between query & subject will be annotated, rest will be ignored and will have NAs in the output. #### getNearestFeature: find the nearest neighbor Given a query object, the function retrieves the nearest feature and its properties from a subject and then appends them as new columns within the query object. When used in genomic context, the function can be used to retrieve a nearest gene 5' or 3' end relative to a genomic position of interest. By default, nearest distance to either boundary is calculated unless specifically defined using the `side` parameter. ```{r} nearestGenes <- getNearestFeature(alldata.rd, genes.rd, "NearestGene") head(nearestGenes) # nearestGenes <- getNearestFeature(alldata.rd,genes.rd,"NearestGene", parallel=TRUE) ## get nearest 5' genes nearestGenes <- getNearestFeature(alldata.rd, genes.rd, "NearestGene", side = "5p") head(nearestGenes) ## get nearest 3' genes nearestGenes <- getNearestFeature(alldata.rd, genes.rd, "NearestGene", side = "3p") head(nearestGenes) ## get midpoint of genes nearestGenes <- getNearestFeature(alldata.rd, genes.rd, "NearestGene", side = "midpoint") head(nearestGenes) ### get two nearest upstream and downstream genes relative the query nearestTwoGenes <- get2NearestFeature(alldata.rd, genes.rd, "NearestGene") head(nearestTwoGenes) ``` #### getFeatureCounts: find out how many features are in a neighborhood Given a query object and window size(s), the function finds all the rows in subject which are <= window size/2 distance away. If weights are assigned to each positions in the subject, then tallied counts are multiplied accordingly. If annotation object is large, spanning greater than 100 million rows, then `getFeatureCountsBig()` is used which uses midpoint and drops any weights column if specified to get the job done. The time complexity of this function can be found in `?findOverlaps`. ```{r} geneCounts <- getFeatureCounts(alldata.rd, genes.rd, "NumOfGene") head(geneCounts) # geneCounts <- getFeatureCounts(alldata.rd, genes.rd, "NumOfGene", parallel=TRUE) ``` If dealing with really large set of input objects, the function can break up the data using the `chunkSize` parameter. This is handy when trying to annotated ChipSeq data on an average laptop/machine. There is also `getFeatureCountsBig()` function which uses an alternative method to get the counts using `findInterval`. ```{r, eval=FALSE, echo=TRUE} geneCounts <- getFeatureCounts(alldata.rd, genes.rd, "NumOfGene", doInChunks = TRUE, chunkSize = 100) head(geneCounts) geneCounts <- getFeatureCountsBig(alldata.rd, genes.rd, "NumOfGene") head(geneCounts) ``` #### getSitesInFeature: find out a position within a feature When used in genomic context, the function annotates genomic positions of interest with information like if they were in a gene or cpg island or whatever annotation that was supplied in the subject. ```{r} ## Shows which feature(s) a position was found in. InGenes <- getSitesInFeature(alldata.rd, genes.rd, "InGene") head(InGenes) ## Simply shows TRUE/FALSE InGenes <- getSitesInFeature(alldata.rd, genes.rd, "InGene", asBool = TRUE) head(InGenes) # InGenes <- getSitesInFeature(alldata.rd, genes.rd, "InGene", asBool=TRUE, parallel=TRUE) ``` #### doAnnotation This is a wrapper function which calls one of the functions shown above depending on annotType parameter: within, nearest, twoNearest, counts, countsBig. You can also pass any function to call on the resulting object for any post processing steps. ```{r, eval=FALSE, echo=TRUE} doAnnotation(annotType = "within", alldata.rd, genes.rd, "InGene") doAnnotation(annotType = "counts", alldata.rd, genes.rd, "NumOfGene") doAnnotation(annotType = "countsBig", alldata.rd, genes.rd, "ChipSeqCounts") doAnnotation(annotType = "nearest", alldata.rd, genes.rd, "NearestGene") doAnnotation(annotType = "twoNearest", alldata.rd, genes.rd, "TwoNearestGenes") geneCheck <- function(x, wanted) { x$isWantedGene <- x$InGene %in% wanted; return(x) } doAnnotation(annotType = "within", alldata.rd, genes.rd, "InGene", postProcessFun = geneCheck, postProcessFunArgs = list("wanted" = c("FOXJ3", "SEPT9", "RPTOR")) ) ``` ### Plotting Results `hiAnnotator` comes with a handy plotting function `plotdisFeature` which summarizes and plots the distribution of newly annotated data. Function can be used to easily visualize things like distribution of integration sites around gene TSS, density of genes within various window sizes, etc. ```{r, eval=TRUE, echo=TRUE} res <- doAnnotation(annotType = "within", alldata.rd, genes.rd, "InGene", asBool = TRUE) plotdisFeature(res, "virus", "InGene") res <- doAnnotation(annotType = "nearest", alldata.rd, genes.rd, "NearestGene", side = '5p') plotdisFeature(res, "virus", "X5pNearestGeneDist") data(sites.ctrl) sites$type <- "expr" sites <- rbind(sites,sites.ctrl) alldata.rd <- makeGRanges(sites, soloStart = TRUE) res <- doAnnotation(annotType = "within", alldata.rd, genes.rd, "InGene", asBool = TRUE) plotdisFeature(res, "virus", "InGene") plotdisFeature(res, "virus", "InGene", typeRatio = TRUE) ``` ## Ways to load parallel backends 1) Load one of the following libraries depending on machine/OS: `doMC`, `doSMP`, `doSNOW`, `doMPI` 2) Register the parallel backend using `registerDoXXXX()` function depending on the library. See the examples below: ```{r par_examples, eval=FALSE, echo=TRUE} ## Example 1: library(doSMP) w <- startWorkers(2) registerDoSMP(w) getNearestFeature(..., parallel = TRUE) ## Example 2: library(doMC) registerDoMC(2) getNearestFeature(..., parallel = TRUE) ## Example 3: library(doSNOW) cl <- makeCluster(2, type = "SOCK") registerDoSNOW(cl) getNearestFeature(..., parallel = TRUE) ## Example 4: library(doParallel) cl <- makeCluster(2) registerDoParallel(cl) getNearestFeature(..., parallel = TRUE) ``` 3) Few backends launch worker processes in the background, so be sure to close them. Read the documentation of respective `do*` package to get more information. Few examples are shown below. For doSMP library, use `stopWorkers(w)` For doSNOW & doParallel library, use `stopCluster(cl)`