--- title: "Getting started with GenomicDistributions" author: "Nathan Sheffield" date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true vignette: > %\VignetteIndexEntry{1. Getting started with GenomicDistributions} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo=FALSE} # These settings make the vignette prettier knitr::opts_chunk$set(results="hold", collapse=FALSE, message=FALSE) #refreshPackage("GenomicDistributions") #devtools::build_vignettes("code/GenomicDistributions") #devtools::test("code/GenomicDistributions") ``` # Introduction to GenomicDistributions If you have a set of genomic ranges, the GenomicDistributions R package can help you visualize properties of your region set. GenomicDistributions produces these nine types of plot: - *chromosome distribution plot* - visualizes how your regions are distributed over chromosomes - *width distribution plot* - visualizes the distribution of range widths - *feature distance distribution plot* - visualizes how your regions are distributed in distance to the nearest feature of interest, like Transcription Start Sites (TSSs). - *partition distribution plot* - visualizes how your regions are distributed across a genomic partitioning, such as frequency of overlapping a gene body, exon, promoter, intronic, or intergenic segment. - *specificity of accessibility plot* - visualizes tissue specificity of a set of genomic ranges. In addition to your input genomic ranges, this plot type requires a tissue specificity data matrix, which contains a set of genomic regions that have been annotated for tissue specificity signal levels for tissues of interest. - *neighboring regions distance* - visualizes the distance between chromosomes neighboring regions. - *GC content plot* - visualizes a probability density function of GC content percentage over the genomic ranges in the query. GenomicDistributions can work with any reference genome, as long as you have some annotation data for it (like chromosome sizes and locations of genes). To make things easier for the common use cases, I've included in the package basic metadata for the most commonly used features from the reference genomes I use most (hg19, hg38, and mm10). If you need to produce similar plots with different features, partitions, or reference assemblies, that's also possible, and not much more difficult; GenomicDistributions is very modular and will work with other bioconductor packages to process that data, but it requires one or two additional steps to curate your reference data. In this vignette, we'll go through examples of each of the plots using my common built-in features and partitions. If you want more control, there's another advanced vignette that will introduce you how to define your own features, partitions, and chromosome sizes for custom analysis. ## Philosophy of modular *calc* and *plot* functions Before we start, I want to explain the design philosophy for functions in this package. Many R plotting packages combine calculations and plotting into one function. This may seem convenient, but if you want to plot the calculation results in a different way or combine them with something else, you can't because you only have access to the final plot. GenomicDistributions divides these tasks so you can use the intermediate data to design your own custom plot, or use the calculated results directly for other analysis. In GenomicDistributions, each plot type has two functions: a *calculate* function and a *plot* function. The *calculate* functions take your GRanges object and return a table of summary results. You can use these summary statistics how you like -- aggregate them across multiple region sets, insert them into other plots you have, and so forth; or, you can simply plug that result directly into the corresponding *plot* function, which returns a *ggplot2* object. Separating the calculation and plotting functions like this gives you more control over your results. ## Installing GenomicDistributions Install `GenomicDistributions` like this: ```{r, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("GenomicDistributions") ``` ## Loading genomic range data Start by loading up the package and getting your query set of regions as a GenomicRanges object. I've included an example bed file to demonstrate how these plots look. You can load it up like this: ```{r, echo=TRUE, results="hide", message=FALSE, warning=FALSE} library("GenomicDistributions") queryFile = system.file("extdata", "vistaEnhancers.bed.gz", package="GenomicDistributions") query = rtracklayer::import(queryFile) ``` # GenomicDistributions plot types ## Chromosome distribution plots *Chromosome distribution plots* help you visualize how your regions are distributed across chromosomes. To produce these, you'll need to specify the chromosome lengths for your reference assembly. There are a few ways to do this. For the common reference assemblies that I use (hg19, hg38, mm9, and mm10), I've included the metadata in the package. If you're working with one of these genomes, making a plot of the distribution across chromosomes takes just a couple of lines of code: ```{r chrom-plots-single} # First, calculate the distribution: x = calcChromBinsRef(query, "hg19") # Then, plot the result: plotChromBins(x) ``` What if we want to do the same thing but on 2 query sets at the same time? No problem: ```{r chrom-plots-multi} # Let's fudge a second region set by shifting the first one over query2 = GenomicRanges::shift(query, 1e6) queryList = GRangesList(vistaEnhancers=query, shifted=query2) x2 = calcChromBinsRef(queryList, "hg19") plotChromBins(x2) ``` These functions just do a naive binning across the genome. If you want to tweak the way the bins are handled, or use a different reference assembly, that's also possible and is only slightly more complicated. There are other functions you can use for that, which are outlined in another vignette. ## Feature distance distribution plots *Feature distance distribution plots* will show you how your regions are distributed with respect to the nearest feature of interest. To illustrate, we'll use Transcription Start Sites (TSS) as our example feature of interest (but really, you can use any region set). For TSS plots, since this is such a common use case, we can use a handy built-in function that does everything for us. It's just one line of code to check distances from query to your TSSs (for common genomes), and then a second line of code to plot those distances: ```{r tss-distribution, fig.cap="TSS plot. Distribution of query regions relative to TSSs", fig.small=TRUE} # Calculate the distances: TSSdist = calcFeatureDistRefTSS(query, "hg19") # Then plot the result: plotFeatureDist(TSSdist, featureName="TSS") ``` This plot uses log-scale increasing bins to show how your regions are distributed. Now, let's make a similar plot with multiple region sets input: ```{r tss-distribution-multi, fig.cap="TSS plots with multiple region sets"} TSSdist2 = calcFeatureDistRefTSS(queryList, "hg19") plotFeatureDist(TSSdist2, featureName="TSS") ``` You can also plot a tiled version that aligns them all vertically: ```{r tiled-distance-plot, fig.cap="Tiled feature distance plot. Distribution of query regions relative to arbitrary features", fig.small=TRUE} plotFeatureDist(TSSdist2, featureName="TSS", tile=TRUE) ``` If you want to check distances to other features, that's no problem; `calcFeatureDistRefTSS()` is really just a wrapper for the workhorse function, `calcFeatureDist()`. To show how this works, get some features you want to check the distance to. Here, let's just shift our query set by a normally distributed random number: ```{r Build features} featureExample = GenomicRanges::shift(query, round(rnorm(length(query), 0,1000))) ``` Now, with these features, we just use the `calcFeatureDist` function to calculate the distances. This function uses the fast rolling joins from `data.table` under the hood, so it completes very quickly. The result of this gets piped right into the plotting function as before: ```{r distance-to-features-plot, fig.cap="Feature distance plot. Distribution of query regions relative to arbitrary features", fig.small=TRUE} fdd = calcFeatureDist(query, featureExample) plotFeatureDist(fdd) ``` ## Partition distribution plots Genomic partition distribution plots show you how your regions are distributed across genome annotation classes. This is most commonly used to show the distribution over element types, such as promoters, exons, introns, or intergenic regions. GenomicDistributions provides 3 types of partition distribution plots: *percentages*, *expected*, and *cumulative*. ### Percentage partition distribution plots The most basic partition plot just provides a barplot of percentage region overlaps. You can produce one or two-set plots like so: ```{r percentage-partition-plot, fig.cap="Partition distribution plot. Percentage distribution of query regions across genomic features", fig.small=TRUE} gp = calcPartitionsRef(query, "hg19") plotPartitions(gp) ``` ```{r multiple-percentage-partition-plot, fig.cap="Partition distribution plot for multiple query region sets.", fig.small=TRUE} gp2 = calcPartitionsRef(queryList, "hg19") plotPartitions(gp2) ``` If you wish, you can also plot the raw overlaps across a defined set of partitions, simply by setting the logical `numbers` to TRUE: ```{r multiple-raw-partition-plot, fig.cap="Raw partition distribution plot for multiple regionsets", fig.small=TRUE} # Plot the results: plotPartitions(gp2, numbers=TRUE) ``` ### Expected partition distribution plots A more useful variation of this plot corrects the values for the expected genome distribution. This accounts for the fact that the distribution of intergenic vs promoter space in the genome is not uniform. Here, we produce one or two-set plots showing the log~10~($\frac{observed}{expected}$) of the distribution of your regions across genomic partitions: ```{r corrected-partition-plot, fig.cap="Expected partition distribution plot. Distribution of query regions across genomic features relative to the expected distribution of those features.", fig.small=TRUE} ep = calcExpectedPartitionsRef(query, "hg19") plotExpectedPartitions(ep) ``` And can you do 2 at a time? Indeed: ```{r multiple-corrected-partition-plot, fig.cap="Expected partition distribution plots for multiple query region sets.", fig.small=TRUE} ep2 = calcExpectedPartitionsRef(queryList, "hg19") plotExpectedPartitions(ep2) ``` These plots would show all values aligning at 0 if your given regions perfectly match the expected genomic distribution. In this case, we see an 1.5-fold enrichment of introns over the background genome distribution, along with a concomitant decrease in other partitions. ### Cumulative partition distribution plots The final type of partition plot is the *cumulative partition distribution* plots. These are a relatively less widespread type of plot that we recently developed. The cumulative partition plot provides an information-dense look into the genomic distribution of reads relative to genomic features. This analysis uses a cumulative distribution to visualize how quickly the final region count is accumulated in features of a given type. To calculate, we overlap each region with a feature set of genomic annotations, as before. The individual feature elements are then sorted by read count, and for each feature, we traverse the sorted list and calculate the cumulative sum of regions found in that feature divided by the total number of regions. We plot the fraction against the $log_{10}$ transformed cumulative size of all loci for each feature. This allows the identification of enriched features while correcting for total features and total genomic space. Now, we'll plot single and multiple cumulative distributions of regions in genomic partitions: ```{r cumulative-partition-plot, fig.cap="Cumulative partition distribution plot. Cumulative distribution of query regions across genomic features.", fig.small=TRUE} cp = calcCumulativePartitionsRef(query, "hg19") plotCumulativePartitions(cp) ``` Can you plot 2 of these? You can: ```{r multiple-cumulative-partition-plot, fig.cap="Cumulative partition distribution plots for multiple query region sets.", fig.small=TRUE} cp2 = calcCumulativePartitionsRef(queryList, "hg19") plotCumulativePartitions(cp2) ``` If you're not familiar with these plots, interpreting them can take a minute to familiarize yourself. Each curve represents a partition type. Then terminal (rightmost) endpoint of each curve is the percentage of query regions that overlap this partition type in total. The shape of the curve, and when it arises on the x-axis, demonstrates the size in terms of raw nucleotide coverage that must be accumulated in order to achieve the corresponding percent coverage. In this plot, we see that intergenic regions reach a higher total point than intronic regions -- which is also reflected in the raw partition plot above. What we see additionally in this plot is that the intergenic spaces also cover a much larger portion of the genome, as indicated by their more more left shifted in the plot. If you want to see how your regions are distributed among other partitions, like CpG islands, enhancers, or something else, GenomicDistributions also has functions to produce priority lists from any GRanges objects. ## Specificity of accessibility plots One feature we are often interested in is the tissue specificity of a set of genomic ranges. To visualize this, GenomicDistributions provides functions to calculate and plot the tissue specificity. In addition to your input genomic ranges, this plot type requires a tissue specificity data matrix. This matrix should contain a set of genomic regions that have been annotated for tissue specificity signal levels for tissues of interest. We have produced example reference data matrixes based on ENCODE accessibility information, which can be downloaded from [big.databio.org](http://big.databio.org/open_chromatin_matrix/). Here, we'll demonstrate how this works with some tiny built-in example reference data. ```{r specificity-plot-single, fig.height = 6, fig.cap="Specificity of chromatin accessibility across cell types."} exampleCellMatrixFile = system.file("extdata", "example_cell_matrix.txt", package="GenomicDistributions") cellMatrix = data.table::fread(exampleCellMatrixFile) op = calcOpenSignal(query, cellMatrix) plotOpenSignal(op) ``` To plot multiple datasets at a time: ```{r specificity-plot-multi, fig.height = 7, fig.cap="Specificity of chromatin accessibility across cell types."} op2 = calcOpenSignal(queryList, cellMatrix) plotOpenSignal(op2) ``` ## Neighboring regions distance plots Genomic Distributions can also generate some basic stats about your regionset. One of those includes the distance between chromosomes neighboring regions. Distances are calculated for each specific chromosome and then lumped together to get a more comprehensive view of the entire regionset. Distances have been log10 transformed to ease comparison between different regionsets and account for outliers. ```{r neighbor-distance-plots, fig.cap="Distances between neighboring intervals of a regionset", fig.small=TRUE} # Calculate the distances nd = calcNeighborDist(query) # Plot the distribution plotNeighborDist(nd) ``` You can also look at the distances distribution for more than one regionset: ```{r multiple-neighbor-distance-plots, fig.cap="Neighboring regions distance for multiple regionsets", fig.small=TRUE} # Create a list of GRanges objects s = system.file("extdata", "setB_100.bed.gz", package="GenomicDistributions") setB = rtracklayer::import(s) queryList2 = GRangesList(vistaEnhancers=query, setB=setB) # Calculate the distances dist2 = calcNeighborDist(queryList2) # Plot the distribution plotNeighborDist(dist2) ``` ## GC content plots *GC content plots* display a probability density function of GC content percentage over the genomic ranges in the query. These plots will not be shown in the vignette because they require large BSgenome packages, which contain the full reference sequence. ```{r gc-content-plots, fig.cap="GC content plot. Probability density function of GC percentage", eval=FALSE} # Calculate the GC content gc1 = calcGCContentRef(query, "hg19") # Plot the GC distribution plotGCContent(gc1) ``` You can also plot the probability distribution for more than one query: ```{r gc-content-plots-multi, fig.cap="Multiple GC content plots.", eval=FALSE} gc2 = calcGCContentRef(queryList, "hg19") plotGCContent(gc2) ``` ## Width distribution plots *Width distribution plots* display a quantile-trimmed histogram (qthist) of widths of the genomic ranges in the query. These plots are quantile-trimmed to preserve the visual shape of the distribution in the face of extreme outliers which sometimes occur in genomic ranges. ```{r width-distribution-single, fig.cap="Width distribution plot. Frequency of widths in this query"} # Calculate the widths qt1 = calcWidth(query) # Plot the width distribution plotQTHist(qt1) ``` You can also plot the frequency distribution for more than one query. ```{r width-distribution-multi, fig.cap="Multiple width distribution plots.", fig.small=TRUE} qt2 = calcWidth(queryList) plotQTHist(qt2) ``` Both the quantile and the bins can be manually adjusted. The quantile you indicate should be the proportion of the data that you want in each end bar of the histogram. You can also adjust the colors of these plots: ```{r width-distribution-colors, fig.cap="Width distribution plot with color options.", fig.small=TRUE} plotQTHist(qt1, bins=7, quantile = .015, EndBarColor = 'black', MiddleBarColor='darkblue') ``` # Conclusion GenomicDistributions provides a variety of ways to visualize your data. 3 key features make it very useful: First, it uses a modular, systematic way to name *calc* and *plot* functions, and the function names and interfaces are all consistent. Second, it is optimized for speed. You will find these functions to be much faster than other commonly accepted alternatives due to way we've written the functions under the hood. And finally, it provides a single, convenient package for many types of plots, so you don't have to look around in multiple places.