--- title: "`EpiCompare`: Getting started" author: "

Authors: `r auths <- eval(parse(text = gsub('person','c',read.dcf('../DESCRIPTION', fields = 'Authors@R'))));paste(auths[names(auths)=='given'],auths[names(auths)=='family'], collapse = ', ')`

" date: "

Vignette updated: `r format( Sys.Date(), '%b-%d-%Y')`

" csl: nature.csl output: BiocStyle::html_document: vignette: > %\VignetteIndexEntry{EpiCompare} %\usepackage[utf8]{inputenc} %\VignetteEngine{knitr::rmarkdown} --- # Overview The *EpiCompare* package is designed to facilitate the comparison of epigenomic datasets for quality control and benchmarking purposes. The package combines several downstream analysis tools for epigenomic data and generates a single report that collates all results of the analysis. This allows users to conduct downstream analysis of multiple epigenomic datasets simultaneously and compare the results in a simple and efficient way. ## Introduction For many years, ChIP-seq has been the standard method for epigenomic profiling, but it suffers from a host of limitations. Recently, many other epigenomic technologies (e.g. CUT&Run, CUT&Tag and TIP-seq etc.), designed to overcome these constraints, have been developed. To better understand the performance of these novel approaches, it is important that we systematically compare these technologies and benchmark against a “gold-standard”. There are many tools in R (e.g. *ChIPseeker*) that can be used to conduct downstream analysis and comparison of epigenomic datasets. However, these are often scattered across different packages and difficult to use for researchers with none or little computational experience. *EpiCompare* is designed to provide a simple and comprehensive way of analysing and comparing epigenomic datasets. It combines many useful downstream analysis tools, which can easily be controlled by users and it collates the results in a single report. This allows researchers to systematically compare different epigenomic technologies. While the main functionality of *EpiCompare* is to contrast epigenomic technologies, it can also be used to compare datasets generated using different experimental conditions and data analysis workflows of one technology. This can help researchers to establish a consensus regarding the optimal use of the method. Currently, *EpiCompare* only works for human genome as it uses human-based hg19 and/or hg38 genome references. # Data The *EpiCompare* package contains a small subset of histone mark H3K27ac profile data obtained/generated from: * ENCODE (data accession: ENCFF044JNJ) * CUT&Tag from Kaya-Okur et al., (2019). (PMID: 31036827) * CUT&Run from Meers et al., (2019). (PMID: 31232687) It also contains human genome hg19 and hg38 blacklisted regions obtained from ENCODE. The ENCODE blacklist includes regions of the genome that have anomalous and/or unstructured signals independent of the cell-line or experiment. Removal of ENCODE blacklist is recommended for quality measure. These dataset will be used to showcase *EpiCompare* functionality # Installation To install the package, run the following: ```R if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("EpiCompare") ``` # Running *EpiCompare* In this example analysis, we will compare CUT&Run and CUT&Tag of histone mark H3K27ac against ENCODE ChIP-seq. ## Load package and example datasets Once installed, load the package: ```{r setup_vignette} library(EpiCompare) ``` Load example datasets used in this analysis: ```{r} data("encode_H3K27ac") # ENCODE ChIP-seq data("CnT_H3K27ac") # CUT&Tag data("CnR_H3K27ac") # CUT&Run data("hg19_blacklist") # hg19 genome blacklist data("CnT_H3K27ac_picard") # CUT&Tag Picard summary output data("CnR_H3K27ac_picard") # CUT&Run Picard summary output ``` ## Prepare input data ### Peaklist *EpiCompare* accepts datasets both as `GRanges` object and as paths to BED files. Peakfiles (`GRanges` or paths) that you would like to analyse must be listed and named (see below). ```R # To import BED files as GRanges object peak_GRanges <-ChIPseeker::readPeakFile("/path/to/peak/file.bed",as = "GRanges") # EpiCompare also accepts paths (to BED files) as input peak_path <- "/path/to/BED/file1.bed" # Create named peak list peaklist <- list(peak_GRanges, peak_path) names(peaklist) <- c("sample1", "sample2") ``` In this example, we will use built-in data, which have been converted into `GRanges` object previously (`CnT_H3K27ac` and `CnR_H3K27ac`). ```{r} peaklist <- list(CnT_H3K27ac, CnR_H3K27ac) # create list of peakfiles names(peaklist) <- c("CnT", "CnR") # set names ``` ### Blacklist ENCODE blacklist contains regions of the genome that have anomalous and/or unstructured signals independent of the cell-line or experiment. Removal of these regions from peakfiles is recommended for quality measure. *EpiCompare* has three built-in blacklist files, hg19, hg38 and mm10, downloaded from ENCODE. Run `?hg19_blacklist` for more information. In this example analysis, since all peakfiles (`encode_H3K27ac`, `CnT_H3K27ac`, `CnR_H3K27ac`) were generated using human genome reference build hg19, `hg19_blacklist` will be used. For hg38, use `data(hg38_blacklist)`. Please ensure that you specify the correct blacklist. ### Picard summary files Note that this is *OPTIONAL*. If you want the report to include metrics on DNA fragments (e.g. mapped fragments and duplication rate), please input summary files from Picard. [Picard *MarkDuplicates*](https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates) can be used to mark duplicate reads that are found within the alignment. This tool outputs a metrics file with the ending `.MarkDuplicates.metrics.txt`. To import this text file into R as data frame, use: ```R picard <- read.table("/path/to/Picard/.MarkDuplicates.metrics.txt", header = TRUE, fill = TRUE) ``` In this example. we will use built-in data, which have been converted into data frame previously (`CnT_H3K27ac_picard` and `CnR_H3K27ac_picard`). The files must be listed and named: ```{r} # create list of Picard summary picard <- list(CnT_H3K27ac_picard, CnR_H3K27ac_picard) names(picard) <- c("CnT", "CnR") # set names ``` ### Reference file This is OPTIONAL. If reference peak file is provided, `stat_plot` and `chromHMM_plot` of overlapping peaks are included in the report (see *Optional plots* section below). Reference file must be listed and named. In this example, we will use built-in data (`encode_H3K27ac`), which has been converted to `GRanges` previously: ```{r} reference_peak <- list("ENCODE_H3K27ac" = encode_H3K27ac) ``` ### Output Directory When running `EpiCompare()`, please ensure that you specify `output_dir`. All outputs (figures and HTML report) will be saved in the specified `output_dir`. ## Run EpiCompare Running EpiCompare is done using the function, `EpiCompare()` . Users can choose which analyses to run and include in the report by setting parameters to `TRUE` or `FALSE`. ```{r} EpiCompare(peakfiles = peaklist, genome_build = "hg19", blacklist = hg19_blacklist, picard_files = picard, reference = reference_peak, upset_plot = FALSE, stat_plot = FALSE, chromHMM_plot = FALSE, chromHMM_annotation = "K562", chipseeker_plot = FALSE, enrichment_plot = FALSE, tss_plot = FALSE, interact = FALSE, save_output = FALSE, output_filename = "EpiCompare_test", output_timestamp = FALSE, output_dir = tempdir()) ``` ### Optional plots By default, these plots will not be included in the report unless set `TRUE`. * `upset_plot` : Upset plot showing the number of overlapping peaks between samples. *EpiCompare* uses `UpSetR` package. * `stat_plot` : A `reference` peakfile must be included for this plot. The plot displays distribution of statistical significance (q-values) of sample peaks that are overlapping/non-overlapping with the `reference` dataset. * `chromHMM_plot` : [ChromHMM](http://compbio.mit.edu/ChromHMM/#:~:text=ChromHMM%20is%20software%20for%20learning,and%20spatial%20patterns%20of%20marks.) annotation of peaks. If `reference` is provided, ChromHMM annotation of overlapping and non-overlapping peaks with the `reference` is also included in the report. * `chipseeker_plot` : `ChIPseeker` functional annotation of peaks. * `enrichment_plot` : KEGG pathway and GO enrichment analysis of peaks. * `tss_plot` : Peak frequency around (+/- 3000bp) transcriptional start site. Note that it may take awhile to generate this plot for large sample sizes. ### Other options * `chromHMM_annotation` : Cell-line annotation for ChromHMM. Default is K562. Options are: + "K562" = K-562 cells + "Gm12878" = Cellosaurus cell-line GM12878 + "H1hesc" = H1 Human Embryonic Stem Cell + "Hepg2" = Hep G2 cell + "Hmec" = Human Mammary Epithelial Cell + "Hsmm" = Human Skeletal Muscle Myoblasts + "Huvec" = Human Umbilical Vein Endothelial Cells + "Nhek" = Normal Human Epidermal Keratinocytes + "Nhlf" = Normal Human Lung Fibroblasts * `interact` : By default, all heatmaps (percentage overlap and ChromHMM heatmaps) in the report will be interactive. If set FALSE, all heatmaps will be static. N.B. If `interact=TRUE`, interactive heatmaps will be saved as html files, which may take time for larger sample sizes. * `output_filename` : By default, the report is named EpiCompare.html. You can specify the filename of the report here. * `output_timestamp` : By default FALSE. If TRUE, the filename of the report includes the date. ## Output *EpiCompare* outputs * An HTML report consisting of three sections: + 1. **General Metrics**: Metrics on peaks (percentage of blacklisted and non-standard peaks, and peak widths) and fragments (duplication rate) of samples. + 2. **Peak Overlap**: Percentage and statistical significance of overlapping and non-overlapping peaks. Also includes upset plot. + 3. **Functional Annotation**: Functional annotation (ChromHMM, ChIPseeker and enrichment analysis) of peaks. Also includes peak enrichment around TSS. * `EpiCompare_file` containing all plots generated by EpiCompare if `save_output = TRUE`. Both outputs are saved in the specified `output_dir`. # Future Enhancements In the current version, *EpiCompare* only recognizes certain BED formats. We hope to improve this. Moreover, if there are other downstream analysis tools that may be suitable in *EpiCompare*, feel free to report this through [Github](https://github.com/neurogenomics/EpiCompare). # Code used to generate the example report An example report comparing ATAC-seq, DNase-seq and ChIP-seq of K562 can be found [here](https://neurogenomics.github.io/EpiCompare/articles/example_report.html). Code used to generate this:

```R ## Load data # ATAC-seq data: https://www.encodeproject.org/files/ENCFF558BLC/ atac1_hg38 <- ChIPseeker::readPeakFile("/path/to/bed/ENCFF558BLC.bed", as="GRanges") # Dnase-seq data: https://www.encodeproject.org/files/ENCFF274YGF/ dna1_hg38 <- ChIPseeker::readPeakFile("/path/to/bed/ENCFF274YGF.bed", as="GRanges") # Dnase-seq data: https://www.encodeproject.org/files/ENCFF185XRG/ dna2_hg38 <- ChIPseeker::readPeakFile("/path/to/bed/ENCFF185XRG.bed", as="GRanges") # ChIP-seq data: https://www.encodeproject.org/files/ENCFF038DDS/ chip_hg38 <- ChIPseeker::readPeakFile("/path/to/bed/ENCODE_H3K27ac_hg38_ENCFF038DDS.bed", as="GRanges") ## Peaklist peaklist <- list("ATAC_ENCFF558BLC" = atac1_hg38_unique, "Dnase_ENCFF274YGF" = dna1_hg38, "ChIP_ENCFF038DDS" = chip_hg38) ## Reference reference <- list("Dnase_ENCFF185XRG_reference"=dna2_hg38) ## Blacklist data("hg38_blacklist") ## Run Epicompare EpiCompare(peakfiles = peaklist, genome_build = "hg38", genome_build_output = "hg38", blacklist = hg38_blacklist, reference = reference, picard_files = NULL, upset_plot = T, stat_plot = T, save_output = F, chromHMM_plot = T, chromHMM_annotation = "K562", chipseeker_plot = T, enrichment_plot = T, tss_plot = T, precision_recall_plot =T, corr_plot = T, interact = T, output_dir = "/path/") ```

# Session Information

```{r Session_Info_vignette} utils::sessionInfo() ```