--- title: "Introduction to HiCExperiment" author: "Jacques Serizay" date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{Introduction to HiCExperiment} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r opts, eval = TRUE, echo=FALSE, results="hide", warning=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", crop = NULL ) suppressPackageStartupMessages({ library(dplyr) library(GenomicRanges) library(HiContactsData) library(HiCExperiment) }) ``` # Introduction Hi-C experimental approach allows one to query contact frequency for all possible pairs of genomic loci simultaneously, in a genome-wide manner. The output of this next-generation sequencing-supported technique is a file describing every pair (a.k.a contact, or interaction) between two genomic loci. This so-called "pairs" file can be binned and transformed into a numerical matrix. In such matrix, each cell contains the raw or normalized interaction **frequency** between a pair of genomic loci (which location can be retrieved using the corresponding column and row indices). [HiC-Pro](https://github.com/nservant/HiC-Pro), [distiller](https://github.com/open2c/distiller-nf) and [Juicer](https://github.com/aidenlab/juicer/) are the three main pipelines used to align, filter and process paired-end fastq reads into pairs files and contact matrices. Each pipeline defined their own file formats to store these two types of files. - Pairs files are (gzipped) human-readable, text files that are a variant of the BEDPE format; however the column order varies depending on the pipeline being used. - Contact matrix file formats greatly vary depending on the pipeline: - `HiC-Pro` generates two human-readable files: a `regions` file describing each genomic interval, and a `matrix` file quantifying interaction frequency between pairs of loci from the `regions` file, using a standard triplet sparse matrix format. - `Juicer` generates a `.hic` file, a highly compressed binary file storing sparse contact matrices from multiple resolutions into a single file. - `distiller` uses the `.(m)cool` format, a sparse, compressed, binary genomic matrix data model built on HDF5. Each file format can contain roughly the same information, albeit with a largely improved compression for `.hic` and `.(m)cool` files, which can also contain multi-resolution matrices compared to the HiC-Pro derived files. The [4DN consortium](https://data.4dnucleome.org/help/about/about-dcic), deciphering the role nuclear organization plays in gene expression and cellular function, officially supports both the `.hic` and `.(m)cool` formats. Furthermore, the `.(m)cool` format has recently gained a lot of traction with the release of a series of `python` packages (`cooler`, `cooltools`, `pairtools`, `coolpuppy`) by the [Open2C organization](https://open2c.github.io/) facilitating the investigation of Hi-C data stored in `.(m)cool` files in a `python` environment. The R `HiCExperiment` package aims at unlocking HiC investigation within the rich, genomic-oriented Bioconductor environment. It provides a set of classes and import functions to parse HiC files (both contact matrices and pairs) in R, allowing random access and efficient genome-based subsetting of contact matrices. It leverages pre-existing base Bioconductor classes, notably `GInteractions` and `ContactMatrix` classes ([Lun, Perry & Ing-Simmons, F1000 Research 2016](https://f1000research.com/articles/5-950/v2)). # Installation `HiCExperiment` package can be installed from Bioconductor using the following command: ```{r eval = FALSE} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("HiCExperiment") ``` All R dependencies will be installed automatically. # The `HiCExperiment` class ```{r load_lib} library(HiCExperiment) showClass("HiCExperiment") hic <- contacts_yeast() hic ``` ```{r graph, eval = TRUE, echo=FALSE, out.width='100%'} knitr::include_graphics( "https://raw.githubusercontent.com/js2264/HiCExperiment/devel/man/figures/HiCExperiment_data-structure.png" ) ``` # Basics: importing `.(m)cool`, `.hic` or HiC-Pro-generated files as `HiCExperiment` objects ## Import methods The implemented `import()` methods allow one to import Hi-C matrix files in R as `HiCExperiment` objects. ```{r import, eval = FALSE} hic <- import( "path/to/contact_matrix.cool", focus = "chr:start-end", resolution = ... ) ``` To give real-life examples, we use the `HiContactsData` package to get access to a range of toy datasets available from the `ExperimentHub`. ```{r evaled_import} library(HiContactsData) cool_file <- HiContactsData('yeast_wt', format = 'cool') import(cool_file, format = 'cool') ``` ## Supporting file classes There are currently three main standards to store Hi-C matrices in files: - `.(m)cool` files - `.hic` files - `.matrix` and `.bed` files: generated by HiC-Pro. Three supporting classes were specifically created to ensure that each of these file structures would be properly parsed into `HiCExperiment` objects: - `CoolFile` - `HicFile` - `HicproFile` For each object, an optional `pairsFile` can be associated and linked to the contact matrix file when imported as a `HiCExperiment` object. ```{r many_imports} ## --- CoolFile pairs_file <- HiContactsData('yeast_wt', format = 'pairs.gz') coolf <- CoolFile(cool_file, pairsFile = pairs_file) coolf import(coolf) import(pairsFile(coolf), format = 'pairs') ## --- HicFile hic_file <- HiContactsData('yeast_wt', format = 'hic') hicf <- HicFile(hic_file, pairsFile = pairs_file) hicf import(hicf) ## --- HicproFile hicpro_matrix_file <- HiContactsData('yeast_wt', format = 'hicpro_matrix') hicpro_regions_file <- HiContactsData('yeast_wt', format = 'hicpro_bed') hicprof <- HicproFile(hicpro_matrix_file, bed = hicpro_regions_file) hicprof import(hicprof) ``` # Import arguments ## Querying subsets of Hi-C matrix files The `focus` argument is used to specifically import contacts within a genomic locus of interest. ```{r focus} availableChromosomes(cool_file) hic <- import(cool_file, format = 'cool', focus = 'I:20001-80000') hic focus(hic) ``` _Note:_ Querying subsets of HiC-Pro formatted matrices is currently not supported. HiC-Pro formatted matrices will systematically be fully imported in memory when imported. One can also extract a count matrix from a Hi-C matrix file that is *not* centered at the diagonal. To do this, specify a couple of coordinates in the `focus` argument using a character string formatted as `"...|..."`: ```{r asym} hic <- import(cool_file, format = 'cool', focus = 'II:1-500000|II:100001-300000') focus(hic) ``` ## Multi-resolution Hi-C matrix files `import()` works with `.mcool` and multi-resolution `.hic` files as well: in this case, the user can specify the `resolution` at which count values are recovered. ```{r mcool} mcool_file <- HiContactsData('yeast_wt', format = 'mcool') availableResolutions(mcool_file) availableChromosomes(mcool_file) hic <- import(mcool_file, format = 'cool', focus = 'II:1-800000', resolution = 2000) hic ``` # HiCExperiment accessors ## Slots Slots for a `HiCExperiment` object can be accessed using the following `getters`: ```{r slots} fileName(hic) focus(hic) resolutions(hic) resolution(hic) interactions(hic) scores(hic) tail(scores(hic, 1)) tail(scores(hic, 'balanced')) topologicalFeatures(hic) pairsFile(hic) metadata(hic) ``` Several extra functions are available as well: ```{r extra} seqinfo(hic) ## To recover the `Seqinfo` object from the `.(m)cool` file bins(hic) ## To bin the genome at the current resolution regions(hic) ## To extract unique regions of the contact matrix anchors(hic) ## To extract "first" and "second" anchors for each interaction ``` ## Slot setters ### Scores Add any `scores` metric using a numerical vector. ```{r scores} scores(hic, 'random') <- runif(length(hic)) scores(hic) tail(scores(hic, 'random')) ``` ### Features Add `topologicalFeatures` using `GRanges` or `Pairs`. ```{r features} topologicalFeatures(hic, 'viewpoints') <- GRanges("II:300001-320000") topologicalFeatures(hic) topologicalFeatures(hic, 'viewpoints') ``` ## Coercing `HiCExperiment` Using the `as()` function, `HiCExperiment` can be coerced in `GInteractions`, `ContactMatrix` and `matrix` seamlessly. ```{r as} as(hic, "GInteractions") as(hic, "ContactMatrix") as(hic, "matrix")[1:10, 1:10] as(hic, "data.frame")[1:10, ] ``` # Importing pairs files Pairs files typically contain chimeric pairs (filtered after mapping), corresponding to loci that have been religated together after restriction enzyme digestion. Such files have a variety of standards. - The `.pairs` file format, supported by the 4DN consortium: [ ] - The pairs format generated by Juicer: [] [] [ ] [ ] - The `.(all)validPairs` file format, defined in the HiC-Pro pipeline: [] Pairs in any of these different formats are automatically detected and imported in R with the `import` function: ```{r pairs} import(pairs_file, format = 'pairs') ``` # Further documentation Please check `?HiCExperiment` in R for a full description of available slots, getters and setters, and comprehensive examples of interaction with a HiCExperiment object. # Session info ```{r session} sessionInfo() ```