HiCBricks 1.2.0
Note by the developer HiCBricks is currently in version 1.0. If users find some features to be missing, experience any issue with the package or find any problems in its usage please do open an issue on the github page.
HiCBricks is a library designed for handling large high-resolution Hi-C datasets. Over the years, the Hi-C field has experienced a rapid increase in the size and complexity of datasets. HiCBricks is meant to overcome the challenges related to the analysis of such large datasets within the R environment.
HiCBricks leverages HDF (Hierarchical Data Format) files to allow efficient handling of large Hi-C contact matrices. HiCBricks implements a Hi-C specific HDF data structure, referred to as a Brick object and presents accessor functions allowing users to access and manipulate Hi-C data without all the difficultis of dealing with the complex structure of HDF files.
The HiCBricks package, its set of data retrieval functions along with the Brick objects are meant to serve as building blocks for custom Hi-C analysis procedures as well as the development of new R or Bioconductor packages.
HiCBricks uses HDF files to efficiently manage large Hi-C datasets. Hereafter, we refer to the structured data contained into HiCBricks HDF files as Brick objects. The Brick object implement a HDF structure consisting of three layers:
n x n
dimensional matrix. On the other hand, the trans (inter-chromosomal) contacts for each chromosome pair is a rectangular n x m
dimensional matrix. The contact matrix loaded into a brick object can include the whole genome, or only specfic chromosomes selected by the end-user.(chromosome, start, end)
associated to each row or column of the contact matrix. Hi-C data are usually aggregated and summarized over relatively large genomic intervals (bins) to achieve a more robust quantification of signal (read counts or normalized read counts). These can be defined as either fixed size bins or variable size bins. The latter may be useful for example to handle very high-resolution Hi-C contacts at single restriction fragment resolution.GenomicRanges
object, it can be stored in the Brick object.In order to load Hi-C data matrix into a Brick object we have to:
The key advantage of HDF files is that once they have been created and populated with data, the data can be accessed very efficiently without the need to reload the whole matrix into memory each time.
Currently, HiCBricks functionalities allows importing data from text files with complete 2D matrices and binary cooler files (.mcool or .cool), as described below in section 2.1 and 2.3, respectively.
In section 2.3 below we also explain how to extract data from a binary .hic file format (produced by the Juicer1 Durand NC, Shamim MS, Machol I, Rao SSP, Huntley MH, Lander ES, Aiden EL. 2016. Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments. Cell Syst 3: 95–98. Hi-C data analysis framework) to import them into a Brick object.
All files in this vignette will be created in a temporary directory, the location of which is available through
tempdir()
. Any files stored in this temporary directory will be deleted upon closure of the R session. Users are therefore advised to peruse the code before execution and to replacetempdir()
with their own project directory locations when working with their own datasets.
First of all load the HiCBricks library:
library("HiCBricks")
Then get the path to the test datasets provided with the package, and in particular the path to the “bin table” specifying the genomics coordinates associated to each genomic bin used to summarize the Hi-C data. These would correspond to row and columns of the Hi-C contact matrix. In this example the bin table
provided is for human chromosome 19 divided in equally sized bins of 40 Kb.
Bintable.path <- system.file("extdata",
"Bintable_40kb.txt", package = "HiCBricks")
Chromosomes <- "chr19"
Note that in this example the bin table is a space delimited text file with 3 columns for chromosome, start and end coordinates of each genomic bin, respectively. This format is not mandatory as:
col.index
which takes as input the column index of the columns containing the chr, start and end coordinates. This parameter defaults to c(1,2,3)
. When users have bin tables with many additional columns, they can take advantage of this parameter and provide the column indices relevant for this function, i.e. the indices corresponding to the chr, start and end columns. The indices must be provided in the specific order of chr,start,end.There must be a one to one match between elements in the bin table and in the contact matrix. The “Bintable_40kb.txt” file contains 800 rows, i.e. the genomic intervals for 800 bins with 40 Kb size.
Then, the backbone of the Brick object data structure is created using only the bin table. The actual contacts frequencies will be added later. The CreateBrick
function returns a named character vector containing the complete path to the HDF file that will store the data.
Path_to_cached_file <- CreateBrick(ChromNames = Chromosomes,
BinTable = Bintable.path, bin.delim = " ",
Output.Filename = file.path(tempdir(),"test.hdf"), exec = "cat",
remove.existing = TRUE)
## The 'showWarnings' argument has been deprecated and will be removed.
## Use suppressMessages() and suppressWarnings() to limit messages printed to screen
HiCBricks does not overwrite your files without you saying so Keeping in line with Bioconductor standards and community guidelines, HiCBricks will never replace a user’s files without their explicit permission. When a user defines
remove.existing=TRUE
, they are giving their permission to HiCBricks function to remove any Brick objects they may have created previously. Otherwise, without this parameter theCreateBrick
will not overwrite existing files. You will encounter this parameter in another form,remove.prior
when populating HDF files with Hi-C data.
Technical note This code chunk is the first place users will encounter the
exec
parameter. Theexec
argument allows users complete flexibility in specifying a call to a system command to manipulate and read the bin table input file. Theexec
parameter takescat
as the default value, i.e. a call to the system utilitycat
used for reading text files. Similarly, if the file is compressed, theexec
parameter can take as values popular compression utilities such aszcat
orbzcat
for reading the compressed file. For advanced users, theexec
parameter can contain full lengthawk
orcut
commands to sort and filter out columns or rows which are not required from the file. Thus, rather than read the same file twice using two different programs, users can provide a system command which will process the file and then pipe the processed output to the R session.
As an illustrative example, to clarify the expected structure of input data for a complete 2D matrix in text format, we are generating a fake contact matrix, containing containing random values, then loading these as if they were contact data.
First we create an empty n x n matrix (with n=800)
with uniformly sampled random numbers.
Test.mat <- matrix(runif(800*800),nrow = 800, ncol = 800)
Then the matrix is written to a txt file, without headers or rownames, and using whitespace as the field delimiter in this example.
Matrix.file <- file.path(tempdir(),"Test_matrix.txt")
write.table(x = Test.mat, file = Matrix.file, sep = " ", quote = FALSE,
row.names = FALSE, col.names = FALSE)
We are now ready to load the 2D matrix data from the Matrix.file
into the Brick object. When doing so, we have to pass to the Brick_load_matrix
function the path to the HDF file (so far containing only the bin table). Then we have to specify which chromosome (for intra-chromosomal n x n contact matrix) or which pair of chromosomes (for inter-chromosomal n x m contact matrix) are contained in the matrix.file
. In case of intra-chromosomal contacts like in this example, chr1
and chr2
arguments will have the same value.
Brick.file <- Path_to_cached_file
Brick_load_matrix(Brick = Brick.file,
chr1 = "chr19",
chr2 = "chr19",
matrix.file = Matrix.file,
delim = " ",
exec = "cat",
remove.prior = TRUE)
## Read 800 lines after Skipping 0 lines
## Inserting Data at location: 1
## Data length: 800
## Loaded 5171640 bytes of data...
## Read 800 records...
## [1] TRUE
Note that also in this case we specify a column delimiter (delim= " ")
and an exec
parameter command for reading the input file. If you have a very large 2D cis matrix, you may want to load data up to a certain diagonal, i.e. up to bin pairs separated by a certain distance on the genome. The rationale in this case is that the contact frequency is rapidly decaying with distance, thus a you may find a very sparse matrix at longer distances.
In this example we load up to 100 diagonals with Brick_load_cis_matrix_till_distance
Brick_load_cis_matrix_till_distance(Brick = Brick.file,
chr = "chr19",
matrix.file = Matrix.file,
delim = " ",
distance = 100,
remove.prior = TRUE)
## Inserting Data at location: 1, 1
## Data length: 800
## Loaded 4.88 MB of data...
## [1] TRUE
HiCBricks functions allow converting files with the mcool or cool (cooler) formats into HDF files with the Brick format. On a technical side, it must be noted that cooler file formats are also based on HDF, but the conversion to HiCBricks
data structure allowed us to design more efficient data accessors. mcool
, is a data formats adopted by the 4D nucleome project to disseminate data. These files contain Hi-C contact matrices in a sparse format, storing the non-zero values of the upper triangular matrix in the HDF file. mcool
files can include multiple normalisations and resolutions within the same file. HiCBricks, on the other hand stores a single normalisation and single resolution in any given HDF file.
In this exercise we will download one mcool file from the 4DN data portal at https://data.4dnucleome.org/. For the purposes of this vignette we will use a sample H1 human embryonic stem cell line (H1-hESC) Hi-C data.
HiCBricks currently accepts mcool and cool files following cool format version 2 and all prior versions. I do not make any guarantees for future versions, since development of the cooler package is independent of the HiCBricks package. If you are a user from the future and have encountered a format specific issue while reading mcool files with HiCBricks, please open an issue on the HiCBricks github repository
Please note, that these are very large files, and they may require a significant amount of time to download, depending on the speed of network connection.
For convenience, you can download the file using “curl” directly within the R prompt.
require(curl)
Consortium.home = "https://data.4dnucleome.org/files-processed"
File = file.path(Consortium.home,"4DNFI7JNCNFB",
"@@download","4DNFI7JNCNFB.mcool")
curl_download(url = File,
destfile = file.path(temp.dir(),"H1-hESC-HiC-4DNFI7JNCNFB.mcool"))
This file contains normalised Hi-C data for H1-hESC cells obtained using the DpnII restriction enzyme. Note that there are multiple types of normalisations available within the sample. We can check what normalisation weights are available using Brick_list_mcool_normalisations(names.only = TRUE)
.
HiCBricks accepts from the mcool files, normalization factors for “Knight-Ruitz”, “Vanilla-coverage”, “Vanilla-coverage-square-root” or “Iterative-Correction”. For more details about the different normalizations see Rao et al., 20142 A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Rao SS, Huntley MH, Durand NC, Stamenova EK, Bochkov ID, Robinson JT, Sanborn AL, Machol I, Omer AD, Lander ES and Aiden EL. Cell, 2014.
The 4D nucleome project disseminates its data with several different resolutions, i.e. different bin sizes. Brick_list_mcool_resolutions
provides information regarding the different resolutions available within the mcool files.
mcoolName=file.path(temp.dir(),"H1-hESC-HiC-4DNFI7JNCNFB.mcool")
Brick_list_mcool_resolutions(mcool = mcoolName)
Then users can query the mcool files to find out if a given normalisation factor is present for a given resolution.
Brick_mcool_normalisation_exists(mcool = mcoolName,
norm.factor = "Iterative-Correction",
binsize = 10000)
HiCBricks allows users to create HDF files with the Brick format from mcool and cool files. Users are limited to a single resolution and normalisation factor per Brick object. When creating a Brick object users are able to load all chromosomes in a single Brick object, but this is not recommended because,
For these reasons, when dealing with high-resolution Hi-C contact matrices it may be more efficient to separate the matrices saving them chromosome by chromosome into different brick objects.
Similar to the previous section, we will start from the mcool file content to:
In the previous example for 2D text matrices, we used the CreateBrick
function to create Brick objects for 2D matrices. Instead, to create Brick objects from mcool files we will use the CreateBrick_from_mcool
function.
When creating Brick objects from scratch users were required to provide a bin table. Instead when starting from mcool files the users do not need to provide a bin table as such information is already embedded within the mcool files. Therefore, users can just provide the resolution they want to load, and the corresponding bin table information will be fetched from the mcool files.
Using the chrs
parameter users can limit the structure created to the relevant chromosomes or, if left NULL (default value), the structure for all chromosome and chromosome-pairs will be created. Please note, that if two chromosomes are specified in the chrs
parameter, then 4 interaction maps will actually be created into the Brick object. These will be two for the cis
(intra-chromosomal) interaction maps for each chromosome and two for the trans
(inter-chromosomal) interaction maps corresponding to each chromosome pair.
Brick.name <- file.path(tempdir(),
"H1-hESC-HiC-4DNFI7JNCNFB-10000-ICE-normalised-chr1.brick")
Output.brick <- CreateBrick_from_mcool(Brick = Brick.name,
mcool = mcoolName, binsize = 10000, chrs = "chr1", remove.existing = TRUE)
After the Brick object has been created, we will populate the HDF file with values coming from the Hi-C interaction matrix stored within the mcool file. For this, we use the Brick_load_data_from_mcool
function.
Brick_load_data_from_mcool(Brick = Output.brick,
mcool = mcoolName,
chr1 = "chr1",
chr2 = "chr1",
binsize = 10000,
cooler.batch.size = 1000000,
matrix.chunk = 2000,
dont.look.for.chr2 = TRUE,
remove.prior = TRUE,
norm.factor = "Iterative-Correction")
There are a few options allowing users to manipulate data read and write speed. cooler.batch.size
determines the data read buffer size. matrix.chunk
determines data write buffer, i.e. the size of the matrix square that will be loaded per iteration through an mcool file. If you are loading cis (intra-chromosomal) contact matrices, it is recommended to set the dont.look.for.chr2
parameter to TRUE to speed up the loading of data. In cases of trans (inter-chromosomal) contact matrices, this option should be set to FALSE. remove.prior
defaults to FALSE to prevent users from overwriting data already loaded into an HDF file.
Note that if the end users wishes to load raw read counts from the cooler file to the Brick object, this can be achieved by specifying the norm.factor=NULL
parameter.
The method to create a Brick object from a .hic
file is still a work in progress and will be a part of the package in a future release. Meanwhile, if users wish to create a .hic
file into a Brick object, they should first use the hic2cool
utility to create an mcool
file and then read that mcool
file into a Brick object. This utility is available at 4D nucleome github repository.
As mentioned above, there are three different types of information stored within Brick objects.
Both the bin table and the annotations are represented as GRanges
objects, i.e. lists of genomic intervals as defined in the Bioconductor package GenomicRanges
.This is the de-facto standard for working with genomic coordinates in the R/Bioconductor environment.
As mentioned in the introduction to this section, HiCBricks
objects contain two different types of GRanges
objects. Hereafter we will refer to them as ranges objects.
The first is the bin table, which is central towards the proper functioning of HiCBricks functions, and is mostly inaccessible for user modifications. The second are annotation for the user’s reference and is completely accessible for the user.
Each of the ranges objects are stored under a unique id. The bin table always holds the id, “Bintable” whereas other annotations ranges objects can hold user-defined unique identifiers. Users can list unique identifiers of all ranges objects inside a Brick object using Brick_list_rangekeys
. In the example below there are two ranges objects that are listed. The first, is the bin table, whereas the second is an example custom annotation stored in the test HDF file.
Similar functions are available inside
HiCBricks
to list attributes or features of elements within Brick objects. See the package manual pages for functions with names like Brick_list_. These are the list functions. Afterwards, we will also come across the get or fetch functions. While list functions list features and attributes of elements within Brick objects, get and fetch functions help users get the same elements.
Brick.file <- system.file("extdata", "test.hdf", package = "HiCBricks")
Brick_list_rangekeys(Brick.file)
## [1] "Bintable" "test_ranges"
If we want to retrieve the bin table, we can retrieve its content using the Brick_get_bintable
function
Brick_get_bintable(Brick.file)
## GRanges object with 1479 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## chr19:1:40000 chr19 1-40000 *
## chr19:40001:80000 chr19 40001-80000 *
## chr19:80001:120000 chr19 80001-120000 *
## chr19:120001:160000 chr19 120001-160000 *
## chr19:160001:200000 chr19 160001-200000 *
## ... ... ... ...
## chr19:58960001:59000000 chr19 58960001-59000000 *
## chr19:59000001:59040000 chr19 59000001-59040000 *
## chr19:59040001:59080000 chr19 59040001-59080000 *
## chr19:59080001:59120000 chr19 59080001-59120000 *
## chr19:59120001:59160000 chr19 59120001-59160000 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Otherwise, you can retrieve the object using Brick_get_ranges
the method called by Brick_get_bintable
.
Brick_get_ranges(Brick = Brick.file,
rangekey = "Bintable")
## GRanges object with 1479 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## chr19:1:40000 chr19 1-40000 *
## chr19:40001:80000 chr19 40001-80000 *
## chr19:80001:120000 chr19 80001-120000 *
## chr19:120001:160000 chr19 120001-160000 *
## chr19:160001:200000 chr19 160001-200000 *
## ... ... ... ...
## chr19:58960001:59000000 chr19 58960001-59000000 *
## chr19:59000001:59040000 chr19 59000001-59040000 *
## chr19:59040001:59080000 chr19 59040001-59080000 *
## chr19:59080001:59120000 chr19 59080001-59120000 *
## chr19:59120001:59160000 chr19 59120001-59160000 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
While fetching the ranges object we can also subset the retrieved GRanges by the chromosome of interest.
Brick_get_ranges(Brick = Brick.file,
rangekey = "Bintable",
chr = "chr19")
## GRanges object with 1479 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## chr19:1:40000 chr19 1-40000 *
## chr19:40001:80000 chr19 40001-80000 *
## chr19:80001:120000 chr19 80001-120000 *
## chr19:120001:160000 chr19 120001-160000 *
## chr19:160001:200000 chr19 160001-200000 *
## ... ... ... ...
## chr19:58960001:59000000 chr19 58960001-59000000 *
## chr19:59000001:59040000 chr19 59000001-59040000 *
## chr19:59040001:59080000 chr19 59040001-59080000 *
## chr19:59080001:59120000 chr19 59080001-59120000 *
## chr19:59120001:59160000 chr19 59120001-59160000 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Users may sometimes find it useful to identify the corresponding contact matrix row or column for a particular genomic coordinate.
Brick_return_region_position(Brick = Brick.file,
region = "chr19:5000000:10000000")
## [1] 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
## [19] 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
## [37] 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
## [55] 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197
## [73] 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215
## [91] 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233
## [109] 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250
This does a within overlap operation and returns the corresponding bins indexes, i.e. the indexes of contact matrix rows or columns covering the user-specified region. This function may fail if the region of interest is smaller than the genomic bins corresponding to the region of interest in the contact matrix.
To have a more fine-grain control, users may choose to use Brick_fetch_range_index
which is also called by the above described Brick_return_region_position
function. Note that in this case, multiple regions can be provided as input at once, by specifying vectors with multiple values as chr
, start
and end
input parameters. Differently from above, the output of this function is a GRanges
object, with an entry for each input query regions. The matching genomic bins are stored in the Indexes column. This column is of class IntegerList
.
Users, unfamiliar with the
IntegerList
and other classes like it are encouraged to check out the IRanges package on Bioconductor.
Brick_fetch_range_index(Brick = Brick.file,
chr = "chr19",
start = 5000000,
end = 10000000)
## GRanges object with 1 range and 1 metadata column:
## seqnames ranges strand | Indexes
## <Rle> <IRanges> <Rle> | <IntegerList>
## chr19:5000000:10000000 chr19 5000000-10000000 * | 125,126,127,...
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
There are three ways to subset matrices.
It is possible to get the interactions between genomic loci separated by a certain distance, which is indicated as number of genomic bins separating the data points of interest: e.g. distance=4 in the following example.
Note that distances always range between 0 and number of rows or columns in contact matrix - 1. When distance is 0, we are fetching the vector of values corresponding to the interaction frequency between any given bin and itself. Similarly, 1 fetches values corresponding to the interaction frequency between any given bin and its immediate neighbour
Values <- Brick_get_values_by_distance(Brick = Brick.file,
chr = "chr19",
distance = 4)
Users have the flexibility to apply custom operations on data while they are retrieved from the Brick object. In the example below, the Hi-C contact frequencies from the specified diagonal are transformed in log10 scale and their median value is computed. Custom operations are applied by providing function definitions in the parameter FUN
.
Failsafe_median_log10 <- function(x){
x[is.nan(x) | is.infinite(x) | is.na(x)] <- 0
return(median(log10(x+1)))
}
Brick_get_values_by_distance(Brick = Brick.file,
chr = "chr19",
distance = 4,
FUN = Failsafe_median_log10)
## [1] 1.470637
Moreover, these functions can also subset the interaction values by a certain region of interest, such as TADs, by using the constrain.region
argument. Human readable coordinates can be provided to this particular paramenter in the form of chr:start:end
. Note that HiCBricks requires the delimiter of the coordinates to always be :
.
Failsafe_median_log10 <- function(x){
x[is.nan(x) | is.infinite(x) | is.na(x)] <- 0
return(median(log10(x+1)))
}
Brick_get_values_by_distance(Brick = Brick.file,
chr = "chr19",
distance = 4,
constrain.region = "chr19:1:5000000",
FUN = Failsafe_median_log10)
## [1] 1.664331
HiCBricks, by user-specified human readable coordinates, as defined above. Once again, we ask users to make note of the human readable coordinate format in the x.coords
and y.coords
parameters: the only accepted delimiter is :
.
Sub.matrix <- Brick_get_matrix_within_coords(Brick = Brick.file,
x.coords="chr19:5000000:10000000",
force = TRUE,
y.coords = "chr19:5000000:10000000")
dim(Sub.matrix)
## [1] 125 125
The range of genomic bins to be fetched can also be provided as rows and columns indexes. In this case users must be careful about how genomic coordinates are translated into bin indexes. For example, users may think the following code would return the the same values as above, but this is not the case.
x.axis <- 5000000/40000
y.axis <- 10000000/40000
Sub.matrix <- Brick_get_matrix(Brick = Brick.file,
chr1 = "chr19", chr2 = "chr19",
x.vector = c(x.axis:y.axis),
y.vector = c(x.axis:y.axis))
dim(Sub.matrix)
## [1] 126 126
Notice, that this selection has one more row and column. This is because the region of interest spans from 5000001:10000000, which starts from the x.axis + 1
and not from x.axis
.
Finally, it is also possible to fetch entire rows and columns from the contact matrix. Users can do so by specifying the exact bin name corresponding to names of the matrix rows or columns as indicated in the bintable. If these are names, it is required to specify by = "ranges"
.
Coordinate <- c("chr19:1:40000","chr19:40001:80000")
Brick.file <- system.file("extdata",
"test.hdf",
package = "HiCBricks")
Test_Run <- Brick_fetch_row_vector(Brick = Brick.file,
chr1 = "chr19",
chr2 = "chr19",
by = "ranges",
vector = Coordinate)
As an alternative, users can also choose to fetch data by position by = positions
. In this case, the Coordinate vector provides indexes to the rows or columns to be fetched.
Coordinate <- c(1,2)
Brick.file <- system.file("extdata",
"test.hdf",
package = "HiCBricks")
Test_Run <- Brick_fetch_row_vector(Brick = Brick.file,
chr1 = "chr19",
chr2 = "chr19",
by = "position",
vector = Coordinate)
If regions is provided, it will subset the corresponding row/col by the specified region. regions
must be in coordinate format as shown below.
Coordinate <- c(1,2)
Brick.file <- system.file("extdata",
"test.hdf",
package = "HiCBricks")
Test_Run <- Brick_fetch_row_vector(Brick = Brick.file,
chr1 = "chr19",
chr2 = "chr19",
by = "position",
vector = Coordinate,
regions = c("chr19:1:1000000", "chr19:40001:2000000"))
There are several metrics which are computed at the time of matrix loading into the HDF file. Principally,
Users can list the names of all the possible matrix metadata columns.
Brick_list_matrix_mcols()
## bin.cov row.sums sparse
## "bin.coverage" "row.sums" "sparsity"
And then fetch one such metadata column
Brick.file <- system.file("extdata",
"test.hdf",
package = "HiCBricks")
MCols.dat <- Brick_get_matrix_mcols(Brick = Brick.file,
chr1 = "chr19",
chr2 = "chr19",
what = "row.sums")
head(MCols.dat, 100)
## [1] 0.000 0.000 0.000 0.000 0.000 0.000 1198.362 1403.718
## [9] 1528.578 1738.523 1442.291 1130.102 1373.675 1404.193 1706.945 1758.811
## [17] 1591.738 1709.158 1826.415 1544.001 1459.131 1436.963 1739.197 1477.874
## [25] 1672.059 1667.533 2252.888 2135.970 1742.534 0.000 1829.932 1967.429
## [33] 2045.234 1767.151 1911.042 1792.624 1967.596 1615.469 1830.918 1910.503
## [41] 1903.489 2054.635 1749.439 1910.534 2087.551 1818.214 1650.842 1103.898
## [49] 2080.651 1994.859 1614.737 1680.188 2111.672 2034.416 1853.433 2001.805
## [57] 2034.529 2113.670 1968.811 1960.784 1891.692 1482.099 1617.554 1837.311
## [65] 1951.701 1970.485 1905.317 1716.711 1873.868 1862.430 2024.424 1717.958
## [73] 1885.324 1792.588 1535.255 1741.546 1735.451 1844.859 1932.773 1818.196
## [81] 1892.870 1983.289 2099.391 2001.333 1621.172 1596.689 1625.296 1846.462
## [89] 1770.970 1946.828 1986.571 1941.454 2026.139 1912.795 1923.558 1913.859
## [97] 2062.051 1843.883 1843.824 1815.478
There are several utility functions that a user may take advantage of to do various checks.
Brick_matrix_isdone(Brick = Brick.file,
chr1 = "chr19",
chr2 = "chr19")
## [1] TRUE
Brick_matrix_issparse(Brick = Brick.file,
chr1 = "chr19",
chr2 = "chr19")
## [1] FALSE
Brick_load_cis_matrix_till_distance
has been used.Brick_matrix_maxdist(Brick = Brick.file,
chr1 = "chr19",
chr2 = "chr19")
## [1] 1479
Brick_matrix_exists(Brick = Brick.file,
chr1 = "chr19",
chr2 = "chr19")
## [1] TRUE
Brick_matrix_minmax(Brick = Brick.file,
chr1 = "chr19",
chr2 = "chr19")
## [1] 0.0000 674.4584
Brick_matrix_dimensions(Brick = Brick.file,
chr1 = "chr19",
chr2 = "chr19")
## [1] 1479 1479
Brick_matrix_filename(Brick = Brick.file,
chr1 = "chr19",
chr2 = "chr19")
## [1] IMR90_RepA_ICEd_40000_chr19.mat.gz
## Levels: IMR90_RepA_ICEd_40000_chr19.mat.gz
Local score differentiator (LSD) is a TAD calling procedure based on the directionality index introduced by Dixon et al., 20123 Topological domains in mammalian genomes identified by analysis of chromatin interactions. Dixon JR, Selvaraj S, Yue F, Kim A, Li Y, Shen Y, Hu M, Liu JS and Ren B. Nature 2012. LSD is based on the computation of the directionality index (DI), as described in the original article, but the genome is partitioned into TADs based on the local directionality index distribution. Briefly, transition points between negative and positive values marking TAD boundaries are identified as the local extreme values in the first derivative of DI computed within a local window of user defined size. This has been implemented into HiCBricks as a test case example to show how custom data analysis procedures can be integrated inside the HiCBricks framework by taking advantage of its accessor functions.
Brick.file <- system.file("extdata",
"test.hdf", package = "HiCBricks")
Chromosome <- "chr19"
di_window <- 10
lookup_window <- 30
TAD_ranges <- Brick_local_score_differentiator(Brick = Brick.file,
chrs = Chromosome,
di.window = di_window,
lookup.window = lookup_window,
strict = TRUE,
fill.gaps=TRUE,
chunk.size = 500)
## [1] Computing DI for chr19
## [2] Computing DI Differences for chr19
## [2] Done
## [3] Fetching Outliers chr19
## [3] Done
## [4] Creating Domain list for chr19
## [4] Done
Briefly, the lookup.window
value corresponds to the local window used to identify local extreme changes in the DI values. Setting strict
to TRUE, adds another additional filter wherein the directionality index is required to be less than or greater than 0 at potential transition points identifying a domain boundary. LSD works by identifying domain starts and ends separately. If a particular domain start was not identified, but the adjacent domain end was identified, fill.gaps
if set to TRUE, will infer the adjacent bin from the adjacent domain end as a domain start bin and create a domain region with both start and end. Any domains identified by fill.gaps
are annotated under the level column in the resulting GRanges
object with the value 2. chunk.size
corresponds to the size of the square to retrieve and process per iteration.
As shown previously, we can store these TAD calls inside the Brick object.
Name <- paste("LSD",
di_window,
lookup_window,
Chromosome,sep = "_")
Brick_add_ranges(Brick = Path_to_cached_file,
ranges = TAD_ranges,
rangekey = Name)
## [1] TRUE
As shown previously, we can list the unique identifiers of the stored ranges (rangekeys) using the Brick_list_rangekeys
function and then retrieve them using the Brick_get_ranges
function.
Brick_list_rangekeys(Brick = Path_to_cached_file)
## [1] "Bintable" "LSD_10_30_chr19"
TAD_ranges <- Brick_get_ranges(Brick = Path_to_cached_file,
rangekey = Name)
Using HiCBricks
functions, users can generate sophisticated plot to visualize Hi-C contact matrices. The representation of contact matrices as a grid with numeric values mapped to a color gradient is commonly referred to as a “heatmap” plot. HiCBricks
allows users to plot one sample or two samples heatmaps. The following examples show multiple options available to generate heatmaps.
Brick_vizart_plot_heatmap(File = file.path(tempdir(),
"chr19-5MB-10MB-normal.pdf"),
Bricks = Brick.file,
x.coords = "chr19:5000000:10000000",
y.coords = "chr19:5000001:10000000",
palette = "Reds",
width = 10,
height = 11,
return.object=TRUE)
Please note the palette argument. It requires the user to provide a palette name from either the RColorBrewer or viridis packages colour palettes. At this time, HiCBricks does not allow the usage of user defined colour palettes.
In the examples above we are plotting the Hi-C signal (normalized read counts) which is expected to have a rapid decay when moving away from the diagonal. Log transformation of Hi-C signal is a popular choice to limit the range of values and have a more informative heatmap representation. We can apply a log10 transformation to the data before plotting, as in the example below, which results in a denser (less white spaces) heatmap.
Failsafe_log10 <- function(x){
x[is.na(x) | is.nan(x) | is.infinite(x)] <- 0
return(log10(x+1))
}
Brick_vizart_plot_heatmap(File = file.path(tempdir(),
"chr19-5MB-10MB-normal-colours-log10.pdf"),
Bricks = Brick.file,
x.coords = "chr19:5000000:10000000",
y.coords = "chr19:5000001:10000000",
FUN = Failsafe_log10,
legend.title = "Log10 Hi-C signal",
palette = "Reds",
width = 10,
height = 11,
return.object=TRUE)
Please note how we created a new custom function for log10 transformation. This as well as other user-defined custom functions to be applied on the data can be provided with the argument FUN
.
Sometimes, the Hi-C signal distribution is biased by outlier values which may stretch the range of values in the color gradient. To limit the range of the color gradient we can cap its maximum value to a specified percentile in the distribution with the value.cap
argument. This will avoid a skewed distribution of colors due to a few outliers with very high signal.
Brick_vizart_plot_heatmap(File = file.path(tempdir(),
"chr19-5MB-10MB-normal-colours-log10-valuecap-99.pdf"),
Bricks = Brick.file,
x.coords = "chr19:5000000:10000000",
y.coords = "chr19:5000001:10000000",
FUN = Failsafe_log10,
value.cap = 0.99,
legend.title = "Log10 Hi-C signal",
palette = "Reds",
width = 10,
height = 11,
return.object=TRUE)
value.cap takes as input a value ranging from 0,1 specifying the percentile at which the threshold will be applied.
Sometimes, it is desirable to plot the heatmap as a 45 degree rotated heatmap, i.e. in it’s triangular form. This can simply be obtained with the rotate=TRUE
parameter.
Brick_vizart_plot_heatmap(File = file.path(tempdir(),
"chr19-5MB-10MB-normal-colours-log10-rotate.pdf"),
Bricks = Brick.file,
x.coords = "chr19:5000000:10000000",
y.coords = "chr19:5000001:10000000",
FUN = Failsafe_log10,
value.cap = 0.99,
distance = 60,
legend.title = "Log10 Hi-C signal",
palette = "Reds",
width = 10,
height = 11,
rotate = TRUE,
return.object=TRUE)
To improve the appearance of the plot shown in the example above, we can modify the width
and height
as the rotated plots width is larger than their height.
Brick_vizart_plot_heatmap(File = file.path(tempdir(),
"chr19-5MB-10MB-normal-colours-log10-rotate-2.pdf"),
Bricks = Brick.file,
x.coords = "chr19:5000000:10000000",
y.coords = "chr19:5000001:10000000",
FUN = Failsafe_log10,
value.cap = 0.99,
distance = 60,
legend.title = "Log10 Hi-C signal",
palette = "Reds",
width = 15,
height = 5,
rotate = TRUE,
return.object=TRUE)
We can also add more features to this plot, such as the TADs identified in the previous examples.
Brick_vizart_plot_heatmap(File = file.path(tempdir(),
"chr19-5MB-10MB-normal-colours-log10-rotate-2-tads.pdf"),
Bricks = Brick.file,
tad.ranges = TAD_ranges,
x.coords = "chr19:5000000:10000000",
y.coords = "chr19:5000001:10000000",
colours = "#230C0F",
FUN = Failsafe_log10,
value.cap = 0.99,
legend.title = "Log10 Hi-C signal",
palette = "Reds",
width = 10,
height = 11,
return.object=TRUE)
Note that by using the distance
parameter we can limit the maximum distance from the diagonal up to which we plot the heatmap.
Please ignore the warning messages that appear with this code chunk. The warnings relate to parts of the TADs that are outside the bounds of the plotting area. The function does not remove these regions before plotting, therefore an error is generated.
Brick_vizart_plot_heatmap(File = file.path(tempdir(),
"chr19-5MB-10MB-normal-colours-log10-rotate-3-tads.pdf"),
Bricks = Brick.file,
tad.ranges = TAD_ranges,
x.coords = "chr19:5000000:10000000",
y.coords = "chr19:5000001:10000000",
colours = "#230C0F",
FUN = Failsafe_log10,
value.cap = 0.99,
distance = 60,
legend.title = "Log10 Hi-C signal",
palette = "Reds",
width = 15,
height = 5,
line.width = 0.8,
cut.corners = TRUE,
rotate = TRUE,
return.object=TRUE)
## Warning: Removed 3 rows containing missing values (geom_path).
## Warning: Removed 3 rows containing missing values (geom_path).
HiCBricks allows to easily plot more complex figures, including bipartite heatmaps, i.e. showing two Hi-C samples in two halves of the same heatmap. Bipartite heatmaps can be plotted as squares plots or as 45 degrees rotated (triangular) heatmaps, with or without additional features such as TAD borders.
Due to space limitations placed on example datasets within Bioconductor packages, in this vignette example we will use the same dataset as before to showcase how two-sample heatmaps can be drawn using the HiCBricks package.
To plot a two sample heatmap, we need only to include an additional Brick file in the Brick
parameter. The data from the two brick files will be plotted in the upper and lower triangle, respectively. The first Brick
file will go to the upper triangle, whereas the second Brick
file will go to the lower triangle.
NOTE: The main diagonal will be set to the 0 in both plots as the main diagonal overlaps between the two contact matrices.
Brick_vizart_plot_heatmap(File = file.path(tempdir(),
"chr19-5MB-10MB-bipartite-colours-log10-valuecap-99.pdf"),
Bricks = c(Brick.file, Brick.file),
x.coords = "chr19:5000000:10000000",
y.coords = "chr19:5000001:10000000",
FUN = Failsafe_log10,
value.cap = 0.99,
legend.title = "Log10 Hi-C signal",
palette = "YlGnBu",
width = 10,
height = 11,
return.object=TRUE)
Since this Hi-C map is sparse there are few informative data points at larger distances. In this case the end-user may want to limit the plot by not showing longer distance interactions. This is achieved using the distance
parameter. Remember, that we can use any of the RColorBrewer
or viridis
colour palettes. For example, we can use the Red to Gray (name RdGy
) palette from RColorBrewer
Brick_vizart_plot_heatmap(File = file.path(tempdir(),
"chr19-5MB-10MB-bipartite-colours-log10-valuecap-99-2.pdf"),
Bricks = c(Brick.file, Brick.file),
x.coords = "chr19:5000000:10000000",
y.coords = "chr19:5000001:10000000",
FUN = Failsafe_log10,
value.cap = 0.99,
legend.title = "Log10 Hi-C signal",
palette = "RdGy",
distance = 30,
width = 10,
height = 11,
return.object=TRUE)
Finally, we can once again rotate the two sample heatmaps.
Brick_vizart_plot_heatmap(File = file.path(tempdir(),
"chr19-5MB-10MB-bipartite-colours-log10-valuecap-99-rotate.pdf"),
Bricks = c(Brick.file, Brick.file),
x.coords = "chr19:5000000:10000000",
y.coords = "chr19:5000001:10000000",
FUN = Failsafe_log10,
value.cap = 0.99,
legend.title = "Log10 Hi-C signal",
palette = "YlGnBu",
distance = 30,
width = 15,
height = 4,
rotate = TRUE,
return.object=TRUE)
HiCBricks also allows the possibility to plot TADs on the Bipartite heatmaps with categorical colours for each of the TAD calls. Although users may provide more than one category per sample, they should be aware that when TADs overlap, the TAD which is plotted at the end will always be the one that appears at the top, while other overlapping TADs will be hidden at the bottom.
As an example we will prepare a set of TAD calls and store them in the Brick object to compare them.
Brick.file <- system.file("extdata",
"test.hdf", package = "HiCBricks")
Chromosome <- "chr19"
di_windows <- c(5,10)
lookup_windows <- c(10, 20)
for (i in seq_along(di_windows)) {
di_window <- di_windows[i]
lookup_window <- lookup_windows[i]
TAD_ranges <- Brick_local_score_differentiator(Brick = Brick.file,
chrs = Chromosome,
di.window = di_window,
lookup.window = lookup_window,
strict = TRUE,
fill.gaps=TRUE,
chunk.size = 500)
Name <- paste("LSD",
di_window,
lookup_window,
Chromosome,sep = "_")
Brick_add_ranges(Brick = Path_to_cached_file, ranges = TAD_ranges,
rangekey = Name)
}
## [1] Computing DI for chr19
## [2] Computing DI Differences for chr19
## [2] Done
## [3] Fetching Outliers chr19
## [3] Done
## [4] Creating Domain list for chr19
## [4] Done
## [1] Computing DI for chr19
## [2] Computing DI Differences for chr19
## [2] Done
## [3] Fetching Outliers chr19
## [3] Done
## [4] Creating Domain list for chr19
## [4] Done
To plot these TAD calls, they need to be formatted correctly before plotting. This involves assigning categorical values to each of the TAD calls we want to plot. We will assign two categorical variables, one will map the TADs to their respective Hi-C map, whereas the other will map the TADs to their respective category.
Chromosome <- "chr19"
di_windows <- c(5,10)
lookup_windows <- c(10, 20)
TADs_list <- list()
for (i in seq_along(di_windows)) {
di_window <- di_windows[i]
lookup_window <- lookup_windows[i]
Name <- paste("LSD",
di_window,
lookup_window,
Chromosome,sep = "_")
TAD_ranges <- Brick_get_ranges(Brick = Path_to_cached_file,
rangekey = Name)
# Map TADs to their Hi-C maps
TAD_ranges$group <- i
# Map TADs to a specific categorical value for the colours
TAD_ranges$colour.group <- paste("LSD", di_window, lookup_window,
sep = "_")
TADs_list[[Name]] <- TAD_ranges
}
TADs_ranges <- do.call(c, unlist(TADs_list, use.names = FALSE))
As described in the manual, the two parameters, group.col
and
tad.colour.col
are relevant towards assigning any TAD to its respective Hi-C map or category, respectively. These two parameters take as input, the column names corresponding to their respective columns in the TADs_ranges
object.
Meanwhile, colours
and colours.names
are the relevant parameter for the colours of the TAD boundaries. colours
is a required parameter in case TAD boundaries are provided, whereas colours.names
can be left empty in case the user intends to provide unique(TAD_ranges$colour.group)
as the colour.names
.
Brick.file <- system.file("extdata",
"test.hdf", package = "HiCBricks")
Colours <- c("#B4436C", "#F78154")
Colour.names <- unique(TADs_ranges$colour.group)
Brick_vizart_plot_heatmap(File = file.path(tempdir(),
"chr19-5MB-10MB-bipartite-colours-log10-valuecap-99-tads.pdf"),
Bricks = c(Brick.file, Brick.file),
x.coords = "chr19:5000001:10000000",
y.coords = "chr19:5000001:10000000",
FUN = Failsafe_log10,
value.cap = 0.97,
legend.title = "Log10 Hi-C signal",
palette = "YlGnBu",
tad.ranges = TADs_ranges,
group.col = "group",
tad.colour.col = "colour.group",
colours = Colours,
colours.names = Colour.names,
distance = 30,
width = 9,
height = 11,
return.object=TRUE)
Brick.file <- system.file("extdata",
"test.hdf", package = "HiCBricks")
Colours <- c("#B4436C", "#F78154")
Colour.names <- unique(TADs_ranges$colour.group)
Brick_vizart_plot_heatmap(File = file.path(tempdir(),
"chr19-5MB-10MB-bipartite-colours-log10-valuecap-99-rotate-tads.pdf"),
Bricks = c(Brick.file, Brick.file),
x.coords = "chr19:5000000:10000000",
y.coords = "chr19:5000001:10000000",
FUN = Failsafe_log10,
value.cap = 0.97,
legend.title = "Log10 Hi-C signal",
palette = "YlGnBu",
tad.ranges = TADs_ranges,
group.col = "group",
tad.colour.col = "colour.group",
colours = Colours,
colours.names = Colour.names,
distance = 30,
width = 15,
height = 4,
rotate = TRUE,
return.object=TRUE)
## Warning: Removed 4 rows containing missing values (geom_path).
## Warning: Removed 4 rows containing missing values (geom_path).
Note, that while creating rotated plots with TADs, if the parameter cut.corners
is not set to TRUE, then the default behaviour is to plot continuous lines. To truncate lines at the corners of TADs, users should set this parameter to value TRUE
.
Brick.file <- system.file("extdata",
"test.hdf", package = "HiCBricks")
Colours <- c("#B4436C", "#F78154")
Colour.names <- unique(TADs_ranges$colour.group)
Brick_vizart_plot_heatmap(File = file.path(tempdir(),
"chr19-5MB-10MB-bipartite-colours-log10-valuecap-99-rotate-tads-2.pdf"),
Bricks = c(Brick.file, Brick.file),
x.coords = "chr19:5000000:10000000",
y.coords = "chr19:5000001:10000000",
FUN = Failsafe_log10,
value.cap = 0.97,
legend.title = "Log10 Hi-C signal",
palette = "YlGnBu",
tad.ranges = TADs_ranges,
group.col = "group",
tad.colour.col = "colour.group",
colours = Colours,
colours.names = Colour.names,
distance = 30,
width = 15,
height = 4,
cut.corners = TRUE,
rotate = TRUE,
return.object=TRUE)
## Warning: Removed 5 rows containing missing values (geom_path).
## Warning: Removed 5 rows containing missing values (geom_path).
There are several features that are not ideal in the above plots that can be fixed according to end-users preferences by adjusting additional parameters.
line.width
parameter.legend.key.width
and legend.key.height
parameters.Brick_vizart_plot_heatmap(File = file.path(tempdir(),
"chr19-5MB-10MB-bipartite-final.pdf"),
Bricks = c(Brick.file, Brick.file),
x.coords = "chr19:5000000:10000000",
y.coords = "chr19:5000001:10000000",
FUN = Failsafe_log10,
value.cap = 0.99,
legend.title = "Log10 Hi-C signal",
palette = "YlGnBu",
tad.ranges = TADs_ranges,
group.col = "group",
tad.colour.col = "colour.group",
colours = Colours,
colours.names = Colour.names,
distance = 30,
width = 15,
height = 4,
legend.key.width = unit(10, "mm"),
legend.key.height = unit(5, "mm"),
line.width = 1.2,
cut.corners = TRUE,
rotate = TRUE,
return.object=TRUE)
## Warning: Removed 5 rows containing missing values (geom_path).
## Warning: Removed 5 rows containing missing values (geom_path).
There are several additional parameters which can be used to modify tex elements in plots.
x.axis
and y.axis
to FALSE.x.axis.title
and y.axis.title
parameter.title
parameter.legend.title
parameter.x.axis.num.breaks
and y.axis.num.breaks
parameters.Finally, the parameters to modify text size in these individual elements are the following ones:
text.size
controls the font size across all plot elements, but is superseded by individual parameters.x.axis.text.size
and y.axis.text.size
control the text size on the x and y axis.legend.title.text.size
controls the font size of the legend title.legend.text.size
controls the font size of individual legend elements.title.size
controls the size of the plot title.