---
title: "HiCDOC"
date: "`r format(Sys.Date(), '%m/%d/%Y')`"
author: "Cyril Kurylo & Matthias Zytnicki & Sylvain Foissac & Elise MaignÃ©"
output:
    BiocStyle::html_document:
       fig_width: 7
       fig_height: 5
       toc_float: true
bibliography: library.bib
vignette: >
  %\VignetteIndexEntry{HiCDOC}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
  %\VignettePackage{HiCDOC}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
options(warn=-1)
```

# Introduction

The aim of HiCDOC is to detect significant A/B compartment changes, using Hi-C
data with replicates.

HiCDOC normalizes intrachromosomal Hi-C matrices, uses unsupervised learning to
predict A/B compartments from multiple replicates, and detects significant
compartment changes between experiment conditions.

It provides a collection of functions assembled into a pipeline:

1. [Filter](#filtering-data):
    1. Remove chromosomes which are too small to be useful.
    2. Filter sparse replicates to remove uninformative replicates with few
    interactions.
    3. Filter positions (*bins*) which have too few interactions.
2. [Normalize](#normalizing-biases):
    1. Normalize technical biases (inter-matrix normalization) using
    cyclic loess normalization [@multihiccompare], so that
    matrices are comparable.
    2. Normalize biological biases (intra-matrix normalization) using
    Knight-Ruiz matrix balancing [@kr], so that
    all the bins are comparable.
    3. Normalize the distance effect, which results from higher interaction
    proportions between closer regions, with a MD loess.
3. [Predict](#predicting-compartments-and-differences):
    1. Predict compartments using
    constrained K-means [@kmeans].
    2. Detect significant differences between experiment conditions.
4. [Visualize](#visualizing-data-and-results):
    1. Plot the interaction matrices of each replicate.
    2. Plot the overall distance effect on the proportion of interactions.
    3. Plot the compartments in each chromosome, along with their concordance
    (confidence measure) in each replicate, and significant changes between
    experiment conditions.
    4. Plot the overall distribution of concordance differences.
    5. Plot the result of the PCA on the compartments' centroids.
    6. Plot the boxplots of self interaction ratios (differences between self
    interactions and the medians of other interactions) of each compartment,
    which is used for the A/B classification.

# Installation

HiCDOC can be installed from Bioconductor:

```{r, eval=FALSE}
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("HiCDOC")
```

The package can then be loaded:

```{r}
library(HiCDOC)
```

# Importing Hi-C data

HiCDOC can import Hi-C data sets in various different formats:
- Tabular `.tsv` files.
- Cooler `.cool` or `.mcool` files.
- Juicer `.hic` files.
- HiC-Pro `.matrix` and `.bed` files.

## Tabular files

A tabular file is a tab-separated multi-replicate sparse matrix with a header:

chromosome    position 1    position 2    C1.R1    C1.R2    C2.R1    ...
Y             1500000       7500000       145      184      72       ...
...

The number of interactions between `position 1` and `position 2` of
`chromosome` are reported in each `condition.replicate` column. There is no
limit to the number of conditions and replicates.

To load Hi-C data in this format:

```{r tabFormat, eval = FALSE}
hic.experiment <- HiCDOCDataSetFromTabular('path/to/data.tsv')
```

## Cooler files

To load `.cool` or `.mcool` files generated by [Cooler][cooler-documentation]
[@cooler]:

```{r coolFormat, eval = FALSE}
# Path to each file
paths = c(
    'path/to/condition-1.replicate-1.cool',
    'path/to/condition-1.replicate-2.cool',
    'path/to/condition-2.replicate-1.cool',
    'path/to/condition-2.replicate-2.cool',
    'path/to/condition-3.replicate-1.cool'
)

# Replicate and condition of each file. Can be names instead of numbers.
replicates <- c(1, 2, 1, 2, 1)
conditions <- c(1, 1, 2, 2, 3)

# Resolution to select in .mcool files
binSize = 500000

# Instantiation of data set
hic.experiment <- HiCDOCDataSetFromCool(
    paths,
    replicates = replicates,
    conditions = conditions,
    binSize = binSize # Specified for .mcool files.
)
```

## Juicer files

To load `.hic` files generated by [Juicer][juicer-documentation] [@juicer]:

```{r hicFormat, eval = FALSE}
# Path to each file
paths = c(
    'path/to/condition-1.replicate-1.hic',
    'path/to/condition-1.replicate-2.hic',
    'path/to/condition-2.replicate-1.hic',
    'path/to/condition-2.replicate-2.hic',
    'path/to/condition-3.replicate-1.hic'
)

# Replicate and condition of each file. Can be names instead of numbers.
replicates <- c(1, 2, 1, 2, 1)
conditions <- c(1, 1, 2, 2, 3)

# Resolution to select
binSize <- 500000

# Instantiation of data set
hic.experiment <- HiCDOCDataSetFromHiC(
    paths,
    replicates = replicates,
    conditions = conditions,
    binSize = binSize
)
```

## HiC-Pro files

To load `.matrix` and `.bed` files generated by [HiC-Pro][hicpro-documentation]
[@hicpro]:

```{r hicproFormat, eval = FALSE}
# Path to each matrix file
matrixPaths = c(
    'path/to/condition-1.replicate-1.matrix',
    'path/to/condition-1.replicate-2.matrix',
    'path/to/condition-2.replicate-1.matrix',
    'path/to/condition-2.replicate-2.matrix',
    'path/to/condition-3.replicate-1.matrix'
)

# Path to each bed file
bedPaths = c(
    'path/to/condition-1.replicate-1.bed',
    'path/to/condition-1.replicate-2.bed',
    'path/to/condition-2.replicate-1.bed',
    'path/to/condition-2.replicate-2.bed',
    'path/to/condition-3.replicate-1.bed'
)

# Replicate and condition of each file. Can be names instead of numbers.
replicates <- c(1, 2, 1, 2, 1)
conditions <- c(1, 1, 2, 2, 3)

# Instantiation of data set
hic.experiment <- HiCDOCDataSetFromHiCPro(
    matrixPaths = matrixPaths,
    bedPaths = bedPaths,
    replicates = replicates,
    conditions = conditions
)
```

# Running the HiCDOC pipeline

An example dataset can be loaded from the HiCDOC package:

```{r reloadExample}
data(exampleHiCDOCDataSet)
```

Once your data is loaded, you can run all the filtering, normalization, and
prediction steps with the command : **`HiCDOC(exampleHiCDOCDataSet)`**.
This one-liner runs all the steps detailed below.

## Filtering data

Remove small chromosomes of length smaller than 100 positions 
(100 is the default value):

```{r filterSmallChromosomes}
hic.experiment <- filterSmallChromosomes(exampleHiCDOCDataSet, threshold = 100)
```

Remove sparse replicates filled with less than 30% non-zero interactions
(30% is the default value):

```{r filterSparseReplicates}
hic.experiment <- filterSparseReplicates(hic.experiment, threshold = 0.3)
```

Remove weak positions with less than 1 interaction in average 
(1 is the default value):

```{r filterWeakPositions}
hic.experiment <- filterWeakPositions(hic.experiment, threshold = 1)
```

## Normalizing biases

### Technical biases

Normalize technical biases such as sequencing depth (inter-matrix 
normalization) so that  matrices are comparable :

```{r normalizeTechnicalBiases}
suppressWarnings(hic.experiment <- normalizeTechnicalBiases(hic.experiment))
```

This normalization uses uses cyclic loess normalization from [multiHiCcompare package] [@multihiccompare].
**Note** : For large dataset, it is highly recommended to set a value for 
`cycleLoessSpan` parameter to reduce computing time and necessary memory. See
`?HiCDOC::normalizeTechnicalBiases`


### Biological biases
Normalize biological biases, such as GC content, number of restriction sites,
etc. (intra-matrix normalization):

```{r normalizeBiologicalBiases}
hic.experiment <- normalizeBiologicalBiases(hic.experiment)
```

### Distance effect

Normalize the linear distance effect resulting from more interactions between 
closer genomic regions (20000 is the default value for `loessSampleSize`):

```{r normalizeDistanceEffect}
hic.experiment <- 
    normalizeDistanceEffect(hic.experiment, loessSampleSize = 20000)
```

## Predicting compartments and differences

Predict A and B compartments and detect significant differences, here using 
the default values as parameters:

```{r detectCompartments}
hic.experiment <- detectCompartments(hic.experiment)
```


# Visualizing data and results

Plot the interaction matrix of each replicate:

```{r plotInteractions}
p <- plotInteractions(hic.experiment, chromosome = "Y")
p
```

Plot the overall distance effect on the proportion of interactions:

```{r plotDistanceEffect}
p <- plotDistanceEffect(hic.experiment)
p
```

List and plot compartments with their concordance (confidence measure) in each
replicate, and significant changes between experiment conditions:

```{r extractCompartments}
compartments(hic.experiment)
```


```{r extractConcordances}
concordances(hic.experiment)
```


```{r extractDifferences}
differences(hic.experiment)
```


```{r plotCompartmentChanges}
p <- plotCompartmentChanges(hic.experiment, chromosome = "Y")
p
```

Plot the overall distribution of concordance differences:

```{r plotConcordanceDifferences}
p <- plotConcordanceDifferences(hic.experiment)
p
```

Plot the result of the PCA on the compartments' centroids:

```{r plotCentroids}
p <- plotCentroids(hic.experiment, chromosome = "Y")
p
```

Plot the boxplots of self interaction ratios (differences between self
interactions and the median of other interactions) of each compartment:

```{r plotSelfInteractionRatios}
p <- plotSelfInteractionRatios(hic.experiment, chromosome = "Y")
p
```


## Sanity checks

Sometimes, basic assumptions on the data are not met.
We try to detect inconsistencies, and warn the user.


### PCA checks

We perform a principal component analysis on the centroids.
Each centroid represent an ideal bin, located either on compartment A or B,
in each sample, and each chromosome.
Given a chromosome, if all the centroids of the A compartment do not have the
same sign on the first principal component, we issue a warning for this
chromosome.
Likewise for the B compartment.

We also check that the inertia on the first principal component is at least
75%.

These checks make sure that centroids of the same compartments do cluster
together.
If the conditions are too different from each other, they may cluster together.
This case is detected by this check.


### Compartment assignemnt

We use "self-interaction" in order to classify centroids to A and B
compartments.
The self-interaction of a bin is the ratio of the number of pairs that link this
bin with other bins of the same compartment, divided by the number of pairs
The self-interactions should be different from compartments A and B.
We perform a Wilcoxon t-test.
If it is not significant, then a warning is issued.


### Warnings

If at least of the PCA checks fail, the warnings are printed on the PCA plot.
If the compartment assignment check fail, the warning is printed on the
corresponding plot.

When accessing the compartments and the concordances, chromosomes which
fail to pass the checks are removed (unless the appropriate parameter is set).


# Session info

```{r sessionInfo}
sessionInfo()
```


# References

[cooler-documentation]: https://cooler.readthedocs.io/en/latest/
[juicer-documentation]: https://github.com/aidenlab/juicer/wiki/Data
[hicpro-documentation]: https://github.com/nservant/HiC-Pro