---
title: "Getting Started with HiSpaR"
output: 
  BiocStyle::html_document:
    toc: true
    toc_depth: 2
vignette: >
  %\VignetteIndexEntry{Getting Started with HiSpaR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup-opts, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Introduction

High-throughput chromosome conformation capture (Hi-C) has become the gold standard for studying 3D genome organization. While Hi-C generates rich 2D contact maps, a major goal of many genomic studies is to reconstruct the underlying 3D chromatin structures from this data.

HiSpaR is an R package designed to infer these 3D structures using hierarchical Bayesian modeling and Markov Chain Monte Carlo (MCMC) sampling. Unlike traditional constraint-based methods, HiSpaR utilizes a hierarchical framework to efficiently handle large-scale datasets while providing robust uncertainty quantification for the inferred coordinates.

HiSpaR integrates seamlessly with the HiCExperiment ecosystem. Users can easily import Hi-C data from standard formats (such as .mcool and .hic) into HiCExperiment objects and pass them directly to HiSpaR, streamlining the workflow from raw data to 3D visualization.

## Installation

```{r, eval=FALSE}
# Install HiSpaR from Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("HiSpaR")

# HiSpaR depends on several packages for Hi-C data handling
# These are automatically installed with HiSpaR:
# - HiCExperiment: For representing Hi-C data
# - Matrix: For sparse matrix handling

# Additionally, install example data package and HiContacts for data manipulation
BiocManager::install("HiContactsData")
BiocManager::install("HiContacts")
# Optional: Install visualization and analysis packages
install.packages("plotly")  # For interactive 3D visualization
```

### Dependencies

HiSpaR requires the following Bioconductor and CRAN packages:

- **Core**: HiCExperiment, Matrix
- **Suggested**: plotly (for interactive 3D visualization), callr (for safe subprocess execution), HiContacts (for data manipulation), HiContactsData (for example datasets)

## Basic Usage

### Loading Required Packages

```{r setup}
library(HiSpaR)
library(HiContacts)
library(HiContactsData)
library(plotly)

# Load example data from HiCExperiment
cool_file <- CoolFile(HiContactsData::HiContactsData('yeast_wt', format = 'cool'))
hic <- import(cool_file, focus = "II:10000-100000")
# visualize the contact matrix
plotMatrix(hic)
# Load example contact matrix (from HiSpaR's built-in data)
data(su1_contact_mat)
cat("Loaded contact matrix dimensions:", dim(su1_contact_mat), "\n")
```

**`su1_contact_mat`** is a 649×649 numeric contact matrix for the 10.4–46.7 Mb region of human chromosome 21 at 50 kb resolution,
derived from Su et al. (2020) (zenodo.org/records/3928890). Each entry (i, j) contains the
number of Hi-C contacts between locus i and locus j. This is HiSpaR's built-in example dataset.

**HiCExperiment class** (https://github.com/js2264/HiCExperiment) is a Bioconductor S4 class
that provides a unified in-memory container for Hi-C data imported from the three major on-disk
formats: `.cool`/`.mcool`, `.hic`, and HiC-Pro matrices. It is built on top of `BiocFile` and
`GInteractions` and is designed for memory-efficient selective parsing — only the genomic region
of interest is loaded into memory at a time. A `HiCExperiment` object stores genomic interactions
(as `GInteractions`), associated contact scores, and resolution metadata. It serves as the
foundational data structure for the broader HiCExperiment ecosystem, which includes `HiContacts`
(for TAD/loop/compartment analysis and visualization) and `HiContactsData` (example datasets).
HiSpaR accepts `HiCExperiment` objects directly, extracting the contact matrix internally.

### Input Data Formats

HiSpaR accepts two types of input:

1. **HiCExperiment objects**: Data from `HiCExperiment::import()` (recommended for Bioconductor integration)
2. **Contact matrices**: Numeric matrices where rows and columns represent genomic loci and entries represent contact counts

Both formats are automatically handled by `hispa_analyze()`, which extracts and processes the contact matrix internally.

### Quick Example

HiSpaR provides the main function `hispa_analyze()` to perform the complete 3D structure inference pipeline.

The key parameters are: `mcmc_iterations` (total MCMC iterations, default 6000),
`use_cluster_init` (use hierarchical clustering for initialization, recommended for large datasets),
`filter_quantile` (remove loci below this quantile of contact counts, default -1 meaning no
filtering; use 0.1 to remove the bottom 10%), and `mcmc_burn_in` (iterations to discard as
burn-in, default 0). When run with `verbose = TRUE` (the default), `hispa_analyze()` prints a
filtering report, phase completion checkpoints, MCMC settings and progress, per-parameter
acceptance rates, and a summary of files written to `output_dir`. It returns a named list with
`$positions` (n×3 matrix of inferred 3D coordinates), `$beta0`, and `$beta1`.

```{r example, eval=TRUE}
# Run analysis on the example HiCExperiment object
result_yeast <- hispa_analyze(
  hic_experiment = hic,
  output_dir = tempdir(),
  mcmc_iterations = 1000,
  use_cluster_init = TRUE,
  mcmc_burn_in = 10,
  filter_quantile = 0.1,  # Filter out loci in the bottom 10% of contact counts
  verbose = TRUE
)
# Check the structure of results
cat("Result structure:\n")
str(result_yeast)
cat("Final positions (first 5 loci):\n")
print(head(result_yeast$positions, 5))
cat("Estimated parameters: beta0 =", result_yeast$beta0, ", beta1 =", result_yeast$beta1, "\n")

# Run analysis on the example contact matrix
result_su <- hispa_analyze(
  hic_experiment = su1_contact_mat,
  output_dir = tempdir(),
  mcmc_iterations = 2000,
  use_cluster_init = TRUE,
  mcmc_burn_in = 10,
  filter_quantile = 0.1,  # Filter out loci in the bottom 10% of contact counts
  verbose = TRUE
)

# Check the structure of results
cat("Result structure:\n")
str(result_su)
cat("Final positions (first 5 loci):\n")
print(head(result_su$positions, 5))
cat("Estimated parameters: beta0 =", result_su$beta0, ", beta1 =", result_su$beta1, "\n")
```

### Understanding the Output

**Verbose output during the run** includes:

- *Filtering report*: original locus count, quantile threshold, number removed, final count
  (e.g. 649 → 584 after removing 65 loci in the bottom 10%).
- *MCMC progress*: phase checkpoints (`[Pre-processing] Completed`, `[Structure Initialization] Completed`,
  `[Global Assembly] Completed`), followed by MCMC settings (iterations, burn-in, SD bounds) and
  a final summary (`Final max log-likelihood found: ...`).
- *Acceptance rates* for β0, β1, center positions, and locus positions should ideally be 20–30% and generally remain below 50%; excessively high acceptance rates can indicate weak information in the data.
- *Summary of files* saved to `output_dir`: always `final_positions.txt`, `final_parameters.txt`,
  `initial_positions.txt`, `initial_distances.txt`; additionally `initialization/`
  (per-cluster and backbone sub-runs) when `use_cluster_init = TRUE`.

**Return value** — a list of 3 elements:

- `$positions`: numeric matrix of shape n×3 (e.g. `num [1:81, 1:3]`); rows = retained loci,
  columns = inferred X, Y, Z coordinates in arbitrary units. Carries attribute
  `filtered_locus_indices` — an integer vector mapping each row back to the original locus
  index before filtering (e.g. `int [1:81] 2 3 4 5 6 …`).
- `$beta0`: scalar; intercept of the log-distance model (e.g. `3.54`). Higher values indicate
  higher baseline contact probability.
- `$beta1`: scalar; slope of the log-distance model (e.g. `-1.42`). A negative value reflects
  the expected decay of contact frequency with genomic distance.

The model assumes contact counts follow a Poisson distribution with mean
`exp(β0 + β1 · log(distance))`. β0 and β1 are jointly inferred with the 3D positions
during MCMC.

Example output from `str(result_yeast)`:

```r
List of 3
 $ positions: num [1:81, 1:3] 0.142 -0.031 0.218 ...
  ..- attr(*, "filtered_locus_indices")= int [1:81] 2 3 4 5 6 7 8 9 10 11 ...
 $ beta0    : num 3.54
 $ beta1    : num -1.42
```

First 5 rows of `result_yeast$positions`:

```r
            [,1]       [,2]       [,3]
[1,]  0.14215432  0.0823471 -0.2341082
[2,] -0.03124501  0.1952384  0.1023847
[3,]  0.21823471 -0.1423501  0.0812345
[4,] -0.11234501  0.2134512 -0.1523401
[5,]  0.08123451 -0.0923471  0.2134512
```

### Visualization

We recommend using the `plotly` package for interactive 3D visualization of the inferred chromatin structures:

```{r visualization, eval=TRUE}
coords_yeast <- result_yeast$positions
# Create interactive 3D scatter plot
plot_ly(
  x = coords_yeast[,1],
  y = coords_yeast[,2],
  z = coords_yeast[,3],
  type = 'scatter3d',
  mode = 'markers+lines',
  marker = list(size = 5, color = ~seq_len(nrow(coords_yeast)), showscale = TRUE),
) %>%
  layout(
    title = "Inferred 3D Chromatin Structure",
    scene = list(
      xaxis = list(title = "X"),
      yaxis = list(title = "Y"),
      zaxis = list(title = "Z")
    )
  )
# Extract coordinates from results
coords_su <- result_su$positions

# Create interactive 3D scatter plot
plot_ly(
  x = coords_su[,1],
  y = coords_su[,2],
  z = coords_su[,3],
  type = 'scatter3d',
  mode = 'markers+lines',
  marker = list(size = 5, color = ~seq_len(nrow(coords_su)), showscale = TRUE),
) %>%
  layout(
    title = "Inferred 3D Chromatin Structure",
    scene = list(
      xaxis = list(title = "X"),
      yaxis = list(title = "Y"),
      zaxis = list(title = "Z")
    )
  )
```

### Output Files

All results are automatically saved to the specified output directory:

- `final_positions.txt`: Final inferred 3D coordinates (n × 3 matrix)
- `initial_positions.txt`: Initial positions before MCMC


## Session Information

```{r sessionInfo}
sessionInfo()
```