--- title: "Getting Started with HiSpaR" output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{Getting Started with HiSpaR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup-opts, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction High-throughput chromosome conformation capture (Hi-C) has become the gold standard for studying 3D genome organization. While Hi-C generates rich 2D contact maps, a major goal of many genomic studies is to reconstruct the underlying 3D chromatin structures from this data. HiSpaR is an R package designed to infer these 3D structures using hierarchical Bayesian modeling and Markov Chain Monte Carlo (MCMC) sampling. Unlike traditional constraint-based methods, HiSpaR utilizes a hierarchical framework to efficiently handle large-scale datasets while providing robust uncertainty quantification for the inferred coordinates. HiSpaR integrates seamlessly with the HiCExperiment ecosystem. Users can easily import Hi-C data from standard formats (such as .mcool and .hic) into HiCExperiment objects and pass them directly to HiSpaR, streamlining the workflow from raw data to 3D visualization. ## Installation ```{r, eval=FALSE} # Install HiSpaR from Bioconductor if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("HiSpaR") # HiSpaR depends on several packages for Hi-C data handling # These are automatically installed with HiSpaR: # - HiCExperiment: For representing Hi-C data # - Matrix: For sparse matrix handling # Additionally, install example data package and HiContacts for data manipulation BiocManager::install("HiContactsData") BiocManager::install("HiContacts") # Optional: Install visualization and analysis packages install.packages("plotly") # For interactive 3D visualization ``` ### Dependencies HiSpaR requires the following Bioconductor and CRAN packages: - **Core**: HiCExperiment, Matrix - **Suggested**: plotly (for interactive 3D visualization), callr (for safe subprocess execution), HiContacts (for data manipulation), HiContactsData (for example datasets) ## Basic Usage ### Loading Required Packages ```{r setup} library(HiSpaR) library(HiContacts) library(HiContactsData) library(plotly) # Load example data from HiCExperiment cool_file <- CoolFile(HiContactsData::HiContactsData('yeast_wt', format = 'cool')) hic <- import(cool_file, focus = "II:10000-100000") # visualize the contact matrix plotMatrix(hic) # Load example contact matrix (from HiSpaR's built-in data) data(su1_contact_mat) cat("Loaded contact matrix dimensions:", dim(su1_contact_mat), "\n") ``` **`su1_contact_mat`** is a 649×649 numeric contact matrix for the 10.4–46.7 Mb region of human chromosome 21 at 50 kb resolution, derived from Su et al. (2020) (zenodo.org/records/3928890). Each entry (i, j) contains the number of Hi-C contacts between locus i and locus j. This is HiSpaR's built-in example dataset. **HiCExperiment class** (https://github.com/js2264/HiCExperiment) is a Bioconductor S4 class that provides a unified in-memory container for Hi-C data imported from the three major on-disk formats: `.cool`/`.mcool`, `.hic`, and HiC-Pro matrices. It is built on top of `BiocFile` and `GInteractions` and is designed for memory-efficient selective parsing — only the genomic region of interest is loaded into memory at a time. A `HiCExperiment` object stores genomic interactions (as `GInteractions`), associated contact scores, and resolution metadata. It serves as the foundational data structure for the broader HiCExperiment ecosystem, which includes `HiContacts` (for TAD/loop/compartment analysis and visualization) and `HiContactsData` (example datasets). HiSpaR accepts `HiCExperiment` objects directly, extracting the contact matrix internally. ### Input Data Formats HiSpaR accepts two types of input: 1. **HiCExperiment objects**: Data from `HiCExperiment::import()` (recommended for Bioconductor integration) 2. **Contact matrices**: Numeric matrices where rows and columns represent genomic loci and entries represent contact counts Both formats are automatically handled by `hispa_analyze()`, which extracts and processes the contact matrix internally. ### Quick Example HiSpaR provides the main function `hispa_analyze()` to perform the complete 3D structure inference pipeline. The key parameters are: `mcmc_iterations` (total MCMC iterations, default 6000), `use_cluster_init` (use hierarchical clustering for initialization, recommended for large datasets), `filter_quantile` (remove loci below this quantile of contact counts, default -1 meaning no filtering; use 0.1 to remove the bottom 10%), and `mcmc_burn_in` (iterations to discard as burn-in, default 0). When run with `verbose = TRUE` (the default), `hispa_analyze()` prints a filtering report, phase completion checkpoints, MCMC settings and progress, per-parameter acceptance rates, and a summary of files written to `output_dir`. It returns a named list with `$positions` (n×3 matrix of inferred 3D coordinates), `$beta0`, and `$beta1`. ```{r example, eval=TRUE} # Run analysis on the example HiCExperiment object result_yeast <- hispa_analyze( hic_experiment = hic, output_dir = tempdir(), mcmc_iterations = 1000, use_cluster_init = TRUE, mcmc_burn_in = 10, filter_quantile = 0.1, # Filter out loci in the bottom 10% of contact counts verbose = TRUE ) # Check the structure of results cat("Result structure:\n") str(result_yeast) cat("Final positions (first 5 loci):\n") print(head(result_yeast$positions, 5)) cat("Estimated parameters: beta0 =", result_yeast$beta0, ", beta1 =", result_yeast$beta1, "\n") # Run analysis on the example contact matrix result_su <- hispa_analyze( hic_experiment = su1_contact_mat, output_dir = tempdir(), mcmc_iterations = 2000, use_cluster_init = TRUE, mcmc_burn_in = 10, filter_quantile = 0.1, # Filter out loci in the bottom 10% of contact counts verbose = TRUE ) # Check the structure of results cat("Result structure:\n") str(result_su) cat("Final positions (first 5 loci):\n") print(head(result_su$positions, 5)) cat("Estimated parameters: beta0 =", result_su$beta0, ", beta1 =", result_su$beta1, "\n") ``` ### Understanding the Output **Verbose output during the run** includes: - *Filtering report*: original locus count, quantile threshold, number removed, final count (e.g. 649 → 584 after removing 65 loci in the bottom 10%). - *MCMC progress*: phase checkpoints (`[Pre-processing] Completed`, `[Structure Initialization] Completed`, `[Global Assembly] Completed`), followed by MCMC settings (iterations, burn-in, SD bounds) and a final summary (`Final max log-likelihood found: ...`). - *Acceptance rates* for β0, β1, center positions, and locus positions should ideally be 20–30% and generally remain below 50%; excessively high acceptance rates can indicate weak information in the data. - *Summary of files* saved to `output_dir`: always `final_positions.txt`, `final_parameters.txt`, `initial_positions.txt`, `initial_distances.txt`; additionally `initialization/` (per-cluster and backbone sub-runs) when `use_cluster_init = TRUE`. **Return value** — a list of 3 elements: - `$positions`: numeric matrix of shape n×3 (e.g. `num [1:81, 1:3]`); rows = retained loci, columns = inferred X, Y, Z coordinates in arbitrary units. Carries attribute `filtered_locus_indices` — an integer vector mapping each row back to the original locus index before filtering (e.g. `int [1:81] 2 3 4 5 6 …`). - `$beta0`: scalar; intercept of the log-distance model (e.g. `3.54`). Higher values indicate higher baseline contact probability. - `$beta1`: scalar; slope of the log-distance model (e.g. `-1.42`). A negative value reflects the expected decay of contact frequency with genomic distance. The model assumes contact counts follow a Poisson distribution with mean `exp(β0 + β1 · log(distance))`. β0 and β1 are jointly inferred with the 3D positions during MCMC. Example output from `str(result_yeast)`: ```r List of 3 $ positions: num [1:81, 1:3] 0.142 -0.031 0.218 ... ..- attr(*, "filtered_locus_indices")= int [1:81] 2 3 4 5 6 7 8 9 10 11 ... $ beta0 : num 3.54 $ beta1 : num -1.42 ``` First 5 rows of `result_yeast$positions`: ```r [,1] [,2] [,3] [1,] 0.14215432 0.0823471 -0.2341082 [2,] -0.03124501 0.1952384 0.1023847 [3,] 0.21823471 -0.1423501 0.0812345 [4,] -0.11234501 0.2134512 -0.1523401 [5,] 0.08123451 -0.0923471 0.2134512 ``` ### Visualization We recommend using the `plotly` package for interactive 3D visualization of the inferred chromatin structures: ```{r visualization, eval=TRUE} coords_yeast <- result_yeast$positions # Create interactive 3D scatter plot plot_ly( x = coords_yeast[,1], y = coords_yeast[,2], z = coords_yeast[,3], type = 'scatter3d', mode = 'markers+lines', marker = list(size = 5, color = ~seq_len(nrow(coords_yeast)), showscale = TRUE), ) %>% layout( title = "Inferred 3D Chromatin Structure", scene = list( xaxis = list(title = "X"), yaxis = list(title = "Y"), zaxis = list(title = "Z") ) ) # Extract coordinates from results coords_su <- result_su$positions # Create interactive 3D scatter plot plot_ly( x = coords_su[,1], y = coords_su[,2], z = coords_su[,3], type = 'scatter3d', mode = 'markers+lines', marker = list(size = 5, color = ~seq_len(nrow(coords_su)), showscale = TRUE), ) %>% layout( title = "Inferred 3D Chromatin Structure", scene = list( xaxis = list(title = "X"), yaxis = list(title = "Y"), zaxis = list(title = "Z") ) ) ``` ### Output Files All results are automatically saved to the specified output directory: - `final_positions.txt`: Final inferred 3D coordinates (n × 3 matrix) - `initial_positions.txt`: Initial positions before MCMC ## Session Information ```{r sessionInfo} sessionInfo() ```