--- title: "CTCF introduction" author: - name: Mikhail Dozmorov affiliation: - Virginia Commonwealth University email: mikhail.dozmorov@gmail.com output: BiocStyle::html_document: self_contained: yes toc: true toc_float: true toc_depth: 2 code_folding: show date: "`r doc_date()`" package: "`r pkg_ver('CTCF')`" vignette: > %\VignetteIndexEntry{Introduction to CTCF} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", crop = NULL, ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html warning = FALSE ) ``` # CTCF [![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable) `r BiocStyle::Biocpkg('CTCF')` defines an AnnotationHub resource representing genomic coordinates of [FIMO](https://meme-suite.org/meme/doc/fimo.html)-predicted CTCF binding sites for human and mouse genomes, including the [Telomere-to-Telomere](https://github.com/marbl/CHM13) and [mm39](http://genomeref.blogspot.com/2020/07/grcm39-new-mouse-reference-genome.html) genome assemblies. It also includes experimentally defined CTCF-bound cis-regulatory elements from [ENCODE SCREEN](https://screen.encodeproject.org/). TL;DR - for human hg38 genome assembly, use `hg38.MA0139.1.RData` ("AH104729"). For mouse mm10 genome assembly, use `mm10.MA0139.1.RData` ("AH104755"). For [ENCODE SCREEN](https://screen.encodeproject.org/) data, use `hg38.SCREEN.GRCh38_CTCF.RData` ("AH104730") or `mm10.SCREEN.mm10_CTCF.RData` ("AH104756") objects. The CTCF GRanges are named as `.`. The FIMO-predicted data includes extra columns with motif name, score, p-value, q-value, and the motif sequence. ## Installation instructions [Install](https://www.bioconductor.org/install/#install-R) the latest release of R, then get the latest version of Bioconductor by starting R and entering the commands: ```{r 'install1', eval = TRUE, message=FALSE, warning=FALSE} # if (!require("BiocManager", quietly = TRUE)) # install.packages("BiocManager") # BiocManager::install(version = "3.16") ``` Then, install additional packages using the following code: ```{r 'install2', eval = TRUE, message=FALSE, warning=FALSE} # BiocManager::install("AnnotationHub", update = FALSE) # BiocManager::install("GenomicRanges", update = FALSE) # BiocManager::install("plyranges", update = FALSE) ``` ## Example ```{r get-data} suppressMessages(library(AnnotationHub)) ah <- AnnotationHub() query_data <- subset(ah, preparerclass == "CTCF") # Explore the AnnotationHub object query_data # Get the list of data providers query_data$dataprovider %>% table() ``` We can find CTCF sites identified using JASPAR 2022 database in hg38 human genome ```{r} subset(query_data, species == "Homo sapiens" & genome == "hg38" & dataprovider == "JASPAR 2022") # Same for mm10 mouse genome # subset(query_data, species == "Mus musculus" & genome == "mm10" & dataprovider == "JASPAR 2022") ``` The `hg38.JASPAR2022_CORE_vertebrates_non_redundant_v2` object contains CTCF sites detected using the [all three CTCF PWMs](https://jaspar.genereg.net/search?q=CTCF&collection=all&tax_group=all&tax_id=9606&type=all&class=all&family=all&version=all). To retrieve, we'll use: ```{r} # hg38.JASPAR2022_CORE_vertebrates_non_redundant_v2 CTCF_hg38_all <- query_data[["AH104727"]] CTCF_hg38_all ``` The `hg38.MA0139.1` object contains CTCF sites detected using the most popular [MA0139.1](https://jaspar.genereg.net/matrix/MA0139.1/) CTCF PWM. To retrieve: ```{r} # hg38.MA0139.1 CTCF_hg38 <- query_data[["AH104729"]] CTCF_hg38 ``` It is always advisable to sort GRanges objects and keep standard chromsomes: ```{r} suppressMessages(library(plyranges)) CTCF_hg38_all <- CTCF_hg38_all %>% keepStandardChromosomes() %>% sort() CTCF_hg38 <- CTCF_hg38 %>% keepStandardChromosomes() %>% sort() ``` Save the data in a BED file, if needed. ```{r eval=FALSE} # Note that rtracklayer::import and rtracklayer::export perform unexplained # start coordinate conversion, likely related to 0- and 1-based coordinate # system. We recommend converting GRanges to a data frame and save tab-separated write.table(CTCF_hg38_all %>% sort() %>% as.data.frame(), file = "CTCF_hg38_all.bed", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE) write.table(CTCF_hg38 %>% sort() %>% as.data.frame(), file = "CTCF_hg38.bed", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE) ``` Create an [IGV](https://software.broadinstitute.org/software/igv/) XML session file out of the saved BED files using the `r BiocStyle::Biocpkg('tracktables')` package. See `vignette("tracktables", package = "tracktables")` for more details. ```{r eval=FALSE} library(tracktables) # BiocManager::install("tracktables") # Sample sheet metadata SampleSheet <- data.frame(SampleName = c("CTCF all", "CTCF MA0139.1"), Description = c("All CTCF matrices from JASPAR2022", "MA0139.1 CTCF matrix from JASPAR2022")) # File sheet linking files with sample names FileSheet <- data.frame(SampleName = c("CTCF all", "CTCF MA0139.1"), bigwig = c(NA, NA), interval = c("CTCF_hg38_all.bed", "CTCF_hg38.bed"), bam = c(NA, NA)) # Creating an IGV session XML file MakeIGVSession(SampleSheet, FileSheet, igvdirectory = getwd(), "CTCF_from_JASPAR2022", "hg38") ``` Note that the [FIMO](https://meme-suite.org/meme/doc/fimo.html) tool detects CTCF binding sites using the 1e-4 p-value threshold by default (the more significant p-value corresponds to the more confidently detected CTCF motif). We found that this threshold may be too permissive. Using the [ENCODE SCREEN](https://screen.encodeproject.org/) database as ground truth, we found 1e-6 as the optimal threshold providing approximately 80% true positive rate. However, less significant CTCF motifs may be cell type-specific or have weaker CTCF binding and therefore be missed by conventional peak callers. If cell type-specific CTCF binding is of interest, we recommend exploring less significant CTCF sites. ```{r echo=FALSE} knitr::include_graphics("../man/figures/Figure_human_pvalues_threshold.png") ``` To filter the GRanges object and keep high-confidence CTCF sites, use: ```{r} # Check length before filtering print(paste("Number of CTCF motifs at the default 1e-4 threshold:", length(CTCF_hg38))) # Filter and check length after filtering CTCF_hg38_filtered <- CTCF_hg38 %>% plyranges::filter(pvalue < 1e-6) print(paste("Number of CTCF motifs at the 1e-6 threshold:", length(CTCF_hg38_filtered))) # Similarly, filter CTCF_hg38_all_filtered <- CTCF_hg38_all %>% plyranges::filter(pvalue < 1e-6) ``` Given some databases provide multiple CTCF PWMs, one CTCF site may be detected multiple times resulting in overlapping CTCF sites. For example, the proportion of overlapping CTCF sites in the `CTCF_hg38_all_filtered` object containing CTCF sites detected by three matrices nearly 40%: ```{r} # Proportion of overlapping enrtries tmp <- findOverlaps(CTCF_hg38_all, CTCF_hg38_all) prop_overlap <- sort(table(queryHits(tmp)) %>% table(), decreasing = TRUE) sum(prop_overlap[which(names(prop_overlap) != "1")]) / length(CTCF_hg38_all) ``` The proportion of overlapping CTCF sites in the `CTCF_hg38_filtered` object containing CTCF sites detected by the [MA0139.1](https://jaspar.genereg.net/matrix/MA0139.1/) matrix is less than 2.5% ```{r} tmp <- findOverlaps(CTCF_hg38, CTCF_hg38) prop_overlap <- sort(table(queryHits(tmp)) %>% table(), decreasing = TRUE) sum(prop_overlap[which(names(prop_overlap) != "1")]) / length(CTCF_hg38) ``` Reducing them (merging overlapping CTCF sites), combined with 1E-6 cutoff filtering, yields the number of CTCF sites comparable to previously reported. ```{r} print(paste("Number of CTCF_hg38 motifs at the 1e-6 threshold AND reduced:", length(CTCF_hg38_filtered %>% reduce()))) print(paste("Number of CTCF_hg38_all motifs at the 1e-6 threshold AND reduced:", length(CTCF_hg38_all_filtered %>% reduce()))) ``` However, regulatory elements with CTCF proteins co-occupying adjacent/overlapping CTCF binding motifs were shown to be functionally and structurally different from those with single CTCF motifs. We provide non-reduced CTCF data and advise considering overlap of CTCF sites depending on the study's goal. # liftOver of CTCF coordinates As genome assemblies for model organisms continue to improve, CTCF sites for previous genome assemblies become obsolete. Typically, the actual genome sequence changes little, leading to changes in genomic coordinates. The [liftOver](https://genome.ucsc.edu/cgi-bin/hgLiftOver) method allows for conversion of genomic coordinates between genome assemblies. Some carefully curated CTCF sites are available only for older genome assemblies. Examples include the data from [CTCFBSDB](https://insulatordb.uthsc.edu/), available for hg18 and mm8 genome assemblies. To investigate whether liftOver of CTCF sites from older genome assemblies is a viable option, we tested for overlap between CTCF sites directly detected in specific genome assemblies with those lifted over. We detected CTCF sites using the [MA0139.1](https://jaspar.genereg.net/matrix/MA0139.1/) PWM from JASPAR 2022 database in hg18, hg19, hg38, and T2T genome assemblies and converted their genomic coordinates using the corresponding liftOver chains ([download_liftOver.sh](https://github.com/dozmorovlab/CTCF.dev/blob/main/scripts/download_liftOver.sh) and [convert_liftOver.sh](https://github.com/dozmorovlab/CTCF.dev/blob/main/scripts/convert_liftOver.sh) scripts). We observed high Jaccard overlap among CTCF sites detected in the original genome assemblies or lifted over. **Jaccard overlaps among CTCF binding sites detected in the original and liftOver human genome assemblies.** CTCF sites were detected using JASPAR 2022 MA0139.1 PWM. The correlogram was clustered using Euclidean distance and Ward.D clustering . White-red gradient indicate low-to-high Jaccard overlaps. Jaccard values are shown in the corresponding cells. ```{r echo=FALSE} knitr::include_graphics("../man/figures/Figure_liftOverJaccard.png") ``` Our results suggest that liftOver is a viable alternative to obtain CTCF genomic annotations for different genome assemblies. We provide [CTCFBSDB](https://insulatordb.uthsc.edu/) data converted to hg19 and hg38 genome assemblies. # CTCF Position Weight Matrices **CTCF PWM information.** "Motif" - individual motif IDs or the total number of motifs per database; "Length" - motif length of the range of lengths; "URL" - direct links to motif pages. Jaspar, Hocomoco, Jolma 2013 PWMs were downloaded from the [MEME database](https://meme-suite.org/meme/doc/download.html). ```{r echo=FALSE} mtx <- read.csv("../man/tables/Table_PWMs.csv") knitr::kable(mtx) ``` **CTCF motif logos.** PWMs from (A) MEME, (B) CTCFBSDB, (C) CIS-BP human, and (D) CIS-BP mouse databases. Clustering and alignment of motifs was performed using the `r `BiocStyle::Biocpkg("motifStack")` R package. ```{r echo=FALSE} knitr::include_graphics("../man/figures/Supplementary_Figure_1.png") ``` See [../inst/scripts/make-data.R](../inst/scripts/make-data.R) how the CTCF GRanges objects were created. # CTCF predicted and experimental data **Predefined CTCF binding data.** "Database" - source of data; "Number" - number of binding sites; "Assembly" - genome assembly; "URL" - direct link to data download. ```{r echo=FALSE} mtx <- read.csv("../man/tables/Table_Predefined.csv") knitr::kable(mtx) ``` # All GRanges objects included in the package **Summary of CTCF binding data provided in the package.** CTCF sites for each genome assembly and PWM combination were detected using FIMO. "ID" - object names formatted as `.`; "Assembly" - genome assembly, T2T - telomere to telomere (GCA_009914755.4) genome assembly; "All (p-value threshold Xe-Y)" - the total number of CTCF binding sites in the corresponding BED file at the Xe-Y threshold; "Non-overlapping (p-value threshold Xe-Y)" - number of non-overlapping CTCF binding sites (overlapping regions are merged) at the Xe-Y threshold. ```{r echo=FALSE} mtx <- read.csv("../man/tables/Table_log.csv") knitr::kable(mtx) ``` ## Citation Below is the citation output from using `citation('CTCF')` in R. Please run this yourself to check for any updates on how to cite __CTCF__. ```{r 'citation', eval = requireNamespace('CTCF')} print(citation("CTCF"), bibtex = TRUE) ``` This package was developed using `r BiocStyle::Biocpkg('biocthis')`. ## Code of Conduct Please note that the CTCF project is released with a [Contributor Code of Conduct](http://bioconductor.org/about/code-of-conduct/). By contributing to this project, you agree to abide by its terms. # Session information ```{r} sessionInfo() ```