chipseqDBData 1.0.0
This package contains several ChIP-seq datasets for use in differential binding (DB) analyses:
These datasets are mainly used in the chipseqDB workflow (Lun and Smyth 2015) and the csaw user’s guide (Lun and Smyth 2016). This vignette will briefly demonstrate how to obtain each dataset and investigate some of the processing statistics.
We obtain the H3K9ac dataset from ExperimentHub using the H3K9acData()
function.
This downloads sorted and indexed BAM files to a local cache, along with the associated index files.
The function returns a DataFrame
of file paths and sample descriptions to further use in workflows.
library(chipseqDBData)
h3k9ac.paths <- H3K9acData()
h3k9ac.paths
## DataFrame with 4 rows and 3 columns
## Name Description
## <character> <character>
## 1 h3k9ac-proB-8113 pro-B H3K9ac (8113)
## 2 h3k9ac-proB-8108 pro-B H3K9ac (8108)
## 3 h3k9ac-matureB-8059 mature B H3K9ac (8059)
## 4 h3k9ac-matureB-8086 mature B H3K9ac (8086)
## Path
## <character>
## 1 /tmp/Rtmp8t5mql/file7e818de859b/h3k9ac-proB-8113.bam
## 2 /tmp/Rtmp8t5mql/file7e818de859b/h3k9ac-proB-8108.bam
## 3 /tmp/Rtmp8t5mql/file7e818de859b/h3k9ac-matureB-8059.bam
## 4 /tmp/Rtmp8t5mql/file7e818de859b/h3k9ac-matureB-8086.bam
Note that the time-consuming download only occurs upon the first use of the function. Later uses will simply re-use the same files, thus avoiding the need to re-download these large files. (Some readers may notice that the paths point to the temporary directory, which is destroyed at the end of each R session. Here, the temporary directory contains only soft-links to the persistent BAM files in the local cache. This is a low-cost illusion to ensure that the index files have the same prefixes as the BAM files.)
The same approach is used for all of the other datasets, e.g., CBPData()
, NFYAData()
.
Be aware that the initial download time will depend on the size and number of the BAM files in each dataset.
We use functions from the Rsamtools package to examine the mapping statistics. This includes the number of mapped reads, the number of marked reads (i.e., potential PCR duplicates) and the number of high-quality alignments with high mapping scores.
library(Rsamtools)
diagnostics <- list()
for (i in seq_len(nrow(h3k9ac.paths))) {
stats <- scanBam(h3k9ac.paths$Path[i],
param=ScanBamParam(what=c("mapq", "flag")))
flag <- stats[[1]]$flag
mapq <- stats[[1]]$mapq
mapped <- bitwAnd(flag, 0x4)==0
diagnostics[[h3k9ac.paths$Name[i]]] <- c(
Total=length(flag),
Mapped=sum(mapped),
HighQual=sum(mapq >= 10 & mapped),
DupMarked=sum(bitwAnd(flag, 0x400)!=0)
)
}
diag.stats <- data.frame(do.call(rbind, diagnostics))
diag.stats$Prop.mapped <- diag.stats$Mapped/diag.stats$Total*100
diag.stats$Prop.marked <- diag.stats$DupMarked/diag.stats$Mapped*100
diag.stats
## Total Mapped HighQual DupMarked Prop.mapped
## h3k9ac-proB-8113 10724526 8832006 8815503 434884 82.35335
## h3k9ac-proB-8108 10413135 7793913 7786335 252271 74.84694
## h3k9ac-matureB-8059 16675372 4670364 4568908 396785 28.00756
## h3k9ac-matureB-8086 6347683 4551692 4535587 141583 71.70635
## Prop.marked
## h3k9ac-proB-8113 4.923955
## h3k9ac-proB-8108 3.236770
## h3k9ac-matureB-8059 8.495805
## h3k9ac-matureB-8086 3.110558
More comprehensive quality checks are beyond the scope of this document, but can be performed with other packages such as ChIPQC.
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] Rsamtools_2.0.0 Biostrings_2.52.0 XVector_0.24.0
## [4] GenomicRanges_1.36.0 GenomeInfoDb_1.20.0 IRanges_2.18.0
## [7] S4Vectors_0.22.0 BiocGenerics_0.30.0 chipseqDBData_1.0.0
## [10] knitr_1.22 BiocStyle_2.12.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_0.2.5 xfun_0.6
## [3] purrr_0.3.2 htmltools_0.3.6
## [5] BiocFileCache_1.8.0 yaml_2.2.0
## [7] interactiveDisplayBase_1.22.0 blob_1.1.1
## [9] rlang_0.3.4 pillar_1.3.1
## [11] later_0.8.0 glue_1.3.1
## [13] DBI_1.0.0 BiocParallel_1.18.0
## [15] rappdirs_0.3.1 bit64_0.9-7
## [17] dbplyr_1.4.0 GenomeInfoDbData_1.2.1
## [19] zlibbioc_1.30.0 stringr_1.4.0
## [21] ExperimentHub_1.10.0 memoise_1.1.0
## [23] evaluate_0.13 Biobase_2.44.0
## [25] httpuv_1.5.1 curl_3.3
## [27] AnnotationDbi_1.46.0 Rcpp_1.0.1
## [29] xtable_1.8-4 promises_1.0.1
## [31] BiocManager_1.30.4 mime_0.6
## [33] bit_1.1-14 AnnotationHub_2.16.0
## [35] digest_0.6.18 stringi_1.4.3
## [37] bookdown_0.9 dplyr_0.8.0.1
## [39] shiny_1.3.2 tools_3.6.0
## [41] bitops_1.0-6 magrittr_1.5
## [43] RCurl_1.95-4.12 tibble_2.1.1
## [45] RSQLite_2.1.1 crayon_1.3.4
## [47] pkgconfig_2.0.2 assertthat_0.2.1
## [49] rmarkdown_1.12 httr_1.4.0
## [51] R6_2.4.0 compiler_3.6.0
Galvis, L. A., A. Z. Holik, K. M. Short, J. Pasquet, A. T. Lun, M. E. Blewitt, I. M. Smyth, M. E. Ritchie, and M. L. Asselin-Labat. 2015. “Repression of Igf1 expression by Ezh2 prevents basal cell differentiation in the developing lung.” Development 142 (8):1458–69.
Kasper, L. H., C. Qu, J. C. Obenauer, D. J. McGoldrick, and P. K. Brindle. 2014. “Genome-wide and single-cell analyses reveal a context dependent relationship between CBP recruitment and gene expression.” Nucleic Acids Res. 42 (18):11363–82.
Lun, A. T., and G. K. Smyth. 2015. “From reads to regions: a Bioconductor workflow to detect differential binding in ChIP-seq data.” F1000Res. 4:1080.
———. 2016. “csaw: a Bioconductor package for differential binding analysis of ChIP-seq data using sliding windows.” Nucleic Acids Res. 44 (5):e45.
Revilla-I-Domingo, R., I. Bilic, B. Vilagos, H. Tagoh, A. Ebert, I. M. Tamir, L. Smeenk, et al. 2012. “The B-cell identity factor Pax5 regulates distinct transcriptional programmes in early and late B lymphopoiesis.” EMBO J. 31 (14):3130–46.
Tiwari, V. K., M. B. Stadler, C. Wirbelauer, R. Paro, D. Schubeler, and C. Beisel. 2011. “A chromatin-modifying function of JNK during stem cell differentiation.” Nat. Genet. 44 (1):94–100.