--- title: "'Motif2Site': an R package to detect binding sites from ChIP-seq and recenter them" author: - name: Peyman Zarrineh affiliation: - The University of Manchester email: peyman.zarrineh@manchester.ac.uk package: "`r BiocStyle::pkg_ver('Motif2Site')`" date: "`r format(Sys.Date(), '%B %d, %Y')`" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{Motif2Site} %\VignettePackage{Motif2Site} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: Motif2Site.bib biblio-style: apalike abstract: >

'Motif2Site' is an R package to detect transcription factor binding sites from motif sets and ChIP-seq experiments. The motif sets are either user provided bed files or user provided DNA nucleotide sequence motifs. It also combines andcompares binding sites of a transcription factor across various experiments and conditions. --- # Introduction Transcription factors often bind to specific DNA nucleotide patterns referred to as sequence motifs. Although sequence motifs are well-characterized for many transcription factors, detecting the actual binding sites is not always a straightforward task. Chromatin immunoprecipitation (ChIP) followed by DNA sequencing (ChIP-seq) is the major technology to detect binding regions of transcription factors. However, the binding regions detected by ChIP-seq may contains several or no DNA sequence motifs. **Motif2Site** is a novel R package which uses ChIP-seq information and detect transcription factor binding sites from a user provided DNA sequence motifs set. **Motif2Site** gets two different input, motif and ChIP-seq alignment information, to detect binding sites. First input is ChIP-seq alignment short reads in the bam or bed format. For each aligned short read the center of the alignment is calculated using `r Biocpkg("GenomicAlignments")` [@Lawrence2013] and `r Biocpkg("Rsamtools")` [@Morgan2020] packages. Motif information is the second input which is provided by user either as a bed file or a DNA string with a mismatch number. In the case of DNA string input, `r Biocpkg("Biostrings")` [@Pages2019] and `r Biocpkg("BSgenome")` [@Herve2019] packages are used to find motif locations on the genome in `r Biocpkg("GenomicRanges")` [@Lawrence2013] format. Negative binomial distribution is used to model count data by using `r Biocpkg("edgeR")` package [@Robinson2010]. `r CRANpkg("mixtools")` [@Benaglia2009] is used to deconvolve binding intensities of closely spaced binding sites. **Motif2Site** also combines binding sites across different experiments. It calls differential binding sites using TMM normalization and GLM test using `r Biocpkg("edgeR")` package [@Robinson2010]. # Major functions of Motif2Site First install and load the libraries needed to run the examples of this document: ```{r, results="hide", warning=FALSE, message=FALSE} library(GenomicRanges) library(Motif2Site) library(BSgenome.Scerevisiae.UCSC.sacCer3) library(BSgenome.Ecoli.NCBI.20080805) ``` The functions, implemented in **Motif2Site**, perform three tasks: **1.** To assist users to select better sequence motif input. **2.** To detect binding sites from sequence motifs and ChIP-seq datasets **3.** To combine and compare binding sites across different experiments, conditions, or tissues. Each of these functions are explained in a separate section. # Selecting sequence motif **Motif2Site** uses DNA motif information as one of its input. To facilitate choosing proper sequence motif, `compareBedFiless2UserProvidedRegions` and `compareMotifs2UserProvidedRegions` functions compare motif regions with a user provided confident regions in terms of precision and recall. As the first example, an artificially generated "YeastSampleMotif.bed" bed file in yeast is considered as a confident binding regions set. `compareMotifs2UserProvidedRegions` function compares these regions with locations of 'TGATTSCAGGANT' 1-mismatch, 'TGATTCCAGGANT' 0-mismatch, 'TGATWSCAGGANT' 2-mismatches on the yeast genome in terms of precision and recall. ```{r, warning=FALSE, message=FALSE, fig.width=10, fig.height=6} yeastExampleFile = system.file("extdata", "YeastSampleMotif.bed", package="Motif2Site") YeastRegionsChIPseq <- Bed2Granges(yeastExampleFile) SequenceComparison <- compareMotifs2UserProvidedRegions( givenRegion=YeastRegionsChIPseq, motifs = c("TGATTSCAGGANT", "TGATTCCAGGANT", "TGATWSCAGGANT"), mismatchNumbers = c(1,0,2), genome="Scerevisiae", genomeBuild="sacCer3" ) SequenceComparison ``` In the following example, an artificially generated "YeastSampleMotif.bed" bed file in yeast is considered as a confident binding regions set. `compareBedFiless2UserProvidedRegions` function compares these regions with two bed files "YeastBedFile1.bed" and "YeastBedFile2.bed" in terms of precision and recall. ```{r, warning=FALSE, message=FALSE, fig.width=8, fig.height=6} # Yeast artificial dataset for comparison bed files yeastExampleFile = system.file("extdata", "YeastSampleMotif.bed", package="Motif2Site") YeastRegionsChIPseq <- Bed2Granges(yeastExampleFile) bed1 <- system.file("extdata", "YeastBedFile1.bed", package="Motif2Site") bed2 <- system.file("extdata", "YeastBedFile2.bed", package="Motif2Site") BedFilesVector <- c(bed1, bed2) SequenceComparison <- compareBedFiless2UserProvidedRegions( givenRegion=YeastRegionsChIPseq, bedfiles = BedFilesVector, motifnames = c("YeastBed1", "YeastBed2") ) SequenceComparison ``` # Detecting binding sites `DetectBindingSitesMotif` function detects binding sites from provided sequence motif information. Here, Artificial ChIP-seq data for FUR transcription factor in Escherichia coli was generated in two conditions fe2+ and dpd inspired by [@Seo2014]. In the following examples, artificial sequence motif locations have been and provided as a bed file called 'FurMotifs.bed'. The alignment of ChIP-seq short reads is the other input of this function. The alignment files can be passed to the function as bam or bed files. In the following examples both IP and background alignment files have been passed as single-end bed files to this function. ```{r, message=FALSE, warning=FALSE} # FUR candidate motifs in NC_000913 E. coli FurMotifs = system.file("extdata", "FurMotifs.bed", package="Motif2Site") # ChIP-seq FUR fe datasets binding sites from user provided bed file # ChIP-seq datasets in bed single end format IPFe <- c(system.file("extdata", "FUR_fe1.bed", package="Motif2Site"), system.file("extdata", "FUR_fe2.bed", package="Motif2Site")) Inputs <- c(system.file("extdata", "Input1.bed", package="Motif2Site"), system.file("extdata", "Input2.bed", package="Motif2Site")) FURfeBedInputStats <- DetectBindingSitesBed(BedFile=FurMotifs, IPfiles=IPFe, BackgroundFiles=Inputs, genome="Ecoli", genomeBuild="20080805", DB="NCBI", expName="FUR_Fe_BedInput", format="BEDSE" ) FURfeBedInputStats # ChIP-seq FUR dpd datasets binding sites from user provided bed file # ChIP-seq datasets in bed single end format IPDpd <- c(system.file("extdata", "FUR_dpd1.bed", package="Motif2Site"), system.file("extdata", "FUR_dpd2.bed", package="Motif2Site")) FURdpdBedInputStats <- DetectBindingSitesBed(BedFile=FurMotifs, IPfiles=IPDpd, BackgroundFiles=Inputs, genome="Ecoli", genomeBuild="20080805", DB="NCBI", expName="FUR_Dpd_BedInput", format="BEDSE" ) FURdpdBedInputStats ``` `DetectBindingSitesMotif` function also works with DNA string motifs. In the following example, FUR binding sites in fe2+ condition are detected from 'GWWTGAGAA' with 1-mismatch motif. The dataset is generated only for 'NC_000913' build. Therefore, in this example the coordinates of this regions are provided as a 'GivenRegion' field. Providing this field ensures that binding sites are only detected in the given regions. This will accelerate the peak calling, and also improve the prediction accuracy. ```{r, message=FALSE, warning=FALSE} # Granages region for motif search NC_000913_Coordiante <- GRanges(seqnames=Rle("NC_000913"), ranges = IRanges(1, 4639675)) # ChIP-seq FUR fe datasets binding sites from user provided string motif # ChIP-seq datasets in bed single end format FURfeStringInputStats <- DetectBindingSitesMotif(motif = "GWWTGAGAA", mismatchNumber = 1, IPfiles=IPFe, BackgroundFiles=Inputs, genome="Ecoli", genomeBuild="20080805", DB= "NCBI", expName="FUR_Fe_StringInput", format="BEDSE", GivenRegion = NC_000913_Coordiante ) FURfeStringInputStats ``` # Combining binding sites across experiments `recenterBindingSitesAcrossExperiments` function combines the binding sites of different tissues or conditions into a single count matrix. In the FUR example, at the first step it combines fe2+ and dpd binding sites. At the next step, it recalculates the p-adjusted values. To ensure the high quality of the combined binding site set, an stringent cross-experiment FDR cutoff is applied (default 0.001). The accepted binding sites should fullfill this cutoff at least in one experiment. Another FDR cutoff value (default 0.05) is used to assign binding or non-binding labels to each binding site for each experiment. ```{r, message=FALSE, warning=FALSE} # Combine FUR binding sites from bed input into one table corMAT <- recenterBindingSitesAcrossExperiments( expLocations=c("FUR_Fe_BedInput","FUR_Dpd_BedInput"), experimentNames=c("FUR_Fe","FUR_Dpd"), expName="combinedFUR" ) corMAT FurTable <- read.table(file.path("combinedFUR","CombinedMatrix"), header = TRUE, check.names = FALSE ) FurBindingTotal <- GRanges(seqnames=Rle(FurTable[,1]), ranges = IRanges(FurTable[,2], FurTable[,3]) ) FurFe <- FurBindingTotal[which((FurTable$FUR_Fe_binding =="Binding")==TRUE)] FurDpd <- FurBindingTotal[which((FurTable$FUR_Dpd_binding =="Binding")==TRUE)] findOverlaps(FurFe,FurDpd) ``` `pairwisDifferential` function uses edgeR TMM normalization and GLM test to detect differential binding sites across two experiments. In the following example it takes combined FUR count matrix and detect differential binding sites across fe2+ and dpd. ```{r, message=FALSE, warning=FALSE} # Differential binding sites across FUR conditions fe vs dpd diffFUR <- pairwisDifferential(tableOfCountsDir="combinedFUR", exp1="FUR_Fe", exp2="FUR_Dpd", FDRcutoff = 0.05, logFCcuttoff = 1 ) FeUp <- diffFUR[[1]] DpdUp <- diffFUR[[2]] TotalComparison <- diffFUR[[3]] head(TotalComparison) ``` ```{r echo=FALSE, results='hide',message=FALSE} # Remove folders unlink("FUR_Fe_BedInput", recursive = TRUE) unlink("FUR_Dpd_BedInput", recursive = TRUE) unlink("FUR_Fe_StringInput", recursive = TRUE) unlink("combinedFUR", recursive = TRUE) ``` # Session Info ```{r sessionInfo} sessionInfo() ``` # References