--- title: "Bioconductor Package Vignette" author: "Tsu-Pei Chiu, Federico Comoglio and Remo Rohs" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{DNAshapeR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Introduction DNAshapeR predicts DNA shape features in an ultra-fast, high-throughput manner from genomic sequencing data. The package takes either nucleotide sequence or genomic intervals as input, and generates various graphical representations for further analysis. DNAshapeR further encodes DNA sequence and shape features for statistical learning applications by concatenating feature matrices with user-defined combinations of k-mer and DNA shape features that can be readily used as input for machine learning algorithms. In this vignette, you will learn: * how to load/install DNAshapeR * how to predict DNA shape features * how to visualize DNA shape predictions * how to encode sequence and shape features, and apply them ## Load DNAshapeR ```{r eval=TRUE} library(DNAshapeR) ``` ## Predict DNA shape features The core of DNAshapeR, the DNAshape method (Zhou, et al., 2013), uses a sliding pentamer window where structural features unique to each of the 512 distinct pentamers define a vector of minor groove width (MGW), Roll, propeller twist (ProT), and helix twist (HelT) at each nucleotide position. MGW and ProT define base-pair parameters whereas Roll and HelT represent base pair-step parameters. DNAshapeR can predict DNA shape features from custom FASTA files or directly from genomic coordinates in the form of a GRanges object within Bioconductor (see for more information). ### From FASTA file To predict DNA shape features from a FASTA file ```{r eval=TRUE} library(DNAshapeR) fn <- system.file("extdata", "CGRsample.fa", package = "DNAshapeR") pred <- getShape(fn) ``` ### From genomic intervals (e.g. TFs binding sites, CpG islands, replication origins, ...) To predict DNA shape from genomic intervals stored as GRanges object, a reference genome is required. Several reference genomes are available within BioConductor as BSgenome objects (see for more information). For example, the sacCer3 release of the *S.Cerevisiae* genome can be retrieved. Given a reference genome, the **getFasta** function first extracts the DNA sequences based on the provided genomic coordinates, and then performs shape predictions within a user-defined window (of size equal to width, 100 bp in the example below) computed from the center of each genomic interval: ```{r eval=FALSE} # Install Bioconductor packages source("http://bioconductor.org/biocLite.R") biocLite("BSgenome.Scerevisiae.UCSC.sacCer3") library(BSgenome.Scerevisiae.UCSC.sacCer3) # Create a query GRanges object gr <- GRanges(seqnames = c("chrI"), strand = c("+", "-", "+"), ranges = IRanges(start = c(100, 200, 300), width = 100)) getFasta(gr, Scerevisiae, width = 100, filename = "tmp.fa") fn <- "tmp.fa" pred <- getShape(fn) ``` ### From public domain projects The genomic intervals can also be obtained from public domain projects, including ENCODE, NCBI, Ensembl, etc. The AnnotationHub package (see for more information) provides an interface to retrieve genomic intervals from these multiple online project resources.The genomic intervals of interest can be selected progressively through the functions of **sebset** and **query** with keywords, and can be subjected as an input of GRanges object to **getFasta** function. ```{r eval=FALSE} # Install Bioconductor packages library(BSgenome.Hsapiens.UCSC.hg19) library(AnnotationHub) ah <- AnnotationHub() ah <- subset(ah, species=="Homo sapiens") ah <- query(ah, c("H3K4me3", "Gm12878", "Roadmap")) getFasta(ah[[1]], Hsapiens, width = 150, filename = "tmp.fa") fn <- "tmp.fa" pred <- getShape(fn) ``` ## Visualize DNA shape prediction DNAshapeR can be used to generate various graphical representations for further analyses. The prediction result can be visualized in the form of scatter plots (Comoglio, et al., 2015), heat maps (Yang, et al., 2014), or genome browser tracks (Chiu, et al., 2014). ### Ensemble representation: metashape plot The prediction result can be visualized in the metaprofiles of DNA shape features. ```{r fig.width=7, fig.height=7, fig.align='center', eval=TRUE} plotShape(pred$MGW) #plotShape(pred$ProT) #plotShape(pred$Roll) #plotShape(pred$HelT) ``` ### Ensemble representation: heat map The prediction result can be visualized in the heatmap of DNA shape features. ```{r fig.width=7, fig.height=7, fig.align='center', eval=TRUE} library(fields) heatShape(pred$ProT, 20) #heatShape(pred$MGW, 20) #heatShape(pred$Roll[1:500, 1:1980], 20) #heatShape(pred$HelT[1:500, 1:1980], 20) ``` ### Individual representation: genome browser-like tracks The prediction result can be visualized in the form of genome browser tracks. *Note that the input data should only contain one sequence. ```{r fig.width=7, fig.height=7, fig.align='center', eval=TRUE} fn2 <- system.file("extdata", "SingleSeqsample.fa", package = "DNAshapeR") pred2 <- getShape(fn2) trackShape(fn2, pred2) # Only for single sequence file ``` ## Encode sequence and shape features DNAshapeR can be used to generate feature vectors for a user-defined model. These models can consist of either sequence features (1-mer, 2-mer, 3-mer), shape features (MGW, Roll, ProT, HelT), or any combination of those two. For 1-mer features, sequence is encoded in form of four binary numbers (i.e., in terms of 1-mers, 0001 for adenine, 0010 for cytosine, 0100 for guanine, and 1000 for thymine) at each nucleotide position (Zhou, et al., 2015). The encoding function of the DNAshapeR package enables the determination of higher order sequence features, for example, 2-mers and 3-mers (16 and 64 binary features at each position, respectively). The user can also choose to include second order shape features in the generated feature vector. The second order shape features are product terms of values for the same category of shape features (MGW, Roll, ProT or HelT) at adjacent positions. The feature encoding function of DNAshapeR enables the generation of any subset of these features, either only a selected shape category or first order shape features, and any combination with shape or sequence features. The result of feature encoding for each sequence is a chimera feature vector. ### Encoding process A feature type vector should be defined before encoding. The vector can be any combination of characters of “k-mer”, “n-shape”, “n-MGW”, “n-ProT”, “n-Roll”, “n-HelT” (k, n are integers) where “1-shape” refers to first order and “2-shape” to second order shape features. ```{r eval=TRUE} library(Biostrings) fn3 <- system.file("extdata", "SELEXsample_short.fa", package = "DNAshapeR") pred3 <- getShape(fn3) featureType <- c("1-mer", "1-shape") featureVector <- encodeSeqShape(fn3, pred3, featureType) head(featureVector) ``` ### Showcase of statistical machine learning application Feature encoding of multiple sequences thus results in a feature matrix, which can be used as input for variety of statistical machine learning methods. For example, an application is the quantitative modeling of SELEX-seq derived protein-DNA binding by linear regression as demonstrated below. First, pre-computed binding affinity values are combined with experimental information in a data frame structure. ```{r eval=TRUE} fn4 <- system.file("extdata", "SELEXsample_short.s", package = "DNAshapeR") experimentalData <- read.table(fn4) df <- data.frame(affinity=experimentalData$V1, featureVector) ``` Then, a machine learning package (which can be any learning tools) is used to train a multiple linear regression (MLR) model based on 3-fold cross-validation. In this example, we used the caret package (see for more information). ```{r eval=TRUE} library(caret) trainControl <- trainControl(method = "cv", number = 3, savePredictions = TRUE) model <- train (affinity~ ., data = df, trControl=trainControl, method="lm", preProcess=NULL) model ``` ## Session Info ```{r eval=TRUE} sessionInfo() ``` ## References Chiu, T.P., et al. GBshape: a genome browser database for DNA shape annotations. Nucleic Acids Res 2014. Comoglio, F., et al. High-resolution profiling of Drosophila replication start sites reveals a DNA shape and chromatin signature of metazoan origins. Cell reports 2015;11(5):821-834. Yang, L., et al. TFBSshape: a motif database for DNA shape features of transcription factor binding sites. Nucleic Acids Res 2014;42 (Database issue):D148-155. Zhou, T., et al. DNAshape: a method for the high-throughput prediction of DNA structural features on a genomic scale. Nucleic Acids Res 2013;41 (Web Server issue):W56-62. If you use DNAshapeR for your work, please cite the following publication: Chiu, T-P., et al. DNAshapeR: an R/Bioconductor package for DNA shape prediction and feature encoding (2015). Submitted