%\VignetteIndexEntry{An introduction to exomePeak} %\VignetteDepends{} %\VignetteKeywords{peak detection, differential methylation} %\VignettePackage{exomePeak} \documentclass[]{article} \usepackage{times} \usepackage{hyperref} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\software}[1]{\textsf{#1}} \newcommand{\R}{\software{R}} \newcommand{\exomePeak}{\Rpackage{exomePeak}} \newcommand{\bam}{\texttt{BAM}} \title{An Introduction to \Rpackage{exomePeak}} \author{Jia Meng, PhD} \date{Modified: 21 Dec, 2015. Compiled: \today} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle <>= options(width=60) @ \section{Introduction} The \exomePeak{} R-package has been developed based on the MATLAB \Rpackage{exomePeak} package, for the analysis of RNA epitranscriptome sequencing data with affinity-based shotgun sequencing approach, such as MeRIP-Seq or m6A-Seq. The exomePeak package is under active development, please don't hesitate to contact me @ jia.meng@hotmail if you have any questions. The inputs of the main function \Rpackage{exomepeak} are the IP BAM files and input control BAM files with optional gene annotation file. The \Rpackage{exomePeak} package fullfills the following two key functions: \begin{itemize} \item Conduct peak calling to identify the RNA methylation sites under a specific condition \item Conduct peak calling to identify the RNA methylation sites and then identify the differential methylation site which can be differentially regulated at epitranscriptomic layer by RNA modifications. \end{itemize} Gene annotation can be provided as a GTF file, a \Rclass{TxDb} object, or automatically downloaded from UCSC through the internet. We will in the next see how the two main functions can be accomplished in a single command. \section{Peak Calling} Let us firstly load the package and get the toy data (came with the package) ready. <>= library("exomePeak") gtf=system.file("extdata", "example.gtf", package="exomePeak") f1=system.file("extdata", "IP1.bam", package="exomePeak") f2=system.file("extdata", "IP2.bam", package="exomePeak") f3=system.file("extdata", "IP3.bam", package="exomePeak") f4=system.file("extdata", "IP4.bam", package="exomePeak") f5=system.file("extdata", "Input1.bam", package="exomePeak") f6=system.file("extdata", "Input2.bam", package="exomePeak") f7=system.file("extdata", "Input3.bam", package="exomePeak") @ The first main function of \Rfunction{exomePeak} R-package is to call peaks (enriched binding sites) to detect RNA methylation sites on the exome. Inputs are the gene annotation GTF file, IP and Input control samples in BAM format. This function is used when data from only one condition is available. <>= result = exomepeak(GENE_ANNO_GTF=gtf, IP_BAM=c(f1,f2,f3,f4), INPUT_BAM=c(f5,f6,f7)) names(result) @ The results will be saved in the specified output directory, including the identified peaks and consistent peaks in BED or XLS (tab-delimited) format. The BED format can be visualized in genome browser directly and the peaks may span one or multiple introns. The difference between peak and consistent peak is that, the peaks in the con{\_|peak file are consitently enriched in all the IP replicates, so indicates higher re-producability. The first 12 columns in both the BED and the XLS are the same as a standard BED12 format: http://genome.ucsc.edu/FAQ/FAQformat.html{\#}format1 \begin{itemize} \item chrom - The name of the chromosome (e.g. chr3, chrY, chr2{\_}random) or scaffold (e.g. scaffold10671). \item chromStart - The starting position of the methylation site in the chromosome or scaffold. \item chromEnd - The ending position of the RNA methylation site in the chromosome or scaffold. \item name - Defines the name of gene on which the RNA methylation site locates \item score - p-value of the peak \item strand - Defines the strand - either '+' or '-'. This is inferred based on gene annotation. \item thickStart - The same as "chromStart". The starting position of the methylation site in the chromosome or scaffold. \item thickEnd - The same as "chromEnd". The ending position of the methylation site in the chromosome or scaffold. \item itemRgb - always 0 \item blockCount - The number of blocks (exons) the RNA methylation site spans. \item blockSizes - A comma-separated list of the block sizes. The number of items in this list should correspond to blockCount. \item blockStarts - A comma-separated list of block starts. All of the blockStart positions should be calculated relative to chromStart. The number of items in this list should correspond to blockCount. \end{itemize} The meaning of the last 3 columns of the xls file is: \begin{itemize} \item lg.p - log10(p-value) of the peak, indicating the significance of the peak as an RNA methylation site \item lg.fdr - log10(fdr) of the peak, indicating the significance of the peak as an RNA methylation site after multiple hypothesis correction \item fold{\_}enrichment - fold enrichment within the peak in the IP sample compared with the input sample. \end{itemize} <>= recommended_peaks = result$con_peaks # consistent peaks (Recommended!) peaks_info = mcols(recommended_peaks) # information of the consistent peaks head(peaks_info) @ or to get all the peak detected (some of them do not consistently appear on all replicates: <>= all_peaks = result$all_peaks # get all peaks peaks_info = mcols(all_peaks) # information of all peaks head(peaks_info) @ \section{Peak Calling and Differential Methylation Analysis} When there are MeRIP-Seq data available from two experimental conditions, the \Rfunction{exomePeak} function may can unveil the dynamics in post-transcriptional regulation of the RNA methylome. In the following example, the function will report the sites that are post-transcriptional differentially methylated between the two tested conditions (TREATED vs. UNTREATED). Again, let us firstly load the package and get the toy data (came with the package) ready. <>= library("exomePeak") gtf=system.file("extdata", "example.gtf", package="exomePeak") f1=system.file("extdata", "IP1.bam", package="exomePeak") f2=system.file("extdata", "IP2.bam", package="exomePeak") f3=system.file("extdata", "IP3.bam", package="exomePeak") f4=system.file("extdata", "IP4.bam", package="exomePeak") f5=system.file("extdata", "Input1.bam", package="exomePeak") f6=system.file("extdata", "Input2.bam", package="exomePeak") f7=system.file("extdata", "Input3.bam", package="exomePeak") f8=system.file("extdata", "treated_IP1.bam", package="exomePeak") f9=system.file("extdata", "treated_Input1.bam", package="exomePeak") @ Please note that, this time we have two additional bam files obtained under a different "Treated" condition, i.e., \Robject{f8} and \Robject{f9}. <>= result = exomepeak(GENE_ANNO_GTF=gtf, IP_BAM=c(f1,f2,f3,f4), INPUT_BAM=c(f5,f6,f7), TREATED_IP_BAM=c(f8), TREATED_INPUT_BAM=c(f9)) @ The algorithm will firstly identify reads enriched binding sites or peaks, and then check whether the sites are differentially methylated between the two experimental conditions. The results will be saved in the specified output directory, including the identified (consistent) peaks in BED and tab-delimited formats, along with the differential information indicating whether the site is hyper- or hypo-methylated under the treated condition. Similar to the peak calling case, the BED format can be visualized in genome browser directly and the peaks may span one or multiple introns. Similar to the peak calling case, the function will report a set of consistent differentially methylated peaks (con{\_}sig{\_}diff{\_}peak.xls) saved in the specified folder, which is the recommended set. \begin{itemize} \item diff{\_}peak.xls - all the detected peaks and their differential methylation information \item sig{\_}diff{\_}peak.xls - all the differentially methylated peaks \item con{\_}sig{\_}diff{\_}peak.xls - all the consistently differentially methylated peaks. There are peaks are consistently differentially methylated among all replicates, indicating highly confidence. This set of differential methylation peaks is highly suggested. \end{itemize} Along with the XLS files, the matched BED files are also generated for visualization purpose. Similar to before, \begin{itemize} \item The first 12 columns in both the BED and the XLS are the same following the standard BED12 format: http://genome.ucsc.edu/FAQ/FAQformat.html{\#}format1 For more details, please previous section for detailed description of the first 12 columns. \item lg.p, lg.fdr, fold{\_}enrchment are results from peak detection step, i.e., log10(pvalue), log10(fdr) and fold enrichment of the detected peak as a true methylation site. Specifically, when dealing with two experimental conditions, the fold enrichment indicates whether reads are more enriched in the pooled IP sample under both conditions than in the pooled Input sample under both conditions. The enrichment{\_}change needs to be greater than 1 to be considered being enriched as an RNA methylation site. \item diff.lg.fdr, diff.lg.p, diff.log2.fc are results from differential methylation analysis, i.e., log10(fdr), log10(pvalue) and log2(odds ratio) of the peak as a differential methylation site between the two experimental conditions tested. If diff.log2.fc is larger than 0, the site is hypermethylated under the treated condition, otherwise if smaller than 0, it is hypomethylated under the treated condition. \end{itemize} The function also returns 3 GRangesList object, containing all the peaks, the differentially methylated peaks with the given threshold on the merged data, consistently differentially methylated peaks. The consistent differentially methylated peaks in the last appear to be differential for all the replicates and is thus recommended. The information of the identified peaks and the differential analysis are stored as metadata, which can be extracted. <>= names(result) is.na(result$con_sig_diff_peaks) # no reported consistent differnetial peaks @ Unfortunately, there is no reported consistent differnetial peaks on the toy data, to get the information of all the peaks and the differential analysis information: <>= diff_peaks = result$diff_peaks # consistent differential peaks (Recommended!) peaks_info = mcols(diff_peaks) # information of the consistent peaks head(peaks_info[,1:3]) # peak calling information head(peaks_info[,4:6]) # differential analysis information @ \section{Download Gene Annotation Directly from Internet} Gene annotation may be alternatively downloaded directly from internet, but will take a really long time due to the downloading time and huge transcriptome needed to be scanned. <>= result = exomepeak(GENOME="hg19", IP_BAM=c(f1,f2,f3,f4), INPUT_BAM=c(f5,f6,f7), TREATED_IP_BAM=c(f8), TREATED_INPUT_BAM=c(f9)) @ Please make sure to use the right genome assembly. \section{Handling Paired-end Reads} Unfortunately, exomePeak does not currently support pair-end sequencing. While the newer version of exomePeak package is under active development, when dealing with pair-end sequencing data, it is suggested to stack the two read files together and then treat the pooled reads as single-end sequencing data, i.e., align the pooled reads to reference genome as single-end reads, and then use exomePeak package to call peaks or do differential methylation analysis. \section{Session Information} <>= sessionInfo() @ \end{document}