--- title: "InPAS Vignette" author: "Jianhong Ou, Lihua Julie Zhu" date: "`r doc_date()`" package: "`r pkg_ver('InPAS')`" abstract: > Alternative polyadenylation (APA) is one of the important post-transcriptional regulation mechanisms which occurs in most human genes. InPAS facilitates the discovery of novel APA sites from RNAseq data. It leverages cleanUpdTSeq to fine tune identified APA sites. vignette: > %\VignetteIndexEntry{InPAS Vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document --- ```{r preloadLibrary, echo=FALSE, results="hide", warning=FALSE} suppressPackageStartupMessages({ library(InPAS) library(BSgenome.Mmusculus.UCSC.mm10) library(TxDb.Mmusculus.UCSC.mm10.knownGene) library(org.Hs.eg.db) library(cleanUpdTSeq) }) ``` # Introduction Alternative polyadenylation (APA) is one of the most important post-transcriptional regulation mechanisms which occurs in most human genes. APA in a gene can result in altered expression of the gene, which can lead pathological effect to the cells, such as uncontrolled cell cycle and growth. However, there are only limited ways to identify and quantify APA in genes, and most of them suffers from complicated process for library construction and requires large amount of RNAs. RNA-seq has become one of the most commonly used methods for quantifying genome-wide gene expression. There are massive RNA-seq datasets available publicly such as GEO and TCGA. For this reason, we develop the InPAS algorithm for identifying APA from conventional RNA-seq data. The workflow for InPAS is: - Calculate coverage from BEDGraph files - Predict cleavage sites - Estimate 3UTR usage # How to To use InPAS, BSgenome and TxDb object have to be loaded before run. ```{r loadLibrary} library(InPAS) library(BSgenome.Mmusculus.UCSC.mm10) library(TxDb.Mmusculus.UCSC.mm10.knownGene) path <- file.path(find.package("InPAS"), "extdata") ``` Users can prepare annotaiton by **utr3Annotation** with a TxDb and org annotation. The 3UTR annotation prepared by **utr3Annotation** includes the gaps to next annotation region. ```{r prepareAnno} library(org.Hs.eg.db) samplefile <- system.file("extdata", "hg19_knownGene_sample.sqlite", package="GenomicFeatures") txdb <- loadDb(samplefile) utr3.sample.anno <- utr3Annotation(txdb=txdb, orgDbSYMBOL="org.Hs.egSYMBOL") utr3.sample.anno ``` Users can load mm10 and hg19 annotation from pre-prepared data. Here we loaded the prepared mm10 3UTR annotation file. ```{r loadAnno} ##step1 annotation # load from dataset data(utr3.mm10) ``` The coverage is calculated from BEDGraph file. The RNA-seq BAM files could be converted to BEDGraph files by bedtools genomecov tool (parameter: -bg -split). PWM and a classifier of polyA signal can be used for adjusting CP sites prediction. ```{r step2HugeDataF} load(file.path(path, "polyA.rds")) library(cleanUpdTSeq) data(classifier) bedgraphs <- c(file.path(path, "Baf3.extract.bedgraph"), file.path(path, "UM15.extract.bedgraph")) hugeData <- FALSE ##step1 Calculate coverage coverage <- coverageFromBedGraph(bedgraphs, tags=c("Baf3", "UM15"), genome=BSgenome.Mmusculus.UCSC.mm10, hugeData=hugeData) ## we hope the coverage rate of should be greater than 0.75 in the expressed gene. ## which is used because the demo data is a subset of genome. coverageRate(coverage=coverage, txdb=TxDb.Mmusculus.UCSC.mm10.knownGene, genome=BSgenome.Mmusculus.UCSC.mm10, which = GRanges("chr6", ranges=IRanges(98013000, 140678000))) ##step2 Predict cleavage sites CPs <- CPsites(coverage=coverage, genome=BSgenome.Mmusculus.UCSC.mm10, utr3=utr3.mm10, search_point_START=200, cutEnd=.2, long_coverage_threshold=3, background="10K", txdb=TxDb.Mmusculus.UCSC.mm10.knownGene, PolyA_PWM=pwm, classifier=classifier, shift_range=50, step=10) head(CPs) ##step3 Estimate 3UTR usage res <- testUsage(CPsites=CPs, coverage=coverage, genome=BSgenome.Mmusculus.UCSC.mm10, utr3=utr3.mm10, method="fisher.exact", gp1="Baf3", gp2="UM15") ##step4 view the results as(res, "GRanges") filterRes(res, gp1="Baf3", gp2="UM15", background_coverage_threshold=3, adj.P.Val_cutoff=0.05, dPDUI_cutoff=0.3, PDUI_logFC_cutoff=0.59) ``` The process described above can be done in one step. ```{r OneStep} if(interactive()){ res <- inPAS(bedgraphs=bedgraphs, tags=c("Baf3", "UM15"), genome=BSgenome.Mmusculus.UCSC.mm10, utr3=utr3.mm10, gp1="Baf3", gp2="UM15", txdb=TxDb.Mmusculus.UCSC.mm10.knownGene, search_point_START=200, short_coverage_threshold=15, long_coverage_threshold=3, cutStart=0, cutEnd=.2, hugeData=FALSE, shift_range=50, PolyA_PWM=pwm, classifier=classifier, method="fisher.exact", adj.P.Val_cutoff=0.05, dPDUI_cutoff=0.3, PDUI_logFC_cutoff=0.59) } ``` InPAS can handle single group data. ```{r SingleGroupData} inPAS(bedgraphs=bedgraphs[1], tags=c("Baf3"), genome=BSgenome.Mmusculus.UCSC.mm10, utr3=utr3.mm10, gp1="Baf3", gp2=NULL, txdb=TxDb.Mmusculus.UCSC.mm10.knownGene, search_point_START=200, short_coverage_threshold=15, long_coverage_threshold=3, cutStart=0, cutEnd=.2, hugeData=FALSE, PolyA_PWM=pwm, classifier=classifier, method="singleSample", adj.P.Val_cutoff=0.05, step=10) ``` # Session Info ```{r sessionInfo, results='asis'} sessionInfo() ```