%\VignetteIndexEntry{IVAS : Identification of genetic Variants affecting Alternative Splicing} %\VignetteDepends{GenomicFeatures} %\VignettePackage{IVAS} \documentclass[a4paper,11pt]{article} \usepackage{graphicx} \usepackage{a4wide} \graphicspath{{./result/}} <>= BiocStyle::latex(use.unsrturl=FALSE) @ \title{\Rpackage{IVAS} : Identification of genetic Variants affecting Alternative Splicing} \author{Seonggyun Han and Sangsoo Kim} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \tableofcontents \section{Introduction} Alternative splicing controls relative expression ratios of mature mRNA isoforms from a single gene. Mapping studies of Splicing Quantitative Trait Loci (SQTL), a genetic variant affecting the alternative splicing, are important steps to understand gene regulations and protein activity \cite{KeyanZhao}. We present an effective and user-friendly computational tool to detect SQTLs using transcript expression data from RNA-seq and genotype data, both measured on the same sample. As RNA sequencing (RNA-seq) provides insight into relatively precise measurments of expression level of transcript isoforms from a gene, it is a useful tool to analyze complicated biological phenomenons of RNA transcripts including the alternative splicing \cite{JosephK}. The mapping analysis uses two statistical models : Linear regression model \cite{Breslow} and/or Generalized linear mixed model \cite{Chambers}. \section{The input data set} The next subsection introduces the input data. To run this tool, two experimental data sets (an expression data frame from RNA-seq and a genotype data frame) are required. Moreover, we also need a data frame for positions of SNP markers and GTF file for transcript models. As any other genome-wide analyses, it is recommended to use as many samples as possible, usually of population scale, in order to guarantee a statistically significant result. \subsection{The genotype data} The genotype data should be prepared as a simple matrix data. Each column represents an individual and its name should match that of the expression matrix described below (2.2) \begin{table}[h!] \begin{tabular}{c c c c c} &ind1&ind2&ind3&ind4\\ SNP1&AA&AA&AT&TT\\ SNP2&CG&CC&GG&CG\\ SNP3&TT&TT&AT&TT\\ \end{tabular} \end{table} \subsection{The expression data} The expression matrix must comprise expresion values of transcripts from RNA-seq. We may obtain them by using alignment tools such as cufflinks. Each column represents an individual and its name should match that of the genotype matrix described above (2.1) \begin{table}[h!] \begin{tabular}{c c c c c} &ind1&ind2&ind3&ind4\\ transcript1&10.5&15.4&6.7&12.4\\ transcript2&6.4&7.2&4.5&9.2\\ transcript3&15.4&14.5&13.2&17.8\\ \end{tabular} \end{table} \subsection{The SNP marker position data} To search SNPs affecting alternative splicing, a data frame comprising genomic location of each SNP is required. It consists of following columns: SNP(SNP marker name), CHR(chromosome number), and locus(SNP position). \begin{table}[h!] \begin{tabular}{c c c} SNP&CHR&locus\\ SNP1&1&4964005\\ SNP2&1&23513047\\ \end{tabular} \end{table} \subsection{The transcripts model data} We need a reference GTF(General Feature Format) file including information about gene structures such as the positions of exons, introns, and transcripts of genes. The GTF file must be \Robject{TxDb} object from the \Biocpkg{GenomicFeatures} package \cite{Michael}. \section{The example dataset : data from Geuvadis RNA sequencing project of 1000 Genome samples} This example uses filtered data from an origin data generated by Geuvadis RNA sequencing project, available at \texttt{http://www.geuvadis.org/web/geuvadis/RNAseq-project} \cite{TuuliLappalainen}. The example expression data includes transcripts of 11 randomly selected genes. The genotype data comprises SNPs in those genes. \section{Loading data} For this analysis, you need to load the \Rpackage{IVAS} package, SNP data, expression data, SNP position data, and \Robject{TxDb} object from GTF. \\ Loading \Rpackage{IVAS} package : <>= library(IVAS) @ Loading expression data : <>= data(sampleexp) @ Loading SNP data : <>= data(samplesnp) @ Loading SNP position data : <>= data(samplesnplocus) @ Loading \Robject{TxDb} object : <>= sampleDB <- system.file("extdata", "sampleDB", package="IVAS") sample.Txdb <- loadDb(sampleDB) @ If you want to create the \Robject{TxDb} object from a GTF file, you need to use the \Rfunction{makeTxDbFromGFF} function in the \Biocpkg{GenomicFeatures} package. \section{Running the \Rpackage{IVAS} package for detecting SQTLs} \subsection{Separating the \Robject{TxDb} object based on a single chromosome.} The \Rfunction{chrseparate} function separates the \Robject{TxDb} object based on a single chromosome for mapping analysis of expression and genotype in a single chromosome. \\ To use this function : <>= filtered.txdb <- chrseparate(sample.Txdb,19) filtered.txdb @ In this example, We filter the \Robject{TxDb} object with only the chromosome 6. \subsection{Finding alternative exons of a single gene} The \Rfunction{findAlternative} function finds flanking introns of alternative exons from a single gene. To run this function, it requires trans.exon.range, trans.intron.range, and txTable. The trans.exon.range is a range of exons based on transcripts using the \Rfunction{exonsBy} function in \Biocpkg{GenomicFeatures} package. The trans.intron.range is a range of introns based on transcripts using the \Rfunction{intronsBy} function in \Biocpkg{GenomicFeatures} package. The txTable has information about trancripts(start site, chromosome number, etc). <>= trans.exon.range <- exonsBy(filtered.txdb,by="tx") trans.intron.range <- intronsByTranscript(filtered.txdb) txTable <- select(filtered.txdb, keys=names(trans.exon.range), columns=c("TXID","TXNAME","GENEID","TXSTART","TXEND"),keytype="TXID") Altvalue <- findAlternative("ENSG00000170889",txTable,trans.exon.range, trans.intron.range,19) Altvalue @ In this example, we will search SQTLs in the ENSG00000170889. \subsection{Finding SNPs which belongs to alternative exons and flanking introns} The \Rfunction{overlapsnp} function searches SNPs in alternative exons and the flanking introns. <>= ch.snp.locus <- as.matrix(samplesnplocus[samplesnplocus[,2] == 19,]) ch.snps <- matrix(ch.snp.locus[is.element(ch.snp.locus[,1],rownames(samplesnp)),], ncol=3,byrow=FALSE) ch.snps.range <- GRanges(seqnames=Rle(19),ranges=IRanges(start=as.integer(ch.snps[,3]), end=as.integer(ch.snps[,3])),metadata=ch.snps[,1]) overlapsnp <- findOversnp(Altvalue,ch.snps.range) overlapsnp @ \subsection{Finding SQTLs} Using the output data from the \Rfunction{Altvalue} function and the \Rfunction{overlapsnp} function, significant SNPs affecting the alternative splicing can be identified by the sqtlfinder function. <>= sqtl.result <- sqtlfinder(Altvalue,overlapsnp,sampleexp,samplesnp,"lm") sqtl.result @ In this example, We will run the function with the linear regression model. \section{Identification of SQTLs in multiple genes} The functions in the \Rpackage{IVAS} package perform mapping analysis using a single gene. However, the \Rfunction{MsqtlFinder} function in the \Rpackage{IVAS} package enables one to analyze multiple genes using the multi-thread version of \Rfunction{foreach} function. Moreover, it joins the results on the genes from sqtlfinder function and calculates the FDR using P-values of the results. <>= final.result <- MsqtlFinder(sampleexp,samplesnp,samplesnplocus,sample.Txdb,"lm",1) @ \Rfunction{MsqtlFinder} shows chromosome numbers during mapping analysis. The last argument denotes the number of threads(a single processor in this example) \section{Visualizing the result} To visualize the results into boxplot, the \Rpackage{IVAS} package provides the saveBplot function. Using the dataframe from the output of \Rfunction{sqtlfinder} or \Rfunction{MsqtlFinder} function, we can make the boxplot. <>= saveBplot(sqtl.result,sampleexp,samplesnp,samplesnplocus,filtered.txdb,"./result") @ \begin{center} \includegraphics[width=0.6\textwidth]{ENSG00000170889_54704610-54704756_rs3810232(13)_lm.png} \end{center} The output png files are saved in "result" folder. \section{Session Information} <>= sessionInfo() @ \begin{thebibliography}{9} \bibitem{KeyanZhao} Keyan Zhao, Zhi-xiang Lu, Juw Won Park, Qing Zhou, Yi Xing. 2013. GLiMMPS: robust statistical model for regulatory variation of alternative splicing using RNA-seq data. Genome Biol 14, R74. \bibitem{JosephK} Joseph K. Pickrell, John C. Marioni, Athma A. Pai, Jacob F. Degner, Barbara E. Engelhardt, Everlyne Nkadori, Jean-Baptiste Veyrieras, Matthew Stephens, Yoav Gilad, Jonathan K. Pritchard. 2010. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464, 768-722. \bibitem{Breslow} N.E. Breslow and D.G. Clayton. 1993. Approximate Inference in Generalized Linear Mixed Models. Journal of the American Statistical Association 88 421: 9-25. \bibitem{Michael} Michael Lawrence, et al. 2013. Software for Computing and Annotating Genomic Ranges. PLoS Comput Biol. 9(8): e1003118. \bibitem{Chambers} Chambers, J. M. 1992. Linear models. Chapter 4 of Statistical Models in S eds J. M. Chambers and T. J. Hastie, Wadsworth, and Brooks Cole. \bibitem{TuuliLappalainen} Tuuli Lappalainen, et al. 2013. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506-511. \end{thebibliography} \end{document}