%\VignetteIndexEntry{Notes for eSet developers} %\VignetteDepends{chopsticks} %\VignetteKeywords{Expression Analysis} %\VignettePackage{Biobase} % % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % \documentclass[12pt]{article} \usepackage{amsmath,pstricks} % With MikTeX 2.9, using pstricks also requires using auto-pst-pdf or running % pdflatex will fail. Note that using auto-pst-pdf requires to set environment % variable MIKTEX_ENABLEWRITE18=t on Windows, and to set shell_escape = t in % texmf.cnf for Tex Live on Unix/Linux/Mac. \usepackage{auto-pst-pdf} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \textwidth=6.2in \bibliographystyle{plainnat} \begin{document} %\setkeys{Gin}{width=0.55\textwidth} \title{Lab exercises. Rare variant concepts and tools with Bioconductor.} \author{VJ Carey} \maketitle \section{Acknowledgment} The material covered in this lab depends heavily on software designs and creations of Martin Morgan and Sean Davis; the \Rpackage{IRanges}, \Rpackage{GenomicRanges}, \Rpackage{BSgenome} and \Rpackage{AnnotationDbi} infrastructures are also critical for allowing these problems to be stated and solved concisely, so great thanks are due to Patrick Aboyoun, Marc Carlson, Mike Lawrence and Herv\'e Pages. \section{Resources} Using the 0.1.6 version of samtools, we created a pileup of 1000 genomes reads from NA19240's solexa image: \begin{verbatim} samtools pileup -cvf \ human_b36_female.fa NA19240.chrom17.SLX.maq.SRP000032.2009_07.bam > \ n240_17.pup \end{verbatim} This text file is available to you for parsing as follows: <>= library(ind1KG) library(Rsamtools) @ <>= readPileup <- selectMethod("readPileup", "connection") @ <>= pup17 <- gzfile(system.file("pileups/n240_17.pup.gz", package="ind1KG")) c17p.i <- readPileup(pup17, variant="indel") levels(seqnames(c17p.i)) seqlevels(c17p.i) = gsub("17", "chr17", seqlevels(c17p.i)) c17p.i @ Information on dbSNP SNP locations is available in <
  • >= library(SNPlocs.Hsapiens.dbSNP.20090506) c6 <- getSNPlocs("chr6") head(c6, 5) @ Resource-oriented exercise: After reviewing the `Setup' material in the next section, estimate the frequencies of dbSNP-catalogued SNP per base pair in intronic vs. exonic DNA on chromosome 17. Estimate frequencies stratified by GC content (i.e., tabulate by 0, 1, 2 bases G or C in SNP). Special design exercise (attempt only after all other exercises below have been solved correctly): Consider alternative representations of the SNP location/value data. The allele data could be represented as a single \Rclass{DNAString}, and the location information as an \Rclass{IRanges} instance. Assess the resource consumption and query resolution performance of these representations in comparison to the existing data.frame. Consider also a representation rooted in an RDBMS such as SQLite. \section{Exercises} \subsection{Check for coding indels and SNP} \begin{itemize} \item Setup: According to data distributed by Shendure, NA19240 has two copies of a triple insert in the coding region of CDRT4. Verify. <>= library(org.Hs.eg.db) egid <- get("CDRT4", revmap(org.Hs.egSYMBOL)) kgid <- get(egid, org.Hs.egUCSCKG) library(GenomicFeatures) library(TxDb.Hsapiens.UCSC.hg18.knownGene) txdb = TxDb.Hsapiens.UCSC.hg18.knownGene txloc <- transcripts(txdb) cdrt4txloc <- txloc[elementMetadata(txloc)$tx_name %in% kgid] subsetByOverlaps(c17p.i, cdrt4txloc) @ On the basis of current annotation, none of these indels are in exons: <>= cdrt4txid <- as.character(elementMetadata(cdrt4txloc)$tx_id) cdrt4exloc <- exonsBy(txdb)[cdrt4txid] subsetByOverlaps(c17p.i, cdrt4exloc) @ \item Exercise: Write a function with parameters identifying a \Rclass{GRanges} instance generated from a pileup, a gene symbol, a variant type, and a specification of feature scope, that reports on the variants present in the gene. Discuss infelicities of data structure in the code segment above that should be ameliorated to simplify solution of this exercise. \item Exercise: Whether or not you solve the previous exercise, characterize the variants in gene MYH3 for NA19240 in some concise way. It is advisable to focus on SNP; show that there are coding SNP present for this individual that are not identified in dbSNP. \item Exercise: Introduce and justify a mechanism for filtering variant reporting using quality information. \end{itemize} \subsection{Compare Solexa calls with Sanger sequencing} \begin{itemize} \item Setup: The 4 million phase II HapMap genotype calls for NA19240 are available to you in package hmyriB36. A selection confined to chromosome 6 is available in the \Rpackage{ind1KG} package. <>= library(ind1KG) library(chopsticks) data(yri240_6) yri240_6$hm head(yri240_6$supp, 10) @ \item Exercise: Assess how many of the MAQ-based SNP calls using the chromosome 6 pileup data are found at dbSNP locations. Is the distribution of quality scores for variants identified at dbSNP locations similar to that of putatively de novo variants? \end{itemize} \subsection{\textit{de novo} SNPs in probes: effects on expression microarrays} Exercise: Acquire the probe sequences for the Illumina Human v1 expression array, perhaps by inverting the nuids found in the lumiHumanIDMapping metadata package. <>= library(lumiHumanIDMapping) con <- lumiHumanIDMapping_dbconn() dbListTables(con) dbGetQuery(con, "select * from HumanWG6_V1 limit 5") @ Determine the genomic positions of all probes interrogating genes on chromosome 17 using \Rpackage{Biostrings} \Rfunction{matchPDict} against the consensus genomic sequence for chromosome 17. Find all probes (on chr17) corresponding to sequence for which NA19240 is found by MAQ to harbor a variant (use the pileup noted previously). We will call these probes ``associated with sequence variants''. Compute expression Z-scores for expression levels obtained for NA19240 using mean and standard deviation based on log expression for the 89 individuals in hmyriB36 excluding NA19240. Can the distribution of expression Z-scores for probes associated with sequence variants be distinguished from the distribution of expression Z-scores for probes not associated with sequence variants. Extra credit extension: Some probes define sequence associated with splice junctions. These 50mers will not align to consensus genomic sequence, but will align once introns are removed. Can you identify probes associated with splice junctions that are also associated with sequence variants? Does the expression Z-score for splice-junction-associated probes differ in distribution from the general distribution of expression Z-scores? Additional exercises. Retrieve the SOLiD or 454-based short read archives for NA19240 and check the consistency of conclusions obtained in prior Solexa-based exercises with results based on these platforms. \section{Session information} <>= sessionInfo() @ \end{document}