--- title: "Using the sangerseqR package" author: "Jonathon T. Hill, PhD" output: BiocStyle::html_document: toc: true toc_depth: 2 toc_float: true css: "bioconductor2.css" vignette: > %\VignetteIndexEntry{Using the sangerseqR package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() ``` ```{r, include=FALSE} library(knitr, quietly=TRUE) library(sangerseqR, quietly=TRUE) library(Biostrings, quietly=TRUE) opts_chunk$set(tidy=TRUE, tidy.opts=list(width.cutoff=70)) ``` # Introduction The `sangerseqR` package provides basic functions for importing and working with sanger sequencing data files. It currently functions with Scf and ABIF files. The Scf file specification is an open source, although somewhat limited, data file type. Several tools designed to view and or edit chromatogram data can convert file types to Scf. The ABIF file specification is a proprietary data storage file specification for sequencing data generated by Applied Biosystems machines. More information on each filetype can be found at the following sites: - ABIF: https://projects.nfstc.org/workshops/resources/articles/ABIF_File_Format.pdf - SCF: http://staden.sourceforge.net/manual/formats_unix_2.html The objects and functions included in this package were developed as part of the Poly Peak Parser web application (http://yost.genetics.utah.edu/software.php), which automates the process of seperating ambiguous double peaks from Sanger sequencing individuals containing heterozygous indels. This package contains a complete working copy of the Poly Peak Parser web application that can be run locally using the `PolyPeakParser()` function. In addition to the web program, this package also provides general objects and functions for working with Sanger sequencing data. This vignette will walk you through a typical workflow using two sequence files: 1) homozygous.scf and 2) heterozygous.ab1. As their names indicate, the first example contains results typical of sequencing from a PCR product of a homozygous individual or from a plasmid. The second example contains results from sequencing the same region in an individual with a small indel. # Loading Data The first step of most workflows will be to upload data from a sequencing results file. This can be done using one of three included functions: `read.abif()`, `read.scf()` and `readsangerseq()`. The first two functions directly import all of the fields into `abif` and `scf` class objects, respectively. These classes are meant as intermediate classes and exist to allow the user to inspect the file contents, as file contents may vary between basecallers and sequencing machines. Users should generally use the `readsangerseq()` function. This function automatically detects and reads in the file type and then extracts the fields necessary to create a `sangerseq` class object, which is used by all of the other functions in this package. ## read.abif `read.abif` takes a single argument for the filename of the abif file to be read. The resulting object contains three major parts. The header, containing information on the file structure, the directory, containing information on each of the data fields included in the file, and the data fields. Here is an example: ```{r} hetab1 <- read.abif(system.file("extdata", "heterozygous.ab1", package="sangerseqR")) str(hetab1, list.len=20) ``` As you can see, the file is very long and contains a lot of Data fields (130 in this example). However, most of these contain run information and only a few are directly relevant to data analysis: | Data Field | Description | |:--------- |:--------- | |DATA.9-DATA.12 | Vectors containing the signal intensities for each channel. | |FWO.1 | A string containing the base corresponding to each channel. For example, if it is "ACGT", then DATA.9 = A, DATA.10 = C, DATA.11 = G and DATA.12 = T. | |PLOC.2 | Peak locations as an index of the trace vectors. | |PBAS.1, PBAS.2 | Primary basecalls. PBAS.1 may contain bases edited in the original basecaller, while PBAS.2 always contains the basecaller's calls. | |P1AM.1 | Amplitude of primary basecall peaks. | |P2BA.1 | (optional) Contains the secondary basecalls. | |P2AM.1 | (optional) Amplitude of the secondary basecall peaks. | ## read.scf Like `read.abif`, `read.scf` takes a single argument with the filename. However, the data structure of the resulting `scf` object is far less complicated, containing only a header with file structure information, a matrix of the trace data (`@sample_points`), a matrix of relative probabilities of each base at each position (`@sequence_probs`), basecall positions (`@basecall_positions`), basecalls (`@basecalls`) and optionally a comments sections with the run data (`@comments`). The last slot (`@private`) is rarely used and impossible to interpret without knowing how it was created. ```{r} homoscf <- read.scf(system.file("extdata", "homozygous.scf", package="sangerseqR")) str(homoscf) ``` ## readsangerseq The `readsangerseq` function is a convenience function equivalent to `sangerseq(read.abif(file))` or `sangerseq(read.scf(file))`. It should generally be used when the contents of the file do not need to be directly accessed because it returns a `sangerseq` object, described below. # Sangerseq Class Objects The `sangerseq` class is the backbone of the sangerseqR package and contains the chromatogram data necesary to perform all other functions. It can be created in two ways: from an `abif` or `scf` object using the `sangerseq` method or directly from an abif or scf file using `readsangerseq`. ```{r} #from a sequence file object homosangerseq <- sangerseq(homoscf) #directly from the file hetsangerseq <- readsangerseq(system.file("extdata", "heterozygous.ab1", package="sangerseqR")) str(hetsangerseq) ``` The slots are as follows: | Slot | Description | |:--------- |:--------- | | `primarySeqID` | Identification of the primary Basecalls. | | `primarySeq` | The primary Basecalls formatted as a DNAString object. | | `secondarySeqID` | Identification of the secondary Basecalls. | | `secondarySeq` | The secondary Basecalls formatted as a DNAString object. | | `traceMatrix` | A numerical matrix containing 4 columns corresponding to the normalized signal values for the chromatogram traces. Column order = A,C,G,T. | | `peakPosMatrix` | A numerical matrix containing the position of the maximum peak values for each base within each Basecall window. If no peak was detected for a given base in a given window, then "NA". Column order = A,C,G,T. | | `peakAmpMatrix` | A numerical matrix containing the maximum peak amplitudes for each base within each Basecall window. If no peak was detected for a given base in a given window, then 0. Column order = A,C,G,T. | Accessor functions also exist for each slot in the sangerseq object. Most of the accessors return the data in its native format, but the `primarySeq` and `secondarySeq` accessors can optionally return the data as a character string or a `DNAString` class object from the `Biostrings` package by setting ` string=TRUE` or ` string=FALSE`, respectively. The `DNAString` class contains several convenient functions for manipulating the sequence, including generating the reverse compliment and performing alignments. The `Biostrings` package is automatically loaded with the `sangerseq` package, so all methods should be available. ```{r} #default is to return a DNAString object Seq1 <- primarySeq(homosangerseq) reverseComplement(Seq1) #can return as string primarySeq(homosangerseq, string=TRUE) ``` # Creating Chromatograms Basic chromatogram plots can be made using the `chromatogram` function. These plots are optimized for printing, so they contain several rows to plot all of the data simultaneously. The downside of this is that it can give an error if the graphics device dimensions are not large enough. If this occurs, we suggest you provide a filename in the command to save it to a pdf automatically sized to fit everything. Several parameters can also be set to affect how the plot appears. These are documented in the chromatogram help file. ```{r, fig.height=10} chromatogram(hetsangerseq, width=80, height=3, trim5=50, trim3=100, showcalls='both') ``` # Making Basecalls As shown in the chromatogram, secondary basecalls are sometimes provided in ab1 files (Scf files are unable to show them). However, the exact nature of these calls is inconsistent. In the heterozygous.ab1 file used here, it is any peak near the primary peak, no matter how small. For example, base 100 (first base on the second line) has a primary call of "C" and a secondary call of "A", even though the A peak is very small and likely noise. In homozygous sequencing results, these calls should simply be ignored and are hidden in the chromatogram by default (` showcalls="primary"`). When heterozygous regions of the sequence are present, the `makeBaseCalls` can be used to determine whether a particular peak is homozygous or heterozygous and call the appropriate bases. Let's use the chromatogram we created in the previous section as an example. The chromatogram contains a homozygous region from bases 1 to approximately 160, but then breaks down into a series of double peaks for the remainder of the chromatogram. This is due to an indel in one allele of the sequenced region. `makeBaseCalls` can be used to show this more clearly or to add the secondary basecalls if the data file does not contain them. The function essentially divides the sequence into a series of basecall windows and identifies the tallest peak for each fluorescence channel within the window. These peaks are converted to signal ratio to the tallest peak. A cutoff ratio is then applied to determine if a peak is signal or noise. Peaks below this ratio are ignored. Remaining peaks in each window are used to make primary and secondary basecalls. ```{r} hetcalls <- makeBaseCalls(hetsangerseq, ratio=0.33) hetcalls ``` The resulting file now contains the maximum peak in the ` @primarySeq` slot and the second tallest peak, if it is above the cutoff, in the ` @secondarySeq` slot. If only one peak is above the cutoff ratio, then this call matches the primary basecall. If three peaks were above the cutoff ratio, then the peak with the maximum amplitude is the primary basecall and an ambiguous base code is used as the secondary basecall. The resulting chromatogram also shows this: ```{r, fig.height=10} chromatogram(hetcalls, width=80, height=3, trim5=50, trim3=100, showcalls='both') ``` Chromatogram of heterozygous sequencing results after making basecalls. Primary and secondary basecalls now match for homozygous peaks # Parsing Alleles Although `makeBaseCalls` has fixed the primary and secondary peak calls. It still does not tell us anything about the nature of the mutation. For this, we need to set the allele phase using a reference base sequence from an online source or from another sequencing run on a homozygous sample. The examples used in this vignette are from heterozygous and homozygous siblings, so we will use the primary basecalls from the homozygous sibling (loaded earlier) as our reference. The beginnings and ends of these sequences do not need to match, but the reference should ideally encompass the sequenced region. `setAllelePhase` will then separate the primary and secondary basecalls into reference and non-reference bases at each position and set (`@primarySeq`) to the reference and `@secondarySeq` to the non-reference allele. ```{r} ref <- subseq(primarySeq(homosangerseq, string=TRUE), start=30, width=500) hetseqalleles <- setAllelePhase(hetcalls, ref, trim5=50, trim3=300) hetseqalleles ``` At this point, we could plot the chromatogram again, but it is more informative to align the resulting sequences to see how the alleles differ. Since `sangerseqR` depends on `Biostrings`, `pairwiseAlignment` can be used. ```{r} pa <- pairwiseAlignment(primarySeq(hetseqalleles)[1:400], secondarySeq(hetseqalleles)[1:400], type="global-local") writePairwiseAlignments(pa) ``` # Conclusion In this vignette, we have walked you through the basic functions in the `sangerseqR` package. This work is a work in progress and we hope to improve its functionality. For example, improving the base calling algorithm and adding an interactive chromatogram function. If you have any suggestions or requested features, please email Jonathon Hill at jhill@byu.edu.