--- title: "Introduction to qckitfastq" author: "August Guang and Wenyue Xing" output: pdf_document package: qckitfastq bibliography: qckitfastq.bib link-citations: true vignette: > %\VignetteIndexEntry{Quality control analysis and visualization using qckitfastq} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, setup, include=FALSE} library(knitr) library(kableExtra) ``` # Introduction `qckitfastq` is part of the planned `qckit` suite of packages from the [Computational Biology Core at Brown University](https://compbiocore.github.io/). The long-term goal of the qckit suite is to not only provide comprehensive quality control metrics for common genomics sequencing workflows, but to 1) also provide quality control visualizations for multiple samples through updated modules in multiQC [@Ewels2016], and 2) provide a public quality control database which allows benchmarking of QC metrics for an experiment against other similar experiments. Users of `qckitfastq` will have the option to enter SRA metadata and create visualizations from our database as well as submit their qc results (if submitting an SRA archive) to our database once these features have been implemented. The purpose of this particular package is to run quality control on FASTQ files from genome sequencing. ## Why use qckitfastq? Indeed there are many other quality control packages for FASTQ files existing already, including ShortReads [@shortread] and seqTools [@seqtools] for R and the popular FASTQC [@Andrews2010] Java-based program. qckitfastq offers a few advantages compared to these 3 programs for users who need such features: 1. access to raw sequence and quality data 2. access to data frames of the quality control results in addition to plots 3. quality control analyses of the entire FASTQ file 4. fast file processing To break it down further, seqTools and ShortReads do not offer as comprehensive set of quality control metrics as qckitfastq and FASTQC. seqTools further provides limited access to raw data and intermediate analysis results. ShortRead provides users with access to the raw sequencing data and intermedite analysis results, but is inefficient on datasets exceeding 10 million reads. FASTQC meanwhile truncates any reads longer than 75bp as well as estimates overall quality only based on the first 100,000 reads of any FASTQ file. `qckitfastq` does not contain any of these limitations. ```{r, comparison_table, include=FALSE,eval=FALSE} qckitfastq <- c("yes","yes","yes+","yes") seqTools <- c("no","yes","yes","yes") ShortRead <- c("no","no","no","yes") FASTQC <- c("yes","yes*","yes*","yes*") metrics <- data.frame(qckitfastq,seqTools,ShortRead,FASTQC) rownames(metrics) <- c("Read Length Distribution", "Per Base Read Quality", "Nucleotide Read Content", "GC Content") kable(metrics) # need to do per read sequence quality # + indicates that the program... # Here, '*' indicates that the program truncates the file or computes on only the first x samples ``` # Running qckitfastq `qckitfastq` provides the following metrics: * [Read length distribution](#read-length-distribution) * [Per base read quality](#per-base-quality) * [Per read read quality](#per-read-quality) * [GC content](#gc-content) * [Nucleotide read content](#nucleotide-read-content) * [Kmer count](#kmer-count) * [Overrepresented reads](#overrep-reads) * [Overrepresented kmers](#overrep-kmers) * [Adapter content](#adapter-content) The simplest way to run `qckitfastq` is by executing `run_all`, a single command that will produce a report of all of the included metrics in a user-provided directory with some default parameters. However, each metric can also be run separately for closer examination. # Individual metrics For each individual metric that `qckitfastq` provides, the user can choose to save either the data frame for the metric, the associated plot, or both. Our example in this vignette has 25,000 reads, each 100bp long. The majority of metrics are run on the path to the FASTQ file. Some functions for quality control in this package are simply wrappers around `seqTools` due to the fact that their functions are fast. We provide these wrappers for the sake of completeness in quality control metrics. These wrappers require processing the FASTQ data through the `seqTools::fastqq` command first: ```{r, loading_file} library(qckitfastq) infile <- system.file("extdata", "10^5_reads_test.fq.gz", package = "qckitfastq") fseq <- seqTools::fastqq(infile) ``` ## Read length distribution {#read-length-distribution} *read_length* generates a data frame of read lengths and the number of reads with that particular read length. *plot_read_length* generates a distribution plot of the length of all reads. The generated plot would show the sequence length of all the sequences throughout the file. The plot is considered an indication of good data quality is all sequences have roughly the same sequence length with minimal deviations. The following plot shows that all reads in the file have sequence length of 100. ```{r, read_length} read_len <- read_length(fseq) kable(head(read_len)) %>% kable_styling() plot_read_length(read_len) ``` ## Per base quality {#per-base-quality} The *per_base_quality* function calculates quality score per base summary statistics of 10th, 25th, median, 75th and 90th quantiles across sequences. Currently it treats all quality score encodings as Phred+33. We can use the result to create a quality score distribution per position plot using *plot_per_base_quality*. As a basic heuristic, quality scores above 28 can be categorized as good (green), those from 20 to 28 can be categorized as medium (yellow), and under 20 is bad (red). ```{r, per_base_sequence_quality} bs <- per_base_quality(infile) kable(head(bs)) %>% kable_styling() plot_per_base_quality(bs) ``` ## Per read quality score {#per-read-quality} The *per_read_quality* function compute the mean quality score per read. We can then use *plot_per_read_quality* to generate a histogram of this statistics. The histogram is considered an indication of good data quality if the majority of reads have mean quality scores greater than 30. If a significant portion of reads have quality scores less than 30, then the data most likely has issues that need to be examined. ```{r, per_read_quality} prq <- per_read_quality(infile) kable(head(prq)) %>% kable_styling() plot_per_read_quality(prq) ``` ## Per read GC content {#gc-content} The *GC_content* function computes the GC nucleotide content percentage per read, and *plot_GC_content* plots the distribution of GC content. As a general rule, an indication of good data quality is when the GC content percentage in each read is between 30 and 50% and roughly follows a normal distribution. ```{r, gc_content} gc_df <- GC_content(infile) kable(head(gc_df)) %>% kable_styling() plot_GC_content(gc_df) ``` ## Per position nucleotide read content {#nucleotide-read-content} *read_content* calculates the total numbers across reads of each nucleotide base by position. *plot_read_content* plots the percentage of nucleotide content per position. We also provide an additional function *read_base_content* that allows the user to get the nucleotide base content by position across reads for a specific nucleotide (options are a,c,t,g,n). As a general rule, the plot would be considered an indication of good data quality when the percentage of each nucleotide sequence content is about evenly distributed across all bases. However, there are some types of analyses for which this will not be the case. For example, RNA-Seq will have an uneven sequence content distribution in the first 10 bases, and RRBS will have almost no cytosines and very high thymine content because the library prep protocol converts most C to T [@Meissner2005]. Knowledge of the library prep protocol is thus important for evaluating quality in terms of nucleotide sequence content. ```{r, nucleotide_read_content} scA <- read_base_content(fseq, content = "A") kable(head(scA)) %>% kable_styling() rc <- read_content(fseq) kable(head(rc)) %>% kable_styling() plot_read_content(rc) ``` ## Per position kmer count {#kmer-count} *kmer_count* function produces the per position kmer count with given path to the FASTQ file and the kmer length specified. ```{r, kmer_count} km <- kmer_count(infile,k=6) kable(head(km)) %>% kable_styling() ``` ## Overrepresented reads {#overrep-reads} *overrep_reads* produces a data frame consisting of overrepresented reads and their counts in decreasing order of counts. Here overrepresented is defined as unique reads that have counts larger than 0.1% of the total reads in the file. *plot_overrep_reads* produces a density plot of the counts and marks the top 5 overrepresented reads in red. ```{r, overrep_reads} overrep_reads<-overrep_reads(infile) knitr::kable(head(overrep_reads,n = 5)) %>% kable_styling() plot_overrep_reads(overrep_reads) ``` ## Overrepresented kmers {#overrep-kmers} *overrep_kmer* generates a data frame of overrepresented kmers with its maximum log2(observed/expected) ratio and the position of the maximum obs/exp ratio in descending order. Only those kmers with a ratio greater than 2 are returned in the data frame. We can also create a boxplot of the obs/exp ratio thatn includes plotting the top 2 (or n) kmer outliers. ```{r, overrep_kmer} overkm <-overrep_kmer(infile,7) knitr::kable(head(overkm,n=10)) %>% kable_styling() plot_overrep_kmer(overkm) ``` ## Adapter content {#adapter-content} *adapter_content* computes the counts of a pre-determined set of adapter sequences and reports back any exceeding 0.1% of the total reads, sorted from most abundant to least abundant. *plot_adapter_content* creates a bar plot of the top 5 most common adapters. In this instance we will use a different file to compute adapter content, as the first example has no adapter contamination. This function is only available for macOS/Linux due to a dependency on RSeqAn/C++14, which is not supported on current Bioconductor Windows build machines. ```{r, adapter_content} if(.Platform$OS.type != "windows") { infile2 <- system.file("extdata", "test.fq.gz", package = "qckitfastq") ac_sorted <- adapter_content(infile2) kable(head(ac_sorted)) %>% kable_styling() plot_adapter_content(ac_sorted) } ``` ```{r, eval=FALSE, include=FALSE} ### Benchmarking #To demonstrate the utility of our functions on large datasets... #(need to benchmark against ShortRead) #library(seqTools) #library(ShortRead) #library(rbenchmark) #sampler <- FastqSampler('E-MTAB-1147/fastq/ERR127302_1.fastq.gz', 20000) ``` # References