--- title: "Sample Quality Check for NGS Data using SeqSQC package" author: "Qian Liu, Qianqian Zhu" date: "`r BiocStyle::doc_date()`" abstract: > SeqSQC is a bioconductor package for sample-level quality check in next generation sequencing (NGS) study. It is designed to automate and accelerate the sample cleaning of NGS data in any scale, including identifying problematic samples with high missing rate, gender mismatch, contamination, abnormal inbreeding coefficient, cryptic relatedness, and discordant population information. SeqSQC takes Variant Calling Format (VCF) files and sample annotation file containing sample population and gender information as input and report problematic samples to be removed from downstream analysis. Through incorporation a benchmark data assembled from the 1000 Genomes Project, it can accommodate NGS study of small sample size and low number of variants. output: BiocStyle::html_document: toc: true Vignette: > %\VignetteIndexEntry{Sample Quality Check for Next-Generation Sequencing Data with SeqSQC} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Quick start The QC process in SeqSQC included five different steps: missing rate check, sex check, inbreeding coefficient check, Identity-by-descent (IBD) check, and population outlier check. Problematic samples identified in each step could be removed from later steps. The entire sample level QC was wrapped up in one function: `sampleQC`. By executing this function, a problematic sample list with criteria from each QC step as well as a QC report with interactive plots in html format will be generated. Here we use an exemplar NGS dataset as a study cohort to demonstrate the execution of the wrap-up function `sampleQC`. The example dataset, with four EUR (European) samples and one AFR (African) sample, is assembled from the 1000 Genomes Project. We labeled the one AFR sample as EUR to mimic a population outlier. Samples from the 1000 Genome Project are whole-genome sequencing dataset. To mimic whole exome sequencing (WES) data, we kept in the vcf file only the variants in capture regions of Agilent SureSelect Human Exon v5, one of the most popular WES capture platforms to date. The code chunk below assumes that you have a vcf file called `infile` with samples from a single population, a sample annotation file called `sample.annot` with sample name, population (e.g., AFR, EUR, EAS (East Asian), SAS (South Asian), or ASN (Asian)) and gender info (male/female), and a bed file called `cr` which contains capture regions. User need to specify the name for output files, incuding the gds file containing genotypes, and the `SeqSQC`(discussed in section below) object with QC results. ```{r setup, eval=FALSE} source("https://bioconductor.org/biocLite.R") biocLite("SeqSQC") ``` ```{r load packages} library(SeqSQC) ``` The wrap-up function `sampleQC` transforms the vcf file into `SeqSQC` object, or take the `SeqSQC` object generated from `LoadVfile` directly as input, evaluates all sample QC steps (as a convenient all inclusive method), outputs the QC results into your designated directory, and generates a sample QC report in html format. We recommend saving the output `SeqSQC` object into an RData. This is an example when we have input as vcf file, sample annotation file, and capture region bed file. Note that our example vcf file only include variants in chromosome 1, and the capture region only include the ```{r loadData, eval=TRUE} infile <- system.file("extdata", "example_sub.vcf", package="SeqSQC") sample.annot <- system.file("extdata", "sampleAnnotation.txt", package="SeqSQC") cr <- system.file("extdata", "CCDS.Hs37.3.reduced_chr1.bed", package="SeqSQC") outdir <- tempdir() outfile <- file.path(outdir, "testWrapUp") ``` ```{r wrapup_vcf, eval=FALSE} seqfile <- sampleQC(vfile = infile, output = outfile, capture.region = cr, sample.annot = sample.annot, format.data = "NGS", format.file = "vcf", QCreport = FALSE) save(seqfile, file="seqfile.RData") ``` This is an example when we directly use `SeqSQC` object as input, and evaluate all sample QC steps. ```{r wrapup_seqfile, eval=FALSE} load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC")) gfile <- system.file("extdata", "example.gds", package="SeqSQC") seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile)) seqfile <- sampleQC(vfile = seqfile, output = "testWrapUp", QCreport = FALSE) save(seqfile, file="seqfile.Rdata") ``` Detailed descriptions of data structure and functionality usage of SeqSQC are provided as below. Users are recommended to use these specific QC functions for each QC step if they want to skip any of them by using `sampleQC()`. # Data preparation ## Input data SeqSQC takes VCF file as input. ```{r loadVfile, eval=TRUE, message=FALSE} seqfile <- LoadVfile(vfile = infile, output = outfile, capture.region = cr, sample.annot = sample.annot) ``` ## SeqSQC class We define an object class `SeqSQC` to store the genotype data, variant / sample annotation data, and the sample QC results. ```{r show} load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC")) gfile <- system.file("extdata", "example.gds", package="SeqSQC") seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile)) slotNames(seqfile) ``` A SeqSQC object is a list of two objects. The first object is the filepath of the GDS (discussed in section below) file which stores the genotype information from the original VCF file. ```{r class} gdsfile(seqfile) ``` The second object is a list of sample information and QC results, which include the dimension (# of samples and variants), sample annotation, and QC results for sample missing rate, sex check, inbreeding outlier check, IBD check, and population outlier check. ```{r class2} QCresult(seqfile) head(QCresult(seqfile)$sample.annot) ``` ## GDS class The genotype information and variant annotation from the input VCF file will be stored in Genomic Data Structure (GDS) format (`r Biocpkg("gdsfmt")`). Compared to VCF format, the GDS format could increase the storage efficiency by 5 fold and data access speed by 2-3 fold. We only include the bi-allelic single nucleotide variants (SNVs) from the VCF input for sample QC analysis. Other information from the VCF file, including the chromosome, position, snp rs id, reference allele, alternative allele, quality score and filter can also be passed into the gds file. The functions `SeqOpen` and `closefn.gds` can be used to open and close the gds file in `SeqSQC` object. It is recommended to close the gds file once it has been opened. ```{r gds} dat <- SeqOpen(seqfile) dat closefn.gds(dat) ``` # Standard workflow ## Sample missing rate check Samples with missing rate > 0.1 are considered problematic. The function `MissingRate` and `plotQC(QCstep = "MissingRate")` calculate and plot the sample missing rate respectively. The result from sample missing rate check contains three columns: sample name, sample missing rate, and an indicator of whether the sample has a missing rate greater than 0.1. The value of the `outlier` column is set to NA for benchmark samples. When running the QC process through the wrap-up function `sampleQC`, problematic samples identified in each QC step are automatically remove before getting to the next step. However when a QC step is executed separately, users need to specify the list of problematic samples to be removed using the `remove.samples` option. ```{r missingrate, eval=FALSE} seqfile <- MissingRate(seqfile, remove.samples=NULL) ``` ```{r mrresult, eval=TRUE} res.mr <- QCresult(seqfile)$MissingRate tail(res.mr) ``` The function `plotQC(QCstep = "MissingRate")` generates the plot for the sample missing rate, where the problematic samples with missing rate greater than 0.1 are highlighted in the plot. The default plot generated is not interactive. Users can generate the interactive plot in each QC step by specifying the `interactive` argument to be TRUE. The interactive plot allows users to visually inspect the QC result by putting the cursor on samples of interest. ```{r plot.mr} plotQC(seqfile, QCstep = "MissingRate") ``` In this plot, all five samples in the study cohort have missing rate of zero. ## Sex check After filtering out the pseudo-autosomal region in X chromosome, we calculate the sample inbreeding coefficient with variants on X chromosome for all samples in the study cohort and for benchmark samples of the same population as the study cohort. The function `SexCheck` predicts the sample gender and `plotQC(QCstep = "SexCheck")` draws the plot for X chromosome inbreeding coefficients. The result contains sample name, reported gender, X chromosome inbreeding coefficient, and predicted gender. Samples are predicted to be female or male if the inbreeding coefficient is below 0.2, or greater than 0.8. The samples with discordant reported gender and predicted gender are considered as problematic. When the inbreeding coefficient is within the range of [0.2, 0.8], "0" will be shown in the column of `pred.sex` to indicate ambiguous gender, which won't be considered as problematic. ```{r sexcheck, eval=FALSE} seqfile <- SexCheck(seqfile, remove.samples=NULL) ``` ```{r scresult, eval=TRUE} res.sexc <- QCresult(seqfile)$SexCheck tail(res.sexc) ``` The function `plotQC(QCstep = "SexCheck")` generates the plot for the inbreeding coefficient on X chromosome where samples are labeled with different color according to their self-reported gender. If there is any sample detected to be gender mismatched by SeqSQC, it will be highlighted with a red circle. ```{r plot.sexc} plotQC(seqfile, QCstep = "SexCheck") ``` In this plot, none of the five samples in the study cohort have dis-concordant self-reported and predicted genders. ## Inbreeding check Using LD-pruned autosomal variants, we calculate the inbreeding coefficient for each sample in the study cohort and for benchmark samples of the same population as the study cohort. Samples with inbreeding coefficients that are five standard deviations beyond the mean are considered problematic. The function `Inbreeding` and `plotQC(QCstep = "Inbreeding")` calculates and plots the inbreeding coefficients respectively. The result contains sample name, inbreeding coefficient, and an indicator of whether the inbreeding coefficient is five standard deviation beyond the mean. For Benchmark samples the indicator column is set to be "NA". ```{r inbreeding, eval=FALSE} seqfile <- Inbreeding(seqfile, remove.samples=NULL) ``` ```{r inbresult, eval=TRUE} res.inb <- QCresult(seqfile)$Inbreeding tail(res.inb) ``` The function `plotQC(QCstep = "Inbreeding")` generates the plot for the inbreeding coefficient. Problematic samples with extreme inbreeding coefficients will be highlighted in the plot. ```{r plot.inb} plotQC(seqfile, QCstep = "Inbreeding") ``` In this plot, none of the five samples in the study cohort have extreme inbreeding coefficients. ## IBD check Using LD-pruned autosomal variants, we calculate the IBD coefficients for all sample pairs. we then predict related sample pairs in study cohort by using the support vector machine (SVM) method with linear kernel and the known relatedness embedded in benchmark data as training set. All predicted related pairs are also required to have coefficient of kinship ≥ 0.08. The sample with higher missing rate in each related pair is selected for removal from further analysis. The function `IBD` calculates the IBD coefficients for each sample pair and predicts the relatedness for samples in the study cohort. The function `plotQC(QCstep = "IBD")` draws the descent coefficients, K0 and K1, for each pair. The result contains sample names, the descent coefficients k0, k1 and kinship, self-reported relationship and predicted relationship for each pair of samples. Sample pairs with discordant self-reported and predicted relationship are considered as problematic. ```{r ibd, eval=FALSE} seqfile <- IBD(seqfile, remove.samples=NULL) ``` ```{r ibdresult, eval=TRUE} res.ibd <- QCresult(seqfile)$IBD head(res.ibd) ``` The function `plotQC(QCstep = "IBD")` draws the descent coefficients k0 and k1 for each sample pair. The relationship for each sample pair are labelled by different colors. If there is any problematic sample pair detected to be cryptically related by SeqSQC, it will be highlighted with a red circle. ```{r plot.ibd} plotQC(seqfile, QCstep = "IBD") ``` In this plot, none of the five samples in the study cohort have cryptic relationship with each other. ## Population outlier check Using LD-pruned autosomal variants we calculate the eigenvectors and eigenvalues for principle component analysis (PCA). We use the benchmark samples as training dataset, and predict the population group for each sample in the study cohort using the top four principle components. Samples with discordant predicted and self-reported population groups are considered problematic. The function `PCA` performs the PCA analysis and identifies population outliers in study cohort. The result contains sample name, reported population, data resource (benchmark / study cohort), the first four eigenvectors and predicted population. In the example data, we identified one population outlier. The sample `HG02585` was reported as EUR but predicted to be AFR. `HG02585` is indeed a AFR sample. We put it as an intended population outlier in the example data. ```{r pca, eval=FALSE} seqfile <- PCA(seqfile, remove.samples=NULL) ``` ```{r pcaresult, eval=TRUE} res.pca <- QCresult(seqfile)$PCA tail(res.pca) ``` The function `plotQC(QCstep = "PCA")` generates the plot of first two principle components for each sample. Samples with different population are labelled by different colors. If there is any population outlier detected by SeqSQC, it will be highlighted with a red circle. ```{r plot.pca} plotQC(seqfile, QCstep = "PCA") ``` In the following interactive plot, the intended population outlier `HG02585` was grouped closer to the AFR benchmark samples, and far from the other four EUR samples. Users could easily pick it out and remove it from downstream analysis. ```{r plot.pca.inter, eval=TRUE, warning=FALSE} plotQC(seqfile, QCstep = "PCA", interactive=TRUE) ``` # Summary of QC results ## Problematic sample list The SeqSQC function `ProblemList` generates a data frame including all problematic samples, and the reasons for removal recommendation. We recommend users to execute this function after finishing all QC steps in **standard workflow**, so that they can get a full list of problematic samples. ```{r problist} problemList(seqfile) save(seqfile, file="seqfile.Rdata") ``` ## reporting of results After finish all or some of the QC steps, the users can use `RenderReport` to generate the report in html format with optional interactive plots. ```{r report, eval=FALSE} RenderReport(seqfile, output="report.html", interactive=TRUE) ``` # How to get help for SeqSQC Any SeqSQC question can be posted to the *Bioconductor support site*, which serves as a searchable knowledge base of questions and answers: Posting a question and tagging with “SeqSQC” will automatically send an alert to the package authors to respond on the support site. # Session info ```{r} sessionInfo() ```