--- title: "Tutorial of the SARC R Package. Flagging the confidence of CNVs from WES/ WGS cohorts." output: rmarkdown::html_document: keep_tex: true toc: true toc_float: collapsed: true smooth_scroll: true number_sections: true toc_depth: 4 fig_width: 8 fig_height: 5 fig_caption: true df_print: kable highlight: tango citation_package: natbib self_contrained: false lib_dir: libs date: '2023-05-15' author: "by Krutik Patel, PhD" vignette: > %\VignetteIndexEntry{SARC} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # **SARC** ## **S**tatistical **A**nalysis of **R**egions with **C**NVs This tools was designed to evaluate Copy number variations (CNVs) which have been detected from CNV detection pipelines from NGS data. CNV detection pipelines from WES and (to a lesser extent) WGS, lead to high numbers of false-positives. The SARC package aims to flag high confidence and low confidence CNVs from such detection pipelines. This would aid in patient diagnostics and clinical work. Uniquely, the SARC package only requires a `coverage`/`cov` file and an `cnv` file. Both files should ideally be loaded as `data.frames`. The `cov` file can be created from many BAM files from WES/ WGS, and the read depth coverage of each sample should be normalized by library size of the sample prior to processing. The end of the vignette will point to how to perform read coverage reading and normalisation with examples. The `cnv` file is in many ways similar to a standard CNV `BED` file - as it serves as a list of CNVs identified in the samples. Unlike a traditional `BED` file, the `cnv` file should contain additional columns such as CNV value, type of CNV, genes and batch of data. Additionally, the column names of the `cnv` file should ideally include: SAMPLE, CHROM, START, END, TYPE, VALUE. The order of the columns is not necessary. Additional columns such as TOOL, BATCH and GENE can enhance plotting. ### **This is not a detection tool - it is a downstream ratification tool for use AFTER a CNV detection pipeline!** ```{r Load library, echo=TRUE, message=FALSE, warning=FALSE} #INSTALL BiocManager::install("SARC") #load library library(SARC) ``` ```{r Load more library, echo=FALSE, message=FALSE, warning=FALSE} #load additional packages for vignette building library(knitr) library(kableExtra) ``` ## Data set-up The cnv file should contain generic information at the minimum. Sample names (**SAMPLE**), Chromosome (*"CHROM"*), Start site (*"START"*), End sit (*"END"*), Deletion or Duplication (*"TYPE"*) and CNV value (*"VALUE"*) are required for the tool to function. Column names should be in all caps and match the names of the test_cnv example below. START, END and VALUE should be integers! ```{r test_cnv, echo=TRUE} data("test_cnv") #For speed just use a few detected CNVs test_cnv <- test_cnv[1:3,] head(test_cnv) %>% kable %>% kable_styling("striped", full_width=FALSE) %>% scroll_box(width = "800px", height = "200px") ``` The cov file should comprise of coverage from normalised BAM files. Additionally, it is good practice to separate cov files by the technology used to sequence the FASTQ files. Importantly, there will be four columns before the samples - ID, Chromosome, Start and End. ID, Start and End should be integers. The rest of the column names will be the names of the samples - and these should match the samples names found in the cnv file. ```{r test_cov, echo=TRUE} data("test_cov") head(test_cov[, 1:10]) %>% kable %>% kable_styling("striped", full_width=FALSE) %>% scroll_box(width = "800px", height = "200px") ``` **All samples in the cnv file should be found in the cov file.** But you can have samples in the cov file that are not found in the cnv file. The SARC package relies on `RaggedExperiments` objects to store the `cov` file as its assay and will store lists of `dataframes` and `Grange` objects within `metadata`. The `SARC` tool will generate more `cnv` dataframes with additional stats for each CNV in the list of CNVs. These `dataframes` will be stored within the first list stored in the `metadata`. For cohorts with multiple coverage files (from multiple sequencing platforms or meta-analyses studies), we recommend creating multiple `RaggedExperiments` objects, rather than combining them all into one. ```{r pressure, echo=TRUE} #Create a start site and end site for each CNV detected SARC <- regionSet(cnv = test_cnv, test_cov) #Create smaller coverage files for each CNV SARC <- regionSplit(RE = SARC, #the raggedexpression object we made cnv = metadata(SARC)[['CNVlist']][[1]], startlist = metadata(SARC)[[2]], #list of start sites, per CNV endlist = metadata(SARC)[[3]]) #list of end sites, per CNV ``` ## Statistical analysis First we use mean scores to check if the mean reads at this region matches the CNV values from a CNV detection tool. This will make a new cnv file which will be stored in the RE object. ```{r means, echo=TRUE} #Calculate the mean read coverage SARC <- regionMean(RE = SARC, cnv = test_cnv, splitcov = metadata(SARC)[[4]]) #list of cnv specific coverage ``` We then calculate the quantile distributions. We assume a true duplication will be on the higher end of the distribution (in contrast to the other samples) and true deletions will be on the lower end. Depending on the number of samples in the COV file - the thresholds should be altered. ```{r qd, echo=TRUE} #Calculate the distribution of the mean reads SARC <- regionQuantiles(RE = SARC, cnv = metadata(SARC)[['CNVlist']][[2]], #new cnv file meancov = metadata(SARC)[[5]], #list of cnv specific coverage + means q1 = 0.1, #lower threshold q2 = 0.9) #upper threshold ``` Anova can then be used to identify if a region with a suspected CNV has a significantly rare read depth at the region - in contrast to all other samples. The more samples, the more powerful this test is. ```{r anova, echo=TRUE} #Calculate rarity of each suspected CNV - in contrast to other samples SARC <- prepAnova(RE = SARC, start = metadata(SARC)[[2]], end = metadata(SARC)[[3]], cnv = metadata(SARC)[['CNVlist']][[3]]) #newest cnv dataframe SARC <- anovaOnCNV(RE = SARC, cnv = metadata(SARC)[['CNVlist']][[3]], anovacov = metadata(SARC)[[8]]) head(metadata(SARC)[['CNVlist']][[4]]) %>% kable %>% kable_styling("striped", full_width=FALSE) %>% scroll_box(width = "800px", height = "200px") ``` ## Plotting A major complaint of CNV analysis by diagnostic staff is the over-reliance on the Integrative Genome Browser (IGV). While a great tool, it can be a tedious task to search many hundreds of samples manually. The SARC package provides an alternative (but not a complete substitute) to visualize which genes and exons are being (or not being) effected by a CNV. This is also a good way of visualizing the false-positives quickly, without using IGV. ```{r set up plot, echo=TRUE, message=FALSE, warning=FALSE} #prepare new objects for the CNV plots to work library(TxDb.Hsapiens.UCSC.hg38.knownGene) library(Homo.sapiens) #load genome specific libraries from BioConductor TxDb(Homo.sapiens) <- TxDb.Hsapiens.UCSC.hg38.knownGene txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene tx <- transcriptsBy(Homo.sapiens, columns = "SYMBOL") txgene <- tx@unlistData SARC <- plotCovPrep(RE = SARC, cnv = metadata(SARC)[['CNVlist']][[4]], #newest cnv dataframe startlist = metadata(SARC)[[2]], endlist = metadata(SARC)[[3]], n1 = 0, #left-padding n2 = 0) #right-padding SARC <- regionGrangeMake(RE = SARC, covprepped = metadata(SARC)[[9]]) SARC <- addExonsGenes(RE = SARC, covgranges = metadata(SARC)[[10]], #list of grange objects - one for each detected CNV txdb = txdb, #Species specific database txgene = txgene) #genes/ exons from the database SARC <- setupCNVplot(RE = SARC, namedgranges = metadata(SARC)[[11]], #grange objects which have genes/ exons added covprepped = metadata(SARC)[[9]]) ``` If CNVs are very small and cannot to attributed to at least one known gene in the TxDB object they will be too difficult to process any further. A message will state which CNVs were not associated with any genes. These CNVs should be removed from the cnv file also as to not cause errors. Each detected CNV can be plotted. Automated plotting can be done quite simply with a for-loop - and is recommended. The samples with the detected CNV is highlighted with a thin purple line. The Sample with the detected CNV, the type of CNV and the value of the CNV from the detection pipeline are pasted as the subtitle. ```{r plot CNVs, echo=TRUE, warning=FALSE} #Calculate rarity of each suspected CNV - in contrast to other samples plotCNV(cnv = metadata(SARC)[['CNVlist']][[4]], setup = metadata(SARC)[[12]], FilteredCNV = 1) ``` ## Flagging Some clinicians would rather the confidence level flagged rather than completely removed - so SARC will not remove any CNVs, and this is left to the user. ```{r flag, echo=TRUE, eval = TRUE} #Use statistical analyses to flag CNVs as high or low confidence of being true SARC <- cnvConfidence(RE = SARC, cnv = metadata(SARC)[['CNVlist']][[4]]) head(metadata(SARC)[['CNVlist']][[5]]) %>% kable %>% kable_styling("striped", full_width=FALSE) %>% scroll_box(width = "800px", height = "200px") ``` ## Extras If the RaggedExperiment Object is confusing it can be traversed as so. ```{r check, echo=TRUE, eval = TRUE} #This will show all the dataframes (cnv) files made #Generally it is recommended to use the most recently made cnv file to keep all the additional columns print(SARC) #This will show all the list objects. Their names should roughly correlate with the names of the parameters in the functions. names(metadata(SARC)) ``` Plot distributions to see why some CNVs were false positives. The sample with the suspected CNV will be highlighted in red. In cases where many samples are present, *plotly=TRUE* may lead to a clearer visual. *sample* refers to the CNV - so can easily be looped. ```{r check distribution of reads, echo=TRUE, eval=TRUE} SARC <- setQDplot(RE = SARC, meancov = metadata(SARC)[[5]]) seeDist(meanList = metadata(SARC)[[13]], cnv = metadata(SARC)[['CNVlist']][[5]], sample = 1, plotly=FALSE) ``` Add all genes and exons the CNVs effect. This could be useful to identify if the variant contributes to a patients symptoms. ```{r exon numbers and gene names, echo=TRUE, eval=TRUE} SARC <- pasteExonsGenes(RE = SARC, setup = metadata(SARC)[[12]], #list of dataframes from the setupCNVplot function cnv = metadata(SARC)[['CNVlist']][[5]]) #cnv file to add an extra column to metadata(SARC)[['CNVlist']][[6]] %>% kable %>% kable_styling("striped", full_width=FALSE) %>% scroll_box(width = "800px", height = "200px") ``` A more powerful test is the Dunnet test. This contrasts the read-depths between the control samples (samples with suspected CNVs) and the test samples (samples without suspected CNVs) at the same region of the DNA. However this is very slow - and it is only recommended when there are few samples (<100) to test. ```{r dunnet, echo=TRUE, eval = FALSE} DNA <- phDunnetonCNV(RE = DNA, cnv = metadata(SARC)[['CNVlist']][[4]], anovacov = metadata(SARC)[[8]]) SARC <- cnvConfidence(MA = SARC, cnv = experiments(SARC)[[7]], ph = TRUE) ``` *plotCNV* can also hone into a single *GENE* of interest, so long as it matches the genes found via TxDB, can be made specific for a *BATCH* of WES/ WGS data, or sites where the DNA was sequenced, and be logged. For WES data, logging is rerecorded as across a short region of DNA, the reads can change greatly, based on the chemistry of the sequencer. For deletions, *log* 10 is helpful, for duplication's it can actually make them harder to see. Number of samples can also be adjusted for plotting. The plots can be automated easily in a loop. Trick - make one plot in Rstudio, keep it in the plotting pane, and adjust the height and width. All other plots made in a loop with *ggsave* will be made to the same height and width. Useful when making plots for papers. ```{r expanded plotCNV, echo=TRUE, eval = FALSE} data("test_cnv2") library(ggplot2) for (i in seq_len(nrow(test_cnv2))) { n <- paste0(cnv$SAMPLE[i], "_", cnv$GENE[i], "_", cnv$CHROM[i], "_", cnv$START[i], "_", cnv$END[i]) savename <- paste0(n, ".jpeg") print(i) plotCNV(cnv = cnv, setup = metadata(X)[[9]], FilteredCNV = i, batch = cnv$BATCH[i], gene = cnv$GENE[i]) ggplot2::ggsave(filename = savename) } ``` ### Alternate species Though this tool is designed for clinical work, it can be further applied for research projects which focus on other species. Simply change the txdb and species database. For example, for mouse based projects, follow the following steps. ```{r mouse, eval=TRUE} require(TxDb.Mmusculus.UCSC.mm10.knownGene) require(Mus.musculus) # load genome specific libraries from BioConductor #prepare new objects for the CNV plots to work TxDb(Mus.musculus) <- TxDb.Mmusculus.UCSC.mm10.knownGene txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene tx <- transcriptsBy(Mus.musculus, columns = "SYMBOL") txgene <- tx@unlistData ``` ### Obtaining Read Depth Using bioconductor package `GenomicAlignments` users can read the coverage of bam files. Here we give some examples on how to read bam files into R. Further detail is available in this page. ```{r read depths, eval=TRUE, message=FALSE} require(GenomicAlignments) bam <- system.file("extdata", "test.bam", package="SARC") coverage <- coverage(bam) coverage$chr1@values ``` ### Normalising by Library size It is highly recommended to normalize read depth by total library size of the sequencing file for more accurate analysis. Calculating RPM (reads per million) is a straight forward method for genomic data. Here is an example using raw read depth and total library sizes. ```{r CPM, eval=TRUE} load(system.file("extdata", "test_cov_raw.rda", package="SARC")) load(system.file("extdata", "librarySizes.rda", package="SARC")) normalised_counts <- t(t(test_cov[, 5:ncol(test_cov)]) / libsizes$PerMillion) normalised_counts <- cbind(test_cov[,1:4], normalised_counts) ``` ### Session Info ```{r session_info, echo=TRUE, eval = TRUE} sessionInfo() ```