%\VignetteIndexEntry{An introduction to SimFFPE} %\VignetteDepends{} %\VignetteKeywords{Read Simulation, FFPE tissue, NGS} %\VignettePackage{SimFFPE} \documentclass{article} <>= BiocStyle::latex() @ \title{An Introduction to \Biocpkg{SimFFPE}} \author{Lanying Wei} \date{Modified: Nov 11, 2020. Compiled: \today} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \tableofcontents <>= options(width=60) @ \newcommand{\bam}{\texttt{BAM}} \newcommand{\fasta}{\texttt{FASTA}} \newcommand{\phred}{\texttt{Phred score profile}} \section{Introduction} The NGS (Next-Generation Sequencing) reads from FFPE (Formalin-Fixed Paraffin-Embedded) samples contain numerous artificial chimeric reads, which can lead to false positive structural variant calls. These ACRs are derived from the combination of two single-stranded DNA (ss-DNA) fragments with short reverse complementary regions (SRCR). The combined ss-DNA may come from adjacent or distant genomic regions. The \Biocpkg{SimFFPE} package simulates these artificial reads as well as normal reads for FFPE samples. The simulation can cover whole genome, or several chromosomes, or large regions, or whole exome, or targeted regions. It also supports enzymatic / random fragmentation and paired-end / single-end sequencing simulations.Fine-tuning can be achieved by adjusting the parameters, and multi-threading is surported. \section{Input} The essential inputs for the simulation include a \fasta{} file of the reference genome, a \phred{} matrix to simulate Phred scores based on the position on the reads, and a DataFrame or GenomicRanges object representing the target regions (optional). The \phred{} can be estimated from existing \bam{} files using \Rfunction{calcPhredScoreProfile} function (two available examples for \phred{} are stored in the 'extdata' directory of \Biocpkg{SimFFPE} package). <>= library(SimFFPE) bamFilePath <- system.file("extdata", "example.bam", package = "SimFFPE") regionPath <- system.file("extdata", "regionsBam.txt", package = "SimFFPE") regions <- read.table(regionPath) PhredScoreProfile <- calcPhredScoreProfile(bamFilePath, targetRegions = regions) ## Example Phred score profile with 100 read length PhredScoreProfilePath <- system.file("extdata", "PhredScoreProfile1.txt", package = "SimFFPE") PhredScoreProfile <- as.matrix(read.table(PhredScoreProfilePath, skip = 1)) colnames(PhredScoreProfile) <- strsplit(readLines(PhredScoreProfilePath)[1], "\t")[[1]] # ## Example Phred score profile with 150 read length PhredScoreProfilePath2 <- system.file("extdata", "PhredScoreProfile2.txt", package = "SimFFPE") PhredScoreProfile2 <- as.matrix(read.table(PhredScoreProfilePath2, skip = 1)) colnames(PhredScoreProfile2) <- strsplit(readLines(PhredScoreProfilePath2)[1], "\t")[[1]] @ The \fasta{} file of reference genome can be read in as \Rclass{DNAStringSet} with \Rfunction{readDNAStringSet} function from \Biocpkg{Biostrings} package. The reference genome example file consists of small regions of 24 chromosomes from human hg19 reference genome. <>= referencePath <- system.file("extdata", "example.fasta", package = "SimFFPE") reference <- readDNAStringSet(referencePath) reference @ %% To simulate reads of certain regions, a DataFrame or GenomicRanges object representing the target regions is required (not required when simulating reads on the whole genome / several chromosomes / large regions). The DataFrame representing the target regions should have three columns, which indicate chromosomes, start positions and end positions respectively (one-based coordinate). <>= regionPath <- system.file("extdata", "regionsSim.txt", package = "SimFFPE") targetRegions <- read.table(regionPath) @ %% \section{Simulation} The simulation includes three steps: 1) Simulate artificial chimeric reads derived from the combination of two ss-DNA segments from adjacent regions on the chromosome. 2) Simulate artificial chimeric reads derived from the combination of two ss-DNA from distant regions (distant regions on the same chromosome, or any regions on different chromosomes). 3) Simulate reads derived from normal sequences. You can also skip any of these steps in the simulation. There are two functions which can be used for the simulation: \Rfunction{readSimFFPE} and \Rfunction{targetReadSimFFPE}. \subsection{Read simulation functions} To simulate reads on whole genome, or several chromosomes, or large regions, please use the \Rfunction{readSimFFPE} function: <>= ## Simulate reads of the first three sequences of the reference genome sourceSeq <- reference[1:3] outFile1 <- paste0(tempdir(), "/sim1") readSimFFPE(sourceSeq, referencePath, PhredScoreProfile2, outFile1, overWrite = TRUE, coverage = 80, readLen = 150, enzymeCut = TRUE, threads = 2) # ## Simulate reads of defined regions on the first two sequences of ## the reference genome sourceSeq2 <- DNAStringSet(lapply(reference[1:2], function(x) x[1:10000])) outFile2 <- paste0(tempdir(), "/sim2") readSimFFPE(sourceSeq2, referencePath, PhredScoreProfile2, outFile2, overWrite = TRUE, coverage = 80, readLen = 150, enzymeCut = TRUE) @ %% To simulate reads on whole exome or targeted regions please use the \Rfunction{targetReadSimFFPE} function: <>= outFile3 <- paste0(tempdir(), "/sim3") targetReadSimFFPE(referencePath, PhredScoreProfile, targetRegions, outFile3, coverage = 120, readLen = 100, meanInsertLen = 180, sdInsertLen = 50, enzymeCut = FALSE) @ %% Additional information can be found on the help pages for the \Rfunction{readSimFFPE} function and the \Rfunction{targetReadSimFFPE} function. \subsection{Fine-tuning of the simulation} Fine-tuning of the simulation is achievable by the adjustments of some parameters of the function \Rfunction{readSimFFPE} and \Rfunction{targetReadSimFFPE}. You can simulate reads in smaller regions during fine-tuning to save the runtime. To illustrate the impact of some of these parameters, screenshots from IGV tools are used (see Figure \ref{fig:1}, \ref{fig:2} and \ref{fig:3}). These parameters include: 1) enzymeCut: Simulate enzymatic fragmentation. With this fragmentation method, chimeric read pairs with improper pair orientations might be mapped to exactly the same location on the reference genome (see Figure \ref{fig:1}). 2) chimericProp: Proportion of artificial chimeric fragments. The higher the value, the greater the proportion of improper paired reads as shown in Figure \ref{fig:1} and \ref{fig:2}, the smaller the proportion of proper paired reads as shown in Figure \ref{fig:3}, and the larger the proportion of proper paired reads with soft-clips as shown in Figure \ref{fig:3}. 3) sameChrProp: Proportion of artifact chimeric fragments that are derived from the combination of two ss-DNA coming from the same chromosome. The higher the value, the greater the proportion of reads with improper pair orientations as shown in Figure \ref{fig:1}, and the smaller the proportion of reads with their mates mapped to different chromosomes as shown in Figure \ref{fig:2}. 4) adjChimProp: Proportion of adjacent ss-DNA combinations among same chromosomal ss-DNA combinations. sameChrProp * adjChimProp determine the proportion of simulated adjcent ss-DNA combinations. 5) sameStrandProb: Proportion of same-strand ss-DNA combinations among adjacent ss-DNA combinations. The higher the value, the greater the proportion of reads with RR and LL orientations in all reads with improper pair orientations as shown in Figure \ref{fig:1}. 6) spikeWidth: The width of chimeric read spike used in the simulation of distant ss-DNA combinations. As shown in Figure \ref{fig:3}, some regions are enriched in reads with paired reads mapped to other chromosomes, and some others are scarce. The lengths of these regions are of similar scale, and the parameter "spikeWidth" is used to simulate this length. 7) highNoiseRate and highNoiseProb: The noise rate for each base in noisy reads and the proportion of these noisy reads. These very noisy reads as well as less noisy reads are shown in Figure \ref{fig:3}. \begin{figure*} \includegraphics[width=1\textwidth]{./ex1.png} \caption{\label{fig:1} Simulated reads with improper pair orientations. These read pairs are mapped in RR, LL, or RL orientations. Black-framed reads are paired reads that are mapped to the same position (enzymatic fragmentation).} \end{figure*} \begin{figure*} \includegraphics[width=1\textwidth]{./ex2.png} \caption{\label{fig:2} Simulated reads with mate reads mapped to different chromosomes. The different read colors indicate the different chromosomes that their mate reads mapped to.} \end{figure*} \begin{figure*} \includegraphics[width=1\textwidth]{./ex3.png} \caption{\label{fig:3} Simulated reads mapped in proper pair. Mismatches and soft-clips are shown as colored vertical lines in reads.} \end{figure*} \appendix \section{Session info} <>= packageDescription("SimFFPE") sessionInfo() @ \end{document}