%\VignetteIndexEntry{RSVSim: an R/Bioconductor package for the simulation of structural variations} %\VignettePackage{RSVSim} \documentclass[10pt]{article} \usepackage{times} \usepackage{hyperref} \textwidth=6.2in \textheight=8.5in \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \SweaveOpts{prefix.string=RSVSim} \begin{document} \title{RSVSim \\ an R/Bioconductor package for the simulation of structural variations} \author{Christoph Bartenhagen} \date{\today} \maketitle \tableofcontents <>= options(width=90) options(continue=" ") @ \section{Introduction} The simulation of structural variations (SV) is an important measure to assess the performance of algorithms dealing with SVs and their detection and can help with the design of sequencing experiments. A simulation generates a base exact ground truth, which can be used to test the sensitivity and precision of SV callers. \\ A FASTA-file with the simulated, rearranged genome can be used by common, published read simulators (like \cite{art}), \cite{pirs}) to generate NGS datasets from various platforms that can then be used to asses an SV algorithm . A typical workflow consists of \begin{center} SV simulation $\Rightarrow$ (Paired-End) Read simulation $\Rightarrow$ SV algorithm $\Rightarrow$ Evaluation\\ \end{center} Varying parameters of the SV simulation like SV type, size or location and of the read simulator like number of reads (coverage), insert-size (for paired-end) or read length can give helpful information for future sequencing experiment designs. \\ This package addresses the very first step of SV simulation and provides the following features: \\ \begin{itemize} \item Simulation of deletions, insertions, inversions, tandem duplications and translocations (balanced and unbalanced) of various sizes \item Rearrangement of the human genome (hg19) by default or any other kind of genome available as FASTA file or \Rpackage{BSgenome} package \item Non-overlapping positioning of SV breakpoints within the whole genome or only a subset (e.g. coding, non-coding or low-complexity regions) \item Implementation of, e.g. previously detected or known, SVs at user-supplied coordinates \item Uniform distribution of SV breakpoints or simulation of biases towards repeat regions and regions of high homology according to different SV formation mechanisms (for hg19 only) \item Simulation of smaller mutations (SNPs and indels) close to the SV breakpoint \item Estimation of SV size distribution from real datasets \item Comparison of SV simulation with results from SV detection algorithms \end{itemize} \subsection{Loading the package} After installation, the package can be loaded into R by typing <>= library(RSVSim) @ into the R console. \\ \Rpackage{RSVSim} requires the R-packages \Rpackage{Biostrings}, \Rpackage{IRanges}, \Rpackage{GenomicRanges} and \Rpackage{ShortRead}. Mainly for efficient and convenient storing and access of sequences and genomic coordinates. The packages \Rpackage{BSgenome.Hsapiens.UCSC.hg19}, \Rpackage{GenomicFeatures}, \Rpackage{rtracklayer} and \Rpackage{MASS} are suggested for certain functionalities. \section{Structural variation simulation} The main function for simulation is called \Rpackage{simulateSV}. The simulation works pretty similar for every different SV type by specifying number and size of the variation(s) and (optionally) the regions, where to place the variation (randomly or not). The size can be either one value for every SV type or a vector of values for every single SV. \\ The following sections give a short example for every SV type using a simple toy example with two chromosomes of 40bp each: <>= genome = DNAStringSet( c("AAAAAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTTTTTT", "GGGGGGGGGGGGGGGGGGGGCCCCCCCCCCCCCCCCCCCC")) names(genome) = c("chr1","chr2") genome @ The genome has to be a named \Robject{DNAStringSet} or a filename that points to a FASTA-file saved somewhere on disk. By default, when omitting the genome parameter, \Rfunction{simulateSV} will load the human genome (hg19) automatically. This requires an installation of the R-package \Rpackage{BSgenome.Hsapiens.UCSC.hg19}. \\ When using other \Rpackage{BSgenome} packages, it is recommended to extract the desired sequences first and combine them into a named \Rcode{DNAStringSet}. For example, the preparation of the genome of an Ecoli strain (str. 536) would look like: <>= library(BSgenome.Ecoli.NCBI.20080805) genome = DNAStringSet(Ecoli[["NC_008253"]]) names(genome) = "NC_008253" @ Each implemented SV will be reported with its position in the "normal" reference genome, and the breakpoint sequences. The rearranged genome is returned as \Robject{DNAStringSet} and the SV information is stored in its \Rcode{metadata} slot as a named \Robject{list} of \Rcode{data.frames}. All this can also be written to disk by specifying an \Rfunarg{output} directory (which is the current directory by default). The SV tables are saved as CSV files (called deletions.csv, insertions.csv etc.) and the genome in FASTA format (genome\_rearranged.fasta). \\ Note, that the seeds for the randomizations in the following examples were set for demonstration purposes only. The \Rfunarg{seed} parameter can be omitted or used to reproduce the same simulation several times. \\ The parameter \Rfunarg{output} is set to \Rcode{NA} in all examples to avoid wrtiting the output to disc. The parameter \Rfunarg{verbose} is set to \Rcode{FALSE} to suppress progress information about the simulation (which is enabled by default). \subsection{Deletions} A segment is cut out from the genome. The following example generates three deletions of 10bp each: <>= sim = simulateSV(output=NA, genome=genome, dels=3, sizeDels=10, bpSeqSize=6, seed=456, verbose=FALSE) sim metadata(sim) @ Chromosome 2, which harbours two deletions, is now 10bp shorter than chromosome 1. The breakpoint sequence of 6bp shows the 3bp up- and downstream of the deletion breakpoint in the rearranged genome. \subsection{Insertions} A segment is cut or copied from one chromosome A and inserted into another chromosome B. The following example generates three insertions of 5bp each: <>= sim = simulateSV(output=NA, genome=genome, ins=3, sizeIns=5, bpSeqSize=6, seed=246, verbose=FALSE) sim metadata(sim) @ Regarding insertion\_1, for example, the 5bp segment \Rcode{AAAAA} has been removed from chr1:14-18 and inserted into chr2:19-23. There are three breakpoint sequences reported for each insertion: the sequence at the deletion on chrA and at the 5' and 3' end of its insertion on chrB. \\ Setting the parameter \Rfunarg{percCopiedIns} (range: 0-1, i.e. 0\%-100\%) can change the amount of "copy-and-paste-like" insertions. \\ The same example as before, with the difference that two of the three inserted sequences are copied: <>= sim = simulateSV(output=NA, genome=genome, ins=3, sizeIns=5, percCopiedIns=0.66, bpSeqSize=6, seed=246, verbose=FALSE) sim metadata(sim) @ The same sequence \Rcode{AAAAA} from insertion\_1 is now duplicated before insertion into chr2:19-23. Here, no breakpoint sequence is reported for the region on chr1, since this chromosome is not altered. \subsection{Inversions} A segment is cut from one chromosome and its reverse complement is inserted at the same place without loss or a shift of sequence. The example below assigns a different size for each inversion: <>= sim = simulateSV(output=NA, genome=genome, invs=3, sizeInvs=c(2,4,6), bpSeqSize=6, seed=456, verbose=FALSE) sim metadata(sim) @ Inversions have two breakpoint sequences, one for the 5' end and one for the 3' end of the inverted segment. \subsection{Tandem duplications} A segment is duplicated one after the other. The number of duplications is determined randomly. The parameter \Rfunarg{maxDups} sets the maximum. The following example generates an, at most, tenfold tandem duplication of a 6bp sequence: <>= sim = simulateSV(output=NA, genome=genome, dups=1, sizeDups=6, maxDups=3, bpSeqSize=6, seed=3456, verbose=FALSE) sim metadata(sim) @ Here, the breakpoint sequence is the sequence at the end of one duplicated segment and the start of the following one. In this example the duplicated sequence is \Rcode{AAATTT} and it has been repeated another two times. \subsection{Translocations} A segment from the 5' or 3' end of one chromosome A is exchanged with the 5' or 3' end of another chromosome B. If it is not balanced, the segment from chromosome B will be lost, what results in a duplicated sequence from chromosome A. The parameter \Rfunarg{percBalancedTrans} sets the amount of balanced translocation (0-1, i.e. 0\%-100\%); by default, all translocations will be balanced. Segments which are translocated between two different ends (5'$\leftrightarrow$'3' of 3'$\leftrightarrow$5') are always inverted. After random generation of the breakpoint, the translocation spans the chromosome until the closest of both ends (which may include the centromere in the human genome). <>= sim = simulateSV(output=NA, genome=genome,trans=1, bpSeqSize=6, seed=123, verbose=FALSE) sim metadata(sim) @ This example exchanges the last 3bp of chromosome 2 with the first 19bp of chromosome 1. Both sequences were inverted.\\ The same example in an unbalanced fashion: <>= sim = simulateSV(output=NA, genome=genome,trans=1, percBalancedTrans=0, bpSeqSize=6, seed=123, verbose=FALSE) sim metadata(sim) @ The sequence from chr1:1-19 is lost, while there are now two copies of the translocated (inverted) segment from chr2:38-40. The breakpoint sequence for chr2 (denoted by chrA in general) is not reported, since this chromosome is not altered. \section{Simulation of biases towards SV formation mechanisms and repeat regions} By default, the SV breakpoints are placed uniformly across the genome. Several studies have shown, that structural variation formation in the human genome is not a random process but rather the result of mechanisms such as nonallelic homologous recombination (NAHR), nonhomologous recombination (NHR), variable number of tandem repeats (VNTRs) and transposable element insertions (TEIs) (\cite{1000genomes}, \cite{svformation2}). These mechanisms can be further associated with repeat elements and regions of high homology such as LINEs, SINEs, Micro-/Minisatellites and segmental duplications (\cite{svformation1}). \\ Using the default genome hg19 and setting \Rfunarg{repeatBias} to \Rcode{TRUE}, \Rpackage{RSVSim} simulates a bias of breakpoint positioning towards repeat regions. This is done in two steps: \begin{enumerate} \item Weighting SV formation mechanisms (here: NAHR, NHR, VNTR, TEI, Other) for each SV type. The type "Other" can be used for any individual weighting of repeats; by default, "Other" is associated with a random breakpoint. \item Weighting each SV formation mechanism for each kind of repeat. The following types of repeat regions are supported: LINE/L1, LINE/L2, SINE/Alu, SINE/MIR, segmental duplications (SD), tandem repeats (TR; mainly micro-/minisatellites) and Random. The latter, "Random", means any region on the genome. \end{enumerate} For the mechanism NAHR, both breakpoints will lie within a repeat region (with at least 50bp distance to the repeat margins), while for NHR, VNTR, TEI and Other, the repeat must make up for at least 75\% of the SV region. \\ This feature is turned off automatically, when the user specifies his own genome (i.e. any genome other than hg19). \\ The default weights for SV mechanisms for deletions, insertions an duplications are based on figure 4b in \cite{1000genomes}. The weights for inversions refer to figure 3c in \cite{svformation2}. The mechanisms and breakpoint sequences of translocations have not been studied as extensively as for other kinds of SVs. The default weights for translocations were chosen according to some exemplary publications (\cite{svformation3}, \cite{svformation4}), so that NAHR, NHR and random breakpoint positioning contribute equally. In all cases, the results for SVs $>$1.000bp were used. The exact weights are: <>= data(weightsMechanisms, package="RSVSim") show(weightsMechanisms) @ The default weights for repeat regions for every SV mechanism were based on the enrichment analysis in \cite{svformation1} (see their supplemental table 5). The exact values are: <>= data(weightsRepeats, package="RSVSim") show(weightsRepeats) @ The user may provide other weights by passing his own \Rcode{data.frames}, using the function arguments \Rfunarg{weightsMechanisms} and \Rfunarg{weightsRepeats}. The structure of the \Rcode{data.frames} has to be identical to the default ones shown above (i.e. same dimensions, column and row names). The effect of the weights is comparable to the \Rfunarg{prob} argument in the R function \Rfunction{sample}. \\ For example <>= weightsRepeats = data.frame( NAHR = c(0,0,3,1,0,0,0), NHR = c(1.04,0.62,1.16,0,2.06,0,0), TEI = c(1.66,0.25,0,0,0,0,0), VNTR = c(0,0,0,0,0,1,0), Other = c(0,0,0,0,0,0,1) ) rownames(weightsRepeats) = c("L1","L2","Alu","MIR","SD","TR","Random") sim = simulateSV(output=NA, dels=10, repeatBias=TRUE, weightsRepeats=weightsRepeats, verbose=FALSE) @ would exclude tandem repeats, segmental duplications and random regions from the simulation (except for VNTRs) by setting their weights to zero for all mechanisms. NAHRs, would be related to SINEs only. For \Rfunarg{weightsMechanisms}, the default values will be used, because the argument is missing here. Note, that \Rcode{repeatBias=TRUE} has to be set to use this feature. \\ This feature requires the coordinates of repeat regions for hg19. This can be handled in two ways: \begin{itemize} \item \Rpackage{RSVSim} downloads the coordinates once automatically from the UCSC Browser's RepeatMasker track (which may take up to 45 Minutes!). \item The user may specify the filename of a RepeatMasker output file downloaded from their homepage (\cite{repeatmasker}): http://www.repeatmasker.org/species/homSap.html (e.g. hg19.fa.out.gz). Loading this file takes only a few minutes. \end{itemize} In both cases, RSVSim saves the coordinates as \Rcode{RData} object \Rcode{repeats\_hg19.RData} to the RSVSim installation directory for a faster access in the future (if write privileges allow to do so). After that, one of the two steps mentioned above is not necessary anymore and next time, \Rpackage{RSVSim} is going to load the coordinates automatically from the \Rcode{RData} file. \\ When loading the repeats, neighboured ones with a distance up to 50bp will be merged, to obtain larger repeat regions and to allow SVs to span more than one repeat. But, breakpoints will only be placed within repeats of the same type (e.g. LINE/L1-LINE/L1, or SINE/MIR-SINE/MIR etc.). \section{Simulation of additional breakpoint mutations} SV breakpoints are usually not clean but tend to co-occur with other, usually smaller mutations, such as indels or SNPs. \Rpackage{RSVSim} allows to randomly generate additional SNPs and indels within the flanking regions of each breakpoint. Their generation can be configured by the four arguments \Rfunarg{bpFlankSize}, \Rfunarg{percSNPs}, \Rfunarg{indelProb} and \Rfunarg{maxIndelSize}, which specify the size of the flanking regions in bp (i.e. proximity of the mutations to the breakpoint), the fraction on SNPs (in \%), the probability of an indel (insertions and deletions are equally likely) and the maximum size of an indel (size is selected randomly between 1 and \Rfunarg{maxIndelSize}). Each flanking region may only contain one indel. SNPs can affect 0-100\% of the region. By default, this feature is turned off. \\ The following example creates one deletion with 10\% SNPs and 100\% indel probability within 10bp up-/downstream of the breakpoint: <>= sim = simulateSV(output=NA, genome=genome, dels=1, sizeDels=5, bpFlankSize=10, percSNPs=0.25, indelProb=1, maxIndelSize=3, bpSeqSize=10, seed=123, verbose=FALSE) sim metadata(sim) @ In addition to the 5bp deletion of the sequence \Rcode{CCCCC}, two SNPs \Rcode{C$->$T} and \Rcode{C$->$A}, and a deletion of two more \Rcode{C}s were added up- and downstream of the breakpoint. \section{Simulation within a genome subset} It is possible to run the simulation to certain chromosomes only by specifying the chromosome names in the parameter \Rfunarg{chrs}. It has to be taken care that these chromosome names match the names in the \Robject{DNAStringSet} containing the genome sequences (e.g. "chr1", "chr2", ..., "chrY" for the default genome hg19 from the package \Rpackage{BSgenome.Hsapiens.UCSC.hg19}). \\ Furthermore, every SV has it's own parameter to restrict the simulation to a desired set of genomic regions: \Rfunarg{regionsDels}, \Rfunarg{regionsIns}, \Rfunarg{regionsInvs}, \Rfunarg{regionsDups} and \Rfunarg{regionsTrans}. Each one being a \Robject{GRanges} object with a chromosome name, start- and end-position. \\ The following example places randomly four inversions into the second half of chr1 and the first half of chr2: <>= regions = GRanges(IRanges(c(21,1),c(40,20)), seqnames=c("chr1","chr2")) regions sim = simulateSV(output=NA, genome=genome, invs=4, sizeInvs=5, regionsInvs=regions, bpSeqSize=6, seed=2345, verbose=FALSE) sim metadata(sim) @ For translocations, the regions only say where to place the breakpoint, since the translocated region spans the chromosome until the closest of both ends. \\ Some applications may focus on certain parts of the hg19 only, like exons, introns or transcripts. The package \Rpackage{GenomicFeatures} provides functionalities to export such coordinates from the UCSC Genome Browser to R (see for example \Rfunction{makeTxDbFromUCSC}, \Rfunction{exonsBy}, \Rfunction{intronsBy}, \Rfunction{transcriptsBy}). In the following example, 100 deletions would be placed somewhere in the exonic regions on hg19: <>= transcriptDB = makeTxDbFromUCSC(genome = "hg19",tablename = "knownGene") exons = exonsBy(transcriptDB) exons = unlist(exons) exons = GRanges(IRanges(start=start(exons), end=end(exons)), seqnames=seqnames(exons)) simulateSV(output=NA, dels=100, regionsDels=exons, sizeDels=1000, bpSeqSize=50) @ SVs will not be placed within unknown regions or assembly gaps denoted by the letter \Rcode{N}. Such regions are detected and filtered automatically. \subsection{Inserting a set of SVs} The simulation allows to turn off the random generation of breakpoints and to insert a set of (for example previously detected or known) SVs. It works by using the same regions parameters and setting the parameter \Rfunarg{random} to \Rcode{FALSE}. This may also be a vector of five \Rcode{TRUE}/\Rcode{FALSE} values (in the order: deletions, insertions, inversions, tandem duplications, translocations) if some SVs shall be generated randomly and others not. \\ The example below inserts a deletion at chr2:16-25: <>= knownDeletion = GRanges(IRanges(16,25), seqnames="chr2") names(knownDeletion) = "myDeletion" knownDeletion sim = simulateSV(output=NA, genome=genome, regionsDels=knownDeletion, bpSeqSize=10, random=FALSE, verbose=FALSE) sim metadata(sim) @ Note, that the output adopts the names, that were given the \Robject{GRanges} object of the inserted SV(s). \\ It's a little different for insertions and translocations, since they involve two genomic regions. Thus, the \Robject{GRanges} object for \Rfunarg{regionsIns} has to be extended by columns \Rcode{chrB} and \Rcode{startB}, saying, that the sequence within ranges of the \Robject{GRanges} object will be inserted at chrB:startB. \\ The next example inserts the sequence from chr1:16:25 at chr2:26: <>= knownInsertion = GRanges(IRanges(16,25),seqnames="chr1", chrB="chr2", startB=26) names(knownInsertion) = "myInsertion" knownInsertion sim = simulateSV(output=NA, genome=genome, regionsIns=knownInsertion, bpSeqSize=10, random=FALSE, verbose=FALSE) sim metadata(sim) @ The \Robject{GRanges} object for translocations has to be extended by columns \Rcode{chrB}, \Rcode{startB} and \Rcode{endB}, saying, the sequence within the ranges of the object will be exchanged with the sequence from chrB:startB-endB. Typically, one start/end of each region equals the 5'or 3' end of the chromosome. One may add a column \Rcode{balanced} saying \Rcode{TRUE}/\Rcode{FALSE} for every single entry. \\ The next example is a simple simulation of the translocation t(9;22) leading to the BCR-ABL fusion gene. It uses simple breakpoints within 9q34.1 and 22q11.2 for demonstration: <>= trans_BCR_ABL = GRanges(IRanges(133000000,141213431), seqnames="chr9", chrB="chr22", startB=23000000, endB=51304566, balanced=TRUE) names(trans_BCR_ABL) = "BCR_ABL" trans_BCR_ABL @ <>= sim = simulateSV(output=NA, chrs=c("chr9", "chr22"), regionsTrans=trans_BCR_ABL, bpSeqSize=50, random=FALSE) @ The example is not executed here, because it requires the package \Rpackage{BSgenome.Hsapiens.UCSC.hg19}. Setting the argument \Rcode{transInsert=20} adds up to 20 random nucleotides at both breakpoints. \\ It is strongly recommended to only use a set of non-overlapping SVs. \section{Comparing two sets of SVs} A typical use case of SV simulation with is the evaluation of SV detection algorithms. The function \Rfunction{compareSV} looks for overlaps between the output of the simulation, the ground truth (\Rfunarg{simSVs}), and the output of an SV detection program (\Rfunarg{querySVs}) up to a certain tolerance. It computes the sensitivity, precision and the percentage overlap between the breakpoint sequences (if available). \\ An overlap is defined as the overlap between the breakpoints/breakpoint regions in \Rfunarg{simSVs}/\Rfunarg{querySVs} up to the given tolerance in bp. Overlap does not mean the whole affected region between the start and end of the SV. \\ Unfortunately, there is currently no common standard format for SVs. Because the main information about SVs is their position in the genome and, sometimes, the breakpoint sequence (which depends on the SV detection algorithm), \Rfunction{compareSV} expects the SV detections as tables in a simple BED- or BEDPE format (\url{http://code.google.com/p/bedtools}). Deletions, inversions and tandem duplications, which affect one region on the genome, can be either given in both formats. Translocations and insertions, which affect to regions on the genome, require the BEDPE-format. Eventually, the output of the SV detection format has to be converted accordingly (for example in R). \\ The function only compares one SV type at a time, so \Rfunarg{querySVs} and \Rfunarg{simSVs} may not contain a mixture of different kinds of SVs. \\ If the BED-tables for \Rfunarg{querySVs} or the simulation output are saved on disk, \Rfunction{compareSV} also accepts their filenames and loads the tables automatically as \Rcode{data.frame} in R. \\ The following example simulates first five 5bp deletions in the small toy genome defined above: \\ <>= sim = simulateSV(output=NA, genome=genome, dels=5, sizeDels=5, bpSeqSize=10, seed=2345, verbose=FALSE) simSVs = metadata(sim)$deletions simSVs @ An SV detection in BED format (the \Rfunarg{querySVs}) may look like this: Four of five deletions were detected, two with exact and two with an approximate breakpoint. Two additional deletions were detected, which were not part of the simulation. <>= querySVs = data.frame( chr=c("chr1","chr1","chr1","chr2","chr2"), start=c(12,17,32,2,16), end=c(15,24,36,6,20), bpSeq=c("AAAAAAAAAA", "AAAAAAATTT", "TTTTTTTTTT", "GGGGGGGGGG", "GGGGGGCCCC") ) querySVs @ The column with the breakpoint sequence is optional, the column names not important (BED-files have no header). \\ A comparison with 0bp tolerance yields only two overlaps: <>= compareSV(querySVs, simSVs, tol=0) @ A higher breakpoint tolerance of +/- 3bp also includes more imprecise detections: <>= compareSV(querySVs, simSVs, tol=3) @ Note that for \Rcode{deletion1}, the breakpoint sequence matched only by 80\%. \\ The second example compares translocations: <>= sim = simulateSV(output=NA, genome=genome, trans=2, percBalancedTrans=0.5, bpSeqSize=10, seed=246, verbose=FALSE) simSVs = metadata(sim)$translocations simSVs @ Detected translocations have to be given in BEDPE-format (i.e. at least six columns chr1, start1, end1, chr2, start2, end2 for the breakpoints on both chromosomes). In this example, the breakpoints were approximated up to 1bp or 2bp, optional breakpoint sequences are missing: <>= querySVs = data.frame( chr=c("chr1", "chr1", "chr2"), start1=c(15,32,32), end1=c(18,36,33), chr2=c("chr2","chr2","chr1"), start2=c(10,31,32), end2=c(12,33,36) ) querySVs @ Here, all detected SVs span the simulated breakpoints: <>= compareSV(querySVs, simSVs, tol=0) @ \section{Setting structural variation sizes} One may specify just one size in the parameters \Rfunarg{sizeDels}, \Rfunarg{sizeIns}, \Rfunarg{sizeInvs} or \Rfunarg{sizeDups} that applies to every SV of each type. But often, it might be more realistic to assign an individual, arbitrary size to every single SV. In the simplest case, they may be uniformly distributed: <>= set.seed(246) @ <>= sizes = sample(2:5, 5, replace=TRUE) sizes sim = simulateSV(output=NA, genome=genome, dels=5, sizeDels=sizes, bpSeqSize=6, seed=246, verbose=FALSE) sim metadata(sim) @ \subsection{Estimating size distribution from real data} According to studies from the 1000 Genomes Project, for deletions, insertions and duplications, the amount of SVs decreases rather quickly as their size increases (\cite{1000genomes}). The function \Rfunction{estimateSVSizes} simulates SV sizes by fitting a beta distribution, which is flexible enough to realistically model the shape of the size distribution of all four SV types. Its two shape parameters can be derived from a given vector of SV sizes. This requires the R-package \Rpackage{MASS}.\\ The following toy example draws 1.000 SV sizes between 10bp and 1000bp from a beta distribution based on a vector of 15 SV sizes: <>= svSizes = c(10,20,30,40,60,80,100,150,200,250,300,400,500,750,1000) simSizes = estimateSVSizes(n=1000, svSizes=svSizes, minSize=10, maxSize=1000, hist=TRUE) head(simSizes, n=20) @ The \Rfunarg{minSize} and \Rfunarg{maxSize} can be omitted; they are then calculated from the given set of \Rfunarg{svSizes}. It is recommended to uses a \Rfunarg{minSize} and \Rfunarg{maxSize} that is consistent with the minimum/maximum values in \Rfunarg{svSizes}. \\ Setting the parameter \Rcode{hist=TRUE} also plots a histogram of the SV sizes to give an impression of their distribution (see Fig.1). \begin{figure}[p] \begin{center} \includegraphics[height=9cm, width=9cm]{RSVSim-sizesExample2} \caption{Distribution of 1.000 SV sizes drawn from a beta distribution using function \Rfunction{estimateSVSizes}.} \end{center} \end{figure} For deletions, insertions, inversions and tandem duplications, \Rfunction{estimateSVSizes} can use default parameters for the beta distribution. They were estimated from the Database of Genomic Variants (DGV) release 2012-03-29 (\cite{dgv}). Hence, no set of SV sizes is needed for fitting the distribution. In total, 1.129 deletions, 490 insertions, 202 inversions and 145 tandem duplications between 500bp and 10kb were used to estimate the shape. The parameter \Rfunarg{default} can be set to either \Rcode{"deletions"}, \Rcode{"insertions"}, \Rcode{"inversions"} or \Rcode{"tandemDuplications"} to use the according set of shape parameters: <>= delSizes = estimateSVSizes(n=10000, minSize=500, maxSize=10000, default="deletions", hist=TRUE) head(delSizes, n=15) @ <>= delSizes = estimateSVSizes(n=10000, minSize=500, maxSize=10000, default="insertions", hist=TRUE) head(delSizes, n=15) @ <>= invSizes = estimateSVSizes(n=10000, minSize=500, maxSize=10000, default="inversions", hist=TRUE) head(invSizes, n=15) @ <>= delSizes = estimateSVSizes(n=10000, minSize=500, maxSize=10000, default="tandemDuplications", hist=TRUE) head(delSizes, n=15) @ See Fig.2, Fig.3, Fig.4 and Fig.5 to see the estimated distribution based on the SVs in the DGV. When using these default values, it is recommended to simulate SVs that do not differ too much in size (aroung 500bp-10kb). \begin{figure}[p] \begin{center} \includegraphics[height=9cm, width=9cm]{RSVSim-sizesExample3} \caption{Distribution of 10.000 deletion sizes based on deletions from the Database of Genomic Variants.} \end{center} \end{figure} \begin{figure}[p] \begin{center} \includegraphics[height=9cm, width=9cm]{RSVSim-sizesExample4} \caption{Distribution of 10.000 insertion sizes based on insertions from the Database of Genomic Variants.} \end{center} \end{figure} \begin{figure}[p] \begin{center} \includegraphics[height=9cm, width=9cm]{RSVSim-sizesExample4} \caption{Distribution of 10.000 inversion sizes based on inversions from the Database of Genomic Variants.} \end{center} \end{figure} \begin{figure}[p] \begin{center} \includegraphics[height=9cm, width=9cm]{RSVSim-sizesExample5} \caption{Distribution of 10.000 tandem duplication sizes based on tandem duplications from the Database of Genomic Variants.} \end{center} \end{figure} \section{Runtime} The runtime of \Rpackage{RSVSim} mainly depends on the number of simulated breakpoints and the size of the genome. \\ The following test case simulates 50 SVs (10 per SV type) on the complete hg19: <>= simulateSV(output=NA, dels=10, ins=10, inv=10, dups=10, trans=10, sizeDels=10000, sizeIns=10000, sizeInvs=10000, sizeDups=10000, repeatBias=FALSE, bpFlankSize=50, percSNPs=0.25, indelProb=0.5, maxIndelSize=10) @ Ten repetitions of the simulation yield an average time of 6 minutes on a single Intel Xeon CPU with 2.40GHz an R version 2.15.2 (2012-10-26) including loading of the hg19 and writing of the output to disc. \\ Enabling biases for repeat regions for the same test case (i.e. \Rcode{repeatBias=TRUE}), yield an average 7 minutes. \\ For other simulations on hg19, the runtime will scale linearly with the number of SV breakpoints. \\ Note, that the one-time, initial download of the repeat coordinates for hg19 from the UCSC browser may take up to 45 minutes. Alternatively, providing a RepeatMasker output file is much quicker (see section 3 for more details). \begin{thebibliography}{} \bibitem[Chen {\it et al}., 2008]{svformation4} Chen W. {\it et al} (2008) Mapping translocation breakpoints by next-generation sequencing, {\it Genome Res}, {\bf 18(7)}, 1143-1149. \bibitem[Huang {\it et al}., 2011]{art} Huang W. {\it et al} (2011) ART: a next-generation sequencing read simulator, {\it Bioinformatics}, {\bf 28 (4)}, 593-594. \bibitem[Hu {\it et al}., 2012]{pirs} Hu X. {\it et al} (2012) pIRS: Profile-based Illumina pair-end reads simulator, {\it Bioinformatics}, {\bf 28(11)}, 1533-1535. \bibitem[Iafrate {\it et al}., 2004]{dgv} Iafrate A.J. {\it et al} (2004) Detection of large-scale variation in the human genome, {\it Nat Genet.}, {\bf 36(9)}, 949-951. \bibitem[Lam {\it et al}., 2010]{svformation1} Lam H.Y. {\it et al} (2010) Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library, {\it Nat Biotechnol}, {\bf 28(1)}, 47-55. \bibitem[Mills {\it et al}., 2011]{1000genomes} Mills R.E. {\it et al} (2011) Mapping copy number variation by population-scale genome sequencing, {\it Nature}, {\bf 470(7332)}, 59-65. \bibitem[Ou {\it et al}., 2011]{svformation3} Ou Z. {\it et~al} (2011) Observation and prediction of recurrent human translocations mediated by NAHR between nonhomologous chromosomes, {\it Genome Res}, {\bf 21(1)}, 33-46. \bibitem[Pang {\it et al}., 2013]{svformation2} Pang A.W. {\it et al} (2013) Mechanisms of Formation of Structural Variation in a Fully Sequenced Human Genome, {\it Hum Mutat}, {\bf 34(2)}, 345-354. \bibitem[Smit {\it et al}., 1996-2010]{repeatmasker} Smit A. {\it et al} (1996-2010) RepeatMasker Open-3.0., . \end{thebibliography} \section{Session Information} <>= sessionInfo() @ \end{document}