A straightforward and complete next-generation sequencing read simulator

**Sandy** is a bioinformatics tool that provides a simple engine to simulate next-generation sequencing (NGS) reads for genomic and transcriptomic pipelines. Simulated data works as experimental control \- a key step to optimize NGS analysis - in comparison to hypothetical models. **Sandy** is a straightforward, easy-to-use, fast and highly customizable tool that generates reads requiring only a fasta file as input. **Sandy** can simulate **single-end** and **paired-end** reads from both DNA and RNA sequencing as if produced from the most used second and third-generation platforms. The tool also tracks a built-in database with predefined models extracted from real data for sequencer **quality-profiles** (i.e. [Illumina](https://www.illumina.com/) *hiseq*, *miseq*, *nextseq*), **expression-matrices** generated from [GTExV8](https://www.gtexportal.org/home/) data for 54 human tissues, and **genomic-variations** such as SNVs and Indels from [1KGP](https://www.internationalgenome.org/) and gene fusions from [COSMIC](https://cancer.sanger.ac.uk/cosmic). For full documentation, please visit . ## Features * Simulate DNA and RNA sequencing Simulate **single-end** (long and short fragments) and **paired-end** sequencing reads for **genome** and **transcriptome** analysis. The simulation can be customized with raffle seed, sequencing coverage, number of reads, fragment mean, output formats (`fastq`, `sam` and their compressed versions `fastq.gz` and `bam`), sequence identifier (header of entries in `fastq`) and much more. * Sequencer **quality-profile** **Sandy** generates `fastq` quality entries that mimic the [Illumina](https://www.illumina.com/), [PacBio](https://www.pacb.com/) and [Nanopore](https://nanoporetech.com/) sequencers, as well as generating the *phred-score* using a statistical model based on the *poisson* distribution. * RNA-Seq **expression-matrix** It is possible to simulate a RNA-Seq which reflects the abundance of gene expression for transcripts and genes of a given tissue. For this purpose, **expression-matrices** were created from the gene expression data of 54 tissues of the [GTExV8](https://www.gtexportal.org/home/) project. * Whole-genome sequencing with **genomic-variiation** The user can tune the reference genome (eg [GRCh38.p13.genome.fa.gz](https://www.gencodegenes.org/human/)), adding homozygous or heterozygous **genomic-variations** such as SNVs, Indels, gene fusions and other types of structural variations (eg CNVs, retroCNVs). **Sandy** has in its database **genomic-variations** obtained from the [1KGP](https://www.internationalgenome.org/) and from [COSMIC](https://cancer.sanger.ac.uk/cosmic). * Custom user models Users can include their models for **quality-profile**, **expression-matrix** and **genomic-variation** in order to adapt the simulation to their needs. * Custom sequence identifier The sequence identifier, as the name implies, is a string that identifies a biological sequence (usually nucleotides) within a sequencing data. For example, the `fasta` format includes the sequence identifier always after the `>` character at the beginning of the line; the `fastq` format always includes it after the `@` character at the beginning of the line; the `sam` format uses the first column (called the *query template name*). | Sequence identifier | File format | | :-- | :-: | | \>**MYID and Optional information**
ATCGATCG | `fasta` | | @**MYID and Optional information**
ATCGATCG
+
ABCDEFGH | `fastq` | | **MYID** 99 chr1 123456 20 8M chr1 123478 30 ATCGATCG ABCDEFGH | `sam` | Sequence identifiers may be customized in output using a format string passed by the user. This format is a combination of literal and escaped characters, in a similar fashion to that used in C programming language’s `printf` function. For example, simulating a paired-end sequencing you can add the read length, read position and mate position into all sequence identifiers with the following format: %i.%U read=%c:%t-%n mate=%c:%T-%N length=%r In this case, results in `fastq` format would be: ==> Into R1 @SR.1 read=chr6:979-880 mate=chr6:736-835 length=100 ... ==> Into R2 @SR.1 read=chr6:736-835 mate=chr6:979-880 length=100 ## Installation There are two recommended ways to obtain **Sandy**: Pulling the official [Docker](https://www.docker.com/) image and installing through [CPAN](https://metacpan.org/). ### Docker Assuming that `docker` is already installed on your server, simply run the command: $ docker pull galantelab/sandy For more details, see [docker/README.md](https://github.com/galantelab/sandy/blob/master/docker/README.md) file. ### CPAN #### Prerequisites Along with `perl`, you must have `zlib`, `gcc`, `make` and `cpanm` packages installed: - Debian/Ubuntu % apt-get install perl zlib1g-dev gcc make cpanminus - CentOS/Fedora % yum install perl zlib gcc make perl-App-cpanminus - Archlinux % pacman -S perl zlib gcc make cpanminus #### Installing with `cpanm` Install **Sandy** with the following command: % cpanm App::Sandy If you concern about speed, you can avoid testing with the flag `--notest`: % cpanm --notest App::Sandy For more details, see [INSTALL](https://github.com/galantelab/sandy/blob/master/INSTALL) file ## Acknowledgments | Institution | Site | | :-- | :-: | | Coordination for the Improvement of Higher Level Personnel | [CAPES](http://www.capes.gov.br/) | | The São Paulo Research Foundation | [FAPESP](https://fapesp.br/en/about) | | Teaching and Research Institute from Sírio-Libanês Hospital | [Galantelab](https://www.bioinfo.mochsl.org.br/) | ## License This is free software, licensed under: The GNU General Public License, Version 3, June 2007