# Bam Parameters - BamParam High throughput sequencing RNA-Seq data comes in a multitude of _flavours_, _i.e._ even from a single provider, protocol - _e.g._ strand specific, paired-end - reads characteristics - _e.g._ read length - will vary. The __easyRNASeq simpleRNASeq__ method will infer these information based on excerpts sampled from the data. However, it is always best to provide these information, as 1. the inference is done on small excerpt and can fail 2. it is always good to document an analysis By default __easyRNASeq simpleRNASeq__ will keep the inferred parameters over the user-provided parameters if these do not agree and emit corresponding warnings. The choice to rely on inferred parameters over user-provided one is to enforce user to cross-validate their knowledge of the data characteristics, as these are crucial for an adequate processing. Remember GIGO[^2]. If the _automatic inference_ does fail, please let me know, so that I optimise it. Meanwhile, you can use the __override__ argument to enforce the use of user-passed parameters. To reproduce the results from _Robinson, Delhomme et al._ [@Robinson:2014p6362], we first need to download an excerpt of the data. We first retrieve the file listing and md5 codes ```{r bam files, eval=FALSE} download.file(url=paste0("ftp://ftp.plantgenie.org/Tutorials/RnaSeqTutorial/", "data/star/md5.txt"), destfile="md5.txt") ``` In this part of the vignette, we will _NOT_ process all the data, albeit it would be possible, but for the sake of brevity, we will only retrieve the six first datasets. We get these from the sample information contained within this package. ```{r data, eval=FALSE} data(RobinsonDelhomme2014) lapply(RobinsonDelhomme2014[1:6,"Filename"],function(f){ # BAM files download.file(url=paste0("ftp://ftp.plantgenie.org/Tutorials/", "RnaSeqTutorial/data/star/",f), destfile=as.character(f)) # BAM index files download.file(url=paste0("ftp://ftp.plantgenie.org/Tutorials/", "RnaSeqTutorial/data/star/",f,".bai"), destfile=as.character(paste0(f,".bai"))) }) ``` ```{r data unit test, eval=TRUE, echo=FALSE} # THIS IS A subset of the data (chr 19 only) used to speed up # the vignette creation while still testing capabilities data(RobinsonDelhomme2014) lapply(RobinsonDelhomme2014[1:6,"Filename"],function(f){ # BAM files file.copy( dir(vDir,pattern=paste0(as.character(f),"$"),full.names=TRUE) ,file.path(".",f)) # BAM index files file.copy( dir(vDir,pattern=paste0(as.character(f),".bai"),full.names=TRUE) ,file.path(".",paste0(f,".bai"))) }) ``` These six files - as the rest of the dataset - have been sequenced on an Illumina HiSeq 2500 in paired-end mode using a non-strand specific library protocol with a read length of 100 bp. The raw data have been processed as described in the aforementioned guidelines[^1] and as such have been filtered for rRNA sequences, trimmed for adapters and clipped for quality. The resulting reads (of length 50-100bp) have then been aligned using STAR [Dobin:2013p5293]. Using these information, we finally generate the __BamParam__ object. ```{r bamParam} bamParam <- BamParam(paired = TRUE, stranded = FALSE) ``` A third parameter _yieldSize_ can be set to speed up the processing on multi-CPU or multi-core computers. It splits and processed the BAM files in chunk of size _yieldSize_ with a default of 1M reads. Finally, we create the list of BAM files we just retrieved. ```{r bamFiles} bamFiles <- getBamFileList(dir(".","*\\.bam$"), dir(".","*\\.bai$")) ``` ----