--- title: "Pre-Processing for the Zebrafish RNA-Seq Gene-Level Counts" author: "Davide Risso" date: "Last modified: June 7, 2024; Compiled: `r format(Sys.time(), '%B %d, %Y')`" bibliography: biblio.bib output: BiocStyle::html_document: toc: true vignette: > %\VignetteEncoding{UTF-8} --- # Introduction This vignette describes the pre-processing steps that were followed for the generation of the gene-level read counts contained in the Bioconductor package _zebrafishRNASeq_. # Sample preparation and sequencing Olfactory sensory neurons were isolated from three pairs of gallein-treated and control embryonic zebrafish pools and purified by fluorescence activated cell sorting (FACS) [@ferreira2014silencing]. Each RNA sample was enriched in poly(A)+ RNA from 10--30 ng total RNA and 1 $\mu$L (1:1000 dilution) of Ambion ERCC ExFold RNA Spike-in Control Mix 1 was added to 30 ng of total RNA before mRNA isolation. cDNA libraries were prepared according to manufacturer's protocol. The six libraries were sequenced in two multiplex runs on an Illumina HiSeq2000 sequencer, yielding approximately 50 million 100bp paired-end reads per library. # Read alignment and expression quantitation We made use of a custom reference sequence, defined as the union of the zebrafish reference genome (Zv9, downloaded from Ensembl [@flicek2012ensembl], v. 67) and the [ERCC spike-in sequences](https://www.thermofisher.com/order/catalog/product/4456739). Reads were mapped with TopHat [@trapnell2009tophat] (v. 2.0.4), with the following parameters, ``` --library-type=fr-unstranded -G ensembl.gtf --transcriptome-index=transcript --no-novel-juncs ``` where _ensembl.gtf_ is a GTF file containing Ensembl gene annotation. Gene-level read counts were obtained using the htseq-count python script [@htseq] in the "union" mode and Ensembl (v. 67) gene annotation. After verifying that there were no run-specific biases, we used the sums of the counts of the two runs as the expression measures for each library. # Loading the zebrafish data into R To load the gene-level read counts into R, simply type ```{r loadData} library(zebrafishRNASeq) data(zfGenes) head(zfGenes) ``` The ERCC spike-in read counts are in the last rows of the same matrix and can be retrieved in the following way. ```{r ercc, eval=TRUE, results="markup"} spikes <- zfGenes[grep("^ERCC", rownames(zfGenes)),] head(spikes) ``` The typical use of this dataset is the indentification of differentially expressed genes between control (Ctl) and treated (Trt) samples. For additional details, exploratory analysis, and normalization of the zebrafish data see @risso2014ruv and @risso2014role. The data are used as a case study for the Bioconductor package _RUVSeq_. # Session info ```{r sessionInfo} sessionInfo() ``` # References