This data package provides a set of output files from running a number
of various transcript abundance quantifiers on 6 samples from the
GEUVADIS Project. The
files are contained in the inst/extdata
directory.
A citation for the GEUVADIS Project is:
Lappalainen, et al., “Transcriptome and genome sequencing uncovers functional variation in humans”, Nature 501, 506-511 (26 September 2013) doi:10.1038/nature12531.
The purpose of this vignette is to detail which versions of software were run, and exactly what calls were made.
A small file, samples.txt
is included in the inst/extdata
directory:
dir <- system.file("extdata", package="tximportData")
samples <- read.table(file.path(dir,"samples.txt"), header=TRUE)
samples
## pop center assay sample experiment run
## 1 TSI UNIGE NA20503.1.M_111124_5 ERS185497 ERX163094 ERR188297
## 2 TSI UNIGE NA20504.1.M_111124_7 ERS185242 ERX162972 ERR188088
## 3 TSI UNIGE NA20505.1.M_111124_6 ERS185048 ERX163009 ERR188329
## 4 TSI UNIGE NA20507.1.M_111124_7 ERS185412 ERX163158 ERR188288
## 5 TSI UNIGE NA20508.1.M_111124_2 ERS185362 ERX163159 ERR188021
## 6 TSI UNIGE NA20514.1.M_111124_4 ERS185217 ERX163062 ERR188356
Further details can be found in a more extended table:
samples.ext <- read.delim(file.path(dir,"samples_extended.txt"), header=TRUE)
colnames(samples.ext)
## [1] "Source.Name"
## [2] "Comment.ENA_SAMPLE."
## [3] "Characteristics.Organism."
## [4] "Term.Source.REF"
## [5] "Term.Accession.Number"
## [6] "Characteristics.Strain."
## [7] "Characteristics.population."
## [8] "Comment.1000g.Phase1.Genotypes."
## [9] "Protocol.REF"
## [10] "Protocol.REF.1"
## [11] "Extract.Name"
## [12] "Comment.LIBRARY_SELECTION."
## [13] "Comment.LIBRARY_SOURCE."
## [14] "Comment.SEQUENCE_LENGTH."
## [15] "Comment.LIBRARY_STRATEGY."
## [16] "Comment.LIBRARY_LAYOUT."
## [17] "Comment.NOMINAL_LENGTH."
## [18] "Comment.NOMINAL_SDEV."
## [19] "Protocol.REF.2"
## [20] "Performer"
## [21] "Assay.Name"
## [22] "Technology.Type"
## [23] "Comment.ENA_EXPERIMENT."
## [24] "Comment.READ_INDEX_1_BASE_COORD."
## [25] "Protocol.REF.3"
## [26] "Scan.Name"
## [27] "Comment.SUBMITTED_FILE_NAME."
## [28] "Comment.ENA_RUN."
## [29] "Comment.FASTQ_URI."
## [30] "Protocol.REF.4"
## [31] "Derived.Array.Data.File"
## [32] "Comment..Derived.ArrayExpress.FTP.file."
## [33] "Factor.Value.population."
## [34] "Factor.Value.laboratory."
## [35] "date"
The quantification outputs themselves can be found in sub-directories:
list.files(dir)
## [1] "cufflinks" "kallisto" "rsem"
## [4] "sailfish" "salmon" "samples.txt"
## [7] "samples_extended.txt" "tx2gene.csv"
list.files(file.path(dir,"cufflinks"))
## [1] "isoforms.attr_table" "isoforms.count_table" "isoforms.fpkm_table"
list.files(file.path(dir,"rsem","ERR188021"))
## [1] "ERR188021.genes.results"
list.files(file.path(dir,"kallisto","ERR188021"))
## [1] "abundance.tsv" "run_info.json"
list.files(file.path(dir,"salmon","ERR188021"))
## [1] "cmd_info.json" "quant.sf"
list.files(file.path(dir,"sailfish","ERR188021"))
## [1] "cmd_info.json" "quant.sf"
The human genome and annotations were downloaded from
Illumina iGenomes
for the UCSC hg19 version. The human genome FASTA file used was in the
Sequence/WholeGenomeFasta
directory and the gene annotation GTF file used
was the genes.gtf
file in the Annotation/Genes
directory. This GTF
file contains RefSeq transcript IDs and UCSC gene names. The
Annotation
directory contained a README.txt
file with the text:
The contents of the annotation directories were downloaded from UCSC on: June 02, 2014.
The genes.gtf
file was filtered to include only chromosomes
1-22, X, Y, and M.
Tophat2 version 2.0.11 was run with the call:
tophat -p 20 -o tophat_out/$f genome fastq/$f\_1.fastq.gz fastq/$f\_2.fastq.gz;
Cufflinks version 2.2.1 was run with the call:
cuffquant -p 40 -b $GENO -o cufflinks/$f genes.gtf tophat_out/$f/accepted_hits.bam;
Cuffnorm was run with the call:
cuffnorm genes.gtf -o cufflinks/ \
cufflinks/ERR188297/abundances.cxb \
cufflinks/ERR188088/abundances.cxb \
cufflinks/ERR188329/abundances.cxb \
cufflinks/ERR188288/abundances.cxb \
cufflinks/ERR188021/abundances.cxb \
cufflinks/ERR188356/abundances.cxb
RSEM version 1.2.11 was run with the call:
rsem-calculate-expression -p 20 --no-bam-output --paired-end <(zcat fastq/$f\_1.fastq.gz) <(zcat fastq/$f\_2.fastq.gz) rsem/index rsem/$f/$f
kallisto version 0.42.4 was run with the call:
kallisto quant --bias -i kallisto_0.42.4/index -o kallisto_0.42.4/$f fastq/$f\_1.fastq.gz fastq/$f\_2.fastq.gz
Salmon version 0.6.0 was run with the call:
$salmon quant -p 10 --biasCorrect -i salmon_0.6.0/index -l IU -1 <(zcat fastq/$f\_1.fastq.gz) -2 <(zcat fastq/$f\_2.fastq.gz) -o salmon_0.6.0/$f
Sailfish version 0.9.0 was run with the call:
sailfish quant -p 10 --biasCorrect -i sailfish_0.9.0/index -l IU -1 <(zcat fastq/$f\_1.fastq.gz) -2 <(zcat fastq/$f\_2.fastq.gz) -o sailfish_0.9.0/$f
sessionInfo()
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] magrittr_1.5 tools_3.3.0 stringi_1.0-1 knitr_1.13 stringr_1.0.0
## [6] evaluate_0.9