Overview

This document provides a quality assessment of Genome Analyzer results. The assessment is meant to complement, rather than replace, quality assessment available from the Genome Analyzer and its documentation. The narrative interpretation is based on experience of the package maintainer. It is applicable to results from the 'Genome Analyzer' hardware single-end module, configured to scan 300 tiles per lane. The 'control' results refered to below are from analysis of PhiX-174 sequence provided by Illumina.

Run summary

Subsequent sections of the report use the following to identify figures and other information.

Key
wgEncodeUwTfbsAg04449CtcfStdAlnRep1.bam1
wgEncodeUwTfbsAg04450CtcfStdAlnRep1.bam2
wgEncodeUwTfbsAg09309CtcfStdAlnRep1.bam3
wgEncodeUwTfbsAg09319CtcfStdAlnRep1.bam4
wgEncodeUwTfbsAg10803CtcfStdAlnRep1.bam5
wgEncodeUwTfbsAoafCtcfStdAlnRep1.bam6
wgEncodeUwTfbsHaspCtcfStdAlnRep1.bam7
wgEncodeUwTfbsHbmecCtcfStdAlnRep1.bam8
wgEncodeUwTfbsHcfaaCtcfStdAlnRep1.bam9
wgEncodeUwTfbsHcpeCtcfStdAlnRep1.bam10
wgEncodeUwTfbsHeeCtcfStdAlnRep1.bam11
wgEncodeUwTfbsHmfCtcfStdAlnRep1.bam12
wgEncodeUwTfbsHpafCtcfStdAlnRep1.bam13
wgEncodeUwTfbsHpfCtcfStdAlnRep1.bam14
wgEncodeUwTfbsHrpeCtcfStdAlnRep1.bam15
wgEncodeUwTfbsAg04449CtcfStdAlnRep2.bam16
wgEncodeUwTfbsAg09309CtcfStdAlnRep2.bam17
wgEncodeUwTfbsAg09319CtcfStdAlnRep2.bam18
wgEncodeUwTfbsAg10803CtcfStdAlnRep2.bam19
wgEncodeUwTfbsAoafCtcfStdAlnRep2.bam20
wgEncodeUwTfbsHbmecCtcfStdAlnRep2.bam21
wgEncodeUwTfbsHcpeCtcfStdAlnRep2.bam22
wgEncodeUwTfbsHeeCtcfStdAlnRep2.bam23
wgEncodeUwTfbsHmfCtcfStdAlnRep2.bam24
wgEncodeUwTfbsHpafCtcfStdAlnRep2.bam25
wgEncodeUwTfbsHpfCtcfStdAlnRep2.bam26
wgEncodeUwTfbsAg04449InputStdAlnRep1.bam27
wgEncodeUwTfbsAg04450InputStdAlnRep1.bam28
wgEncodeUwTfbsAg09309InputStdAlnRep1.bam29
wgEncodeUwTfbsAg09319InputStdAlnRep1.bam30
wgEncodeUwTfbsAg10803InputStdAlnRep1.bam31
wgEncodeUwTfbsAoafInputStdAlnRep1.bam32
wgEncodeUwTfbsHaspInputStdAlnRep1.bam33
wgEncodeUwTfbsHbmecInputStdAlnRep1.bam34
wgEncodeUwTfbsHcfaaInputStdAlnRep1.bam35
wgEncodeUwTfbsHcpeInputStdAlnRep1.bam36
wgEncodeUwTfbsHeeInputStdAlnRep1.bam37
wgEncodeUwTfbsHmfInputStdAlnRep1.bam38
wgEncodeUwTfbsHpafInputStdAlnRep1.bam39
wgEncodeUwTfbsHpfInputStdAlnRep1.bam40
wgEncodeUwTfbsHrpeInputStdAlnRep1.bam41

Read counts. Filtered and aligned read counts are reported relative to the total number of reads (clusters; if only filtered or aligned reads are available, total read count is reported). Consult Genome Analyzer documentation for official guidelines. From experience, very good runs of the Genome Analyzer 'control' lane result in 25-30 million reads, with up to 95% passing pre-defined filters.

  ShortRead:::.ppnCount(qa[["readCounts"]])
readfilteraligned
19952444  
221170101  
314311099  
422451182  
526964677  
69317234  
714968206  
823428973  
920244846  
1020606447  
1123118965  
129764834  
1327021265  
1421074336  
1525683072  
1623572200  
1710263622  
1825700109  
1929559218  
2029974058  
2115647982  
2231968244  
2310628790  
2421481465  
2527813170  
2624317687  
2716096148  
2818400427  
297925518  
3023348186  
3119776716  
3221424353  
3326952024  
3421100439  
3523650231  
3620835811  
3712764649  
388720617  
3920412883  
4018540990  
4118348314  
  ShortRead:::.plotReadCount(qa)
./image/readCount.jpg

Base call frequency over all reads. Base frequencies should accurately reflect the frequencies of the regions sequenced.

  ShortRead:::.plotNucleotideCount(qa)
./image/baseCalls.jpg

Overall read quality. Lanes with consistently good quality reads have strong peaks at the right of the panel.

  df <- qa[["readQualityScore"]]
  ShortRead:::.plotReadQuality(df[df$type=="read",])
./image/readQuality.jpg

Read distribution

These curves show how coverage is distributed amongst reads. Ideally, the cumulative proportion of reads will transition sharply from low to high.

Portions to the left of the transition might correspond roughly to sequencing or sample processing errors, and correspond to reads that are represented relatively infrequently. 10-15%; of reads in a typical Genome Analyzer 'control' lane fall in this category.

Portions to the right of the transition represent reads that are over-represented compared to expectation. These might include inadvertently sequenced primer or adapter sequences, sequencing or base calling artifacts (e.g., poly-A reads), or features of the sample DNA (highly repeated regions) not adequately removed during sample preparation. About 5% of Genome Analyzer 'control' lane reads fall in this category.

Broad transitions from low to high cumulative proportion of reads may reflect sequencing bias or (perhaps intentional) features of sample preparation resulting in non-uniform coverage. the transition is about 5 times as wide as expected from uniform sampling across the Genome Analyzer 'control' lane.

  df <- qa[["sequenceDistribution"]]
  ShortRead:::.plotReadOccurrences(df[df$type=="read",], cex=.5)
./image/readOccurences.jpg

Common duplicate reads might provide clues to the source of over-represented sequences. Some of these reads are filtered by the alignment algorithms; other duplicate reads might point to sample preparation issues.

  ShortRead:::.freqSequences(qa, "read")
sequencecountlane
AAAAAAAAAAAAAAAGAAAAAAAAAAAAAAACAAAA145823
TTGTTCACTATGGAGTTGCGGTTAAAAGTAGGCCCT132921
GCTTCTCCAAGGGCAGAGCCAGAGTCCTCTTTTGCC107021
AAAAAAAAAAAAAAAAGGAAAAAAAAAAAAAAAAAA 89522
AAAAAAAAAAAAAAATAAAAAAAAAAAAAAAAAAAA 66523
AAAAAAAAAAAAAAAAGTAAAAAAAAAAAAAAAAAA 53522
TTGTTCACTATGGAGTTGCGGTTAAAAGTAGGCCCT 4752
AAAAAAAAGAAAAAAAAAAAAAANAAAAAAAAAAGA 46022
AAAAAAAATAAAAAAAAAATAAAAAAAAAAAAAAAA 45522
TTGTTCACTATGGAGTTGCGGTTAAAAGTAGGCCCT 4429
AAAAAACAAAAAAAAACAAAAAAAACAAAACAANAA 43922
GCTTCTCCAAGGGCAGAGCCAGAGTCCTCTTTTGCC 4082
TTGTTCACTATGGAGTTGCGGTTAAAAGTAGGCCCT 36715
GCTTCTCCAAGGGCAGAGCCAGAGTCCTCTTTTGCC 3589
TTGTTCACTATGGAGTTGCGGTTAAAAGTAGGCCCT 35525
GCTTCTCCAAGGGCAGAGCCAGAGTCCTCTTTTGCC 34315
AAAAAAAGAAAACAAAAAACAAAAAAAAGAAAAAAA 30822
TTGTTCACTATGGAGTTGCGGTTAAAAGTAGGCCCT 29520
GCTTCTCCAAGGGCAGAGCCAGAGTCCTCTTTTGCC 29325
AAAAAAAATAAAAAAAAAAAAAANAAAAAAAAAAGA 29122

Common duplicate reads after filtering

  ShortRead:::.freqSequences(qa, "filtered")
NA

Common aligned duplicate reads are

  ShortRead:::.freqSequences(qa, "aligned")
NA

Cycle-specific base calls and read quality

Per-cycle base call should usually be approximately uniform across cycles. Genome Analyzer `control' lane results often show a deline in A and increase in T as cycles progress. This is likely an artifact of the underlying technology.

  perCycle <- qa[["perCycle"]]
  ShortRead:::.plotCycleBaseCall(perCycle$baseCall)
./image/perCycleBaseCall.jpg

Per-cycle quality score. Reported quality scores are `calibrated', i.e., incorporating phred-like adjustments following sequence alignment. These typically decline with cycle, in an accelerating manner. Abrupt transitions in quality between cycles toward the end of the read might result when only some of the cycles are used for alignment: the cycles included in the alignment are calibrated more effectively than the reads excluded from the alignment.

The reddish lines are quartiles (solid: median, dotted: 25, 75), the green line is the mean. Shading is proporitional to number of reads.

  perCycle <- qa[["perCycle"]]
  ShortRead:::.plotCycleQuality(perCycle$quality)
./image/perCycleQuality.jpg

Depth Of Coverage

The number of times the aligned reads overlap a given sequence position.

  ShortRead:::.plotDepthOfCoverage(qa[["depthOfCoverage"]])
./image/depthOfCoverage.jpg

Adapter Contamination

Adapter contamination is defined here as non-genetic sequences attached at either or both ends of the reads. The 'contamination' measure is the number of reads with a right or left match to the adapter sequence over the total number of reads. Mismatch rates are 10% on the left and 20% on the right with a minimum overlap of 10 nt.

  ShortRead:::.ppnCount(qa[["adapterContamination"]])
Not available.

Tue Oct 11 17:41:36 2011; ShortRead v. 1.11.42
Report template: Martin Morgan