You will need the following softwares intsalled on the system that you want to use for creating tallies.
We start with a fresh install of Ubuntu 12.04 LTS.
First we install required packages in the OS using the package manager (apt-get
).
sudo apt-get install build-essential samtools python-dev swig cython python-numpy python-matplotlib hdf5-tools libhdf5-serial-1.8.4 libhdf5-serial-dev screen subversion git
Let's create a local Software
directory to store our modules.
cd ~
mkdir Software
cd Software
wget http://pysam.googlecode.com/files/pysam-0.7.5.tar.gz
tar xvfz pysam-0.7.5.tar.gz
cd pysam-0.7.5
sudo python setup.py install
cd ~/Software
wget https://pypi.python.org/packages/source/H/HTSeq/HTSeq-0.5.4p3.tar.gz
tar xvfz HTSeq-0.5.4p3.tar.gz
cd HTSeq-0.5.4p3
./build_it
sudo python setup.py install
cd ~/Software
wget http://h5py.googlecode.com/files/h5py-2.2.0.tar.gz
tar xvfz h5py-2.2.0.tar.gz
cd h5py-2.2.0/
sudo python setup.py install
Let's get the tally.bam.py
python script and see if we can run it.
cd ~/Software
wget http://www.ebi.ac.uk/~pyl/h5vc/tally.bam.py
Calling the script with the -h
option should print the following help message.
python tally.bam.py -h
usage: tally.bam.py [-h] [--bam BAM [BAM ...]] [--reg REG [REG ...]]
[--out OUT] [--group GROUP] [--log LOG]
[--padding_front PADDING_FRONT]
[--padding_back PADDING_BACK] [--readlen READLEN]
[--min_mapq MIN_MAPQ] [--loglvl LOGLVL] [--debug]
[--count_skipped] [--core] [--replace] [--gzip]
A HTSeq script for tallying mismatches, coverage and deletions a set of
samples and regions
optional arguments:
-h, --help show this help message and exit
--bam BAM [BAM ...] bam file to tally in, can be specified more than once;
use "SampleID:PatientID:SampleType:BamFileName" to
encode Sample ID, Patient ID and Sample Type!
Examples: "Pt1Relapse:Patient1:Case:pt1r.bam",
"Pt1Control:Patient1:Control:pt1c.bam"
--reg REG [REG ...] region to tally in (e.g. X:10000-20000), can be
specified more than once; use only the chromosome name
for whole chromosome processing
--out OUT output destination tag - will be prefixed with group
and sample, e.g. VariantTally.case.output.hfs
--group GROUP group_id for hfs5 output
--log LOG msg-log destination
--padding_front PADDING_FRONT
Number of cycles from the start of the read to
consider as untrustworthy - defaults to 10
--padding_back PADDING_BACK
Number of cycles from the end of the read to consider
as untrustworthy - defaults to 10
--readlen READLEN Length of the reads in the bam file (all reads in the
input bam must have the same length - defaults to 101)
--min_mapq MIN_MAPQ minimum mapping quality for a read to be considered -
defaults to 5
--loglvl LOGLVL ping every args.loglvl entries - defaults to 100000
--debug Should debug information be printed, this can lead to
really big log-files
--count_skipped Should cigar operations denoting skipped bases be
counted as deletions (CIGAR code "N")
--core Should the hfs5 "core" driver be used, this loads the
file into memory completely!
--replace Replace output file if it exists!
--gzip Use gzip compression in the hdf5 file
Version: 0.0.1 Written by Paul Theodor Pyl (pyl@embl.de), European Molecular
Biology Laboratory (EMBL). (c) 2012. Released under the terms of the GNU
General Public License v3.
We should also get the merge.tallies.by.chromosome.py
python script and see if we can run it. This script can be used to merge tally files containing the tallies of different samples on the same chromosome(s) and will be important in settings with large cohorts or where samples are added over time.
cd ~/Software
wget http://www.ebi.ac.uk/~pyl/h5vc/merge.tallies.by.chromosome.py
Calling the script with the -h
option should print the following help message.
python merge.tallies.by.chromosome.py -h
usage: merge.tallies.by.chromosome.py [-h] [--tally TALLY [TALLY ...]]
[--out OUT] --chrom CHROM
[--group GROUP] [--log LOG] [--gzip]
A HTSeq script for merging tallies
optional arguments:
-h, --help show this help message and exit
--tally TALLY [TALLY ...]
tally to be added; all files must contain the same
chromosome;
--out OUT output destination
--chrom CHROM chromosome to process
--group GROUP group_id for hfs5 output
--log LOG msg-log destination
--gzip Use gzip compression in the hdf5 file
Version: 0.0.1 Written by Paul Theodor Pyl (pyl@embl.de), European Molecular
Biology Laboratory (EMBL). (c) 2012. Released under the terms of the GNU
General Public License v3.
Warning: The following examples take time, compute power and disk space, you might need to spend a day or two on this :)
We will now perform a test run of the tally script using publicly available data from the European Nucleotide Archive. Specifically we will get some WGS runs from two yeast strains (s288c and YB210) that have been sequenced with Illumina HiSeq 2000 machines.
SRR488141
SRR488139
We suggest using aspera connect to get these files, but simple ftp is also fine, it might just take a long time.s288c runs
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR488/SRR488141/SRR488141_1.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR488/SRR488141/SRR488141_2.fastq.gz
YB210 runs
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR488/SRR488139/SRR488139_1.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR488/SRR488139/SRR488139_2.fastq.gz
Firstly, install AsperaConnect into the Firefox browser shipped with Ubuntu. The executable will be in your home
directory located at ~/.aspera/connect/bin
s288c runs
~/.aspera/connect/bin/ascp -QT -l 300m -i ~/.aspera/connect/etc/asperaweb_id_dsa.putty era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR488/SRR488141/SRR488141_1.fastq.gz .
~/.aspera/connect/bin/ascp -QT -l 300m -i ~/.aspera/connect/etc/asperaweb_id_dsa.putty era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR488/SRR488141/SRR488141_2.fastq.gz .
YB210 runs
~/.aspera/connect/bin/ascp -QT -l 300m -i ~/.aspera/connect/etc/asperaweb_id_dsa.putty era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR488/SRR488139/SRR488139_1.fastq.gz .
~/.aspera/connect/bin/ascp -QT -l 300m -i ~/.aspera/connect/etc/asperaweb_id_dsa.putty era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR488/SRR488139/SRR488139_2.fastq.gz .
bwa
+ ENSEMBL yeast genomeWe will use bwa
as an example of how to align those reads, of course you may choose a different aligner. Our choice is purely for simplicity of the tutorial and there might be various reasons you could have for choosing more complex setups than this.
bwa
cd ~/Software
git clone https://github.com/lh3/bwa.git
cd bwa; make
echo alias bwa='~/Software/bwa/bwa' > ~/.bashrc
We download the reference from Ensembl.
mkdir Reference
cd Reference
wget ftp://ftp.ensembl.org/pub/release-73/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa.gz
gunzip Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa.gz
Before we can use the reference FASTA file to align reads we need to index it for the given aligner.
bwa index Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa
This should not take long.
We align the reads using the bwa mem
command with the following extra options:
-t 4
using multithreading with 4 cores (depending on your machine this can be adjusted up or down)-M
activate notation of split alignments as secondary alignments to avoid errors with downstream tools (GATK doesn't like multiple primary alignments)-R '@RG\tID:s288c_one\tSM:ilumina'
This command adds read groups to the .bam filesAssuming the .fastq.gz files were downloaded into separate sub-folders named for the respective strain of yeast (i.e. s288c
and YB210
) we can use the following commands to align the reads.
s288c
bwa mem -t 4 -M -R '@RG\tID:s288c\tSM:ilumina' Reference/Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa s288c/SRR488141_1.fastq.gz s288c/SRR488141_2.fastq.gz | samtools view -Shb - > s288c.bam
samtools sort s288c.bam s288c.sorted
samtools index s288c.sorted.bam
YB210
bwa mem -t 4 -M -R '@RG\tID:YB210\tSM:ilumina' Reference/Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa YB210/SRR488139_1.fastq.gz YB210/SRR488139_2.fastq.gz | samtools view -Shb - > YB210.bam
samtools sort YB210.bam YB210.sorted
samtools index YB210.sorted.bam
This step has to be performed manually since an account with the GATK forums is required before downloading the GATK.
Eventhough it is highly recommended to perform indel realignment (as the GATK can do) for the purpose of this vignette you may skip this step and continue with the actually tallying.
Once you have downloaded and extracted it, the folder should contain the following files (assuming you did export GATK=/path/to/GATK/
):
cd $GATK
ls -l
-rw-r--r-- 1 pyl pyl 14436682 Aug 28 22:32 GenomeAnalysisTK.jar
drwxr-xr-x 2 pyl pyl 4096 Sep 12 12:44 resources
cd ~/Software
wget http://downloads.sourceforge.net/project/picard/picard-tools/1.98/picard-tools-1.98.zip
unzip picard-tools-1.98.zip
cd ~/Reference
java -jar ~/Software/picard-tools-1.98/CreateSequenceDictionary.jar R=Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa O=Saccharomyces_cerevisiae.EF4.73.dna.toplevel.dict
samtools faidx Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa
You can now run the RealignerTargetCreator
walker to create a set of intervals in which realignment might be necessary.
You need Java 7 for this; i.e. do sudo apt-get install openjdk-7-jre
first.
gatk -T RealignerTargetCreator -nt 4 -R Reference/Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa -I s288c.sorted.bam -I YB210.sorted.bam -o target.intervals
The parameters are:
-nt 4
4 threads in parallel-o target.intervals
this file will hold the target intervals to realign inThe resulting file is simply a list of intervals in which indels might occur and that need to be realigned.
head -n 3 target.intervals
I:1723-1724
I:2875-2876
I:3889-3890
We can now run the IndelRealigner
walker with the following command:
gatk -T IndelRealigner -R Reference/Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa -I s288c.sorted.bam -I YB210.sorted.bam --targetIntervals target.intervals -nWayOut .realigned.bam
This might take a while …
The tallying is based on the MD
tag in the .bam
file which encodes mismatches, this can be invalidated by hte realignment and therefore has to be recomputed.
samtools calmd s288c.sorted.realigned.bam Reference/Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa -b | samtools view -bh - > s288c.final.bam
samtools calmd YB210.sorted.realigned.bam Reference/Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa -b | samtools view -bh - > YB210.final.bam
samtools index s288c.final.bam
samtools index YB210.final.bam
Now we have our realigned files with correct MD
tags and indexes and can finally use them to create the tally.
We specify the sample s288c
as control and the sample YB210
as case, create the gzipped output file yeast.hfs5
and specify the readlength, which is 100
we also specify all the chromosomes as regions to tally in. In a human sample we would most likely utilise a compute cluster and parallelize on the chromosome level, but for the yeast genome this is not neccessary.
python tally.bam.py --bam s288c:yeast:Control:s288c.final.bam --bam YB210:yeast:Case:YB210.final.bam --out yeast.hfs5 --group yeast --reg I --reg II --reg III --reg IV --reg V --reg VI --reg VII --reg VIII --reg IX --reg X --reg XI --reg XII --reg XIII --reg XIV --reg XV --reg XVI --reg Mito --gzip --readlen 100 --replace
In some settings we might not want to create the tally file for the whole cohort in one thread on one machine. This might be due to the size of the cohort (e.g. many human whole genome samples) or because we expect more samples to be added over time. In this case we can parallelize the tallying by sample and chromosome, so that in a first step we create a tally file for each chromosome in each sample and then merge the files for the same chromosomes in all samples into the final set of files.
We could also merge the files of different chromosomes together but most operations we want to perform on tallies happen on a single chromosome (or region therein) without depending on data from other chromosomes so this is not neccessary.
The following example shows a parallelized version of the workflow for tallying our two yeast strains (bash
example).
for chrom in I II III IV V VI VII VIII IX X XI XII XIII XIV XV XVI Mito
do
bsub -J $chrom.s288c "python tally.bam.py --reg $chrom --bam s288c:yeast:Control:s288c.final.bam --out s288c.$chrom.hfs5 --group yeast --gzip --readlen 100 --replace"
bsub -J $chrom.YB210 "python tally.bam.py --reg $chrom --bam YB210:yeast:Case:YB210.final.bam --out YB210.$chrom.hfs5 --group yeast --gzip --readlen 100 --replace"
done
We use the bsub
command to interact with a LSF cluster and depending on your available compute infrastructure you might have to adapt this part of the code to use a different command (or simply send the processes to the background by appending &
if you have a multicore machine with a performant storage solution, i.e. RAID / SSDs)
Running this code produces files like s288c.XIV.hfs5
or YB210.IV.hfs5
. We will now create one tally file per chromosome by merging the files belonging to the same chromosome.
for chrom in I II III IV V VI VII VIII IX X XI XII XIII XIV XV XVI Mito
do
bsub -J $chrom.yeast.merge "python merge.tallies.by.chromosome.py --out yeast.$chrom.hfs5 --chrom $chrom --group yeast --gzip --tally YB210.$chrom.hfs5 --tally s288c.$chrom.hfs5"
done
After those jobs are run we have the files yeast.I.hfs5
to yeast.Mito.hfs5
and can now go on doing analyses on them.
We can now use the tools provided by h5vc
and rhdf5
to have a look into our tally file.
library(h5vc)
library(rhdf5)
h5ls("yeast.hfs5")
Here we see a list of all the chromsoomes and datasets, since we put all chromosomes into the same file this is pretty long.
group name otype dclass dim
0 / yeast H5I_GROUP
1 /yeast I H5I_GROUP
2 /yeast/I Counts H5I_DATASET INTEGER 12 x 2 x 2 x 230218
3 /yeast/I Coverages H5I_DATASET INTEGER 2 x 2 x 230218
4 /yeast/I Deletions H5I_DATASET INTEGER 2 x 2 x 230218
5 /yeast/I Reference H5I_DATASET INTEGER 230218
6 /yeast II H5I_GROUP
7 /yeast/II Counts H5I_DATASET INTEGER 12 x 2 x 2 x 813184
8 /yeast/II Coverages H5I_DATASET INTEGER 2 x 2 x 813184
9 /yeast/II Deletions H5I_DATASET INTEGER 2 x 2 x 813184
10 /yeast/II Reference H5I_DATASET INTEGER 813184
[...]
81 /yeast XVI H5I_GROUP
82 /yeast/XVI Counts H5I_DATASET INTEGER 12 x 2 x 2 x 948066
83 /yeast/XVI Coverages H5I_DATASET INTEGER 2 x 2 x 948066
84 /yeast/XVI Deletions H5I_DATASET INTEGER 2 x 2 x 948066
85 /yeast/XVI Reference H5I_DATASET INTEGER 948066
Note how the last dimension (the genomic position) differs from chromosome to chromosome, while the first dimensions (bases, samples and strands) stay the same.
We check if the sample data was correctly added.
getSampleData( "yeast.hfs5", "/yeast/I" )
This data frame is quite small since we only have 2 samples in this cohort. We can see that s288c was calles Control and written into the first slot in the sample dimension (i.e. the coverage of s288c on chromosome I is found in the respective 'Coverages' dataset like so: data$Coverages[1,,], assuming we have used h5dapply to extract the data. See the other vignettes for details.)
Sample Column Patient Type SampleFiles
1 YB210 2 yeast Case YB210.final.bam
2 s288c 1 yeast Control s288c.final.bam
Now we can go on and find differences between the samples with the tools described in the other vignettes or build a genome browser and have a look at the contents of our tally file (see h5vc.simple.genome.browser for this).