In this vignette we show how to create tally files form bam files within an interactive R session (note that this is not a good idea for large datasets).
Dear Windows users, please be advised that the code in this vignette is not guaranteed to run under Windows because of an issue in the HDF5 implementation for that platform. Evaluation is disabled to avoid problems and you wil not see the plots / outputs for this vignette.
First we load necessary packages and check for the example .bam files shipped with h5vc
.
suppressPackageStartupMessages(library(h5vc)) # h5vc is needed
suppressPackageStartupMessages(library(rhdf5)) # rhdf5 is needed
suppressPackageStartupMessages(library(deepSNV)) # we use deepSNV for tallying
files <- c("NRAS.AML.bam", "NRAS.Control.bam")
bamFiles <- file.path(system.file("extdata", package = "h5vcData"), files)
Those bam files contain reads from a matched pair of samples from an AML patient focusing on the region containing the gene NRAS (Chromosome 1: 115,247,090-115,259,515).
We will now create the tally file and create the groups that represent the study and chromosome we want to work on. Before we do this we have to find out how big our datasets have to be in their genomic-position dimension, to do this we will look into the header of the bam files and extract the sequence length information.
chromdim <- sapply(scanBamHeader(bamFiles), function(x) x$targets)
colnames(chromdim) <- files
head(chromdim)
## NRAS.AML.bam NRAS.Control.bam
## 1 249250621 249250621
## 2 243199373 243199373
## 3 198022430 198022430
## 4 191154276 191154276
## 5 180915260 180915260
## 6 171115067 171115067
Both files have the same header information and are fully compatible, so we can just pick one file and take the chromosome lengths from there.
chrom <- "1"
chromlength <- chromdim[chrom, 1]
study <- "/NRAS"
tallyFile <- file.path(tempdir(), "NRAS.tally.hfs5")
if (file.exists(tallyFile)) {
file.remove(tallyFile)
}
h5createFile(tallyFile)
## [1] TRUE
group <- paste(study, chrom, sep = "/")
h5createGroup(tallyFile, study) # create the toplevel group first
## [1] TRUE
h5createGroup(tallyFile, group)
## [1] TRUE
h5createDataset(tallyFile, paste(group, "Counts", sep = "/"), dims = c(12, 2,
2, chromlength), storage.mode = "integer", level = 9) #Creating the Counts group for chromosome 1 with 12 bases, 2 samples, 2 strands and 249250621 positions
## [1] TRUE
h5createDataset(tallyFile, paste(group, "Deletions", sep = "/"), dims = c(2,
2, chromlength), storage.mode = "integer", level = 9) #Creating the Deletions group for chromosome 1 with 2 samples, 2 strands and 249250621 positions
## [1] TRUE
h5createDataset(tallyFile, paste(group, "Coverages", sep = "/"), dims = c(2,
2, chromlength), storage.mode = "integer", level = 9) #Creating the Coverages group for chromosome 1 with 2 samples, 2 strands and 249250621 positions
## [1] TRUE
h5createDataset(tallyFile, paste(group, "Reference", sep = "/"), dims = c(chromlength),
storage.mode = "integer", level = 9) #Creating the Reference group for chromosome 1 with 249250621 positions
## [1] TRUE
h5ls(tallyFile)
## group name otype dclass dim
## 0 / NRAS H5I_GROUP
## 1 /NRAS 1 H5I_GROUP
## 2 /NRAS/1 Counts H5I_DATASET INTEGER 12 x 2 x 2 x 249250621
## 3 /NRAS/1 Coverages H5I_DATASET INTEGER 2 x 2 x 249250621
## 4 /NRAS/1 Deletions H5I_DATASET INTEGER 2 x 2 x 249250621
## 5 /NRAS/1 Reference H5I_DATASET INTEGER 249250621
Note that we use the default layout for tally files with 12 slots for bases (based on c("A","C","G","T")
times the 3
bins for sequencing cycle, i.e. start, middle, end ). The function bam2R
from the deepSNV
package, which we will use to create the tally actually does not stratify like this so we will leave the first and last 4 slots for bases empty for now.
Now we will attach the sampleData
to the group.
sampleData <- data.frame(Sample = c("AML", "Control"), Column = c(1, 2), Patient = c("Pt1",
"Pt1"), Type = c("Case", "Control"), stringsAsFactors = FALSE)
setSampleData(tallyFile, group, sampleData)
getSampleData(tallyFile, group)
## Column Patient Sample Type
## 1 1 Pt1 AML Case
## 2 2 Pt1 Control Control
Note that we set the columns to be 1
and 2
respectively while within the tally file the values 0
and 1
are stored. setSampleData
and getSampleData
automatically remove / add 1
from the value. This is implemented to adress the fact that R counts 1-based while HDF5 internally counts 0-based.
Now it is time to extract tally information from the bam files.
startpos <- 115247090
endpos <- 115259515
Counts <- lapply(bamFiles, function(bamf) {
bam2R(file = bamf, chr = chrom, start = startpos, stop = endpos, head.clip = 10) #we tell it to ignore the first and last 10 sequencing cycles
})
When we now have a look at the resulting matrix we can see that some transformations will be needed to make the data compatible with the h5vc
tally file format. The good news is that we can get counts for both strands at all positions and even get deletions and insertion counts (we will ignore the latter for now). Unfortunately bam2R
choses not to store the coverage separately and instead stored it implicitly in the sum of counts (technically the deletions have to be added in also here). This means we will have counts in the slot corresponding to the reference base which would break some of the downstream functions of h5vc
. We will calculate the coverages by summing up the counts and deletions for each position and strand.
Counts[[1]][5000:5010, ]
## A T C G - N INS DEL HEAD TAIL QUAL a t c g _ n ins del head tail
## [1,] 11 0 0 0 0 5 0 0 0 0 440 0 0 0 0 0 0 0 0 0 0
## [2,] 12 0 0 0 0 4 0 0 0 0 480 0 0 0 0 0 0 0 0 0 0
## [3,] 12 0 0 0 0 4 0 0 0 0 480 0 0 0 0 0 0 0 0 0 0
## [4,] 13 0 0 0 0 6 0 0 3 0 520 0 0 0 0 0 0 0 0 0 0
## [5,] 13 0 0 0 0 8 0 0 2 0 520 0 0 0 0 0 0 0 0 0 0
## [6,] 0 14 0 0 0 7 0 0 0 0 560 0 0 0 0 0 0 0 0 0 0
## [7,] 0 0 0 13 0 9 0 0 1 0 560 0 0 0 0 0 0 0 0 0 0
## [8,] 0 0 15 0 0 9 0 0 2 0 600 0 0 0 0 0 0 0 0 0 0
## [9,] 15 0 0 0 0 9 0 0 0 0 640 0 0 0 0 0 0 0 0 0 0
## [10,] 0 15 0 0 0 9 0 0 0 0 640 0 0 0 0 0 0 0 0 0 0
## [11,] 16 0 0 0 0 8 0 0 0 0 640 0 0 0 0 0 0 0 0 0 0
## qual
## [1,] 0
## [2,] 0
## [3,] 0
## [4,] 0
## [5,] 0
## [6,] 0
## [7,] 0
## [8,] 0
## [9,] 0
## [10,] 0
## [11,] 0
Coverages <- lapply(Counts, function(count) matrix(c(rowSums(count[, c("A",
"C", "G", "T", "DEL")]), rowSums(count[, c("a", "c", "g", "t", "del")])),
ncol = 2, byrow = FALSE, dimnames = list(NULL, c("Fwd", "Rev"))))
Coverages[[1]][5000:5010, ]
## Fwd Rev
## [1,] 11 0
## [2,] 12 0
## [3,] 12 0
## [4,] 13 0
## [5,] 13 0
## [6,] 14 0
## [7,] 13 0
## [8,] 15 0
## [9,] 15 0
## [10,] 15 0
## [11,] 16 0
Deletions <- lapply(Counts, function(count) count[, c("DEL", "del")])
Normally one would at this point load the reference sequence and mask out the counts for the reference base at each position. For simplicity, speed and since we want to illustrate comparative analyses mostly we will estimate the reference base as the base with the most counts in both strands and samples (potentially losing e.g. homozygous germline variants in the process).
Counts <- lapply(Counts, function(count) count[, c("A", "C", "G", "T", "a",
"c", "g", "t")]) # kick out all unnecessary columns
ref <- apply(Counts[[1]][, 1:4] + Counts[[1]][5:8] + Counts[[2]][, 1:4] + Counts[[2]][5:8],
1, which.max)
for (i in seq(length(ref))) {
Counts[[1]][i, ref[i]] <- 0 #Set the forward strand in the first sample to zero
Counts[[1]][i, (ref[i] + 4)] <- 0
Counts[[2]][i, ref[i]] <- 0
Counts[[2]][i, (ref[i] + 4)] <- 0 #Set the reverse strand in the first sample to zero
}
Reference <- ref - 1 #the tally file encodes the reference as A=0,C=1,G=2,T=3
Now we will write the data to the tally file.
for (sample in 1:2) {
h5write(t(Counts[[sample]][, 1:4]), tallyFile, paste(group, "Counts", sep = "/"),
index = list(5:8, sample, 1, startpos:endpos)) #Sample One Forward Strand
h5write(t(Counts[[sample]][, 5:8]), tallyFile, paste(group, "Counts", sep = "/"),
index = list(5:8, sample, 2, startpos:endpos)) #Sample One Reverse Strand
h5write(Coverages[[sample]][, "Fwd"], tallyFile, paste(group, "Coverages",
sep = "/"), index = list(sample, 1, startpos:endpos)) #Sample One Forward Strand
h5write(Coverages[[sample]][, "Rev"], tallyFile, paste(group, "Coverages",
sep = "/"), index = list(sample, 2, startpos:endpos)) #Sample One Reverse Strand
h5write(Deletions[[sample]][, "DEL"], tallyFile, paste(group, "Deletions",
sep = "/"), index = list(sample, 1, startpos:endpos)) #Sample Two Forward Strand
h5write(Deletions[[sample]][, "del"], tallyFile, paste(group, "Deletions",
sep = "/"), index = list(sample, 2, startpos:endpos)) #Sample Two Reverse Strand
}
h5write(Reference, tallyFile, paste(group, "Reference", sep = "/"), index = list(startpos:endpos)) #The Reference
h5ls(tallyFile)
## group name otype dclass dim
## 0 / NRAS H5I_GROUP
## 1 /NRAS 1 H5I_GROUP
## 2 /NRAS/1 Counts H5I_DATASET INTEGER 12 x 2 x 2 x 249250621
## 3 /NRAS/1 Coverages H5I_DATASET INTEGER 2 x 2 x 249250621
## 4 /NRAS/1 Deletions H5I_DATASET INTEGER 2 x 2 x 249250621
## 5 /NRAS/1 Reference H5I_DATASET INTEGER 249250621
This very explicit step-by-step implementation can be streamlined quite a bit to prevent much of the code-duplication.
We will use the h5dapply
function provided by h5vc
to extract the data again and have a look at it.
data <- h5dapply(filename = tallyFile, group = group, blocksize = 1e+08, range = c(startpos,
endpos))[[1]] # we use a blocksize larger than the range to get all data in one block (extracted by [[1]])
str(data)
## List of 5
## $ Counts : int [1:12, 1:2, 1:2, 1:12425] 0 0 0 0 0 0 0 0 0 0 ...
## $ Coverages : int [1:2, 1:2, 1:12425] 0 0 0 0 0 0 0 0 0 0 ...
## $ Deletions : int [1:2, 1:2, 1:12425] 0 0 0 0 0 0 0 0 0 0 ...
## $ Reference : int [1:12425(1d)] 0 0 0 0 0 0 0 0 0 0 ...
## $ h5dapplyInfo:List of 4
## ..$ Blockstart: num 1.15e+08
## ..$ Blockend : num 1.15e+08
## ..$ Datasets :'data.frame': 4 obs. of 7 variables:
## .. ..$ group : chr [1:4] "/NRAS/1" "/NRAS/1" "/NRAS/1" "/NRAS/1"
## .. ..$ name : chr [1:4] "Counts" "Coverages" "Deletions" "Reference"
## .. ..$ otype : Factor w/ 7 levels "H5I_FILE","H5I_GROUP",..: 5 5 5 5
## .. ..$ dclass : chr [1:4] "INTEGER" "INTEGER" "INTEGER" "INTEGER"
## .. ..$ dim : chr [1:4] "12 x 2 x 2 x 249250621" "2 x 2 x 249250621" "2 x 2 x 249250621" "249250621"
## .. ..$ DimCount: int [1:4] 4 3 3 1
## .. ..$ PosDim : int [1:4] 4 3 3 1
## ..$ Group : chr "/NRAS/1"
We will call variants within this gene now:
vars <- h5dapply(filename = tallyFile, group = group, blocksize = 1000, FUN = callVariantsPaired,
sampledata = getSampleData(tallyFile, group), cl = vcConfParams(returnDataPoints = TRUE),
range = c(startpos, endpos))
vars <- do.call(rbind, vars)
vars
## Chrom Start End Sample altAllele refAllele
## 115258090:115259089 1 115258747 115258747 AML G C
## caseCountFwd caseCountRev caseCoverageFwd
## 115258090:115259089 12 22 27
## caseCoverageRev controlCountFwd controlCountRev
## 115258090:115259089 41 0 0
## controlCoverageFwd controlCoverageRev
## 115258090:115259089 53 37
By cleverly selecting the example data we have found exactly one variant that seems ot be interesting and will now plot the region in question to also check if the mismatchPlot
function will work with the tally data we created.
position <- vars$End[1]
windowsize <- 50
data <- h5dapply(filename = tallyFile, group = group, blocksize = 1e+08, range = c(position -
windowsize, position + windowsize))[[1]]
p <- mismatchPlot(data, getSampleData(tallyFile, group), samples = c("AML",
"Control"), windowsize = windowsize, position = position)
print(p)
It would seem that we have found a reliable variant here.
Below you can see the code used to fill in the Reference
dataset in the example tally file, we do not run it here since we do not have the FASTA file available. This code could easily be adapted to replace the majority-vote reference calling used in the above examples.
library(h5vc)
library(rhdf5)
tallyFile <- system.file("extdata", "example.tally.hfs5", package = "h5vcData")
library(ShortRead)
reference <- readDNAStringSet("GRCh37.72.fa")
names(reference) <- sapply(strsplit(names(reference), " "), function(x) x[1])
regions <- list(`16` = c(2.9e+07, 3e+07), `22` = c(3.8e+07, 4e+07))
for (chrom in names(regions)) {
start <- regions[[chrom]][1]
end <- regions[[chrom]][2]
ref <- Views(reference[[chrom]], start = start, end = end)
ds <- encodeDNAString(ref[[1]])
h5write(ds, tallyFile, paste("/ExampleStudy", chrom, "Reference", sep = "/"),
index = list(start:end))
}
The bam2R
function from the deepSNV
package gives us the number of insertion calls at each position (more precisely, between each position and the next one).
First, we have to create the dataset Insertions
to hold our data and we will use the same layout as we used for Deletions
.
h5createDataset(tallyFile, paste(group, "Insertions", sep = "/"), dims = c(2,
2, chromlength), storage.mode = "integer", level = 9) #Creating the Deletions group for chromosome 1 with 2 samples, 2 strands and 249250621 positions
## [1] TRUE
h5ls(tallyFile)
## group name otype dclass dim
## 0 / NRAS H5I_GROUP
## 1 /NRAS 1 H5I_GROUP
## 2 /NRAS/1 Counts H5I_DATASET INTEGER 12 x 2 x 2 x 249250621
## 3 /NRAS/1 Coverages H5I_DATASET INTEGER 2 x 2 x 249250621
## 4 /NRAS/1 Deletions H5I_DATASET INTEGER 2 x 2 x 249250621
## 5 /NRAS/1 Insertions H5I_DATASET INTEGER 2 x 2 x 249250621
## 6 /NRAS/1 Reference H5I_DATASET INTEGER 249250621
Next we will collate the insertion information from the tally returned by bam2R
.
Counts <- lapply(bamFiles, function(bamf) {
bam2R(file = bamf, chr = chrom, start = startpos, stop = endpos, head.clip = 10) #we tell it to ignore the first and last 10 sequencing cycles
})
Insertions <- lapply(Counts, function(count) count[, c("INS", "ins")]) # kick out all unnecessary columns
Now we use the same code as above to also write the Insertions
into the tally File.
for (sample in 1:2) {
h5write(Insertions[[sample]][, "INS"], tallyFile, paste(group, "Insertions",
sep = "/"), index = list(sample, 1, startpos:endpos)) #Sample Two Forward Strand
h5write(Insertions[[sample]][, "ins"], tallyFile, paste(group, "Insertions",
sep = "/"), index = list(sample, 2, startpos:endpos)) #Sample Two Reverse Strand
}
h5write(Reference, tallyFile, paste(group, "Reference", sep = "/"), index = list(startpos:endpos)) #The Reference
h5ls(tallyFile)
## group name otype dclass dim
## 0 / NRAS H5I_GROUP
## 1 /NRAS 1 H5I_GROUP
## 2 /NRAS/1 Counts H5I_DATASET INTEGER 12 x 2 x 2 x 249250621
## 3 /NRAS/1 Coverages H5I_DATASET INTEGER 2 x 2 x 249250621
## 4 /NRAS/1 Deletions H5I_DATASET INTEGER 2 x 2 x 249250621
## 5 /NRAS/1 Insertions H5I_DATASET INTEGER 2 x 2 x 249250621
## 6 /NRAS/1 Reference H5I_DATASET INTEGER 249250621
We will now plot the mismatch plot of the variant again, adding the Insertions
as an additional layer in the plot by using the geom_h5vc
function (we specify fill="orange"
so insertions should show up as orange boxes).
data <- h5dapply(filename = tallyFile, group = group, blocksize = 1e+08, range = c(position -
windowsize, position + windowsize))[[1]]
p <- mismatchPlot(data, getSampleData(tallyFile, group), samples = c("AML",
"Control"), windowsize = windowsize, position = position)
p <- p + geom_h5vc(data, getSampleData(tallyFile, group), samples = c("AML",
"Control"), windowsize = windowsize, position = position, dataset = "Insertions",
fill = "orange")
print(p)
The above code examplifies the ease with which we can extend plots from ggplot2
with additional layers of information.
Note that the region we plot here actually doesn't contain any Insertions
since we do not see any orange blocks in this plot.
For quality control this is still an interesting result since it confirms that our variant does not fall next to a small insertion or deletion (which could cause artefacts).