Bioconductor Newsletter
posted by Valerie Obenchain, April 2014
Contents
- Introduction
- Infrastructure
- Networks / Annotations
- Web Site-ish
- Programming bits
- Courses / Conferences
- Looking ahead
Introduction
Welcome to the first Bioconductor newsletter! In this space we will provide project updates from the Seattle team and recent contributions from core developers. Topics will include infrastructure reorganizations, new packages and features, conferences, and other areas of interest.
The newsletter will be on a quarterly schedule with the intent of summarizing major developments of the past 3 months. Comments and suggestions are welcome, please send to Valerie at vobencha@fhcrc.org.
Infrastructure
GenomicAlignments
package
Last fall Hervé did some housecleaning on the infrastructure
packages. All GAlignments*
classes and related functions were
moved from
GenomicRanges
and
Rsamtools
to the new
GenomicAlignments
package. Classes and functions from GenomicRanges
that built on
functionality in Rsamtools
were also moved. The reorg substantially
cleaned up our dependency tree and will be reflected in the upcoming
release, BioC 2.14. (ASCII art courtesy of Hervé).
IRanges
^ ^
| |
| XVector
| ^
| |
GenomicRanges Biostrings
^ ^
| |
Rsamtools
^
|
BSgenome
^
|
GenomicAlignments
^
|
rtracklayer
^
|
GenomicFeatures
Junction tools in GenomicAlignments
New tools for extracting junctions from genomic alignments have been
added to the GenomicAlignments
package. The junctions()
function,
previously known as gaps()
, extracts the portion of a read
corresponding to the N operation in the CIGAR.
New junction-related operations:
- summarize the number of reads that overlap each junction
- retrieve junction motifs based on dinucleotides at junction boundaries (BSgenome package required)
- import junction files generated by TopHat and STAR aligners
- determine if the RNA-seq protocol was stranded
GenomeInfoDb
package
Converting between chromosome names from different institutions presents a challenge when working with ranges and annotations. The new GenomeInfoDb package is a step towards resolving these inconsistencies by providing a mapping interface between styles (e.g., UCSC, NCBI, Ensembl, …) and organisms.
Low-level functions related to seqnames and SeqnamesStyle were
moved from GenomicRanges
and
AnnotationDbi
into GenomeInfoDb
. Some functions were deprecated, some renamed.
See the
NEWS
file for a complete listing of activities.
The package is intended to serve as the base for methods querying or transforming seqlevels. Future work will address the problem of auto-detecting incompatible styles between objects in operations that require matching seqlevels.
Sonali, Marc, Hervé and Martin have all contributed to this work.
Currently 9 organisms are supported:
library(GenomeInfoDb)
names(genomeStyles())
## [1] "Arabidopsis_thaliana" "Caenorhabditis_elegans"
## [3] "Cyanidioschyzon_merolae" "Drosophila_melanogaster"
## [5] "Homo_sapiens" "Oryza_sativa"
## [7] "Populus_trichocarpa" "Saccharomyces_cerevisiae"
## [9] "Zea_mays"
There are methods for style discovery,
seqlevelsStyle("chr3")
## [1] "UCSC"
and for converting between styles. Here we map from UCSC to NCBI:
mapSeqlevels(c("chrII", "chrIII", "chrM"), "NCBI")
## chrII chrIII chrM
## "II" "III" "MT"
Networks / Annotations
AnnotationHub
package
The
AnnotationHub
resource continues to grow steadily. As of March
2014 the hub supports 204 genomes and 98 species from 7 data providers
(Ensembl, NCBI, UCSC, dbSNP, HAEMCODE, Encode, Inparanoid 8). The hub
is a repository of annotation data parsed from original file format
to R objects for convient computing. For example, BED, GTF, and track
files from UCSC are available as GRanges
. VCF files from dbSNP have
been parsed into VCF
objects and FASTA files are available as FaFile
objects.
Marc, Dan and Paul have led this effort.
RefNet
package
A new package to watch for in the Spring release is RefNet
.
This package offers access to molecular interaction data in computable
form and is the first of its kind in Bioconductor. The package offers
access to 2 transcription factor networks and Recon 2 (consensus
metabolic reconstruction compiled from 5 resources). Queries are
made using the
PSICQUIC
or RefNet.db
packages. (Note: RefNet
and RefNet.db
will be
available in BioC 2.14.)
These data can be used to curate or study experiment-specific pathways, explore the function and relationships of differentially expressed genes, and offer insight to a variety of explorations at the molecular level.
RefNet
is extensible and we encourage contributions. Networks, known
interactions, infection pathways, disease-host interactions, or pathways
from manuscripts would be valuable additions. Guidance for contributing
data are included in the package.
Credit for RefNet
, RefNet.db
and PSICQUIC
goes to Paul with
contributions from Mentored Projects mentees.
Web Site-ish
biocViews terms
The biocViews terms used to categorize packages have been been refreshed. New terms were added, old terms removed and packages with old or invalid terms were updated. The updated web-interface allows searching by biocViews term or subject. A check box option displays terms not associated with any package. Thanks to Sonali and Dan for the work on this.
BiocCheck
package
BiocCheck is a new package that encapsulates Bioconductor package guidelines and best practices. It was designed to provide guidance to new package authors and is run by the SinglePackageBuilder during the package review process.
BiocCheck
can also be used from the command line
R CMD BiocCheck package
and should be run after R CMD check
successfully completes. A
full-featured vignette
explains the suggestions made by the package.
Dan is the architect behind this one.
Programming bits
Hits
class
One of my favorite objects is the Hits
class. It is a simple,
straightforward class that can be used for complicated indexing and
combining. It stores a set of “hits” between two vector-like objects and
is most familiar as the output of a findOverlaps
operation.
Here we use Hits
to identify and remove the “shared” regions in a
set of ranges. The shared region in the original set is positions 5 to 10.
Range 3 is completely within this shared range.
library(GenomicRanges)
gr1 <- GRanges("A", IRanges(c(1, 5, 5), c(10, 15, 10)))
gr1
## GRanges with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] A [1, 10] *
## [2] A [5, 15] *
## [3] A [5, 10] *
## ---
## seqlengths:
## A
## NA
First we disjoin the ranges and perform findOverlaps
which
will give us a Hits
object:
gr2 <- disjoin(gr1)
ov <- findOverlaps(gr1, gr2)
ov
## Hits of length 5
## queryLength: 3
## subjectLength: 3
## queryHits subjectHits
## <integer> <integer>
## 1 1 1
## 2 1 2
## 3 2 2
## 4 2 3
## 5 3 2
Convenience functions countQueryHits' and
countSubjectHits’
summarize the number of times each record in the query
or
subject
was hit.
countSubjectHits(ov)
## [1] 1 3 1
The regions that are not shared between the ranges will be counted only once (self-hit).
keep_idx <- which(countSubjectHits(ov) == 1L)
Accessors queryHits
and subjectHits
extract the corresponding
columns from the Hits
object:
subjectHits(ov)
## [1] 1 2 2 3 2
The extracted subjectHits
can be used in a matching operation
to subset the Hits
object containing just the “unshared” regions:
ov <- ov[subjectHits(ov) %in% keep_idx]
ov
## Hits of length 2
## queryLength: 3
## subjectLength: 3
## queryHits subjectHits
## <integer> <integer>
## 1 1 1
## 2 2 3
Again subjectHits
are extracted but this time from the subset
Hits
object. We use this index to subset the disjoined original
ranges leaving only the “unshared” regions:
ans_regions <- gr2[subjectHits(ov)]
ans_regions
## GRanges with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] A [ 1, 4] *
## [2] A [11, 15] *
## ---
## seqlengths:
## A
## NA
To finish this off we make a GRangesList
of the unshared regions.
The list is length 3 corresponding to the 3 ranges in `gr1’:
ans_eltlens <- countQueryHits(ov)
ans_skeleton <- PartitioningByEnd(cumsum(ans_eltlens))
relist(ans_regions, ans_skeleton)
## GRangesList of length 3:
## [[1]]
## GRanges with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] A [1, 4] *
##
## [[2]]
## GRanges with 1 range and 0 metadata columns:
## seqnames ranges strand
## [1] A [11, 15] *
##
## [[3]]
## GRanges with 0 ranges and 0 metadata columns:
## seqnames ranges strand
##
## ---
## seqlengths:
## A
## NA
Thanks to Michael and Hervé for the work on the Hits
class.
Self-matching
Recently I had a question about splitting on groups/factors. I thought the solution Hervé came up with was quite elegant and worth sharing.
The problem boils down to converting the ‘group1’ vector to ‘group2’:
group1 <- c(3, 1, 2, 2)
group2 <- c(1, 2, 3, 3)
The motivating use case was to split a GRanges based on the
groups in group1
but not the implied order.
When we split this GRanges
x <- GRanges(c(rep("chr2", 2), rep("chr19", 2)), IRanges(1:4, width = 1))
x
## GRanges with 4 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr2 [1, 1] *
## [2] chr2 [2, 2] *
## [3] chr19 [3, 3] *
## [4] chr19 [4, 4] *
## ---
## seqlengths:
## chr19 chr2
## NA NA
by group1
the object is re-ordered to match the priority
in group1
.
split(x, group1)
## GRangesList of length 3:
## $1
## GRanges with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr2 [2, 2] *
##
## $2
## GRanges with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## [1] chr19 [3, 3] *
## [2] chr19 [4, 4] *
##
## $3
## GRanges with 1 range and 0 metadata columns:
## seqnames ranges strand
## [1] chr2 [1, 1] *
##
## ---
## seqlengths:
## chr19 chr2
## NA NA
This is expected behavior but wasn’t exactly the result I was
after. I wanted to respect the grouping in group1
but not the order.
Essentially I wanted to turn group1
in to group2
so
I would get this result:
split(x, group2)
## GRangesList of length 3:
## $1
## GRanges with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr2 [1, 1] *
##
## $2
## GRanges with 1 range and 0 metadata columns:
## seqnames ranges strand
## [1] chr2 [2, 2] *
##
## $3
## GRanges with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## [1] chr19 [3, 3] *
## [2] chr19 [4, 4] *
##
## ---
## seqlengths:
## chr19 chr2
## NA NA
Solution:
One straight-forward solution is to re-order the GRangesList object once created.
grl <- split(x, group1)
grl[as.character(unique(group1))]
However the data were large so I was hoping to solve the ordering
problem before creating the list. The second solution was to
transform group1
with a “self-match”.
match(group1, group1)
## [1] 1 2 3 3
A “self match” on any vector-like object ‘x’ is an easy and very efficient way to assign unique ids to unique values in ‘x’. Also the mapping from unique values in ‘x’ and ids is guaranteed to be strictly increasing, which is what I needed for my use case.
xx <- c("Z", "B", "C", "C", "A", "B")
ids <- match(xx, xx)
ids
## [1] 1 2 3 3 5 2
ids[!duplicated(xx)] # always strictly increasing
## [1] 1 2 3 5
Hervé’s 2 line solution to my splitting problem:
grl2 <- split(x, match(group1, group1))
names(grl2) <- unique(group1)
More detail on the Hits
class can be found on the man page:
?Hits
The “HOWTOs” vignette
in GenomicRanges
demonstrates other
practical workflows that make use of the extensive GenomicRanges
infrastructure.
library(GenomicRanges)
browseVignettes("GenomicRanges")
Courses / Conferences
In January Martin, Sean and Hervé headed south to Ribeirão Preto, Brazil to teach at the Summer Bioinformatics course hosted by the Ribeirão Preto Medical School. The course was organized by Benilton Carvalho and Houtan Noushmehr and focused on the analysis and comprehension of high throughput sequence data.
In Seattle we offer several short courses throughout the year. The last was in February given by Martin and Sonali. This was a 2 day introductory course covering the basics of working with large data, class structure, integrated workflows and the analysis of RNAseq differential expresssion data.
Upcoming events and past course materials can be found on the web site.
Looking ahead
The spring release is just weeks away. We plan to branch on April 11th and release on the 14th. The full schedule is posted online.
BioC 2014 is in Boston this year (July 30 - August 1). Watch the web site for updates.
The July newsletter will (probably) include copy number packages on the
rise, the GenomicFileViews
package, and a summary of tally/coverage
functions. If there are topics you’d like to see covered let us know!