Abstract
The L1000 data set from the NIH LINCS program holds the promise to deconvolute a wide range of biological questions in transcriptional space. However, using this large and decentralized data set presents its own challenge. Here we demonstrate how to use the slinky package to simplify filtering and accessing this data in an efficient analysis pipeline.The The Library of Integrated Network-Based Cellular Signatures (LINCS) program has created over 1 million transcriptional profiles (Subramanian et al. 2017, Keenan et al. (2018)) that detail the effects of thousands of drugs and genetic perturbagens in gene expression space (known as the “L1000” dataset as it is based on the measured expression of approximately 1000 highly variable genes).
This creates a Rosetta stone of sorts, allowing us to translate between disease states, drug treatments, and genetics. This resource holds the promise of unlocking new therapies for diseases, and deeper understanding of how cells respond to drugs and gene perturbation.
The LINCS L1000 data set has three components: the gene expression data itself (available as a “GCTX” file which is in HDF5 format), a metadata file (as an “INFO” file, in tab delimited format), and additional metadata that is made available through a web-based API (http://clue.io). Analysis of this dataset thus requires familiarity with the contents and formats of these three resources, as well as the R tools required to interact with them. The goal of the slinky package is to relieve the user of details of file access and web calls and provide a simplified interface to these various resources.
For example, consider the high level method function loadL1K
. If provided a single argument (the name of a perturbagen of interest, for example “amoxicillin”), the function does the following:
For users more familiear with the underlying data resources, much more granular control can be achieved through optional arguments to loadL1K
or to lower level functions including clue
, and readGCTX
. Examples of these and other functions provided by this package are described in the balance of this vignette.
The gene expression data derived from a single cell culture well treated with a single perturbagen has historically been referred to as an “instance” in the CMAP/LINCS program. I will use “instance” in this way. For any given perturbagen, there will be many instances encompassing different doses, durations, and cell lines as well as technical replicates.
Slinky uses the rhdf5 package to access slices of the L1000 data from disk. As a result, you only need enough memory to hold the data slices you are working with, and possibly the metadata as well. (The entire metadata data frame for all 1.3 million instances requires a modest 220MB of RAM). The development machine on which the slinky package was primarily developed has 16GB of RAM and can load and manipulate tens of thousands of instances from the L1000 dataset without difficulty. Of course, YMMV depending on the specifications of your computer.
The examples in this vignette can be completed with the demonstration gctx
and info
files that are installed along with the package. Note: the expression data in the demo gctx file has been truncated from double precsion floating point to integer to keep the package size under 4MB as required by the Bioconductor project.
To move on and conduct your own analysis, you will need access to a LINCS L1000 data file, such as the Phase I Level 3 data file.
This file may already be available from your local bioinformatics core. If you need or want to download it yourself, note that these datafiles are quite large (up to 40GB). A robust multithreaded download client makes this much faster and less prone to failures due to connectivity hiccups. For example, you might try:
Alternatively, you can use the slinky package itself to fetch this file. Again, this will take a while and may fail if your connection is not rock solid.
You will also need the metadata describing the instances in that file. You could download it yourself, but note that the slinky package will automatically fetch the default phase I info file automatically if and when it needs it, and even try to place it in the package installation directory so it is there the next time. (If it cannot be moved to that directory due to permission issues, it will just be left in your current working directory. The slinky
package also checks in the current working directory for the file before downloading it again.) If using phase 2 data, you must download the corresponding INFO file yourself. Automated support for the phase 2 data is planned for our next update.
Details on the LINCS data files are available here.
Access to the LINCS api and clue.io requires a user key, which can be obtained from the LINCS by registering at https://clue.io (free for non-commercial use). You can provide your key directly in the call to Slinky$new("your_key_here")
, or define the environment variable CLUE_API_KEY
and set it to your key. Alternatively, you can store your key as a single line in a text file so it does not get included in your project files (which may, among other thing, be stored in public repositories, etc.).
library(slinky)
# update following lines with your details:
key <- "YOUR_API_KEY"
gctx <- "/path/to/GSE92742_Broad_LINCS_Level3_INF_mlr12k_n1319138x12328.gctx"
info <- "/path/to/GSE92742_Broad_LINCS_inst_info.txt.gz"
sl <- Slinky(key, gctx, info)
For the purposes of this vignette, we will use a small subset of the L1000 data distributed with this package, and the demo API key from clue.io.
Please do not use the demo key for actual analysis. Note: the data in this demonstration dataset has been rounded to integer to keep the package size under 4MB as required by the BioConductor project.
There are several ways to identify a useful slice of the L1000 data set to load. The first way is to figure out the column and row indices you are interested in and specify these directly. For example, you might do this by filtering the info file that accompanies the L1000 data from GEO. This approach does not even require an API key since it does not call the API.
(It may be useful to note that the “landmark” genes are the first 978 rows of the dataset. The remaining rows contain the inferred expression for the rest of the transcriptome.)
The info file specified at object instantiation is loaded when the object is created and its data is made available through the metadata
accessor slot. We can use this data to identify which columns of L1000 expression data we wish to load. Note that for performance, it is much faster to subset the slinky object rather than subset the data. The latter will load the entire dataset which may be very large. An example is given below
col.ix <- which(metadata(sl)$pert_iname == "amoxicillin" & metadata(sl)$cell_id ==
"MCF7")
data <- readGCTX(sl[, col.ix])
## This would be slower: data <- readGCTX(sl)[,col.ix]
Note that loading contiguous data out of the gctx file is much faster than data spread throughout the file. For example, loading 1000 consecutive instances (over the network) takes 3.4 seconds in our environment, while loading 1000 random instances takes almost 10 times that long. When the info file is loaded, it is reordered to match the order of columns in the gctx file.
(Interestingly, the info file provided by GEO is not in the same order as the gctx file). Depending on your use case, you may be able to optimize loading by accessing more nearly contiguous data slices as needed.
While not quite as fast, querying the clue.io API is simpler, ensures you have the most uptodate metadata, and also allows you to query on metadata not present in the info file provided via GEO. For example, we could identify all the instances that are not only perturbed by amoxicillin in the A375 cell line, but are also designated as “gold” by the LINCS program (by virtue of having low variance across replicates):
amox_gold <- clueInstances(sl, where_clause = list(pert_type = "trt_cp",
pert_iname = "amoxicillin", cell_id = "MCF7", is_gold = TRUE),
poscon = "omit")
Note the poscon = "omit"
argument. There are a small number of instances that have a pert_type
of trt_poscon
in clue.io’s profiles
endpoint but are recoded as trt_cp
in the sigs
endpoint. Since the is_gold
metadata must be obtained from the sigs
endpoint, this can cause confusion/errors in downstream analysis. Therefore, these samples can be omitted from query results. In fact, this is the default behavior of the clueInstances
method (the argument was simply specified above for emphasis).
You could then load these instances by “brute force” as we did above.
But it is usually more useful to keep this data annotated with its corresponding metadata in the form of an SummarizedExperiment
. This can be easily done with the toSummarizedExperiment
method.
In fact, you could have skipped the trouble of looking up the ids in the first place and simply passed the where_clause
to loadL1K
directly.
amox_gold_sumex <- loadL1K(sl, where_clause = list(pert_type = "trt_cp",
pert_iname = "amoxicillin", cell_id = "MCF7", is_gold = TRUE))
Note that by default, toSummarizedExperiment
loads all the genes in the dataset. If only the L1000 landmark genes are desired, you can either pass the inferred = FALSE
argument, or explicitly request just the first 978 rows. Both methosd are shown below.
amox_gold_sumex <- loadL1K(sl, where_clause = list(pert_type = "trt_cp",
pert_iname = "amoxicillin", cell_id = "MCF7", is_gold = TRUE),
inferred = FALSE)
# equivalent to
amox_gold_sumex <- loadL1K(sl[1:978, ], where_clause = list(pert_type = "trt_cp",
pert_iname = "amoxicillin", cell_id = "MCF7", is_gold = TRUE),
inferred = FALSE)
# equivalent to
amox_gold_sumex <- loadL1K(sl[1:978, ], where_clause = list(pert_type = "trt_cp",
pert_iname = "amoxicillin", cell_id = "MCF7", is_gold = TRUE),
inferred = FALSE)
amox_gold_sumex <- amox_gold_sumex[1:978, ]
If you want some other subset of genes, simply subset the Slinky
object accordingly. The entrez gene ids of the genes in the object can be retrieved by the rownames
method (and likewise, the distil_ids
can be retrieved using the colnames
methodD).
## [1] "5720" "466" "6009" "2309" "387"
## [1] "CPC020_MCF7_24H_X1_F1B4_DUO52HI53LO:P17"
## [2] "CPC020_MCF7_24H_X2_F1B4_DUO52HI53LO:P17"
## [3] "CPC020_MCF7_24H_X3_F1B4_DUO52HI53LO:P17"
## [4] "CPC020_MCF7_24H_X4_F2B4_DUO52HI53LO:P17"
## [5] "CPC020_MCF7_24H_X5_B5_DUO52HI53LO:P17"
# note subsetting first will be faster as it avoids loading
# in the entire set of names from the gctx file:
rownames(sl[1:5, ])
## [1] "5720" "466" "6009" "2309" "387"
## [1] "CPC020_MCF7_24H_X1_F1B4_DUO52HI53LO:P17"
## [2] "CPC020_MCF7_24H_X2_F1B4_DUO52HI53LO:P17"
## [3] "CPC020_MCF7_24H_X3_F1B4_DUO52HI53LO:P17"
## [4] "CPC020_MCF7_24H_X4_F2B4_DUO52HI53LO:P17"
## [5] "CPC020_MCF7_24H_X5_B5_DUO52HI53LO:P17"
## [1] TRUE
Here is a more complex example. Let us identify those instances treated with FDA approved compounds that are part of the “gold” subset of highly consistent instances (as defined by LINCS). Definitively identifying FDA approved drugs is a little tricky due to subtle (and not so subtle) differences in vocabularies, but we exploit the “repositioning” data available at the rep_drugs
endpoint to achieve our ends. Since we want to access the clue API directly, we will use the lower level clue
method here rather than loadL1K
.
fda <- clue(sl, "rep_drugs", where_clause = list(status_source = list(like = "FDA Orange"),
final_status = "Launched", animal_only = "0", in_cmap = TRUE),
verbose = FALSE)
fda_pert <- clueInstances(sl, poscon = "omit", where_clause = list(pert_type = "trt_cp",
is_gold = TRUE, pert_iname = list(inq = fda$pert_iname)),
verbose = FALSE)
Note the inq
syntax used above to provide an array of possible values to match. This is a vernacular of the loopback
framework on which the clue.io service is built. Further details can be obtained from the loopback website, or the numerous examples at the clue.io API.
We do not have access to most of these instances in the demo GCTX file, but with the full GCTX file we could then load this data by passing the list of ids as an argument to the loadL1K
method as follows:
Reading all 30320 instances from the full gctx file from GEO takes about 5 minutes in our environment (in part because these instances are scattered all accross the file). If you plan on reusing larger slices of the data like this, it would be a good idea to save them as .rds
files for rapid loading in the future.
It is often desirable to identify suitable controls for the perturbed samples in your desired dataset. A simplistic approach would be just to identify all the instances that were treated with the appropriate control. For instances of type trt_cp
(compound perturbed instances), the vehicle is usually, though not always, DMSO. For instances of type trt_sh
and trt_oe
, the control would be empty vector.
The slinky package can query clue.io to identify the appropriate controls for a list of ids.
We could then identify the corresponding samples treated with that vehicle in our data set using any of the approaches outlined above. For example
ix <- which(metadata(sl)$pert_iname %in% veh$pert_vehicle)
amox_and_control <- loadL1K(sl, ids = c(amox_gold, metadata(sl[,
ix])$inst_id))
However, there are 54,707 DMSO treated instances in LINCS. You may not want all of them every time you need controls for compound perturbed samples. A better strategy might be to restrict the controls to those on the same plate as your perturbed samples. The slinky package can retrieve the appropriate same plate controls for trt_cp
, trt_sh
, and trt_oe
instance types, including a mixture of those types.
ids.ctrl <- controls(sl, ids = amox_gold)$distil_id
amox_and_control <- loadL1K(sl, ids = c(amox_gold, ids.ctrl))
In fact, you can skip the step of identifying the same-plate controls altogether and simply specify controls=TRUE
when calling loadL1K
:
A common goal in analysis of the LINCS L1000 data is to identify gene signatures that describe the transcriptional footprint of a compound or gene. This might be done with a goal of matching drugs to diseases, identifying novel targets of drugs, or deconvoluting the key pathways involved in a disease.
The slinky package facilitates this analysis by identifying and loading suitable data subsets necessary to either define or apply gene signatures. The package takes this one step further by formatting the data for characteristic direction analysis (Clark et al. 2014) as implemented in the GeoDE
package.
This functionality is wrapped in the diffexp
function.
Our goal is to incorporate other popular methods of differential gene expression analysis into this pipeline (analagous to the uniform interface the caret package provides for machine learning).
You can provide the diffexp
function two expression sets (treated and control) to analyze. Alternatively, you can simply provide the treated samples as an expression set and diffexp
will automatically identify and load the corresponding same-plate control samples in the same manner as discussed above.
Alternatively, you can simply specify a perturbagen of interest as the “treat” argument. (You can also provide additional parameters to the “where_clause” argument to further narrow your query). Then diffexp
will load the corresponding data, and identify and load the corresponding same-plate controls.
cd_vector <- diffexp(sl, treat = "amoxicillin", split_by_plate = FALSE,
verbose = FALSE)
head(cd_vector)
## 1
## 5720 -0.008842742
## 466 0.002015884
## 6009 -0.002921965
## 2309 0.003507942
## 387 0.028411214
## 3553 -0.006835874
Note that not all the matching samples were found in our small demonstration gctx file. The function ran using the available data, but presents a message that some data was missing.
If split_by_plate = FALSE
, the function returns a vector of scores for each gene in the dataset representing the magnitude and sign of that gene’s component vector (i.e. its contribution to the characteristic direction of your dataset). In an (arguably) ideal situation, you would only compare instances to controls from the same plate. But to perform the characteristic direction analysis (or any other imagineable statistical analysis) requires at least two samples in the treated group and two in the control group. We do not always have this luxury. So in the case of the amoxicillin, we have to combine our treated and control samples across plates into a single analysis.
However, even in this scenario, the control samples identified by diffexp
still come from the same plates as the treated samples (even if there is only one treated or one control on a given plate).
In the next example, we are able to split the analysis into plates because the perturbagen is present in replicates on each plate.
cd_vecs <- diffexp(sl, treat = "E2F3", where_clause = list(pert_type = "trt_sh",
cell_id = "MCF7"), split_by_plate = TRUE, verbose = FALSE)
cd_vecs[1:5, 1:3]
## KDA001_MCF7_144H_X2 KDA001_MCF7_144H_X3 KDA001_MCF7_96H_X1
## 5720 -0.011311670 -1.042297e-02 -0.009905573
## 466 -0.006983589 -7.000988e-03 0.008619106
## 6009 0.000515203 -2.359553e-02 -0.020492520
## 2309 -0.005588067 8.400016e-03 0.008891955
## 387 -0.004151143 8.411729e-16 0.006958661
Here, diffexp
returns a matrix of scores, one row per gene and one column per plate. The plates could then be summarized in some fashion. One nice approach is to take the rank product (Breitling et al. 2004) of the scores which identifies those genes that are most consistently up regulated (the smaller the rank product, the more likely the gene is to be differentially expressed.) This approach can also identify down regulated genes, of course, provided the values are sorted in the opposite direction.
# negate the values so 1 = the most up regulated gene (rank
# sorts in ASCENDING order, which is not what we want)
ranks <- apply(-cd_vecs, 2, rank)
n <- nrow(ranks)
rp <- apply(ranks, 1, function(x) {
(prod(x/n))
})
If the user has the org.Hs.eg.db
package installed, then the gene symbols corresponding to the most upregulated genes in this case could be identified.
suppressMessages(library(org.Hs.eg.db))
entrez_ids <- names(sort(rp, decreasing = FALSE))[1:5]
entrez_ids <- entrez_ids[which(entrez_ids %in% ls(org.Hs.egSYMBOL))]
as.vector(unlist(mget(entrez_ids, org.Hs.egSYMBOL)))
This list of genes then could be used as a signature for E2F3 knockdown (though ideally some permutation based significance analysis would be performed first to further filter the gene list).
Having ready access to both the expression data and metadata facilitates visualization of the data, including some basic quality control measures. For example, we could explore to what extent batch (plate) effects are driving differences between samples.
suppressMessages(library(ggplot2))
suppressMessages(library(Rtsne))
sumex <- loadL1K(sl[seq_len(978), seq_len(131)])
set.seed(100)
ts <- Rtsne(t(SummarizedExperiment::assays(sumex)[[1]]), perplexity = 10)
tsne_plot <- data.frame(x = ts$Y[, 1], y = ts$Y[, 2], treatment = sumex$pert_iname,
plate = sumex$rna_plate)
ggplot(tsne_plot) + geom_point(aes(x = x, y = y, color = plate)) +
labs(x = "TSNE X", y = "TSNE Y") + theme(axis.title = element_text(face = "bold",
color = "gray"))
Ideally, the points would cluster by treatment and not by plate. Alas, as can be seen in the resulting figure above. Compare that to the same plot colored by treatment.
ggplot(tsne_plot) + geom_point(aes(x = x, y = y, color = treatment)) +
labs(x = "TSNE X", y = "TSNE Y") + theme(axis.title = element_text(face = "bold",
color = "gray"))
This underscores the importance of constructing your analysis of this data set in a way that controls for batch effects.
Note that presently, the Slinky object does not support loading multiple dataset (e.g. the phase 2 and phase 1 data together). However, you can create a separate Slinky object for each dataset, extract data of interest into objects with the method and then simply merge them with cbind. Note that you will need to first verify that the two objects contain the same genes. For example:
For the foreseeable future, our main emphasis will be to expand the methods that slinky supports for identifying differentially expressed genes and/or calculating enrichment of gene signatures in samples. High priority will of course be given to any bugs identiftied in the software, as well as other features that seem to be of high value to any other users of the software. Users are encouraged to either email the package maintainer or raise issues on the package’s github repository with either bug reports or feature requests.
Breitling, R., P. Armengaud, A. Amtmann, and P. Herzyk. 2004. “Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments.” FEBS Lett. 573 (1-3):83–92.
Clark, N. R., K. S. Hu, A. S. Feldmann, Y. Kou, E. Y. Chen, Q. Duan, and A. Ma’ayan. 2014. “The characteristic direction: a geometrical approach to identify differentially expressed genes.” BMC Bioinformatics 15 (March):79.
Keenan, A. B., S. L. Jenkins, K. M. Jagodnik, S. Koplev, E. He, D. Torre, Z. Wang, et al. 2018. “The Library of Integrated Network-Based Cellular Signatures NIH Program: System-Level Cataloging of Human Cells Response to Perturbations.” Cell Syst 6 (1):13–24.
Subramanian, A., R. Narayan, S. M. Corsello, D. D. Peck, T. E. Natoli, X. Lu, J. Gould, et al. 2017. “A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles.” Cell 171 (6):1437–52.