4 Main functions
4.1 Query
The queryEncode
function allows the user to find the subset of files corresponding to
a precise query defined according to the following criteria :
Parameter | Description |
---|---|
set_accession | The accession for the containing experiment or dataset |
dataset_accession | There is a subtle difference between the parameters set_accession and dataset_accession. In fact, some files can be part of an experiment, a dataset or both. When using set_accession, you will get all the files directly associated with this accession (experiment and/or dataset). While the usage of dataset_accession will get the files directly associated to the requested dataset AND those which are part of an experiment and indirectly linked to a dataset (reported as related files in the dataset and related_dataset in the experiment). |
file_accession | The accesion for one specific file |
biosample_name | The biosample name (“GM12878”, “kidney”) |
biosample_type | The biosample type (“tissue”, “cell line”) |
assay | The assay type (“ChIP-seq”, “polyA RNA-seq”) |
file_format | The file format. Some currently available formats include bam, bed, fastq, bigBed, bigWig, CEL, csfasta, csqual, fasta, gff, gtf, idat, rcc, sam, tagAlign, tar, tsv, vcf, wig. |
lab | The laboratory |
organism | The donor organism (“Homo sapiens”, “Mus musculus”) |
target | The gene, protein or histone mark which was targeted by the assay (Immunoprecipitated protein in ChIP-seq, knocked-down gene in CRISPR RNA-seq assays, etc) |
treatment | The treatment related to the biosample |
project | The project name/id |
By default, the query function uses exact string matching to perform the
selection of the relevant entries. This behavior can be changed by changing the
fixed
or fuzzy
parameters. Setting fixed
to FALSE
will perform
case-insensitive regular expression matching. Setting fuzzy
to TRUE
will
retrieve search results where the query string is only a partial match.
The result set is a subset of the encode_df_lite
table.
For example, to select all the fastq files originating from assays on the MCF-7 (human breast cancer) cell line:
query_results <- queryEncode(organism = "Homo sapiens",
biosample_name = "MCF-7", file_format = "fastq",
fixed = TRUE)
## Results : 813 files, 234 datasets
The same request with approximate spelling of the biosample name and fuzzy
option
to FALSE
, will give no results :
query_results <- queryEncode(organism = "Homo sapiens", biosample_name = "mcf7",
file_format = "fastq", fixed = TRUE,
fuzzy = FALSE)
## No result found in encode_df. You can try the <searchEncode> function or set the fuzzy option to TRUE.
If you follow the warning guidance and set the fuzzy
option to TRUE
:
query_results <- queryEncode(organism = "Homo sapiens",
biosample_name = "mcf7", file_format = "fastq", fixed = TRUE,
fuzzy = TRUE)
## Results : 813 files, 234 datasets
You can also perform matching through regular expressions by setting fixed to FALSE
.
query_results <- queryEncode(assay = ".*RNA-seq",
biosample_name = "HeLa-S3", fixed = FALSE)
## Results : 318 files, 11 datasets
table(query_results$assay)
##
## polyA RNA-seq polyA depleted RNA-seq small RNA-seq
## 150 90 78
Finally, the queryEncodeGeneric
function can be used to perform searches on
columns which are not part of the queryEncode interface but are present within
the encode_df_lite data.table:
query_results <- queryEncodeGeneric(biosample_name="HeLa-S3",
assay="RNA-seq", submitted_by="Diane Trout",
fuzzy=TRUE)
## Results : 54 files, 2 datasets
table(query_results$submitted_by)
##
## Diane Trout
## 54
These criteria correspond to the filters that you can find on ENCODE portal :
4.2 fuzzySearch
This function is a more user-friendly version of queryEncode
that also
searches on the encode_df_lite
object. The character vector or the list of
characters specified by the user will be searched for in every column of the
database. The user can also constrain the query by selecting the specific column
in which to search for the query term by using the filterVector
parameter.
The following request will produce a data.table with every files containing the term brca.
fuzzy_results <- fuzzySearch(searchTerm = c("brca"))
## Results: 236 files, 7 datasets
Multiple terms can be searched simultaneously. This example extracts all files containing brca or ZNF24 within the target column.
fuzzy_results <- fuzzySearch(searchTerm = c("brca", "ZNF24"),
filterVector = c("target"),
multipleTerm = TRUE)
## Results: 710 files, 17 datasets
When searching for multiple terms, three type of input can be passed to the
searchTerm
parameter :
- A single character where the various terms are separated by commas
- A character vector
- A list of characters
4.3 Search
This function simulates a keyword search performed through the ENCODE web portal.
The searchEncode
function returns a data frame
corresponding to the result
page provided by the ENCODE portal. If a specific file or dataset isn’t
available with fuzzySearch
or queryEncode
(i.e. within get_encode_df()
),
the user can access the latest data from the ENCODE database through the
searchEncode function.
The searchToquery
function convert the result of a search to a data.table
with the same design as get_encode_df()
. This format contains more metadata
and allow the user to extract all files within the dataset. This format also
allows the user to create a design using the createDesign
function.
Here is the example of the following search : “a549 chip-seq homo sapiens”.
On ENCODE portal :
With our function :
search_results <- searchEncode(searchTerm = "a549 chip-seq homo sapiens",
limit = "all")
## results : 414
4.4 createDesign
This function organizes the data.table
created by fuzzySearch
, queryEncode
or searchToquery
. It extracts the replicate and control files within a dataset.
It creates a data.table
with the file accessions, the dataset accessions and
numeric values associated with the nature of the file (1:replicate / 2:control)
when the format
parameter is set to long
.
By setting the format
parameter to wide
, each dataset will have its own column
as illustrated below.
4.5 downloadEncode
downloadEncode
allows a user to download a file or an entire dataset. Downloading
files can be done by providing a vector of file accessions or dataset accessions
(represented by the accession column in get_encode_df()
) to the file_acc
parameter.
This parameter can also be the data.table
created by queryEncode
, fuzzySearch
,
searchToquery
or createDesign
.
If the accession doesn’t exist within the passed-in get_encode_df()
database,
downloadEncode
will search for the accession directly within the ENCODE database.
The path to the download directory can be specified (default: /tmp
).
To ensure the integrity of each file, the md5 sum of each downloaded file is compared to the reported md5 sum in ENCODE.
Moreover, if the accession is a dataset accession, the function will download each file in this dataset. The format option, which is set by default to all, enables the downloading of a specific format.
Here is a small example query:
query_results <- queryEncode(assay = "switchgear", target ="elavl1", fixed = FALSE)
## Results : 2 files, 1 datasets
And its equivalent search:
search_results <- searchEncode(searchTerm = "switchgear elavl1", limit = "all")
## results : 1
To select a particular file format you can:
- add filters to your query and then run the
downloadEncode
function.
query_results <- queryEncode(assay = "switchgear", target ="elavl1",
file_format = "bed" , fixed = FALSE)
downloadEncode(query_results)
- specify the format to the
downloadEncode
function.
downloadEncode(search_results, format = "bed")
4.6 Conversion
The function searchToquery
enables the conversion of the results of
searchEncode
to a queryEncode
output based on the accession numbers.
The user can then benefit from all the collected metadata and the createDesign
function.
The structure of the result set is similar to the get_encode_df()
structure.
Let’s try it with the previous example :
- search
search_results <- searchEncode(searchTerm = "switchgear elavl1", limit = "all")
## results : 1
- convert
convert_results <- searchToquery(searchResults = search_results)
4.7 shinyEncode
This function launches the shinyApp of ENCODExplorer that implements the
fuzzySearch
and queryEncode
search functions. It also allows the creation
of a design to organize and download specific files with the downloadEncode
function. The Search tab of shinyEncode uses the fuzzySearch
function for a
low specificity request while the Advanced Search
tab uses the queryEncode
function.