getDataset

This vignette shows detailed examples for all functionalities of the getDataset function.

Connections
List the datasets
Download a dataset
Filters
Views
Caching
sessionInfo

Connections

As explained into the user guide vignette, datasets must be downloaded from ImmuneSpaceConnection objects. We must first instantiate a connection to the study or studies of interest. Throughout this vignette, we will use two connections, one to a single study, and one to to all available data.

library(ImmuneSpaceR)
sdy269 <- CreateConnection("SDY269")
all <- CreateConnection("")

List the datasets

Now that the connections have been instantiated, we can start downloading from them. But we need to figure out which datasets are available within our chosen studies.

Printing the connections will, among other information, list the datasets availables. The listDatasets method will display only the information we are looking for.

sdy269$listDatasets()

## datasets
##  cohort_membership
##  pcr
##  elisa
##  fcs_analyzed_result
##  fcs_sample_files
##  hai
##  elispot
##  demographics
##  gene_expression_files
## Expression Matrices
##  TIV_2008
##  LAIV_2008

all$listDatasets()

## datasets
##  fcs_analyzed_result
##  elisa
##  mbaa
##  pcr
##  hai
##  hla_typing
##  neut_ab_titer
##  elispot
##  fcs_control_files
##  fcs_sample_files
##  gene_expression_files
##  demographics
##  cohort_membership
## Expression Matrices
##  TIV_older
##  SDY296_AIRFV_2011
##  LAIV_2008
##  TIV_2008
##  TIV_2007
##  pH1N1_2009
##  TIV_young
##  SDY63_older_PBMC_year1
##  SDY63_young_PBMC_year1
##  SDY404_older_PBMC_year2
##  SDY404_young_PBMC_year2
##  Cohort1_young
##  Cohort2_older
##  TIV_2011
##  Pneumovax23_group1
##  Fluzone_group1
##  Fluzone_group2
##  Pneumovax23_group2
##  Saline_group1
##  Saline_group2
##  VLminus
##  VLplus
##  TIV_2010
##  SDY301_AIRFV_2012

Naturally, all contains every dataset available on ImmuneSpace as it combines all available studies.

Additionaly, when creating connection object with verbose = TRUE, a call to the getDataset function with an invalid dataset name will return the list of valid datasets.

Download

Calling getDataset returns a selected dataset as it is displayed on ImmuneSpace.

hai_269 <- sdy269$getDataset("hai")
hai_all <- all$getDataset("hai")

print(head(hai_269))

##    participant_id age_reported gender                      race
## 1:  SUB112829.269           26   Male                     White
## 2:  SUB112870.269           33   Male                     White
## 3:  SUB112873.269           25   Male                     White
## 4:  SUB112832.269           26   Male                     White
## 5:  SUB112856.269           46 Female Black or African American
## 6:  SUB112857.269           33   Male                     White
##             cohort study_time_collected study_time_collected_unit
## 1: LAIV group 2008                    0                      Days
## 2: LAIV group 2008                    0                      Days
## 3: LAIV group 2008                    0                      Days
## 4:  TIV Group 2008                   28                      Days
## 5:  TIV Group 2008                   28                      Days
## 6:  TIV Group 2008                   28                      Days
##                 virus value_reported
## 1:   B/Florida/4/2006             20
## 2:   B/Florida/4/2006              5
## 3:   B/Florida/4/2006              5
## 4: A/Brisbane/59/2007             20
## 5: A/Brisbane/59/2007            160
## 6: A/Brisbane/59/2007            160

Because some datasets such as flow cytometry results can contain a large number of rows, the function returns data.table objects to improve performance. This is especially important with multi-study connections.

Filters

The datasets can be filtered before download. Filters should be created using Rlabkey’s makeFilter function.

Each filter is composed of three part: - A column name or column label - An operator - A value or array of values separated by a semi-colon

library(Rlabkey)
# Get participants under age of 30
young_filter <- makeFilter(c("age_reported", "LESS_THAN", 30))
# Get a specific list of two participants
pid_filter <- makeFilter(c("participantid", "IN", "SUB112841.269;SUB112834.269"))

For a list of available operators, see ?makeFilter.

# HAI data for participants of study SDY269 under age of 30
hai_young <- sdy269$getDataset("hai", colFilter = young_filter)
# List of participants under age 30
demo_young <- all$getDataset("demographics", colFilter = young_filter)
# ELISPOT assay results for two participants
mbaa_pid2 <- all$getDataset("elispot", colFilter = pid_filter)

Note that filtering is done before download. When performance is a concern, it is faster to do the filtering via the colfFilter argument than on the returned table.

Views

Any dataset grid on ImmuneSpace offers the possibility to switch views between ‘Default’ and ‘Full’. The Default view contains information that is directly relevant to the user. Sample description and results are joined with basic demographic. However, this is not the way data is organized in the database. The ‘Full’ view is a representation of the data as it is stored on ImmPort. The accession columns are used under the hood for join operations. They will be useful to developers and user writing reports to be displayed in ImmuneSpace studies.

Screen capture of the button bar of a dataset grid on ImmuneSpace

The original_view argument decides which view is downloaded. If set to TRUE, the ‘Full’ view is returned.

full_hai <- sdy269$getDataset("hai", original_view = TRUE)
print(colnames(full_hai))

##  [1] "participant_id"            "arm_accession"            
##  [3] "biosample_accession"       "expsample_accession"      
##  [5] "experiment_accession"      "study_accession"          
##  [7] "study_time_collected"      "study_time_collected_unit"
##  [9] "virus"                     "value_reported"           
## [11] "value_preferred"           "unit_reported"            
## [13] "unit_preferred"

For additional information, refer to the ‘Working with tabular data’ video tutorial available in the menu bar on any page of the portal.

Caching

As explained in the user guide, the ImmuneSpaceConnection class is a Reference class. It means its objects have fields accessed by reference. As a consequence, they can be modified without making a copy of the entire object. ImmuneSpaceR uses this feature to store downloaded datasets and expression matrices. Subsequent calls to getDataset with the same input will be faster.

See ?setRefClass for more information about reference classes.

We can see the data currently cached using the data_cache field. This is not intended to be used for data manipulation and only displayed here to explain what gets cached.

pcr <- sdy269$getDataset("pcr")
names(sdy269$data_cache)

## [1] "GE_matrices"

Different views are saved separately.

pcr_ori <- sdy269$getDataset("pcr", original_view = TRUE)
names(sdy269$data_cache)

## [1] "GE_matrices"

Because of the infinite number of filters and combinations of filters, we do not cache filtered datasets.

If, for any reason, a specific dataset needs to be redownloaded, the reload argument will clear the cache for that specific getDataset call and download the table again.

hai_269 <- sdy269$getDataset("hai", reload = TRUE)

Finally, it is possible to clear every cached dataset (and expression matrix).

sdy269$clear_cache()
names(sdy269$data_cache)

## [1] "GE_matrices"

Again, the data.cache field should never be modified manually. When in doubt, simply reload the dataset.

sessionInfo()

sessionInfo()

## R version 3.4.0 (2017-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.5-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.5-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] Rlabkey_2.1.134    rjson_0.2.15       RCurl_1.95-4.8    
## [4] bitops_1.0-6       ImmuneSpaceR_1.4.0 rmarkdown_1.4     
## [7] knitr_1.15.1      
## 
## loaded via a namespace (and not attached):
##  [1] Biobase_2.36.0      viridis_0.4.0       httr_1.2.1         
##  [4] tidyr_0.6.1         jsonlite_1.4        viridisLite_0.2.0  
##  [7] foreach_1.4.3       gtools_3.5.0        assertthat_0.2.0   
## [10] stats4_3.4.0        yaml_2.1.14         robustbase_0.92-7  
## [13] backports_1.0.5     lattice_0.20-35     digest_0.6.12      
## [16] RColorBrewer_1.1-2  colorspace_1.3-2    htmltools_0.3.5    
## [19] plyr_1.8.4          pheatmap_1.0.8      purrr_0.2.2        
## [22] mvtnorm_1.0-6       scales_0.4.1        gdata_2.17.0       
## [25] whisker_0.3-2       tibble_1.3.0        ggplot2_2.2.1      
## [28] nnet_7.3-12         BiocGenerics_0.22.0 lazyeval_0.2.0     
## [31] magrittr_1.5        mclust_5.2.3        heatmaply_0.9.1    
## [34] evaluate_0.10       MASS_7.3-47         gplots_3.0.1       
## [37] class_7.3-14        tools_3.4.0         registry_0.3       
## [40] data.table_1.10.4   trimcluster_0.1-2   stringr_1.2.0      
## [43] plotly_4.5.6        kernlab_0.9-25      munsell_0.4.3      
## [46] cluster_2.0.6       fpc_2.1-10          compiler_3.4.0     
## [49] caTools_1.17.1      grid_3.4.0          iterators_1.0.8    
## [52] htmlwidgets_0.8     labeling_0.3        base64enc_0.1-3    
## [55] gtable_0.2.0        codetools_0.2-15    flexmix_2.3-13     
## [58] DBI_0.6-1           reshape2_1.4.2      TSP_1.1-5          
## [61] R6_2.2.0            seriation_1.2-1     gridExtra_2.2.1    
## [64] prabclus_2.2-6      dplyr_0.5.0         rprojroot_1.2      
## [67] KernSmooth_2.23-15  dendextend_1.5.2    modeltools_0.2-21  
## [70] stringi_1.1.5       parallel_3.4.0      Rcpp_0.12.10       
## [73] gclus_1.3.1         DEoptimR_1.0-8      diptest_0.75-7