--- title: "Introduction_1_loadmetadata" author: "Hangjia Zhao" date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true vignette: > %\VignetteIndexEntry{Introduction_1_loadmetadata} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` [Progenetix](https://progenetix.org/) is an open data resource that provides curated individual cancer copy number variation (CNV) profiles along with associated metadata sourced from published oncogenomic studies and various data repositories. This vignette provides a comprehensive guide on accessing and utilizing metadata for samples or their corresponding individuals within the Progenetix database. If your focus lies in cancer cell lines, you can access data from [*cancercelllines.org*](https://cancercelllines.org/) by specifying the `dataset` parameter as "cancercelllines". This data repository originates from CNV profiling data of cell lines initially collected as part of Progenetix and currently includes additional types of genomic mutations. # Load library ```{r setup} library(pgxRpi) ``` ## `pgxLoader` function This function loads various data from `Progenetix` database. The parameters of this function used in this tutorial: * `type` A string specifying output data type. Available options are "biosample", "individual", "variant" or "frequency". * `filters` Identifiers for cancer type, literature, cohorts, and age such as c("NCIT:C7376", "pgx:icdom-98353", "PMID:22824167", "pgx:cohort-TCGAcancers", "age:>=P50Y"). For more information about filters, see the [documentation](https://docs.progenetix.org/common/beacon-api/#filters-filters-filtering-terms). * `filterLogic` A string specifying logic for combining multiple filters when query metadata. Available options are "AND" and "OR". Default is "AND". An exception is filters associated with age that always use AND logic when combined with any other filter, even if filterLogic = "OR", which affects other filters. * `individual_id` Identifiers used in Progenetix database for identifying individuals. * `biosample_id` Identifiers used in Progenetix database for identifying biosamples. * `codematches` A logical value determining whether to exclude samples from child concepts of specified filters that belong to cancer type/tissue encoding system (NCIt, icdom/t, Uberon). If TRUE, retrieved samples only keep samples exactly encoded by specified filters. Do not use this parameter when `filters` include ontology-irrelevant filters such as PMID and cohort identifiers. Default is FALSE. * `limit` Integer to specify the number of returned samples/individuals/coverage profiles for each filter. Default is 0 (return all). * `skip` Integer to specify the number of skipped samples/individuals/coverage profiles for each filter. E.g. if skip = 2, limit=500, the first 2*500 =1000 profiles are skipped and the next 500 profiles are returned. Default is NULL (no skip). * `dataset` A string specifying the dataset to query. Default is "progenetix". Other available options are "cancercelllines". # Retrieve meatdata of samples ## Relevant parameters type, filters, filterLogic, individual_id, biosample_id, codematches, limit, skip, dataset ## Search by filters Filters are a significant enhancement to the [Beacon](https://www.ga4gh.org/product/beacon-api/) query API, providing a mechanism for specifying rules to select records based on their field values. To learn more about how to utilize filters in Progenetix, please refer to the [documentation](https://docs.progenetix.org/beaconplus/#filters-filters-filtering-terms). The `pgxFilter` function helps access available filters used in Progenetix. Here is the example use: ```{r} # access all filters all_filters <- pgxFilter() # get all prefix all_prefix <- pgxFilter(return_all_prefix = TRUE) # access specific filters based on prefix ncit_filters <- pgxFilter(prefix="NCIT") head(ncit_filters) ``` The following query is designed to retrieve metadata in Progenetix related to all samples of lung adenocarcinoma, utilizing a specific type of filter based on an [NCIt code](https://ncit.nci.nih.gov) as an ontology identifier. ```{r} biosamples <- pgxLoader(type="biosample", filters = "NCIT:C3512") # data looks like this biosamples[c(1700:1705),] ``` The data contains many columns representing different aspects of sample information. ## Search by biosample id and individual id In Progenetix, biosample id and individual id serve as unique identifiers for biosamples and the corresponding individuals. You can obtain these IDs through metadata search with filters as described above, or through [website](https://progenetix.org/search) interface query. ```{r} biosamples_2 <- pgxLoader(type="biosample", biosample_id = "pgxbs-kftvgioe",individual_id = "pgxind-kftx28q5") metainfo <- c("biosample_id","individual_id","pubmed_id","histological_diagnosis_id","geoprov_city") biosamples_2[metainfo] ``` It's also possible to query by a combination of filters, biosample id, and individual id. ## Access a subset of samples By default, it returns all related samples (limit=0). You can access a subset of them via the parameter `limit` and `skip`. For example, if you want to access the first 1000 samples , you can set `limit` = 1000, `skip` = 0. ```{r} biosamples_3 <- pgxLoader(type="biosample", filters = "NCIT:C3512",skip=0, limit = 1000) # Dimension: Number of samples * features print(dim(biosamples)) print(dim(biosamples_3)) ``` ## Query the number of samples in Progenetix The number of samples in specific group can be queried by `pgxCount` function. ```{r} pgxCount(filters = "NCIT:C3512") ``` ## Parameter `codematches` use The NCIt code of retrieved samples doesn't only contain specified filters but contains child terms. ```{r} unique(biosamples$histological_diagnosis_id) ``` Setting `codematches` as TRUE allows this function to only return biosamples with exact match to the filter. ```{r} biosamples_4 <- pgxLoader(type="biosample", filters = "NCIT:C3512",codematches = TRUE) unique(biosamples_4$histological_diagnosis_id) ``` ## Parameter `filterLogic` use This function supports querying samples that belong to multiple filters. For example, If you want to retrieve information about lung adenocarcinoma samples from the literature PMID:24174329, you can specify multiple matching filters and set `filterLogic` to "AND". ```{r} biosamples_5 <- pgxLoader(type="biosample", filters = c("NCIT:C3512","PMID:24174329"), filterLogic = "AND") ``` # Retrieve meatdata of individuals If you want to query metadata (e.g. survival data) of individuals where the samples of interest come from, you can follow the tutorial below. ## Relevant parameters type, filters, filterLogic, individual_id, biosample_id, codematches, limit, skip, dataset ## Search by filters ```{r} individuals <- pgxLoader(type="individual",filters="NCIT:C3270") # Dimension: Number of individuals * features print(dim(individuals)) # data looks like this individuals[c(36:40),] ``` ## Search by biosample id and individual id You can get the id from the query of samples ```{r} individual <- pgxLoader(type="individual",individual_id = "pgxind-kftx26ml", biosample_id="pgxbs-kftvh94d") individual ``` # Visualization of survival data ## `pgxMetaplot` function This function generates a survival plot using metadata of individuals obtained by the `pgxLoader` function. The parameters of this function: * `data`: The meatdata of individuals returned by `pgxLoader` function. * `group_id`: A string specifying which column is used for grouping in the Kaplan-Meier plot. * `condition`: Condition for splitting individuals into younger and older groups. Only used if `group_id` is age related. * `return_data`: A logical value determining whether to return the metadata used for plotting. Default is FALSE. * `...`: Other parameters relevant to KM plot. These include `pval`, `pval.coord`, `pval.method`, `conf.int`, `linetype`, and `palette` (see ggsurvplot function from survminer package) Suppose you want to investigate whether there are survival differences between younger and older patients with a particular disease, you can query and visualize the relevant information as follows: ```{r} # query metadata of individuals with lung adenocarcinoma luad_inds <- pgxLoader(type="individual",filters="NCIT:C3512") # use 65 years old as the splitting condition pgxMetaplot(data=luad_inds, group_id="age_iso", condition="P65Y", pval=TRUE) ``` It's noted that not all individuals have available survival data. If you set `return_data` to `TRUE`, the function will return the metadata of individuals used for the plot. # Session Info ```{r echo = FALSE} sessionInfo() ```