---
title: "OmicsMLRepoR - Quickstart"
author: "Sehyun Oh, Kaelyn Long"
date: "`r format(Sys.time(), '%B %d, %Y')`"
vignette: >
  %\VignetteIndexEntry{Quickstart}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
output:
  BiocStyle::html_document:
    number_sections: yes
    toc: yes
    toc_depth: 4
editor_options: 
  markdown: 
    wrap: 72
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(comment = "#>", 
                      collapse = TRUE, 
                      message = FALSE, 
                      warning = FALSE)
```

# Introduction

Our OmicsMLRepo project aims to improve the AI/ML-readiness of Omics datasets 
available through Bioconductor. One of the main activities under this project 
is metadata harmonization (e.g., remove redundant information) and 
standardization (i.e., incorporate ontology).

Currently, we released the harmonized version of metadata for two Bioconductor 
data packages - `r BiocStyle::Biocpkg("curatedMetagenomicData")`containing 
human microbiome data and `r BiocStyle::Biocpkg("cBioPortalData")` package 
on cancer genomics data. OmicsMLRepoR is a software package allowing users 
to easily access the harmonized metadata and to leverage
ontology in metadata search.

OmicsMLRepoR package provides the three major functions:\
1. Download the harmonized metadata\
2. Browse the harmonized metadata using ontology\
3. Manipulate the 'shape' of the harmonized metadata


## Load package
```{r}
suppressPackageStartupMessages({
    library(OmicsMLRepoR)
    library(dplyr)
    library(curatedMetagenomicData)
    library(cBioPortalData)
})
```

## Load the metadata

You can download the harmonized version of metadata using the `getMetadata` 
function. Currently, two options are available - `cMD` and `cBioPortalData`.

```{r}
cmd <- getMetadata("cMD")
cmd
```

```{r}
cbio <- getMetadata("cBioPortal")
cbio
```

# Access metadata
Harmonized metadata can be easily searched by `dplyr` functions. To fully 
leverage ontologies incorporated in harmonized metadata and provide more robust 
data browsing experience, the package provides the `tree_filter` function. 
Note, that `tree_filter` can be used on the attributes mapped to the ontology
terms:

```{r echo=FALSE}
colnames(cmd)[grep("_ontology_term_id", colnames(cmd))] %>% 
    gsub("_ontology_term_id", "", .)
```


## Robust search using ontology
Compared to the typical searching in the original metadata from the 
`curatedMetagenomicData`, OmicsMLRepoR enables more robust data browsing, 
including case-insensitive, synonyms and descendant searching capabilities.

Searching the same information in the original, unharmonized metadata
(`sampleMetadata`) from the `curatedMetagenomicData` package is much less 
robust:

```{r}
## Information spread out in two different columns
nrow(sampleMetadata |> filter(study_condition == "CRC"))
nrow(sampleMetadata |> filter(disease == "CRC"))

## Case sensitive
nrow(sampleMetadata |> filter(study_condition == "CRC"))
nrow(sampleMetadata |> filter(study_condition == "crc"))

## Synonyms not covered
nrow(sampleMetadata |> filter(study_condition == "Colorectal Carcinoma"))
nrow(sampleMetadata |> filter(study_condition == "Colorectal Cancer"))
```

### Not case-sensitive
`tree_filter` is not case-sensitive.
```{r}
nrow(cmd |> tree_filter(disease, "Colorectal Carcinoma"))
nrow(cmd |> tree_filter(disease, "colorectal carcinoma"))
```

### Include synonyms
`tree_filter` includes the synonyms of the queried terms in its searching.
```{r}
syn_res1 <- cmd |> tree_filter(disease, "CRC")
syn_res2 <- cmd |> tree_filter(disease, "Colorectal Cancer")
syn_res3 <- cmd |> tree_filter(disease, "Colorectal Carcinoma")

nrow(syn_res1)
nrow(syn_res2)
nrow(syn_res3)
```

Check that the returned results are identical.
```{r}
unique(syn_res1$disease)
unique(syn_res2$disease)
unique(syn_res3$disease)
```

### Search descendants in ontology tree
`tree_filter` includes all the descendants of the queried term in its searching.
```{r}
onto_res <- cmd |> tree_filter(disease, "Intestinal Disorder")
unique(onto_res$disease)
```

## Multiple searching terms
For example, you can search for any row including a disease related to either 
"migraine" or "diabetes."

```{r}
res_or <- cmd %>% tree_filter(disease, c("migraine", "diabetes"), "OR")
```

We can also change the "OR" argument (default) to either "AND" or "NOT" and 
change the filtering action. "AND" will return any rows including a disease
value that is related to both "migraine" and "diabetes," and "NOT" will return
any rows including a disease value that is not related to either "migraine" or
"diabetes."

```{r}
res_and <- cmd %>% tree_filter(disease, c("migraine", "diabetes"), "AND")
res_not <- cmd %>% tree_filter(disease, c("migraine", "diabetes"), "NOT")
```

You can combine `tree_filter` and `dplyr` functions. For example, if you want 
all rows with a disease value related to either "migraine" or "diabetes," as 
well as with an age_years value under 30,

```{r}
res_or_below30 <- cmd %>% 
    filter(age_years < 30) %>%
    tree_filter(disease, c("migraine", "diabetes"))
```


## Collapse/Expand metadata

Some metadata columns (e.g., `biomarker`) contain multiple, similar
attributes separated with a specific delimiter (i.e., `<;>`). Our
harmonization use this structure because they are related information
often looked up together.

```{r}
cmd_biomarker <- cmd %>% 
    filter(!is.na(biomarker)) %>% 
    select(curation_id, biomarker)
wtb <- getWideMetaTb(cmd_biomarker, "biomarker")
head(wtb)
```

```{r}
ltb <- getLongMetaTb(cmd, targetCol = "target_condition")
dim(cmd)
dim(ltb)
```


# Download omics data
## curatedMetagenomicData

```{r debug_needed, echo=FALSE}
cmd_sub <- tree_filter(cmd, target_condition, "Alzheimer's disease")
```

```{r}
cmd_dat <- cmd %>%
    tree_filter(col = "disease", "Type 2 Diabetes Mellitus") %>%
    filter(sex == "Female") %>%
    filter(age_group == "Elderly") %>%
    returnSamples("relative_abundance", rownames = "short")
```

## cBioPortalData
```{r}
cbio_sub <- cbio %>%
    getLongMetaTb("treatment_name", "<;>") %>%
    filter(treatment_name == "Fluorouracil") %>%
    filter(age_at_diagnosis > 50) %>%
    filter(sex == "Female") %>%
    getShortMetaTb(idCols = "curation_id", targetCol = "treatment_name")

dim(cbio_sub)
studies <- unique(cbio_sub$studyId)
studies
```

A simple `for` loop can collect samples from multiple studies. For example,
```{r echo=FALSE, eval=FALSE}
cbio_api <- cBioPortal()
resAll <- as.list(vector(length = length(studies)))

for (i in seq_along(studies)) {
    study <- studies[i]
    samples <- cbio_sub %>%
        filter(studyId == study) %>%
        pull(sampleId)

    res <- cBioPortalData(
        api = cbio_api,
        by = "hugoGeneSymbol",
        studyId = study,
        sampleIds = samples,
        genePanelId = "IMPACT341"
    )
    
    resAll[[i]] <- res
}
```


# Session Info
<details>
```{r}
sessionInfo()
```
</details>