---
title: "Introduction_4_process_pgxseg"
author: "Hangjia Zhao"
date: "`r Sys.Date()`"
output: 
  BiocStyle::html_document:
  toc: true
vignette: >
  %\VignetteIndexEntry{Introduction_4_process_pgxseg}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

[Progenetix](https://progenetix.org/) is an open data resource that provides curated individual cancer copy number aberrations (CNA) profiles along with associated metadata sourced from published oncogenomic studies and various data repositories. Progenetix uses the ".pgxseg" data format to store variant data, which encompasses CNV (Copy Number Variation) and SNV (Single Nucleotide Variant), as well as the metadata of associated samples. This vignette describes how to work with local ".pgxseg" files using this package. For more details about the ".pgxseg" file format, please refer to the the [documentation](https://docs.progenetix.org/file-formats/?h=.pgxseg#pgxseg-sample-variant-files).

# Load library 

```{r setup}
library(pgxRpi)
library(GenomicRanges)
```

## `pgxSegprocess` function 

This function extracts segments, CNV frequency, and metadata from local "pgxseg" files and supports survival data visualization  

* `file` A string specifying the path and name of the "pgxseg" file where the data is to be read. 
* `group_id` A string specifying which id is used for grouping in KM plot or CNV frequency calculation. Default is "group_id".
* `show_KM_plot` A logical value determining whether to return the Kaplan-Meier plot based on metadata. Default is FALSE.
* `return_metadata` A logical value determining whether to return metadata. Default is FALSE.
* `return_seg` A logical value determining whether to return segment data. Default is FALSE.
* `return_frequency` A logical value determining whether to return CNV frequency data. The frequency calculation is based on segments in segment data and specified group id in metadata. Default is FALSE.
* `assembly` A string specifying which genome assembly version should be applied to CNV frequency calculation and plotting. Allowed options are "hg19", "hg38". Default is "hg38".
* `...` Other parameters relevant to KM plot. These include `pval`, `pval.coord`, `pval.method`, `conf.int`, `linetype`, and `palette` (see ggsurvplot function from survminer package)

# Extract segment data 

```{r}
# specify the location of file
file_name <- system.file("extdata", "example.pgxseg",package = 'pgxRpi')

# extract segment data
seg <- pgxSegprocess(file=file_name,return_seg = TRUE)
```

The segment data looks like this

```{r}
head(seg)
```

# Extract metadata 

```{r}
meta <- pgxSegprocess(file=file_name,return_metadata = TRUE)
```

The metadata looks like this

```{r}
head(meta)
```

# Visualize survival data in metadata 

The KM plot is plotted from samples with available followup state and followup time. The default grouping is "group_id" column in metadata.

```{r, fig.width=7, fig.height=5}
pgxSegprocess(file=file_name,show_KM_plot = TRUE)
```

You can try different grouping by `group_id` parameter

```{r, fig.width=7, fig.height=5}
pgxSegprocess(file=file_name,show_KM_plot = TRUE,group_id = 'histological_diagnosis_id')
```

You can specify more parameters to modify this plot (see parameter `...` in documentation)

```{r, fig.width=7, fig.height=5}
pgxSegprocess(file=file_name,show_KM_plot = TRUE,pval=TRUE,palette='npg')
```

# Calculate CNV frequency 

The CNV frequency is calculated from segments of samples with the same group_id. The group_id
is specified in `group_id` parameter. 

```{r}
# Default is "group_id" in metadata
frequency <- pgxSegprocess(file=file_name,return_frequency = TRUE) 
# Use different ids for grouping
frequency_2 <- pgxSegprocess(file=file_name,return_frequency = TRUE, 
                             group_id ='icdo_morphology_id')
frequency
```

The returned object is same as the CNV frequency object with "pgxfreq" format returned by `pgxLoader` function (format details see the vignette Introduction_3_loadfrequency). The CNV frequency is calculated from groups which exist in both metadata and segment data. It is noted that not all groups in metadata must exist in segment data (e.g. some samples don't have CNV calls).

```{r}
head(frequency[["pgx:icdot-C16.9"]])
```

The associated metadata in CNV frequency objects looks like this 

```{r}
mcols(frequency)
```

```{r}
mcols(frequency_2)
```

## Visualize CNV frequency 

You can visualize the CNV frequency of the interesting group. 

```{r, fig.width=7, fig.height=5}
pgxFreqplot(frequency, filters="pgx:icdot-C16.9")
```

```{r, fig.width=7, fig.height=5}
pgxFreqplot(frequency, filters="pgx:icdot-C16.9",chrom = c(1,8,14), layout = c(3,1))
```

Circos plot supports multiple group visualization

```{r,fig.width=6, fig.height=6}
pgxFreqplot(frequency, filters=c("pgx:icdot-C16.9","pgx:icdot-C73.9"),circos = TRUE)
```

The details of `pgxFreqplot` function see the vignette Introduction_3_loadfrequency.

# Extract all data 

If you want different types of data such as segment data and metadata, and visualize 
the survival data at the same time, you can just set the corresponding parameters as TRUE. 
The returned data is an object including all specified data. It is noted that in this case
the CNV frequency and KM plot use the same group_id.

```{r, fig.width=7, fig.height=5}
info <- pgxSegprocess(file=file_name,show_KM_plot = TRUE, return_seg = TRUE, 
                      return_metadata = TRUE, return_frequency = TRUE)
```

```{r}
names(info)
```

# Session Info

```{r echo = FALSE}
sessionInfo()
```