---
title: "PMScanR: An R Package for the large-scale identification, analysis, and visualization of protein motifs"
author:
  - Jan Pawel Jastrzebski
  - Damian Czopek
  - Monika Gawronska
  - Wiktor Babis
  - Miriana Quaranta
date: "`r Sys.Date()`"
output: BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{PMScanR: Protein Motif Scanning and Analysis}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=TRUE}
# Standard setup chunk
knitr::opts_chunk$set(echo = TRUE, collapse = TRUE)
# Load libraries required for the vignette to build
library(PMScanR)
library(ggseqlogo)
library(seqinr)
library(plotly)
```

# 1 Introduction

The `PMScanR` package provides large-scale identification, analysis, and visualization of protein motifs. The package integrates various methods to facilitate motif identification, characterization, and visualization. It includes functions for running PS-Scan, a PROSITE database tool. Additionally, `PMScanR` supports format conversion to GFF, enhancing downstream analyses such as graphical representation and database integration. The library offers multiple visualization tools, including heatmaps, sequence logos, and pie charts, enabling a deeper understanding of motif distribution and conservation. Through its integration with PROSITE, `PMScanR` provides access to up-to-date motif data.

Proteins play a crucial role in biological processes, with their functions closely related to structure. Protein functions are often associated with the presence of specific motifs, which are short, sometimes repetitive amino acid sequences essential for distinctive molecular interactions or modifications. Most of the existing bioinformatics tools focus mainly on the identification of known motifs and often do not provide interactive analysis and visualization tools during motif extraction. Moreover, they do not take into account the effect of single variations on an entire domain or protein motif. These limitations highlight the need for a tool that can automate and scale the analysis. To address this, we have developed `PMScanR`, an R-based package designed to facilitate and automate the prediction and evaluation of the effect of single amino acid substitutions on the occurrence of protein motifs on a large scale. However, existing tools lacked the capability to perform comparative analysis of multiple motifs across multiple sequences, a gap that `PMScanR` was particularly developed to fill.

## 1.1 Installation and loading
To install this package, start R (version "4.4" or higher) and enter:

```{r installation-loading, eval=FALSE}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("PMScanR")

library(PMScanR)
```

# Quick start

Here we show the most basic pipeline for protein motif analysis: scanning a sequence file, processing the results, and generating a occurrence plot visualization. This code chunk assumes you have a FASTA file ready for analysis.

```{r quickStart, eval=FALSE}
fasta_file <- system.file("extdata", "hemoglobins.fasta", package = "PMScanR")

runPsScan(in_file = fasta_file, out_file = "results.gff", out_format = "gff")

gff_data <- as.data.frame(rtracklayer::import.gff("results.gff"))
motif_matrix <- gff2matrix(gff_data)

matrix2OP(motif_matrix)
```


# 2 Data manipulation and overall usage

## 2.1 GUI

If the user prefers to perform the analysis using a graphical user interface (GUI), they can simply run the function `runPMScanRShiny()`. This will launch a Shiny app that opens an interactive window. The window can be used both within R and in a web browser, providing a clickable, user-friendly interface that allows the entire analysis, including visualizations, to be carried out without needing to write code.

```{r run-shiny-app, eval=FALSE}
# This command launches the interactive Shiny app
runPMScanRShiny()
```

## 2.2 Command Line

Alternatively, if the user wishes to work directly with the code, the library provides a set of functions to perform the full analysis, including protein motif identification and visualization. This can be done through an R script, where users can execute and customize the analysis programmatically. Each function included in the package is described below, along with an explanation of its purpose and functionality.

### 2.2.1 Loading example data

The first step of the analyses is to establish the working environment. 

```{r set-working-directory}
# Setting working directory is user-specific, e.g.:
# setwd("/path/to/your/working/directory")
```

For the purpose of this vignette, we will use sample files included with the `PMScanR` package. These files represent the input (FASTA) and various outputs (GFF, PSA, TXT) generated by the PROSITE analysis.

```{r load-example-files}
# 1. Load example FASTA file (Input for runPsScan)
fasta_file <- system.file("extdata", "hemoglobins.fasta", package = "PMScanR")

# 2. Load example GFF output
gff_file <- system.file("extdata", "out_Hb_gff.txt", package = "PMScanR")

# 3. Load example PSA output
psa_file <- system.file("extdata", "out_Hb_psa.txt", package = "PMScanR")

# 4. Load example PROSITE text output (Scan format)
prosite_txt_file <- system.file("extdata", "out_Hb_PROSITE.txt", package = "PMScanR")
```

Once the data is accessible, you can move on to the analysis.

#### `runPsScan()`

This function runs the `ps_scan` tool to identify protein motifs. It automatically detects the operating system and handles the downloading and caching of required PROSITE databases (using `BiocFileCache`) upon the first run.

The `runPsScan` function allows you to specify the output format via the `out_format` argument. Available formats include: `scan` (default), `fasta`, `psa`, `msa`, `gff`, `pff`, `epff`, `sequence`, `matchlist`, `ipro`. 

Regardless of the chosen output format (`gff`, `psa`, or `scan`), `PMScanR` allows you to import the data into R and perform the complete analysis.

```{r runPsScan, eval=FALSE}
# This command is not evaluated in the vignette as it requires an external
# dependency (Perl) and can be time-consuming during the first run.

# Example: Generate GFF output
runPsScan(in_file = fasta_file, out_format = 'gff', out_file = "results.gff")
```

### 2.2.2 Parsing Results

The `PMScanR` package is designed to work flexibly with different PROSITE output formats. Whether you have a GFF, PSA, or a standard text file, you can load it into a standardized data frame in R.

**Option A: Loading GFF files**
If you generated a GFF file, use `rtracklayer::import.gff` and convert it to a data frame.
```{r read-gff}
gff_data <- as.data.frame(rtracklayer::import.gff(gff_file))
# The data frame now contains all necessary columns (including Sequence)
head(gff_data)
```

**Option B: Loading PSA files**
If you generated a PSA file, use the `readPsa()` function.
```{r read-psa}
psa_data <- readPsa(psa_file)
head(psa_data)
```

**Option C: Loading standard PROSITE text files**
If you have a standard output file (format 'scan'), use the `readProsite()` function.
```{r read-prosite}
prosite_data <- readProsite(prosite_txt_file)
head(prosite_data)
```

### 2.2.3 Visualization and Analysis

Once the data is loaded into a data frame (using any of the options above), the downstream analysis is identical. You can generate heatmaps, pie charts, and sequence logos from the same object.

#### `gff2matrix()`

This function converts the data frame into a binary motif-occurrence matrix. This matrix is the required input for the heatmap visualization functions. Each row represents a unique motif, each column represents a sequence, and a `1` indicates the presence of that motif.

**Note:** This function works with data loaded from GFF, PSA, or TXT files, as long as they are parsed into the standard format shown above.

```{r convert-gff-to-matrix}
# We can use the data loaded from Option A, B or C. 
# Here we use 'gff_data' as an example.
motif_matrix <- gff2matrix(gff_data)

# Display the first few rows of the resulting matrix
head(motif_matrix)
```

After using this function a occurrence plot can be generated by using:

#### `matrix2OP()` and `matrixOP()`

These functions generate interactive occurrence plots from the motif-occurrence matrix. `matrix2OP` creates a standard rectangular occurrence plot, while `matrix2SquareOP` creates one with a square aspect ratio.

```{r generate-occurrence_Plot, fig.show='hold'}
# Generate a standard occurrence plot from the motif_matrix
occurrencePlot <- matrix2OP(input = motif_matrix)
occurrencePlot
```

```{r generate-square-Occurrence-Plot, fig.show='hold'}
# Generate a square occurrence plot from the motif_matrix
squareOccurrencePlot <- matrix2SquareOP(input = motif_matrix)
squareOccurrencePlot
```

#### `freqPie()`

This function generates a pie chart to visualize the frequency distribution of each motif type found in the analysis.

```{r generate-pie-chart, eval=TRUE}
pie_chart <- freqPie(gff_data)
print(pie_chart)
```

### 2.2.4 Sequence Analysis (SeqLogo)

#### `extractProteinMotifs()`

To generate sequence logos, you can use the `extractProteinMotifs()` function. This function parses the output file generated by `runPsScan()` (supporting GFF, PSA, and TXT formats) and extracts all instances of each identified motif into a list, where the keys are PROSITE IDs.

```{r extract-motifs-from-psa, fig.show='hold'}
# This reads the PROSITE analysis output file from disk and extracts motifs.
# The format is detected automatically, but can also be specified explicitly 
# (e.g., format = "gff").
protein_motifs <- extractProteinMotifs(psa_file)

# Check the PROSITE IDs (keys) found in the file
head(names(protein_motifs))

# Generate sequence logo for the first motif found in the list
ggseqlogo::ggseqlogo(protein_motifs[[1]], seq_type='aa')

```

#### `extractSegments()`

If you want to analyze raw sequences instead of identified motifs (e.g., to look at a specific region across all proteins), use `extractSegments()`.

```{r extract-segments-from-fasta, fig.show='hold'}
# Read the FASTA file into a list of sequences
sequences <- seqinr::read.fasta(file = fasta_file, seqtype = "AA")

# Extract segments from position 10 to 20 from all sequences
segments <- extractSegments(sequences, from = 10, to = 20)

# Generate the sequence logo from the extracted segments
ggseqlogo::ggseqlogo(unlist(segments), seq_type = "aa")
```

# References

Sigrist C.J.A., de Castro E., Cerutti L., Cuche B.A., Hulo N., Bridge A., Bougueleret L., Xenarios I. (2012). New and continuing developments at PROSITE. *Nucleic Acids Res.*

# Session Information

```{r session-info}
sessionInfo()
```