---
title: "sfi workflow"
output:
  BiocStyle::html_document:
    toc: true
    toc_float: true
vignette: >
  %\VignetteIndexEntry{sfi workflow}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(sfi)
```

## Introduction

The `sfi` package provides features extraction/alignment tools for single file injections(sfi) mode Gas/liquid chromatography-mass spectrometry (GC/LC-MS) data. The SFI technique enables the analysis of numerous samples (e.g., up to 1000 per day) by acquiring data from multiple injections within a single analytical run, significantly reducing the time compared to traditional serial LC-MS injections. 

The input of `sfi` package could be a single mzML file, which contains the interleaved data from all injected samples. Feature table from mzML file by other feature extraction tools such as `xcms` can also be used as input and the output will be a feature table compatible with regular metabolomics/non-target analysis. Such a feature table can be used for downstream analysis, such as statistical analysis and machine learning. The core function of `sfi` package is designed to handle sfi feature table and recover peaks back to individual samples. Currently, no bioconductor package is available to analysis such high throughput data. 

We also provide a simple local maxim peak picking algorithm for sfi raw data analysis as regular feature extraction tools such as `xcms` package were designed for single injection files and need to perform feature alignment across multiple files. This is not suitable for sfi data, where all samples are injected in a single file. However, `xcms` package could still serve as a feature extraction tool for sfi data, but it is still need `sfi` package to handle the feature table and recover peaks back to individual samples. We depend `mzR` package to read single mzML file and `data.table` package to handle large data. 

## Installation

```{r eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("sfi")
```

## Experimental Desgin

The SFI method (see demo figure) operates by:

- Injecting a pooled sample multiple times initially to establish reference chromatographic profiles.

- Subsequently injecting individual samples repeatedly at fixed, short time intervals under a continuous isocratic elution. Each sample undergoes a fixed chromatographic separation.

- This process generates a single, continuous data file containing the interleaved data from all injected samples.

```{r fig.cap="Schematic representation of the Single File Injection (SFI) workflow. (Top) A pooled Quality Control (QC) sample is injected multiple times at the beginning of the run to establish a robust reference for peak alignment. Individual study samples are then injected sequentially at short, fixed time intervals (e.g., every 1 min) into a continuous isocratic flow. This results in a single, long mzML file containing interleaved chromatograms from all samples, which `sfi` separates and aligns. (Bottom) Demonstration of peak recovery algorithms for SFI data using fixed time intervals", out.width="80%"}
knitr::include_graphics("https://github.com/yufree/presentation/blob/gh-pages/figure/SFI.png?raw=true")
```

## Basic Usage

### Loading raw data

You need to convert the raw data into mzML file. You might try [ThermoFlask](https://github.com/yufree/thermoflask) for Thermo data or [ProteoWizard](https://proteowizard.sourceforge.io/download.html) for other vendor file.

```{r eval=FALSE}
path <- 'sfi.mzML'
peak <- getmzml(path)
```

### Feature extraction

The `find_2d_peaks()` function will perform peak picking on the mzML file. The input is the mz, rt and intensity of the peaks. The output is a data frame containing the peak information.

```{r}
# load demo data
data(sfi)
# perform peaks picking
peaklist <-  find_2d_peaks(mz=sfi$mz,rt=sfi$rt,intensity=sfi$intensity)
```

### Feature alignment

The `getsfm()` function will extract sample features from one injections file.

```{r}
# injection interval
idelta <- 92
# time windows for a full separation
windows <- 632
# sample numbers in the files
n <- 100
# retention time shift in seconds
deltart <- 10
# min peak number in pooled qc samples 
minn <- 6
# align peaks from sfi. Note: peaklist is an sfi_peaks object, so we can pass it directly.
se <- getsfm(peaklist, idelta=idelta, windows=windows, n=100, deltart=10, minn=6)
# The output is a SummarizedExperiment object
se
```

### Save feature list

Save feature list as csv file.

```{r eval=FALSE}
# extract feature matrix with row metadata
library(SummarizedExperiment)
feature_table <- cbind(as.data.frame(rowData(se)), as.data.frame(assay(se)))
# save feature as csv file
library(data.table)
fwrite(feature_table, 'featurelist.csv')
```

## Using the Results

The primary output of the `sfi` workflow is a feature table (matrix) where rows represent mass spectral features (unique m/z and retention time pairs) and columns represent individual samples. The values in the cells are the integrated intensities of these features.

This feature table allows you to answer biological or chemical questions about your samples:
- **Metabolomics/Lipidomics Profiling:** Identify which compounds change in abundance between different experimental groups (e.g., disease vs. control).
- **Quality Control:** Assess the stability of the instrument over the run by checking the intensity of internal standards or features in QC samples.
- **Sample Classification:** Use the feature profiles to cluster samples or build predictive models.

The "Downstream Analysis" section below demonstrates how to format this data for use with other R/Bioconductor tools that specialize in these statistical tasks.

## Downstream Analysis

For integration with other Bioconductor packages, the `sfi` workflow already provides results as a `SummarizedExperiment` object. This class is the standard container for rectangular data in Bioconductor and facilitates interoperability with tools like `DESeq2`, `limma`, or `SummarizedExperiment`-based metabolomics workflows.

### Interoperability and Visualization

Once the data is in a `SummarizedExperiment` object, we can leverage the rich Bioconductor ecosystem for visualization and quality control.

```{r downstream-viz, message=FALSE}
if (requireNamespace("SummarizedExperiment", quietly = TRUE) && 
    requireNamespace("ggplot2", quietly = TRUE)) {
  library(SummarizedExperiment)
  library(ggplot2)
  
  # Basic data access
  intensities <- assay(se, "intensities")
  
  # Visualization: Distribution of intensities across samples
  # Reshape for ggplot2
  plot_df <- data.frame(
    Intensity = as.vector(intensities),
    Sample = rep(colnames(intensities), each = nrow(intensities))
  )
  # Filter out zeros for log transformation
  plot_df <- plot_df[plot_df$Intensity > 0, ]
  
  ggplot(plot_df, aes(x = Sample, y = Intensity, fill = Sample)) +
    geom_boxplot() +
    scale_y_log10() +
    theme_minimal() +
    theme(axis.text.x = element_blank()) +
    labs(title = "Intensity Distribution Across Samples",
         y = "Log10 Intensity",
         x = "Sample Index") +
    guides(fill = "none")
}
```

### Normalization and Transformation

Typical metabolomics workflows require data normalization and transformation. We can use `MsCoreUtils` for these tasks while maintaining the metadata within our object.

```{r downstream-norm, message=FALSE}
if (requireNamespace("SummarizedExperiment", quietly = TRUE) && 
    requireNamespace("MsCoreUtils", quietly = TRUE)) {
  library(SummarizedExperiment)
  library(MsCoreUtils)
  
  # Log2 transformation
  # We can create a new assay in the SummarizedExperiment
  assay(se, "log2") <- log2(assay(se, "intensities") + 1)
  
  # Quantile normalization
  # assay(se, "norm") <- MsCoreUtils::normalizeMethods()[["quantile"]](assay(se, "log2"))
  
  # Example: Summarize feature metadata
  cat("Total number of features:", nrow(se), "\n")
  cat("M/Z range:", paste(range(rowData(se)$mz), collapse = " - "), "\n")
  
  # Accessing sample metadata for group-wise analysis
  colData(se)$Group <- rep(c("Control", "Treatment"), length.out = ncol(se))
  
  # You can now pass this object to other tools like limma or DESeq2
  # print(se)
}
```

## Advanced Usage

### Advanced Usage

In real data, you might find the input window for separation and injection interval are not accurate due to the lag in sample injection process. It's suggested to filter high intensity peaks and use `get_sfi_params` to find the accurate value for accurate instrumental window for separation and injection interval.

```{r}
# get windows and delta time
sfmsub <- peaklist[peaklist$intensity>1e4,]
class(sfmsub) <- c("sfi_peaks", "data.frame") # Preserve class for method dispatch
# get windows and delta time
sfi_params <- get_sfi_params(sfmsub, n=158, deltart = 5)
```

Then you can use those parameters for all data.

```{r}
windows <- sfi_params['window']
idelta <- sfi_params['idelta']
se <- getsfm(peaklist,
              idelta = idelta, windows = windows,
              n = 158, deltart = 10, minn = 6)
```

## Conclusion

The `sfi` package streamlines the complex task of processing Single File Injection data, converting a continuous stream of interleaved mass spectral data into a structured format ready for analysis. By correctly identifying and aligning features from rapid, sequential injections, it unlocks the throughput potential of SFI-MS for large-scale metabolomics and screening studies.

```{r}
sessionInfo()
```