--- title: "sfi workflow" output: BiocStyle::html_document: toc: true toc_float: true vignette: > %\VignetteIndexEntry{sfi workflow} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(sfi) ``` ## Introduction The `sfi` package provides features extraction/alignment tools for single file injections(sfi) mode Gas/liquid chromatography-mass spectrometry (GC/LC-MS) data. The SFI technique enables the analysis of numerous samples (e.g., up to 1000 per day) by acquiring data from multiple injections within a single analytical run, significantly reducing the time compared to traditional serial LC-MS injections. The input of `sfi` package could be a single mzML file, which contains the interleaved data from all injected samples. Feature table from mzML file by other feature extraction tools such as `xcms` can also be used as input and the output will be a feature table compatible with regular metabolomics/non-target analysis. Such a feature table can be used for downstream analysis, such as statistical analysis and machine learning. The core function of `sfi` package is designed to handle sfi feature table and recover peaks back to individual samples. Currently, no bioconductor package is available to analysis such high throughput data. We also provide a simple local maxim peak picking algorithm for sfi raw data analysis as regular feature extraction tools such as `xcms` package were designed for single injection files and need to perform feature alignment across multiple files. This is not suitable for sfi data, where all samples are injected in a single file. However, `xcms` package could still serve as a feature extraction tool for sfi data, but it is still need `sfi` package to handle the feature table and recover peaks back to individual samples. We depend `mzR` package to read single mzML file and `data.table` package to handle large data. ## Installation ```{r eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("sfi") ``` ## Experimental Desgin The SFI method (see demo figure) operates by: - Injecting a pooled sample multiple times initially to establish reference chromatographic profiles. - Subsequently injecting individual samples repeatedly at fixed, short time intervals under a continuous isocratic elution. Each sample undergoes a fixed chromatographic separation. - This process generates a single, continuous data file containing the interleaved data from all injected samples. ```{r fig.cap="Schematic representation of the Single File Injection (SFI) workflow. (Top) A pooled Quality Control (QC) sample is injected multiple times at the beginning of the run to establish a robust reference for peak alignment. Individual study samples are then injected sequentially at short, fixed time intervals (e.g., every 1 min) into a continuous isocratic flow. This results in a single, long mzML file containing interleaved chromatograms from all samples, which `sfi` separates and aligns. (Bottom) Demonstration of peak recovery algorithms for SFI data using fixed time intervals", out.width="80%"} knitr::include_graphics("https://github.com/yufree/presentation/blob/gh-pages/figure/SFI.png?raw=true") ``` ## Basic Usage ### Loading raw data You need to convert the raw data into mzML file. You might try [ThermoFlask](https://github.com/yufree/thermoflask) for Thermo data or [ProteoWizard](https://proteowizard.sourceforge.io/download.html) for other vendor file. ```{r eval=FALSE} path <- 'sfi.mzML' peak <- getmzml(path) ``` ### Feature extraction The `find_2d_peaks()` function will perform peak picking on the mzML file. The input is the mz, rt and intensity of the peaks. The output is a data frame containing the peak information. ```{r} # load demo data data(sfi) # perform peaks picking peaklist <- find_2d_peaks(mz=sfi$mz,rt=sfi$rt,intensity=sfi$intensity) ``` ### Feature alignment The `getsfm()` function will extract sample features from one injections file. ```{r} # injection interval idelta <- 92 # time windows for a full separation windows <- 632 # sample numbers in the files n <- 100 # retention time shift in seconds deltart <- 10 # min peak number in pooled qc samples minn <- 6 # align peaks from sfi. Note: peaklist is an sfi_peaks object, so we can pass it directly. se <- getsfm(peaklist, idelta=idelta, windows=windows, n=100, deltart=10, minn=6) # The output is a SummarizedExperiment object se ``` ### Save feature list Save feature list as csv file. ```{r eval=FALSE} # extract feature matrix with row metadata library(SummarizedExperiment) feature_table <- cbind(as.data.frame(rowData(se)), as.data.frame(assay(se))) # save feature as csv file library(data.table) fwrite(feature_table, 'featurelist.csv') ``` ## Using the Results The primary output of the `sfi` workflow is a feature table (matrix) where rows represent mass spectral features (unique m/z and retention time pairs) and columns represent individual samples. The values in the cells are the integrated intensities of these features. This feature table allows you to answer biological or chemical questions about your samples: - **Metabolomics/Lipidomics Profiling:** Identify which compounds change in abundance between different experimental groups (e.g., disease vs. control). - **Quality Control:** Assess the stability of the instrument over the run by checking the intensity of internal standards or features in QC samples. - **Sample Classification:** Use the feature profiles to cluster samples or build predictive models. The "Downstream Analysis" section below demonstrates how to format this data for use with other R/Bioconductor tools that specialize in these statistical tasks. ## Downstream Analysis For integration with other Bioconductor packages, the `sfi` workflow already provides results as a `SummarizedExperiment` object. This class is the standard container for rectangular data in Bioconductor and facilitates interoperability with tools like `DESeq2`, `limma`, or `SummarizedExperiment`-based metabolomics workflows. ### Interoperability and Visualization Once the data is in a `SummarizedExperiment` object, we can leverage the rich Bioconductor ecosystem for visualization and quality control. ```{r downstream-viz, message=FALSE} if (requireNamespace("SummarizedExperiment", quietly = TRUE) && requireNamespace("ggplot2", quietly = TRUE)) { library(SummarizedExperiment) library(ggplot2) # Basic data access intensities <- assay(se, "intensities") # Visualization: Distribution of intensities across samples # Reshape for ggplot2 plot_df <- data.frame( Intensity = as.vector(intensities), Sample = rep(colnames(intensities), each = nrow(intensities)) ) # Filter out zeros for log transformation plot_df <- plot_df[plot_df$Intensity > 0, ] ggplot(plot_df, aes(x = Sample, y = Intensity, fill = Sample)) + geom_boxplot() + scale_y_log10() + theme_minimal() + theme(axis.text.x = element_blank()) + labs(title = "Intensity Distribution Across Samples", y = "Log10 Intensity", x = "Sample Index") + guides(fill = "none") } ``` ### Normalization and Transformation Typical metabolomics workflows require data normalization and transformation. We can use `MsCoreUtils` for these tasks while maintaining the metadata within our object. ```{r downstream-norm, message=FALSE} if (requireNamespace("SummarizedExperiment", quietly = TRUE) && requireNamespace("MsCoreUtils", quietly = TRUE)) { library(SummarizedExperiment) library(MsCoreUtils) # Log2 transformation # We can create a new assay in the SummarizedExperiment assay(se, "log2") <- log2(assay(se, "intensities") + 1) # Quantile normalization # assay(se, "norm") <- MsCoreUtils::normalizeMethods()[["quantile"]](assay(se, "log2")) # Example: Summarize feature metadata cat("Total number of features:", nrow(se), "\n") cat("M/Z range:", paste(range(rowData(se)$mz), collapse = " - "), "\n") # Accessing sample metadata for group-wise analysis colData(se)$Group <- rep(c("Control", "Treatment"), length.out = ncol(se)) # You can now pass this object to other tools like limma or DESeq2 # print(se) } ``` ## Advanced Usage ### Advanced Usage In real data, you might find the input window for separation and injection interval are not accurate due to the lag in sample injection process. It's suggested to filter high intensity peaks and use `get_sfi_params` to find the accurate value for accurate instrumental window for separation and injection interval. ```{r} # get windows and delta time sfmsub <- peaklist[peaklist$intensity>1e4,] class(sfmsub) <- c("sfi_peaks", "data.frame") # Preserve class for method dispatch # get windows and delta time sfi_params <- get_sfi_params(sfmsub, n=158, deltart = 5) ``` Then you can use those parameters for all data. ```{r} windows <- sfi_params['window'] idelta <- sfi_params['idelta'] se <- getsfm(peaklist, idelta = idelta, windows = windows, n = 158, deltart = 10, minn = 6) ``` ## Conclusion The `sfi` package streamlines the complex task of processing Single File Injection data, converting a continuous stream of interleaved mass spectral data into a structured format ready for analysis. By correctly identifying and aligning features from rapid, sequential injections, it unlocks the throughput potential of SFI-MS for large-scale metabolomics and screening studies. ```{r} sessionInfo() ```