TraianProt 0.99.14
TraianProt is an interactive R/Shiny package designed to simplify the downstream analysis of wide-format proteomics data.
It provides a modular workflow for processing both label-free and labeled quantitative data obtained from Data-Dependent Acquisition (DDA) or Data-Independent Acquisition (DIA) mass spectrometry experiments.
Supported input formats include output from tools such as MaxQuant, MSFragger, DIA-NN, ProteoScape, and Proteome Discoverer.
💡 This vignette provides a step-by-step walkthrough of TraianProt — from data import to biological interpretation.
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("TraianProt")
Once installed, you can load the package into your R session:
library(TraianProt)
library(dplyr)
library(knitr)
The primary and most user-friendly way to interact with TraianProt is through its interactive Shiny application. This GUI guides users from data upload to final biological interpretation without requiring R programming skills.
To start the graphical interface, simply run the following command in your R console:
# runTraianProt()
Before starting your analysis in the Shiny app, ensure you have the following files ready:
Protein Groups / Raw Data Table: This is the direct output from your search engine (e.g., proteinGroups.txt from MaxQuant or report.tsv from DIA-NN). Each column should correspond to a sample, and rows should represent proteins.
Metadata Table: A tab-separated values (.tsv) or comma-separated values (.csv) file that defines your experimental design. This file is globally shared among all search engines supported by the app. It must contain the following information:
intensity_sample_name: The exact column names as they appear in the raw data table.
group: The experimental condition (e.g., “WT”, “Treatment”).
sample_name: A simplified, user-friendly name for the sample.
Note: A detailed metadata structure guide and examples can be found in the tutorial available in the Shiny application’s “Help” section.
Before starting, ensure your data file is formatted appropriately (e.g., protein groups table and metadata table).
In protein groups table file, ach column should correspond to a sample , and rows should represent proteins.
The following code chuncks include the example format for MaxQuant output data and the metadata format which is globally shared among all search engines supported along with several parameters required for the analysis.
# 1. Define file paths for example data included in the package
file_path_raw <- system.file("extdata", "proteinGroups.txt", package = "TraianProt")
file_path_metadata <- system.file("extdata", "metadata_MaxQuant.tsv", package = "TraianProt")
# file_path_report #In case we are using DIANN data for reporting where the report.tsv file is located.
output_directory <- tempdir() # Use a temporary directory for outputs in this vignette
# 2. Define Experimental Parameters
platform_id <- 1 # 1=MaxQuant, 2=MSFragger, 3=DIA-NN, 4=PD, 5=ProteoScape
organism_id <- 1 # 1= Candida albicans model organism, 2= Other organism (e.g Homo sapiens/Mus musculus etc)
selected_conditions <- c("WT_H2O2", "WT") # Condition id defined in group column from metadata file.
conditions <- "WT_H2O2 vs WT" # Title for plots.
id_target <- "ENSG" # Gene ID type
organismo <- "calbicans" # Organism for annotation, more at https://biit.cs.ut.ee/gprofiler/page/organism-list
threshold <- 0.05 # Significance threshold for enrichment analyis.
font_size <- 12 # Term size for plots in the enrichment analysis.
terms_number <- 12 # Number of terms for enrichment plots in the enrichment analysis.
psms <- TRUE # Whether PSMs correction is applied in differential analysis.
The metadata structure is defined in the tutorial.pdf file which can be found at https://github.com/SamueldelaCamaraFuentes/TraianProt. In the following chunck some modifications are aggreagated depending on the platform (MaxQuant, FragPipe, DIANN…) in order to facilitate unique peptide filtering and ongoing processes.
metadata <- read.delim(file_path_metadata, header = TRUE, row.names = NULL, stringsAsFactors = FALSE)
if (platform_id == 1 | platform_id == 2 | platform_id == 5) {
metadata <- metadata %>%
mutate(
raw_name = intensity_sample_name,
intensity_sample_name = raw_name,
log2_col = sub("Intensity", "LOG2", raw_name)
) %>%
mutate(
raw_name = intensity_sample_name,
unique_peptides_col = sub("Intensity", "Unique.peptides", raw_name)
) %>%
select(intensity_sample_name, group, sample_name, log2_col, unique_peptides_col)
} else if (platform_id == 3) {
metadata <- metadata %>%
mutate(
raw_name = basename(intensity_sample_name),
intensity_sample_name = raw_name,
log2_col = sub("\\.(d|raw)$", ".LOG2", raw_name, ignore.case = TRUE),
unique_peptides_col = paste0("Unique peptides ", sub("\\.(d|raw)$", "", raw_name, ignore.case = TRUE))
) %>%
select(intensity_sample_name, group, sample_name, log2_col, unique_peptides_col)
} else if (platform_id == 4) {
metadata <- metadata %>%
mutate(
raw_name = intensity_sample_name,
intensity_sample_name = raw_name,
log2_col = sub("Abundance:", "LOG2", raw_name)
) %>%
mutate(
raw_name = intensity_sample_name,
unique_peptides_col = sub("Abundance:", "Unique.peptides", raw_name)
) %>%
select(intensity_sample_name, group, sample_name, log2_col, unique_peptides_col)
}
filtered_metadata <- metadata[metadata$group %in% selected_conditions, ]
In the following chunk, we load raw proteomics data, taking into account various considerations depending on the platform used (e.g. MaxQuant, Fragpipe or DIANN). Once the data has been loaded, the quick_filtering function helps to remove contaminants and decoys (if using MaxQuant) and converts all the intensity values from the samples in the experiment into numeric values. Furthermore, for DIANN data it extracts peptide information from the report.tsv file whose directory has been supplied in the “file_path_report” parameter.
if (platform_id == 1 | platform_id == 2 | platform_id == 5) {
raw <- read.delim(file_path_raw, sep = "\t", stringsAsFactors = FALSE, colClasses = "character")
df <- TraianProt::quick_filtering(raw, platform_id, organism_id, filtered_metadata, selected_conditions)
} else if (platform_id == 3) {
raw <- read.delim(file_path_raw, sep = "\t", stringsAsFactors = FALSE, colClasses = "character", check.names = FALSE)
df <- quick_filtering(raw, platform_id, organism_id, filtered_metadata, selected_conditions, file_path_report)
df <- as.data.frame(df)
} else if (platform_id == 4) {
raw <- as.data.frame(readxl::read_xlsx(file_path_raw))
df <- quick_filtering(raw, platform_id, organism_id, filtered_metadata, selected_conditions)
}
# LOG 2 Intensity
LOG2.names <- obtain_LOG.names(df)
TraianProt provides a comprehensive suite of analytical modules:
The first step is to identify the unique proteins. These are defined as proteins that have only been identified and quantified in one of the conditions under study. These proteins are important and will be considered significant in later steps.
# Unique Proteins
unique_proteins <- TraianProt::obtain_unique_proteins(df, filtered_metadata, selected_conditions)
unique_proteins_control <- as.data.frame(unique_proteins[1], check.names = FALSE)
unique_proteins_treatment <- as.data.frame(unique_proteins[2], check.names = FALSE)
The next step is to filter the data frame by unique peptides. The user will select the proportion of samples required for a protein to be included in the analysis (the ‘min_fraction’ parameter). For example, proteins with one unique peptide in at least 50% of the samples in at least one of the two compared conditions will be kept for further analysis. The user can then filter for missing values; during log transformation, these will be assigned as ‘NA’ and considered invalid. The normalisation step can be assessed using two functions: the ‘median_centering’ function normalises according to the median of the samples, while the ‘normalization_func’ function normalises using different methods, such as mean, median centring, trimMean and VSC.
Finally, the choice of imputation can be assessed. Users can choose to filter using the impute_data function, which performs imputation based on a normal distribution, or the “K Nearest Neighbors” function, which uses the KNN algorithm for imputation. Alternatively, they can choose not to impute at all.
# 5. Preprocessing
# ------------------------------------------------------------------------------
# 5.1) UNIQUE PEPTIDES FILTERING
df <- unique_peptides_filter(df, filtered_metadata, number = 1, min_fraction = 0.5)
# 5.2) Quantification filtering
df.F <- TraianProt::filter_valids(df, filtered_metadata, unique_proteins, min_prop = 0.5, at_least_one <- FALSE, labeltype = 1)
# 5.3) Normalization
df.F <- median_centering(df.F, LOG2.names)
df.F <- normalization_func(df.F, LOG2.names, method = "mean")
# 5.4) Imputation
# df.FNI <- impute_KNN_data(as.data.frame(df.F), LOG2.names, k = 5)
df.FNI <- impute_data(as.data.frame(df.F), LOG2.names)
df.FNI <- df.F # No imputation preferred action by developers.
total <- bind_rows(df.FNI, as.data.frame(unique_proteins[1], check.names = FALSE))
total_dataset <- bind_rows(total, as.data.frame(unique_proteins[2], check.names = FALSE))
A Venn diagram showing where users can inspect the proteins that are common to each condition, as well as those that are unique to each condition.
library(VennDiagram)
venn_diagram_plot <- grid.draw(venn_diagram(df.F, unique_proteins, color1 = "blue", color2 = "maroon"))
plot(venn_diagram_plot)
Users can see how many proteins have been identified and quantified in their experiment.
identify_proteins(df, filtered_metadata, platform_id, selected_conditions)
This module covers a group of plots that describe the nature of our data (distribution, dispersion, missing values proportion in our data) Inside this section we can highlight the following setions: * Distribution plots: include a boxplot and dispersion plot. * Imputation plots: include a representation of the amount of missing values in the data before imputation and overly of both imputed and non-imputed distribution in data. * Normality plots: covers a set of plots whose purpose is to the representation of data´s distribution, including histogram of proteins abundances and a Q-Q plot. * Dimension reduction plots: plot with Principal Component Analysis or t-SNE in case t-SNE is displayed a perplexity parameter can be applied, taking as the maximum value N/2, being N the number of samples per condition. * Correlation plots: include a Scater plot and correlation plot.
boxplot <- boxplot_function(df.FNI, filtered_metadata, selected_conditions)
print(boxplot)
preimputation_state(df.F, filtered_metadata$log2_col)
corrplot_function(df.FNI[filtered_metadata$log2_col], filtered_metadata, display = "shade")
pca(df.FNI, filtered_metadata, selected_conditions)
tsne(df.FNI, filtered_metadata, perplexity_num = 2, selected_conditions)
In the differential analysis module, users can perform statistical tests. Firstly, the statistical test required to analyse the data must be chosen. The ‘Simple t-test approach’ option (test = 1) performs two-sample t-tests on protein intensity data using the t-test function from base R, while the ‘Limma approach’ option (test = 2) uses the limma (v 3.64.3) R package to calculate significant differences between groups. Lastly, the ‘Wilcoxon test’ option (test = 3) is a non-parametric alternative that can be used to compare two independent groups of samples when the data is not normally distributed. If the ‘Limma approach’ is selected, a variance correction depending on the number of PSMs identified can be applied (PSMs = TRUE, previously defined), using the DEqMS (v1.26.0) R package.
Furthermore, users can choose whether or not to include the unique proteins in the final dataset as significant proteins (way parameter). When the “way” parameter is set to “2”, unique proteins are included in the final dataset. This results in the maximum log2FC value for proteins exclusively identified in the “Treatment” condition and the minimum log2FC value for proteins exclusively identified in the “Control” condition. It also results in the smallest p-value and adjusted p-value for proteins exclusively identified in the “Treatment” condition and the smallest p-value and adjusted p-value for proteins exclusively identified in the “Control” condition. This is done to keep track of unique proteins. Furthermore, the two-sample t-test can be specified as dependent or independent, and the paired parameter can be selected. Finally, the user can select the statistical cut-off value, which is established in the sig parameter, and the method for committing the p-value adjustment is specified in the adjval parameter. Furthermore, a Log2FC threshold can be assigned.
limma_1 <- TraianProt::statistical_analysis(df.FNI, test = 2, paired = FALSE, filtered_metadata, logfc = 1, sig = 0.05, adjval = "fdr", statval = 1, unique_proteins, way = 2, psms = TRUE, platform_id, selected_conditions, diann_dir = if (platform_id == 3) file_path_report else NULL)
limma <- limma_1[-6]
limma <- merge(total_dataset, limma, by = "Protein", check.names = FALSE)
if (psms == TRUE) {
limma <- limma %>%
dplyr::select("Protein", "Protein_description", "logFC", "sca.P.Value", "sca.adj.pval", "expression", everything())
} else if (psms == FALSE) {
limma <- limma %>%
select("Protein", "Protein_description", "logFC", "p.value", "adj.P.Val", "expression", everything())
}
limma <- limma[order(limma$expression), ]
row.names(limma) <- limma$Protein
# write.xlsx(limma, file = "AD_.xlsx", rowNames = FALSE)
Columns from differential analysis file:
In case it is DIA data coming from DIANN: * Sample name ending in .d: Raw intensity values of the protein in each of the samples handled in the experiment. In case it is DDA data coming from FragPipe: * Sample name ending in .Intensity: Raw intensity values of the protein in each of the samples handled in the experiment. In case it is DIA data coming from MaxQuant: * Intensity. followed by sample name: Raw intensity values of the protein in each of the samples handled in the experiment.
In this module 2 different types of plots are included; the first one is the volcano plot which enables the visualization of data within an experimental setup and how each comparison differs to another, users can specify the points size and the title. In addition, in this module two heatmaps are displayed, the first one displays the identified proteins and the second one those proteins that exhibited differential relative abundances, users can write the title for the plot.
volcano <- volcano_plot_tiff(limma, title = "Treatment vs Control", label = 3, statval = 2, psms)
print(volcano)
heatmap <- my_heatmap_differential(limma, df.FNI, filtered_metadata$log2_col, title = "Treatment vs Control")
The functional analysis module is focused on performing gene set enrichment analysis taking into consideration those proteins that exhibited significant changes in their relative abundance and it is performed using the gprofiler2 (v0.2.3) R package. For that purpose, users have to introduce the organism id, for example in case we are analyzing C. albicans derived mass spectrometry data, “calbicans” should be introduced (obtained from htps://biit.cs.ut.ee/gprofiler/page/organism-list) the type of Protein ID which serves as the nomenclature destination change for the protein ids (it is always ENSG as default), the enrichment threshold for the retrieval of functional terms and whether the whole list of identified proteins is used as the background (this option is recommended to set as “YES” as it is going to allow to obtain unbiased results).
Go_terms <- Goterms_finder(limma, df, target = "ENSG", numeric_ns = "", mthreshold = Inf, filter_na = TRUE, organismo = "calbicans", custombg = FALSE, platform_id, user_threshold = 0.05, multi_query = FALSE, evcodes = TRUE, sources = c("GO", "KEGG", "WP", "REAC"))
# write.xlsx(Go_terms, file = "AF_.xlsx", rowNames = FALSE)
Columns from functional enrichment file: * Cluster: Category to which the term belongs (down-regulated or up-regulated). Same information as the “Conditions” column. * Category: database from which the term originates. * ID: identifier of the term. * Description: description of the term. * p.value: p-value after correction for multiple testing. * adj.P.Val: adjusted p-value after correction for multiple testing. * Query_size: number of protein identifiers included in the search. * Count: number of protein identifiers in the search that are annotated with the corresponding term. * Term_size: total number of protein identifiers annotated with the term. * Effective_domain_size: list of protein identifiers that map to the term. * geneID: total number of protein identifiers used for the hypergeometric test. * Gene Ratio: number of protein identifiers in the query that are annotated with the corresponding term divided by the number of protein identifiers included in the query (Count/query_size). * Bg ratio: total number of protein identifiers annotated in the term divided by the total number of protein identifiers used for the hypergeometric test (term_size/geneID).
In the plot section, the user has the ability of customizing both a Dotplot and a Barplot, by adding the title name, the number of terms to display and the corresponding size of the font used.
In the interaction analysis module, the user is allowed to perform a protein interaction analysis with the proteins that exhibited significant changes in their relative abundance using the STRINGdb (v2.20) R package. For that purpose, the user is required to introduce the STRING ID for the species along with a score threshold.
#> Warning: we couldn't map to STRING 3% of your identifiersWarning: we couldn't map to STRING 6% of your identifiers
#> Warning: we couldn't map to STRING 6% of your identifiers
For reproducibility, include session information:
sessionInfo()
#> R version 4.6.0 RC (2026-04-17 r89917)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] knitr_1.51 dplyr_1.2.1 TraianProt_0.99.14 BiocStyle_2.39.0
#>
#> loaded via a namespace (and not attached):
#> [1] STRINGdb_2.23.0 fs_2.1.0
#> [3] matrixStats_1.5.0 bitops_1.0-9
#> [5] enrichplot_1.31.5 httr_1.4.8
#> [7] RColorBrewer_1.1-3 tools_4.6.0
#> [9] mlr3learners_0.14.0 backports_1.5.1
#> [11] R6_2.6.1 DT_0.34.0
#> [13] lazyeval_0.2.3 withr_3.0.2
#> [15] sp_2.2-1 VennDiagram_1.8.2
#> [17] cli_3.6.6 Biobase_2.71.0
#> [19] formatR_1.14 DEqMS_1.29.0
#> [21] scatterpie_0.2.6 labeling_0.4.3
#> [23] sass_0.4.10 S7_0.2.2
#> [25] robustbase_0.99-7 proxy_0.4-29
#> [27] mlr3tuning_1.6.0 systemfonts_1.3.2
#> [29] yulab.utils_0.2.4 gson_0.1.0
#> [31] DOSE_4.5.1 paradox_1.0.1
#> [33] dichromat_2.0-0.1 parallelly_1.47.0
#> [35] plotrix_3.8-14 limma_3.67.3
#> [37] RSQLite_2.4.6 generics_0.1.4
#> [39] gridGraphics_0.5-1 gtools_3.9.5
#> [41] car_3.1-5 GO.db_3.23.1
#> [43] Matrix_1.7-5 futile.logger_1.4.9
#> [45] S4Vectors_0.49.3 abind_1.4-8
#> [47] lifecycle_1.0.5 yaml_2.3.12
#> [49] carData_3.0-6 SummarizedExperiment_1.41.1
#> [51] gplots_3.3.0 qvalue_2.43.0
#> [53] SparseArray_1.11.13 Rtsne_0.17
#> [55] grid_4.6.0 blob_1.3.0
#> [57] promises_1.5.0 crayon_1.5.3
#> [59] shinydashboard_0.7.3 miniUI_0.1.2
#> [61] ggtangle_0.1.2 lattice_0.22-9
#> [63] KEGGREST_1.51.1 magick_2.9.1
#> [65] pillar_1.11.1 wrProteo_2.0.0.2
#> [67] GenomicRanges_1.63.2 boot_1.3-32
#> [69] codetools_0.2-20 glue_1.8.1
#> [71] ggiraph_0.9.6 ggfun_0.2.0
#> [73] fontLiberation_0.1.0 data.table_1.18.2.1
#> [75] vcd_1.4-13 vctrs_0.7.3
#> [77] png_0.1-9 treeio_1.35.0
#> [79] gtable_0.3.6 gsubfn_0.7
#> [81] cachem_1.1.0 xfun_0.57
#> [83] S4Arrays_1.11.1 mime_0.13
#> [85] Seqinfo_1.1.0 aisdk_1.1.0
#> [87] pheatmap_1.0.13 tinytex_0.59
#> [89] statmod_1.5.1 nlme_3.1-169
#> [91] ggtree_4.1.2 bit64_4.8.0
#> [93] fontquiver_0.2.1 bbotk_1.10.0
#> [95] mlr3pipelines_0.11.0 bslib_0.10.0
#> [97] mlr3_1.6.0 KernSmooth_2.23-26
#> [99] otel_0.2.0 colorspace_2.1-2
#> [101] BiocGenerics_0.57.1 DBI_1.3.0
#> [103] nnet_7.3-20 mlr3misc_0.21.0
#> [105] tidyselect_1.2.1 processx_3.9.0
#> [107] curl_7.1.0 bit_4.6.0
#> [109] compiler_4.6.0 chron_2.3-62
#> [111] httr2_1.2.2 lgr_0.5.2
#> [113] fontBitstreamVera_0.1.1 DelayedArray_0.37.1
#> [115] plotly_4.12.0 bookdown_0.46
#> [117] checkmate_2.3.4 scales_1.4.0
#> [119] caTools_1.18.3 DEoptimR_1.1-4
#> [121] lmtest_0.9-40 callr_3.7.6
#> [123] rappdirs_0.3.4 palmerpenguins_0.1.1
#> [125] stringr_1.6.0 digest_0.6.39
#> [127] rmarkdown_2.31 XVector_0.51.0
#> [129] htmltools_0.5.9 pkgconfig_2.0.3
#> [131] MatrixGenerics_1.23.0 fastmap_1.2.0
#> [133] rlang_1.2.0 htmlwidgets_1.6.4
#> [135] shiny_1.13.0 farver_2.1.2
#> [137] jquerylib_0.1.4 zoo_1.8-15
#> [139] jsonlite_2.0.0 GOSemSim_2.37.2
#> [141] RCurl_1.98-1.18 magrittr_2.0.5
#> [143] Formula_1.2-5 ggplotify_0.1.3
#> [145] patchwork_1.3.2 Rcpp_1.1.1-1.1
#> [147] ape_5.8-1 ggnewscale_0.5.2
#> [149] proto_1.0.0 gdtools_0.5.0
#> [151] sqldf_0.4-12 stringi_1.8.7
#> [153] MASS_7.3-65 plyr_1.8.9
#> [155] parallel_4.6.0 listenv_0.10.1
#> [157] ggrepel_0.9.8 wrMisc_2.0.0
#> [159] Biostrings_2.79.5 splines_4.6.0
#> [161] hash_2.2.6.4 ps_1.9.3
#> [163] igraph_2.3.0 uuid_1.2-2
#> [165] ranger_0.18.0 enrichit_0.1.4
#> [167] reshape2_1.4.5 stats4_4.6.0
#> [169] futile.options_1.0.1 gprofiler2_0.2.4
#> [171] evaluate_1.0.5 lambda.r_1.2.4
#> [173] BiocManager_1.30.27 laeken_0.5.3
#> [175] tweenr_2.0.3 httpuv_1.6.17
#> [177] VIM_7.0.0 tidyr_1.3.2
#> [179] purrr_1.2.2 polyclip_1.10-7
#> [181] future_1.70.0 ggplot2_4.0.3
#> [183] ggforce_0.5.0 ggExtra_0.11.0
#> [185] xtable_1.8-8 e1071_1.7-17
#> [187] tidytree_0.4.7 tidydr_0.0.6
#> [189] later_1.4.8 viridisLite_0.4.3
#> [191] class_7.3-23 tibble_3.3.1
#> [193] clusterProfiler_4.19.8 aplot_0.2.9
#> [195] memoise_2.0.1 AnnotationDbi_1.73.1
#> [197] IRanges_2.45.0 cluster_2.1.8.2
#> [199] shinyWidgets_0.9.1 globals_0.19.1
TraianProt streamlines the analysis of proteomics datasets within an intuitive graphical interface.
From normalization to enrichment analysis, it supports a complete and reproducible workflow.
🎯 Whether you are a beginner or experienced researcher, TraianProt provides a fast and transparent way to extract biological insights from quantitative proteomics data.