--- title: "BiocBuildReporter Data Use Cases" author: "BiocBuildReporter Team" format: html output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{BiocBuildReporter Data Use Cases} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 8, fig.height = 6, warning = FALSE, message = FALSE ) ``` # Background The BiocBuildReporter package provides access to years of Bioconductor build system data, representing a comprehensive record of package builds across: - **Thousands of packages** in the Bioconductor ecosystem - **Multiple R versions** spanning several years of development - **Multiple platforms** including Linux, macOS, and Windows - **Different build stages** (install, build, check) The Bioconductor build system runs regularly, testing all packages to ensure they meet quality standards and work correctly across different platforms. This dataset captures the results of these builds, including: - Build status (OK, WARNING, ERROR, TIMEOUT) - Package version information - Git commit information - Maintainer details - Propagation status to the community This vignette demonstrates how to use BiocBuildReporter functions to explore and analyze this rich dataset to understand: - Package build history and stability - Platform-specific issues - Package growth over time - Common failure patterns The Bioconductor Build Report Logs are being processed by [BiocBuildDB](https://github.com/seandavi/BiocBuildDB). The BiocBuildDB creates parquet files read by this package for analysis. # Installation and Loading `BiocBuildReporter` is a _Bioconductor_ package and can be installed through `BiocManager::install()`. ```{r, install, eval = FALSE} if (!"BiocManager" %in% rownames(installed.packages())) install.packages("BiocManager") BiocManager::install("BiocBuildReporter", dependencies=TRUE) ``` After the package is installed, it can be loaded into _R_ workspace by ```{r, library, results='hide', warning=FALSE, message=FALSE} library(BiocBuildReporter) ``` We will also load libraries need to run this vignette: ```{r, setup} library(BiocBuildReporter) library(dplyr) library(ggplot2) library(tidyr) ``` # Accessing Bioconductor Build Report Data ## Getting All Available Tables The simplest way to start is to download all available data tables. The `get_all_bbs_tables()` function retrieves all three parquet files containing Bioconductor build data: ```{r, get_all_tables} # Download all available tables # This will cache the tables for quick subsequent access get_all_bbs_tables() ``` The function downloads three tables: 1. **build_summary**: Results of each build stage for every package 2. **info**: Package metadata including version, maintainer, and git information 3. **propagation_status**: Information about package propagation to the community ## Getting Individual Tables You can also retrieve individual tables using `get_bbs_table()`: ```{r, get_individual_table} # Get the build summary table build_summary <- get_bbs_table("build_summary") # Get the info table info <- get_bbs_table("info") # Get the propagation status table propagation_status <- get_bbs_table("propagation_status") ``` Once downloaded, subsequent calls to these functions will use cached data, making analysis much faster. ## Remote Read vs Local Download The package allows the option to either download locally and read from a locally saved BiocFileCache, or to read files remotely. The default is to download locally and use a cached version of the files for analysis. To read remotely, calls to `get_bbs_table` or `get_all_bbs_tables` should be altered to use the argument `useLocal=FALSE`. ```{r, read_remote} info <- get_bbs_table("info", useLocal=FALSE) ``` Bioconductor devel reports are daily, meaning cached data can get out of date quickly. If using locally cached data, you can update/re-download the files using the argument `updateLocal=TRUE`. The default is not to update. ```{r, updateTables} info <- get_bbs_table("info", useLocal=TRUE, updateLocal=TRUE) ``` # Package-Specific Queries This section shows usage of BiocBuildReporter provided helper functions for package specific analysis. ## Package Release Information The `get_package_release_info()` function retrieves version and git information for a package across all Bioconductor releases: ```{r, package_release_info} # Get release information for BiocFileCache bfc_releases <- get_package_release_info("BiocFileCache") bfc_releases ``` This shows: - Package versions across different Bioconductor releases - Git branches (devel, RELEASE_3_22, etc.) - Git commit hashes - Last commit dates This is useful for tracking when a package was updated in different Bioconductor releases. It also is useful for mapping which git commit correpsonds to the currently available version in a given Bioconductor release. ## Package Build Results The `get_package_build_results()` function retrieves latest build status information for a package for all available builders on a given Bioconductor branch: ```{r, get_package_build_results} # Get build results for BiocFileCache on branch RELEASE_3_22 get_package_build_results("BiocFileCache", branch="RELEASE_3_22") ``` This shows: - Node (builder machine name) - Build stage (install, build, check) - Package version - Build status (OK, WARNING, ERROR, TIMEOUT) - Date of completion - Git branch - Git commit hashes - Last commit dates This is useful for showing the last known status of a package on a given release for all active builders. ## Package Error Counts The `package_error_count()` function provides statistics on how often a package has failed during builds: ```{r, package_error_count} # Get error counts for BiocFileCache bfc_errors <- package_error_count("BiocFileCache") bfc_errors # Filter to a specific branch bfc_errors_release <- package_error_count("BiocFileCache", branch = "RELEASE_3_22") bfc_errors_release # Filter to a specific builder bfc_errors_builder <- package_error_count("BiocFileCache", builder = "nebbiolo2", branch = "RELEASE_3_22") bfc_errors_builder ``` This returns: - Node (builder machine name) - Package version - Build stage (install, build, check) - Total number of runs - Total number of errors - Git branch For the devel branch, you can filter to the most recent version: ```{r, filter_devel_errors} # Get devel errors dev_errors <- package_error_count("BiocFileCache", branch = "devel") # Filter to current devel version dev_errors |> filter(version == max(version)) ``` This is useful for comparing how often a package failed during a given stage vs how often it attempted that stage. It gives an overview of frquency of failures. ## Package Failures Over Time The `package_failure_over_time()` function gives and overview of how long a package has been failing on a given builder: ```{r, package_failure_over_time} # Get failure events for BiocFileCache on nebbiolo1 and # group events in a 24 hour period package_failures_over_time("BiocFileCache", "nebbiolo1", 24) ``` This shows: - Package version - Sequential count of number of failure episodes - Time of first event failure - Time of last event failure - Number of failures during that episode - Stages of failures - Status of stages This is useful to track the length of failure events to determine if it is intermittent or consistent. The grouping of events is given as an argument, in this example we used 24 hours. This is to account for branches having different build cadenances and allowing sequential builds to be potentially grouped together. # Exploratory Data Analysis This section provides examples of utilizing the report tables for broader analysis and queries. ## Package Growth Over Time Let's explore how the number of Bioconductor packages has grown over time: ```{r, package_growth} # Get info table info <- get_bbs_table("info") # Count unique packages by branch package_counts <- info |> group_by(git_branch) |> summarise( n_packages = n_distinct(Package), .groups = "drop" ) |> arrange(desc(n_packages)) # Display the counts package_counts # Visualize package counts by branch ggplot(package_counts, aes(x = reorder(git_branch, n_packages), y = n_packages)) + geom_col(fill = "steelblue") + coord_flip() + labs( title = "Number of Packages by Bioconductor Branch", x = "Branch", y = "Number of Packages" ) + theme_minimal() ``` ## Build Status Distribution Understanding the distribution of build statuses helps identify overall system health: ```{r, build_status} # Get build summary table build_summary <- get_bbs_table("build_summary") # Count build statuses status_counts <- build_summary |> count(status) |> arrange(desc(n)) status_counts # Visualize status distribution ggplot(status_counts, aes(x = reorder(status, n), y = n)) + geom_col(aes(fill = status)) + scale_fill_manual(values = c( "OK" = "green3", "WARNING" = "orange", "ERROR" = "red", "TIMEOUT" = "darkred" )) + coord_flip() + labs( title = "Distribution of Build Statuses", x = "Status", y = "Count" ) + theme_minimal() + theme(legend.position = "none") ``` ## Platform-Specific Analysis Different platforms may have different build characteristics: ```{r, platform_analysis} # Analyze build status by platform (node) platform_status <- build_summary |> group_by(node, status) |> summarise(count = n(), .groups = "drop") |> group_by(node) |> mutate( total = sum(count), percentage = count / total * 100 ) |> ungroup() # Show error rates by platform error_rates <- platform_status |> filter(status %in% c("ERROR", "TIMEOUT")) |> group_by(node) |> summarise( error_count = sum(count), total = first(total), error_rate = sum(percentage), .groups = "drop" ) |> arrange(desc(error_rate)) head(error_rates, 10) ``` ## Build Stage Analysis Understanding which build stage most often fails: ```{r, stage_analysis} # Analyze failures by stage stage_failures <- build_summary |> filter(status %in% c("ERROR", "TIMEOUT")) |> count(stage, status) |> arrange(desc(n)) stage_failures # Visualize ggplot(stage_failures, aes(x = stage, y = n, fill = status)) + geom_col() + scale_fill_manual(values = c("ERROR" = "red", "TIMEOUT" = "darkred")) + labs( title = "Build Failures by Stage", x = "Build Stage", y = "Number of Failures", fill = "Status" ) + theme_minimal() ``` ## Most Problematic Packages Identify packages with the highest error rates: ```{r, problematic_packages} # Find packages with most errors package_errors <- build_summary |> filter(status %in% c("ERROR", "TIMEOUT")) |> count(package, status) |> group_by(package) |> summarise( total_errors = sum(n), .groups = "drop" ) |> arrange(desc(total_errors)) # Top 10 packages with most errors head(package_errors, 10) ``` ## Maintainer Analysis Analyze package maintenance patterns: ```{r, maintainer_analysis} # Get unique packages per maintainer maintainer_packages <- info |> group_by(Maintainer) |> summarise( n_packages = n_distinct(Package), packages = paste(unique(Package), collapse = ", "), .groups = "drop" ) |> arrange(desc(n_packages)) # Top maintainers by number of packages head(maintainer_packages, 10) # Distribution of packages per maintainer ggplot(maintainer_packages, aes(x = n_packages)) + geom_histogram(binwidth = 1, fill = "steelblue", color = "white") + labs( title = "Distribution of Packages per Maintainer", x = "Number of Packages", y = "Number of Maintainers" ) + theme_minimal() ``` ## Temporal Analysis Analyze build patterns over time: ```{r, temporal_analysis} # Analyze build patterns over time build_summary <- build_summary |> mutate( date = as.Date(startedat), month = format(startedat, "%Y-%m") ) # Build activity by month monthly_builds <- build_summary |> count(month) |> mutate(month_date = as.Date(paste0(month, "-01"))) ggplot(monthly_builds, aes(x = month_date, y = n)) + geom_line(color = "steelblue", linewidth = 1) + geom_point(color = "steelblue") + labs( title = "Build Activity Over Time", x = "Month", y = "Number of Builds" ) + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Error rate over time monthly_errors <- build_summary |> group_by(month) |> summarise( total = n(), errors = sum(status %in% c("ERROR", "TIMEOUT")), error_rate = errors / total * 100, .groups = "drop" ) |> mutate(month_date = as.Date(paste0(month, "-01"))) ggplot(monthly_errors, aes(x = month_date, y = error_rate)) + geom_line(color = "red", linewidth = 1) + geom_point(color = "red") + labs( title = "Build Error Rate Over Time", x = "Month", y = "Error Rate (%)" ) + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) ``` # Bioconductor Report Overview This sections demonstrates usage of BiocBuildReporter helper functions for broader analysis. ## Get Bioconductor Build Report The `get_build_report()` function will retrieve the Bioconductor Build Report for any given day. Optionally you may specify a specific Bioconductor git branch or build machine name to filter further. ```{r, get_build_report} # Retrieves the build report for all packages on December 29, 2025 # Filtering also for RELEASE_3_22 branch and linux "nebbiolo1" build machine get_build_report("2025-12-29", branch="RELEASE_3_22", builder="nebbiolo2") ``` This shows: - Package name - Node (builder machine name) - Build stage (install, build, check) - Package version - Status of stage (OK, WARNING, ERROR, TIMEOUT) - Time started - Time completed - Command utilized to initiate run - Report md5 sum - Git branch - Git commit hashes - Last commit dates This function is meant to programatically reproduce a daily report for any given day for any given builder and branch. ## Get List of Failing Packages The `get_failing_packages()` function will return a list of all the currently failing packages for a given branch and build machine name: ```{r, get_failing_packages} # returns all failing packages for RELEASE_3_22 branch # for build machine nebbolo2 get_failing_packages("RELEASE_3_22", "nebbiolo2") ``` This shows: - Git branch - Package name - Package version - Node (builder machine name) - Build stages (install, build, check) - Build statuses (OK, WARNING, ERROR, TIMEOUT) This gives a quick list of all failing packages. If querying the currently active Bioconductor branches (see `get_latest_branches()`), maintainers of these packages should be contacted to fix their packages to avoid deprecation and removal from Bioconductor. # Conclusion The BiocBuildReporter package provides powerful tools for analyzing Bioconductor build system data. This vignette demonstrated: 1. **Data Access**: Using `get_bbs_table()` and `get_all_bbs_tables()` to retrieve build data 2. **Package-Specific Queries**: Using `get_package_release_info()`, `get_package_build_results()`, `package_error_count()` and `package_failures_over_time()` to analyze individual packages 3. **Exploratory Analysis**: Examining package growth, build statuses, platform differences, and temporal patterns 4. **Bioconductor Report Overview**: Using `get_build_report()` and `get_failing_packages()` This dataset can help package developers, maintainers, and the Bioconductor community to: - Monitor package build health - Identify platform-specific issues - Track package evolution over time - Understand common failure patterns - Improve package quality and reliability For more information about specific functions, see their documentation with `?function_name`. ```{r, sessionInfo} sessionInfo() ```