OpenStats is a freely available R package that presents statistical methods and detailed analyses to promote the hard process of identification of abnormal phenotypes. The package incorporates several checks and cleaning on the input data prior to the statistical analysis. For continuous data, Linear Mixed Model with an optional model selection routine is implemented, whilst for categorical data, Fisher’s Exact Test is implemented. For cases where the linear mixed model fails, Reference Range Plus method has been employed for a quick, simple analysis of the continuous data. User can perform inspections and diagnostics of the final fitted model by the visualisation tools that come with the software. Furthermore, the user can export/report the outputs in the form of either standard R list or JavaScript Object Notation (JSON). OpenStats has been tested and demonstrated with an application of \(2.5M+\) analyses from the Internationa Mouse Phenotyping Consortium (IMPC).

The User’s Guide with more details about the statistical analysis is available as part of the online documentation from https://rpubs.com/hamedhm/openstats. Project Github repository including dev version of the package is available on https://git.io/JeOVN.

OpenStats can be installed using the standard R package installation routin:

R code here

0.1 Building block of the software

OpenStats consists of one input layer and three operational layers:

0.2 Data preprocessing

OpenStatsList function performs data processing and creates an OpenStatsList object. As input, OpenStatsList function requires dataset of phenotypic data that can be presented as data frame. For instance, it can be dataset stored in csv, tsv or txt file. Data is organised with rows and columns for samples and features respectively. Following shows an example of the input data where rows and columns represent mice and features (mouse id, treatment group, gender, age of animal in days):

library(OpenStats)
## Loading required package: nlme
## 
##  >===============================================================================<
##  OpenStats is developed by International Mouse Phenotyping Consortium (IMPC) 
##  More details           : https://www.mousephenotype.org/                         
##  Source code and issues : https://git.io/Jv5w0                                    
##  Contact us             : hamedhm@ebi.ac.uk                                       
##  >===============================================================================<
###################
# Data preparation
###################
fileCon <- system.file("extdata", "test_continuous.csv",
  package = "OpenStats"
)
read.csv(fileCon, as.is = TRUE)[60:75, c(
  "external_sample_id",
  "biological_sample_group",
  "sex",
  "age_in_days"
)]
##    external_sample_id biological_sample_group    sex age_in_days
## 60             C10058                 control female          53
## 61             C10059                 control female          53
## 62             C10060                 control female          53
## 63             C10061                 control female          53
## 64             C10062                 control female          53
## 65             C10063                 control   male          53
## 66             C10064                 control   male          53
## 67             C10065                 control   male          53
## 68             C10066                 control   male          53
## 69             C10067                 control   male          53
## 70             C10192            experimental   male          46
## 71             C10193            experimental   male          46
## 72             C10194            experimental   male          46
## 73             C10195            experimental   male          46
## 74             C10197            experimental   male          46
## 75             C10199            experimental   male          46

The main preprocessing tasks performed by the OpenStatsList function are:

We define “terminology unification” as the terminology used to describe data (variables) that are essential for the analysis. OpenStats package uses the following nomenclature for the names of columns: “Genotype”, the only mandatory variable, “Sex”, “Batch” “LifeStage” and “Weight”. In addition, expected (default) Sex, LifeStage values are “Male/Female” and “Early/Late” respectively. However, the user can define the custom levels by setting dataset.values.male, dataset.values.female, dataset.values.early and dataset.values.late in the OpenStatsList function. Missing value is specified by dataset.values.missingValue argument and set to NA.

The statistical analysis requires exactly two “Genotype” groups for comparison (e.g. wild-type versus knockout). Thus the function OpenStatsList requires users to define the reference genotype (mandatory argument refGenotype with default value “control”) and test genotype (mandatory argument testGenotype), defaulted to “experimental”. If the OpenStatsList function argument dataset.clean is set to TRUE then all records with genotype values others than reference or test genotype are filtered out.

All tasks in OpenStats are accompanied by step-by-step reports, error messages, warnings and/or other useful information about the progress of the function. If messages are not desirable, OpenStatsList function’s argument debug can be set to FALSE meaning there will be no messages.

The chunk of code below demonstrates an example of using OpenStatsList when the user sets out-messages to TRUE/FALSE:

#######################################
# Default behaviour with messages
#######################################
library(OpenStats)
fileCon <- system.file("extdata", "test_continuous.csv",
  package = "OpenStats"
)
test_Cont <- OpenStatsList(
  dataset = read.csv(fileCon),
  testGenotype = "experimental",
  refGenotype = "control",
  dataset.colname.genotype = "biological_sample_group",
  dataset.colname.batch = "date_of_experiment",
  dataset.colname.lifestage = NULL,
  dataset.colname.weight = "weight",
  dataset.colname.sex = "sex"
)
## 2023-10-24 18:14:48.098825. Input data of the dimensions, rows = 410, columns = 75
## 2023-10-24 18:14:48.100416. Checking the input data in progress ...
## 2023-10-24 18:14:48.102344. Checking the specified missing values [x2] (` `, ``) ...
## 2023-10-24 18:14:48.10307.   1/2. Checking (` `) ...
## 2023-10-24 18:14:48.113008.  2/2. Checking (``) ...
## 2023-10-24 18:14:48.124811. Checking whether variable `biological_sample_group` exists in the data ... 
## 2023-10-24 18:14:48.124811.  Result = TRUE
## 2023-10-24 18:14:48.127629.   Levels (Total levels = 2, missings = 0%): 
## 2023-10-24 18:14:48.127629.    1. control
## 2023-10-24 18:14:48.127629.    2. experimental
## 2023-10-24 18:14:48.128577. Checking whether variable `sex` exists in the data ... 
## 2023-10-24 18:14:48.128577.  Result = TRUE
## 2023-10-24 18:14:48.130103.   Levels (Total levels = 2, missings = 0%): 
## 2023-10-24 18:14:48.130103.    1. female
## 2023-10-24 18:14:48.130103.    2. male
## 2023-10-24 18:14:48.130733. Checking whether variable `date_of_experiment` exists in the data ... 
## 2023-10-24 18:14:48.130733.  Result = TRUE
## 2023-10-24 18:14:48.132505.   Levels (Total levels = 43, missings = 0%): 
## 2023-10-24 18:14:48.132505.    1. 2012-07-23T00:00:00Z
## 2023-10-24 18:14:48.132505.    2. 2012-07-30T00:00:00Z
## 2023-10-24 18:14:48.132505.    3. 2012-08-06T00:00:00Z
## 2023-10-24 18:14:48.132505.    4. 2012-08-13T00:00:00Z
## 2023-10-24 18:14:48.132505.    5. 2012-08-20T00:00:00Z
## 2023-10-24 18:14:48.132505.    6. 2012-11-26T00:00:00Z
## 2023-10-24 18:14:48.132505.    7. 2012-12-24T00:00:00Z
## 2023-10-24 18:14:48.132505.    8. 2013-01-02T00:00:00Z
## 2023-10-24 18:14:48.132505.    9. 2013-01-15T00:00:00Z
## 2023-10-24 18:14:48.132505.    10. 2013-01-21T00:00:00Z
## 2023-10-24 18:14:48.132505.    ...
## 2023-10-24 18:14:48.133149. Checking whether variable `weight` exists in the data ... 
## 2023-10-24 18:14:48.133149.  Result = TRUE
## 2023-10-24 18:14:48.134735.   Summary:
## 2023-10-24 18:14:48.134735.    mean      = 20.0036585365854
## 2023-10-24 18:14:48.134735.    sd        = 2.63972182732584
## 2023-10-24 18:14:48.134735.    Missings  = 0%
## 2023-10-24 18:14:48.135936. Variable `biological_sample_group` renamed to `Genotype`
## 2023-10-24 18:14:48.13665. Variable `sex` renamed to `Sex`
## 2023-10-24 18:14:48.137269. Variable `date_of_experiment` renamed to `Batch`
## 2023-10-24 18:14:48.137873. Variable `weight` renamed to `Weight`
## 2023-10-24 18:14:48.154445. Total samples in Genotype:Sex interactions: 
## 2023-10-24 18:14:48.154445.   Level(frequency): 
## 2023-10-24 18:14:48.154445.    1. control.Female(196)
## 2023-10-24 18:14:48.154445.    2. experimental.Female(5)
## 2023-10-24 18:14:48.154445.    3. control.Male(201)
## 2023-10-24 18:14:48.154445.    4. experimental.Male(8)
## 2023-10-24 18:14:48.161093. Total `Weight` data points for Genotype:Sex interactions: 
## 2023-10-24 18:14:48.161093.   Level(frequency): 
## 2023-10-24 18:14:48.161093.    1. control.Female(196),
## 2023-10-24 18:14:48.161093.    2. experimental.Female(5),
## 2023-10-24 18:14:48.161093.    3. control.Male(201),
## 2023-10-24 18:14:48.161093.    4. experimental.Male(8)
## 2023-10-24 18:14:48.162813. Successfully performed checks in 0.06 second(s).
#######################################
# OpenStatsLis behaviour without messages
#######################################
fileCon <- system.file("extdata", "test_continuous.csv",
  package = "OpenStats"
)
test_Cont <- OpenStatsList(
  dataset = read.csv(fileCon),
  testGenotype = "experimental",
  refGenotype = "control",
  dataset.colname.genotype = "biological_sample_group",
  dataset.colname.batch = "date_of_experiment",
  dataset.colname.lifestage = NULL,
  dataset.colname.weight = "weight",
  dataset.colname.sex = "sex",
  debug = FALSE
)
# No output printed

0.2.1 OpenStatsList Object

The output of the OpenStatsList function is the OpenStatsList object that contains a cleaned dataset as well as a copy of the original dataset. OpenStats allows plot and summary/print of the OpenStatList object. Below is an example of the OpenStatsList function accompanied by the plot and summary:

library(OpenStats)
df <- read.csv(system.file("extdata", "test_continuous.csv",
  package = "OpenStats"
))
OpenStatsList <- OpenStatsList(
  dataset = df,
  testGenotype = "experimental",
  refGenotype = "control",
  dataset.colname.batch = "date_of_experiment",
  dataset.colname.genotype = "biological_sample_group",
  dataset.colname.sex = "sex",
  dataset.colname.weight = "weight",
  debug = FALSE
)
p <- plot(OpenStatsList, vars = c("Sex", "Genotype", "data_point"), ask = TRUE)
## 2023-10-24 18:14:48.405864. Working on the plot ...
# Plot categorical variables
p$Categorical

# plot continuous variable
p$Continuous

summary(OpenStatsList,
  style = "grid",
  varnumbers = FALSE, # See more options ?summarytools::dfSummary
  graph.col = FALSE, # Do not show the graph column
  valid.col = FALSE,
  vars = c("Sex", "Genotype", "data_point")
)
## Loading required namespace: summarytools
## 2023-10-24 18:14:50.506418. Working on the summary table ...
## Data Frame Summary  
## Dimensions: 410 x 3  
## Duplicates: 0  
## 
## +------------+----------------------+---------------------+---------+
## | Variable   | Stats / Values       | Freqs (% of Valid)  | Missing |
## +============+======================+=====================+=========+
## | Sex        | 1. Female            | 201 (49.0%)         | 0       |
## | [factor]   | 2. Male              | 209 (51.0%)         | (0.0%)  |
## +------------+----------------------+---------------------+---------+
## | Genotype   | 1. control           | 397 (96.8%)         | 0       |
## | [factor]   | 2. experimental      |  13 ( 3.2%)         | (0.0%)  |
## +------------+----------------------+---------------------+---------+
## | data_point | Mean (sd) : 11 (1.7) | 408 distinct values | 0       |
## | [numeric]  | min < med < max:     |                     | (0.0%)  |
## |            | 7 < 10.9 < 16.7      |                     |         |
## |            | IQR (CV) : 2.6 (0.2) |                     |         |
## +------------+----------------------+---------------------+---------+

OpenStatsList object stores many characteristics of the data, for instance, reference genotype, test genotype, original column names, factor levels etc.

0.3 Data Analysis

OpenStats package contains three statistical frameworks for the phenodeviants identification:

OpenStats’s function OpenStatsAnalysis works as a hub for the different statistical analysis methods. It checks the dependent variable, the data, missings, not proper terms in the model (such as terms that do not exist in the input data) and runs the selected statistical analysis framework and returns modelling testing results. All analysis frameworks output a statistical significance measure, effect size measure, model diagnostics, and graphical visualisations.

Here we explain the main bits of the OpenStatsAnalysis function:

The possible values for the method arguments are “MM” which stands for mixed model framework, “FE” to perform Fisher’s exact test model and “RR” for Reference Range Plus framework. The semantic naming in the input arguments of the OpenStatsAnalysis function allows natural distinction of the input arguments For example, \(MM\_\), \(RR\_\) and \(FE\_\) prefixes represent the arguments that can be set in the corresponging frameworks. Having said that,

The OpenStatsAnalysis function performs basic checks to ensure that the data and model match, the model is feasible for the type of the data and reports step-by-step progress of the function. Some of the checks and operations are listed below:

All frameworks are equipped with the step-by-step report of the progress of the function. Warnings, errors and messages are reported to the user. In the situation where the function encounters a critical failure, then the output object contains a slot called \(messages\) that reports back the cause of the failure.

0.3.1 OpenStatsAnalysis output object

OpenStatsAnalysis output consists of three elements namely, input, output and extra. The input object encapsulate the input parameters to the function, output hold the analysis results and the extra keeps some extra processes on the data/model. Below is an example output from the Reference Rage plus framework:

library(OpenStats)
#################
# Data preparation
#################
#################
# Continuous data - Creating OpenStatsList object
#################
fileCon <- system.file("extdata", "test_continuous.csv",
  package = "OpenStats"
)
test_Cont <- OpenStatsList(
  dataset = read.csv(fileCon),
  testGenotype = "experimental",
  refGenotype = "control",
  dataset.colname.genotype = "biological_sample_group",
  dataset.colname.batch = "date_of_experiment",
  dataset.colname.lifestage = NULL,
  dataset.colname.weight = "weight",
  dataset.colname.sex = "sex",
  debug = FALSE
)
#################
# Reference range framework
#################
RR_result <- OpenStatsAnalysis(
  OpenStatsList = test_Cont,
  method = "RR",
  RR_formula = data_point ~ Genotype + Sex,
  debug = FALSE
)
lapply(RR_result, names)
## $output
## [1] "SplitModels"
## 
## $input
##  [1] "OpenStatsList"     "data"              "depVariable"      
##  [4] "rep"               "method"            "formula"          
##  [7] "prop"              "ci_level"          "refLevel"         
## [10] "full_comparisions"
## 
## $extra
## [1] "Cleanedformula"    "TransformedRRprop"
# lapply(RR_result$output,names)

0.4 Examples

In this section, we show some examples of the functionalities in OpenStats for the continuous and categorical data. Each section contains the code and different possible scenarios.

0.4.1 Linear mixed model framework

The linear mixed model framework applies to continuous data. In this example, data is extracted from the sample data that accompany the software. Here, “Genotype” is the effect of interest. The response is stored in the variable “data_point” and genotype (Genotype) and body weight (Weight) are covariates. The model selection is left to the default, stepwise, and between-group covariance structure are assumes proportional to the genotype levels (different variation for controls than mutants):

library(OpenStats)
#################
# Data preparation
#################
#################
# Continuous data - Creating OpenStatsList object
#################
fileCon <- system.file("extdata", "test_continuous.csv",
  package = "OpenStats"
)
test_Cont <- OpenStatsList(
  dataset = read.csv(fileCon),
  testGenotype = "experimental",
  refGenotype = "control",
  dataset.colname.genotype = "biological_sample_group",
  dataset.colname.batch = "date_of_experiment",
  dataset.colname.lifestage = NULL,
  dataset.colname.weight = "weight",
  dataset.colname.sex = "sex",
  debug = FALSE
)
#################
# LinearMixed model (MM) framework
#################
MM_result <- OpenStatsAnalysis(
  OpenStatsList = test_Cont,
  method = "MM",
  MM_fixed = data_point ~ Genotype + Weight
)
## 2023-10-24 18:14:51.957518. OpenStats loaded.
## 2023-10-24 18:14:51.960451. Checking the input model for including functions ...
## 2023-10-24 18:14:51.961292. Linear Mixed Model (MM framework) in progress ...
## 2023-10-24 18:14:51.966617. Removing possible leading/trailing whitespace from the variables in the formula ...
## 2023-10-24 18:14:51.96984. Checking duplications in the data model:
## 2023-10-24 18:14:51.96984.    Genotype, data_point, Weight
## 2023-10-24 18:14:51.973286.  No duplicate found.
## 2023-10-24 18:14:51.976488. Checking for the feasibility of terms and interactions ...
## 2023-10-24 18:14:51.976488.   Formula: data_point ~ Genotype + Weight
## 2023-10-24 18:14:51.978122.  1 of 1. Checking for the feasibility of terms and interactions ...
## 2023-10-24 18:14:51.979033.       Checking Genotype ...
## 2023-10-24 18:14:51.982595.  Checked model: data_point ~ Genotype + Weight
## 2023-10-24 18:14:51.983237. Check missings in progress ...
## 2023-10-24 18:14:51.984065.   Missings in variable `data_point`: 0%
## 2023-10-24 18:14:51.984807.   Missings in variable `Genotype`: 0%
## 2023-10-24 18:14:51.985456.   Missings in variable `Weight`: 0%
## 2023-10-24 18:14:51.987065. Checking the random effect term ...
## 2023-10-24 18:14:51.987065.  Formula: ~1 | Batch
## 2023-10-24 18:14:51.987757. Lme: Fitting the full model ...
## 2023-10-24 18:14:51.988643.  Applied model: lme
## 2023-10-24 18:14:52.096572.  The full model successfully applied.
## 2023-10-24 18:14:52.098452.  Computing the confidence intervals at the level of 0.95 ...
## 2023-10-24 18:14:52.106749.   CI for `all` term(s) successfully estimated
## 2023-10-24 18:14:52.108151. The specified "lower" model: 
## 2023-10-24 18:14:52.108151.  ~Genotype + 1
## 2023-10-24 18:14:52.109949. The model optimisation is in progress ...
## 2023-10-24 18:14:52.110578.  The direction of the optimisation (backward, forward, both): both
## 2023-10-24 18:14:52.111109.  Optimising the model ...
## 2023-10-24 18:14:52.502702.  Optimised model: data_point ~ Genotype + Weight
## 2023-10-24 18:14:52.580123.  Computing the confidence intervals at the level of 0.95 ...
## 2023-10-24 18:14:52.586769.   CI for `all` term(s) successfully estimated
## 2023-10-24 18:14:52.587892. Testing varHom ...
## 2023-10-24 18:14:52.626836.  Computing the confidence intervals at the level of 0.95 ...
## 2023-10-24 18:14:52.632398.   CI for `all` term(s) successfully estimated
## 2023-10-24 18:14:52.770735.  VarHom checked out ...
## 2023-10-24 18:14:52.772067. Testing Batch ...
## 2023-10-24 18:14:52.782584.  Computing the confidence intervals at the level of 0.95 ...
## 2023-10-24 18:14:52.786616.   CI for `all` term(s) successfully estimated
## 2023-10-24 18:14:52.879178. Estimating effect sizes ...
## 2023-10-24 18:14:53.160796.  Total effect sizes estimated: 2
## 2023-10-24 18:14:53.162056. Quality tests in progress ...
## 2023-10-24 18:14:53.169775. MM framework executed in 1.2 second(s).

0.4.1.1 Sub-model estimation

OpenStats allows fitting submodels from an input model. This is called Split model effects in the outputs and it is mainly useful for reporting sex/age-specific etc. effects. This is performed by creating submodels of a full model. For instance, for the input fixed effect, MM_fixed, model \(Response\sim Genotype+Sex+Weight\) a possible submodel is \(Response \sim Sex+Sex:Genotype + Weight\) that can be used to estimate sex-specific effects for genotype. This model is then estimated under the configuration of the optimal model. One can turn off Split model effects by setting the fourth element of “MM_optimise” to FALSE.

An alternative to the analytically estimating the sub-models is to break the input data into splits and run the model on the subset of the data. This can be performed by passing the output of the OpenStatsAnalysis function, OpenStatsMM, to the function, OpenStatsComplementarySplit. This function allows the OpenStatsMM object as input and a set of variable names that split the data. The output is stored in an OpenStatsComplementarySplit object. The example below shows a split on “Sex”:

library(OpenStats)
#################
# Data preparation
#################
#################
# Continuous data - Creating OpenStatsList object
#################
fileCon <- system.file("extdata", "test_continuous.csv",
  package = "OpenStats"
)
test_Cont <- OpenStatsList(
  dataset = read.csv(fileCon),
  testGenotype = "experimental",
  refGenotype = "control",
  dataset.colname.genotype = "biological_sample_group",
  dataset.colname.batch = "date_of_experiment",
  dataset.colname.lifestage = NULL,
  dataset.colname.weight = "weight",
  dataset.colname.sex = "sex",
  debug = FALSE
)
#################
# LinearMixed model (MM) framework
#################
MM_result <- OpenStatsAnalysis(
  OpenStatsList = test_Cont,
  method = "MM",
  MM_fixed = data_point ~ Genotype + Weight,
  debug = FALSE
)
# SplitEffect estimation with respect to the Sex levels
Spliteffect <- OpenStatsComplementarySplit(
  object = MM_result,
  variables = "Sex"
)
## 2023-10-24 18:14:54.355132. Split effects in progress ...
## 2023-10-24 18:14:54.356454. Variable(s) to split:
## 2023-10-24 18:14:54.356454.  Sex
## 2023-10-24 18:14:54.357448.  Splitting in progress ...
## 2023-10-24 18:14:54.358546.  Spliting on Sex ...
## 2023-10-24 18:14:54.365095. Processing the levels: Female
## 2023-10-24 18:14:55.251284. [Successful]
## 2023-10-24 18:14:55.252255. Processing the levels: Male
## 2023-10-24 18:14:55.948624. [Successful]
class(Spliteffect)
## [1] "OpenStatsComplementarySplit"

0.4.2 Reference range plus framework

Reference range plus framework applies to continuous data. In this example, data is extracted from the sample data that accompany the software. Here, “Genotype” is the effect of interest. The response is stored in the variable “data_point” and genotype (Genotype) and sex (Sex) are covariates.

library(OpenStats)
#################
# Data preparation
#################
#################
# Continuous data - Creating OpenStatsList object
#################
fileCon <- system.file("extdata", "test_continuous.csv",
  package = "OpenStats"
)
test_Cont <- OpenStatsList(
  dataset = read.csv(fileCon),
  testGenotype = "experimental",
  refGenotype = "control",
  dataset.colname.genotype = "biological_sample_group",
  dataset.colname.batch = "date_of_experiment",
  dataset.colname.lifestage = NULL,
  dataset.colname.weight = "weight",
  dataset.colname.sex = "sex",
  debug = FALSE
)
#################
# Reference range framework
#################
RR_result <- OpenStatsAnalysis(
  OpenStatsList = test_Cont,
  method = "RR",
  RR_formula = data_point ~ Genotype + Sex
)
## 2023-10-24 18:14:56.073221. OpenStats loaded.
## 2023-10-24 18:14:56.07547. Checking the input model for including functions ...
## 2023-10-24 18:14:56.076148. Reference Range Plus (RR framework) in progress ...
## 2023-10-24 18:14:56.076693. Optimisation level:
## 2023-10-24 18:14:56.077214.  Estimation of all factor combination effects = TRUE
## 2023-10-24 18:14:56.077729.  Estimation of inter level factors for the response = FALSE
## 2023-10-24 18:14:56.07895. The input formula: data_point ~ Genotype + Sex
## 2023-10-24 18:14:56.079685. The reformatted formula for the algorithm: ~data_point + Genotype + Sex
## 2023-10-24 18:14:56.080468. The probability of the middle area in the distribution: 0.95
## 2023-10-24 18:14:56.081102.   Tails probability: 0.025
## 2023-10-24 18:14:56.081102.   Formula to calculate the tail probabilities: 1-(1-x)/2, (1-x)/2 where x = 0.95
## 2023-10-24 18:14:56.081764. Discritizing the continuous data into discrete levels. The quantile = 0.975
## 2023-10-24 18:14:56.082257. Stp 1. Low versus Normal/High
## 2023-10-24 18:14:56.083044. Removing possible leading/trailing whitespace from the variables in the formula ...
## 2023-10-24 18:14:56.086004. Preparing the reference ranges ...
## 2023-10-24 18:14:56.086514. Preparing the data for the variable: Genotype
## 2023-10-24 18:14:56.087167. Reference level is set to `control`
## 2023-10-24 18:14:56.093232.  Initial quantiles for cutting the data 
## 2023-10-24 18:14:56.093232.           Probs: 0, 0.025, 1
## 2023-10-24 18:14:56.093232.           N.reference: 397
## 2023-10-24 18:14:56.093232.           Quantiles: 6.04, 8.08, 17.73
## 2023-10-24 18:14:56.094621.   Detected percentiles in the data (8 decimals): Low = 0.02518892, NormalHigh = 0.97481108
## 2023-10-24 18:14:56.100899.  Spliting on Sex ...
## 2023-10-24 18:14:56.10251. Preparing the data for the combined effect: Female
## 2023-10-24 18:14:56.103271. Reference level is set to `control`
## 2023-10-24 18:14:56.105651.  Initial quantiles for cutting the data 
## 2023-10-24 18:14:56.105651.           Probs: 0, 0.025, 1
## 2023-10-24 18:14:56.105651.           N.reference: 196
## 2023-10-24 18:14:56.105651.           Quantiles: 7.579, 9.267, 17.73
## 2023-10-24 18:14:56.106885.   Detected percentiles in the data (8 decimals): Low = 0.0255102, NormalHigh = 0.9744898
## 2023-10-24 18:14:56.109588. Preparing the data for the combined effect: Male
## 2023-10-24 18:14:56.110445. Reference level is set to `control`
## 2023-10-24 18:14:56.113084.  Initial quantiles for cutting the data 
## 2023-10-24 18:14:56.113084.           Probs: 0, 0.025, 1
## 2023-10-24 18:14:56.113084.           N.reference: 201
## 2023-10-24 18:14:56.113084.           Quantiles: 6.04, 7.673, 15.645
## 2023-10-24 18:14:56.114396.   Detected percentiles in the data (8 decimals): Low = 0.02985075, NormalHigh = 0.97014925
## 2023-10-24 18:14:56.11703. Stp 2. Low/Normal versus High
## 2023-10-24 18:14:56.117902. Removing possible leading/trailing whitespace from the variables in the formula ...
## 2023-10-24 18:14:56.120965. Preparing the reference ranges ...
## 2023-10-24 18:14:56.121514. Preparing the data for the variable: Genotype
## 2023-10-24 18:14:56.12222. Reference level is set to `control`
## 2023-10-24 18:14:56.128418.  Initial quantiles for cutting the data 
## 2023-10-24 18:14:56.128418.           Probs: 0, 0.975, 1
## 2023-10-24 18:14:56.128418.           N.reference: 397
## 2023-10-24 18:14:56.128418.           Quantiles: 6.04, 14.5, 17.73
## 2023-10-24 18:14:56.129959.   Detected percentiles in the data (8 decimals): LowNormal = 0.97481108, High = 0.02518892
## 2023-10-24 18:14:56.136281.  Spliting on Sex ...
## 2023-10-24 18:14:56.138213. Preparing the data for the combined effect: Female
## 2023-10-24 18:14:56.139291. Reference level is set to `control`
## 2023-10-24 18:14:56.142023.  Initial quantiles for cutting the data 
## 2023-10-24 18:14:56.142023.           Probs: 0, 0.975, 1
## 2023-10-24 18:14:56.142023.           N.reference: 196
## 2023-10-24 18:14:56.142023.           Quantiles: 7.579, 14.744, 17.73
## 2023-10-24 18:14:56.143299.   Detected percentiles in the data (8 decimals): LowNormal = 0.9744898, High = 0.0255102
## 2023-10-24 18:14:56.145651. Preparing the data for the combined effect: Male
## 2023-10-24 18:14:56.146352. Reference level is set to `control`
## 2023-10-24 18:14:56.148688.  Initial quantiles for cutting the data 
## 2023-10-24 18:14:56.148688.           Probs: 0, 0.975, 1
## 2023-10-24 18:14:56.148688.           N.reference: 201
## 2023-10-24 18:14:56.148688.           Quantiles: 6.04, 13.527, 15.645
## 2023-10-24 18:14:56.149927.   Detected percentiles in the data (8 decimals): LowNormal = 0.97014925, High = 0.02985075
## 2023-10-24 18:14:56.152386. Fisher exact test with 1500 iteration(s) in progress ...
## 2023-10-24 18:14:56.152928. Analysing Low vs NormalHigh ...
## 2023-10-24 18:14:56.300921. Analysing LowNormal vs High ...
## 2023-10-24 18:14:56.421095. RR framework executed in 0.34 second(s).

0.4.3 Fisher’s exact test framework

Fisher’s Exact test framework applies to categorical data. In this example, data is extracted from the sample data that accompany the software. Here, Genotype is the effect of interest. The response is stored in the variable category and Genotype and Sex are the covariates.

library(OpenStats)
#################
# Categorical data - Creating OpenStatsList object
#################
fileCat <- system.file("extdata", "test_categorical.csv",
  package = "OpenStats"
)
test_Cat <- OpenStatsList(
  dataset = read.csv(fileCat, na.strings = "-"),
  testGenotype = "Aff3/Aff3",
  refGenotype = "+/+",
  dataset.colname.genotype = "Genotype",
  dataset.colname.batch = "Assay.Date",
  dataset.colname.lifestage = NULL,
  dataset.colname.weight = "Weight",
  dataset.colname.sex = "Sex",
  debug = FALSE
)
#################
# Fisher's exact test framework
#################
FE_result <- OpenStatsAnalysis(
  OpenStatsList = test_Cat,
  method = "FE",
  FE_formula = Thoracic.Processes ~ Genotype + Sex
)
## 2023-10-24 18:14:56.544126. OpenStats loaded.
## 2023-10-24 18:14:56.545573. Checking the input model for including functions ...
## 2023-10-24 18:14:56.545985. Fisher Exact Test (FE framework) in progress ...
## 2023-10-24 18:14:56.546284. Optimisation level:
## 2023-10-24 18:14:56.54657.   Estimation of all factor combination effects = TRUE
## 2023-10-24 18:14:56.546849.  Estimation of inter level factors for the response = FALSE
## 2023-10-24 18:14:56.547566. The input formula: Thoracic.Processes ~ Genotype + Sex
## 2023-10-24 18:14:56.547988. The reformatted formula for the algorithm: ~Thoracic.Processes + Genotype + Sex
## 2023-10-24 18:14:56.548317.  Top framework: FE
## 2023-10-24 18:14:56.548638. Fisher exact test with 1500 iteration(s) in progress ...
## 2023-10-24 18:14:56.548979. Check missings in progress ...
## 2023-10-24 18:14:56.549397.   Missings in variable `Thoracic.Processes`: 0.32%
## 2023-10-24 18:14:56.549779.   Missings in variable `Genotype`: 0%
## 2023-10-24 18:14:56.550147.   Missings in variable `Sex`: 0%
## 2023-10-24 18:14:56.550504. Removing possible leading/trailing whitespace from the variables in the formula ...
## 2023-10-24 18:14:56.552436. Step 1. Testing "Thoracic.Processes"
## 2023-10-24 18:14:56.553178. The data (variable(s) = Thoracic.Processes) contain 2 missing(s) ...
## 2023-10-24 18:14:56.553178.  Missing data removed
## 2023-10-24 18:14:56.554032.  The response variable `Thoracic.Processes` is not categorical.
## 2023-10-24 18:14:56.554354. Step 2. Testing "Genotype"
## 2023-10-24 18:14:56.555463.  Testing for the main effect: Sex
## 2023-10-24 18:14:56.558934. Total tested categories = 1: Genotype
## 2023-10-24 18:14:56.55929.   Total tests =  1
## 2023-10-24 18:14:56.559677. FE framework  executed in 0.01 second(s).

0.5 Summary and export

OpenStats package stores the input data in OpenStatsList and the results of statistical analyses in the OpenStatsMM/RR/FE or OpenStatsComplementarySplit object. The standard summary/print function applies to print off a summary table. The summary table encompasses:

The function OpenStatsReport can be used to create a table of detailed summary from OpenStatsMM/RR/FE object in the form of either list or JSON. The following is an example of the summary output of the liner mixed model framework.

library(OpenStats)
#################
# Data preparation
#################
#################
# Continuous data - Creating OpenStatsList object
#################
fileCon <- system.file("extdata", "test_continuous.csv",
  package = "OpenStats"
)
test_Cont <- OpenStatsList(
  dataset = read.csv(fileCon),
  testGenotype = "experimental",
  refGenotype = "control",
  dataset.colname.genotype = "biological_sample_group",
  dataset.colname.batch = "date_of_experiment",
  dataset.colname.lifestage = NULL,
  dataset.colname.weight = "weight",
  dataset.colname.sex = "sex",
  debug = FALSE
)
#################
# LinearMixed model (MM) framework
#################
MM_result <- OpenStatsAnalysis(
  OpenStatsList = test_Cont,
  method = "MM",
  MM_fixed = data_point ~ Genotype + Weight,
  debug = FALSE
)
summary(MM_result)
## 2023-10-24 18:14:57.039101. Working out the summary table ...
## 
## 
## =============================  =============================================================================================
## Statistic                      Value                                                                                        
## =============================  =============================================================================================
## Applied framework              Linear Mixed Model framework, LME, including Weight                                          
## Final model                    data_point ~ Genotype + Weight                                                               
## ............................   ............................                                                                 
## Tested Gene                    experimental                                                                                 
## Reference Gene                 control                                                                                      
## ............................   ............................                                                                 
## Sexual dimorphism detected?    FALSE, Genotype-Sex interaction is not part of the input (it is not part of the final) model.
## ............................   ............................                                                                 
## Genotype contribution overall  0.343064968220182                                                                            
## Genotype contribution Females  -                                                                                            
## Genotype contribution Males    -                                                                                            
## ............................   ............................                                                                 
## LifeStage contribution         -                                                                                            
## Genotype contribution Early    -                                                                                            
## Genotype contribution Late     -                                                                                            
## ............................   ............................                                                                 
## Sex contribution               -                                                                                            
## Body weight contribution       0                                                                                            
## =============================  =============================================================================================

OpenStatsReport function was developed for large scale application where automatic implementation is require. Following is the JSON output of the function from an OpenStatsMM object (cut to the first 1500 charachters):

strtrim(
  OpenStatsReport(
    object = MM_result,
    JSON = TRUE,
    RemoveNullKeys = TRUE,
    pretty = TRUE
  ),
  1500
)
## {
##   "Applied method": "Linear Mixed Model framework, LME, including Weight",
##   "Dependent variable": "data_point",
##   "Batch included": true,
##   "Residual variances homogeneity": false,
##   "Genotype contribution": {
##     "Overall": 0.343064968220182,
##     "Sexual dimorphism detected": {
##       "Criteria": false,
##       "Note": "Genotype-Sex interaction is not part of the input (it is not part of the final) model. "
##     }
##   },
##   "Genotype estimate": {
##     "Value": -0.502512881011737,
##     "Confidence": {
##       "Genotypeexperimental lower": -1.54340604650595,
##       "Genotypeexperimental upper": 0.538380284482477
##     },
##     "Level": 0.95
##   },
##   "Genotype standard error": 0.529316714411108,
##   "Genotype p-value": 0.343064968220182,
##   "Genotype percentage change": {
##     "control Genotype": -4.66480561595459,
##     "experimental Genotype": 8.36889856564581
##   },
##   "Genotype effect size": {
##     "Value": -0.572742110247569,
##     "Variable": "Genotype",
##     "Model": "data_point ~ Genotype",
##     "Type": "Mean differences. Formula = (mu_treatment - mu_control)/sd(residuals)",
##     "Percentage change": {
##       "control Genotype": -4.66480561595459,
##       "experimental Genotype": 8.36889856564581
##     }
##   },
##   "Weight estimate": {
##     "Value": -0.390270185750165,
##     "Confidence": {
##       "Weight lower": -0.432146253284604,
##       "Weight upper": -0.348394118215727
##     },
##     "Level": 0.95
##   },
##   "Weight standard error": 0.0212948871359552,
##   "Weight p-value": 0,
##   "Weight effect size": {
##     "Value": -0.598034137811873,
##     "Variable": "Weight",
##     "

0.6 Graphics

Graphics in OpenStats are as easy as calling the plot() function on a OpenStatsList or the OpenStatsMM/FE/RR object. Calling the plot function on the OpenStatsList object is shown below:

library(OpenStats)
###################
file <- system.file("extdata", "test_continuous.csv",
  package = "OpenStats"
)
###################
# OpenStatsList object
###################
OpenStatsList <- OpenStatsList(
  dataset = read.csv(file),
  testGenotype = "experimental",
  refGenotype = "control",
  dataset.colname.batch = "date_of_experiment",
  dataset.colname.genotype = "biological_sample_group",
  dataset.colname.sex = "sex",
  dataset.colname.weight = "weight",
  debug = FALSE
)
plot(OpenStatsList)
## 2023-10-24 18:14:57.28305. Working on the plot ...

summary(
  OpenStatsList,
  style     = "grid",
  varnumbers = FALSE, # See more options ?summarytools::dfSummary
  graph.col = FALSE, # Do not show the graph column
  valid.col = FALSE
)
## 2023-10-24 18:14:57.429632. Working on the summary table ...
## Data Frame Summary  
## Dimensions: 410 x 6  
## Duplicates: 320  
## 
## +--------------------+------------------------------+--------------------+---------+
## | Variable           | Stats / Values               | Freqs (% of Valid) | Missing |
## +====================+==============================+====================+=========+
## | Genotype           | 1. control                   | 397 (96.8%)        | 0       |
## | [factor]           | 2. experimental              |  13 ( 3.2%)        | (0.0%)  |
## +--------------------+------------------------------+--------------------+---------+
## | Sex                | 1. Female                    | 201 (49.0%)        | 0       |
## | [factor]           | 2. Male                      | 209 (51.0%)        | (0.0%)  |
## +--------------------+------------------------------+--------------------+---------+
## | Batch              | 1. 2012-07-23T00:00:00Z      |  10 ( 2.4%)        | 0       |
## | [factor]           | 2. 2012-07-30T00:00:00Z      |  10 ( 2.4%)        | (0.0%)  |
## |                    | 3. 2012-08-06T00:00:00Z      |  10 ( 2.4%)        |         |
## |                    | 4. 2012-08-13T00:00:00Z      |  10 ( 2.4%)        |         |
## |                    | 5. 2012-08-20T00:00:00Z      |   9 ( 2.2%)        |         |
## |                    | 6. 2012-11-26T00:00:00Z      |  10 ( 2.4%)        |         |
## |                    | 7. 2012-12-24T00:00:00Z      |  10 ( 2.4%)        |         |
## |                    | 8. 2013-01-02T00:00:00Z      |  10 ( 2.4%)        |         |
## |                    | 9. 2013-01-15T00:00:00Z      |   9 ( 2.2%)        |         |
## |                    | 10. 2013-01-21T00:00:00Z     |   4 ( 1.0%)        |         |
## |                    | [ 33 others ]                | 318 (77.6%)        |         |
## +--------------------+------------------------------+--------------------+---------+
## | age_in_weeks       | Mean (sd) : 7.5 (0.6)        | 6 :   8 ( 2.0%)    | 0       |
## | [integer]          | min < med < max:             | 7 : 216 (52.7%)    | (0.0%)  |
## |                    | 6 < 7 < 9                    | 8 : 174 (42.4%)    |         |
## |                    | IQR (CV) : 1 (0.1)           | 9 :  12 ( 2.9%)    |         |
## +--------------------+------------------------------+--------------------+---------+
## | phenotyping_center | 1. JAX                       | 410 (100.0%)       | 0       |
## | [character]        |                              |                    | (0.0%)  |
## +--------------------+------------------------------+--------------------+---------+
## | metadata_group     | 1. 1d91d45f80eed65ea2122ebeb | 410 (100.0%)       | 0       |
## | [character]        |                              |                    | (0.0%)  |
## +--------------------+------------------------------+--------------------+---------+

There are also graphics for the OpenStatsMM/FE/RR. Here is the list of plots for each framework:

Linear mixed model framework:

Reference Range plus frameworks:

Fisher’s exact test framework:

Below shows an example for the OpenStatsMM output:

plot(MM_result, col = 2)
## 2023-10-24 18:14:57.639641. Working on the plot ...
## 2023-10-24 18:14:57.642729. The normality test result/p-value should be considered carefully.

0.6.1 Session information

sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] OpenStats_1.14.0 nlme_3.1-163    
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.4       xfun_0.40          bslib_0.5.1        ggplot2_3.4.4     
##  [5] htmlwidgets_1.6.2  rlist_0.4.6.2      lattice_0.22-5     summarytools_1.0.1
##  [9] vctrs_0.6.4        tools_4.3.1        generics_0.1.3     stats4_4.3.1      
## [13] parallel_4.3.1     tibble_3.2.1       fansi_1.0.5        cluster_2.1.4     
## [17] pkgconfig_2.0.3    Matrix_1.6-1.1     data.table_1.14.8  checkmate_2.2.0   
## [21] pryr_0.1.6         lifecycle_1.0.3    compiler_4.3.1     farver_2.1.1      
## [25] stringr_1.5.0      rapportools_1.1    munsell_0.5.0      codetools_0.2-19  
## [29] carData_3.0-5      htmltools_0.5.6.1  AICcmodavg_2.3-2   sass_0.4.7        
## [33] yaml_2.3.7         htmlTable_2.4.1    Formula_1.2-5      tidyr_1.3.0       
## [37] pillar_1.9.0       car_3.1-2          jquerylib_0.1.4    MASS_7.3-60       
## [41] cachem_1.0.8       magick_2.8.1       Hmisc_5.1-1        rpart_4.1.21      
## [45] abind_1.4-5        tidyselect_1.2.0   digest_0.6.33      stringi_1.7.12    
## [49] purrr_1.0.2        reshape2_1.4.4     pander_0.6.5       dplyr_1.1.3       
## [53] labeling_0.4.3     splines_4.3.1      fastmap_1.1.1      grid_4.3.1        
## [57] colorspace_2.1-0   cli_3.6.1          magrittr_2.0.3     base64enc_0.1-3   
## [61] survival_3.5-7     utf8_1.2.4         foreign_0.8-85     withr_2.5.1       
## [65] scales_1.2.1       backports_1.4.1    unmarked_1.3.2     timechange_0.2.0  
## [69] lubridate_1.9.3    rmarkdown_2.25     matrixStats_1.0.0  nnet_7.3-19       
## [73] gridExtra_2.3      VGAM_1.1-9         pbapply_1.7-2      evaluate_0.22     
## [77] knitr_1.44         tcltk_4.3.1        rlang_1.1.1        Rcpp_1.0.11       
## [81] xtable_1.8-4       glue_1.6.2         rstudioapi_0.15.0  jsonlite_1.8.7    
## [85] plyr_1.8.9         R6_2.5.1