SI AUGUSTUS CERNATUR, CERNANTUR QOUQUE AMICI
18 May 2026
There is a simple intuition of what it means to be a friend. A friend of Augustus cares about Augustus more than about other people. And, if we see Augustus, then we infer to see friends(s) of Augustus also. Please refer to our Arxive publication for more formal introduction.
Let’s picture a set of expression pattern, each represents a biological process. Techically, each pattern is a vector of expression loads of the involved genes.
Sometimes, the expression of a single gene in a sample indicates the activity of the entire pattern there. The simplest case: the gene has a nonzero load in only one pattern.
Actually, the gene may have several nonzero loads, but all of them but the one of interest are relatively small. Then the gene (AKA Augustus) is the marker gene for the pattern, and the marked pattern is the a friend of this gene.
There could be more then one friend for a marker. In this case, the marker is more important for the friends than for the non-friends.
We have two sets:T (rows) and C (columns) and A real matrix A(t,c) that describes the strength of relation between each t and each c; t is an element of T and c is an element of C.
The model applies to many problems in bioinformatics and statistics.
| Example | \(\left\{T\right\}\) | \(\left\{C\right\}\) | \(A\left(t,c\right)\) |
|---|---|---|---|
| gene regulation by TFs | gene | genes under the TF regulation | strength of regulation |
| fuzzy clustering | object | cluster | object weight in cluster |
| transcription decomposition | transcript | transcription pattern | transcript’s load in pattern |
| weighted graph | vertex | another vertex | weight of edge between column and row |
For each row, we want to identify whather there are columns(s) that particularly prefer(s) the row, and list the columns. Then such a column is a friend (or the best friend if there is only one) for the row. We intend to express the meaning of the word “particularly” in this context by a statistical test, that is the friends.test provided by this package.
Let’s define the relevance of a row to a column as the rank of the \(i\)-th rows’s relation with \(j\)-the column inside the column, or, in our notation, the rank of \(A_{ij}\) in \(A_{*j}\), and denote it as \(r(t_i,c_j)\).
The main idea of the friends.test test is that if a row \(t_i\) is not preferred by any column(s), the distribution of the ranks of the row in all the columns \(r(t_i,c_j)\) behaves like \(j\) i.i.d integers that are uniformely distributed in \(1..n\).
Let’s illustrate the method with something supersimple.
``` r
regulation <-
structure(
c(
0.2, 0.25, 0.1, 0.23, 0.3, 0.12, 0.14,
4, 3, 6, 1, 9, 7, 10,
3, 1, 5, 6, 3, 4, 11
),
dim = c(7, 3),
dimnames =
list(
c(
"Gene1", "Gene2", "Gene3",
"Gene4", "Gene5", "Gene6",
"Gene7"
),
c("TF1", "TF2", "TF3")
)
)
noquote(regulation)
```
```
## TF1 TF2 TF3
## Gene1 0.20 4 3
## Gene2 0.25 3 1
## Gene3 0.10 6 5
## Gene4 0.23 1 6
## Gene5 0.30 9 3
## Gene6 0.12 7 4
## Gene7 0.14 10 11
```
What we want to identify is Gene2-TF1 pair. Gene2 is not the strongest downstream gene in the TF1 column (column), it is only the second, but, on the other hand its rank in other TF-based columns is much lower.
First, let’s apply the Bayesian version of the method, best.friends.bic.
``` r
friends.test.bic(regulation, prior.to.have.friends = .5, max.friends.n = 1)
```
```
## $Gene1
## $Gene1$TF1
## marker friend rank
## 1 1 1
##
##
## $Gene2
## $Gene2$TF1
## marker friend rank
## 2 1 1
```
We see the expected Gene2-TF1 pair as well as the Gene1-TF1 pair, that is similar to the expected, Gene1 has rank of 3 in the TF1 column.
The lower is the prior to have friends for a row (it is an obligatory parameter for friends.test.bic), the less marker-friend pairs we get.
``` r
friends.test.bic(regulation, prior.to.have.friends = .33, max.friends.n = 1)
```
```
## $Gene2
## $Gene2$TF1
## marker friend rank
## 2 1 1
```
``` r
friends.test.bic(regulation, prior.to.have.friends = .25, max.friends.n = 1)
```
```
## named list()
```
In all the three examples above, the max.friends.n parameter was set to 1. It means that the function filters out all the rows that have more that one friendly column, or in other words, we look only for the best friends. If we relax the restriction, we get more results, and the rank that shows the rank of the friend for the marker, can differ from 1:
``` r
friends.test.bic(regulation, prior.to.have.friends = .25)
```
```
## $Gene7
## $Gene7$TF2
## marker friend rank
## 7 2 1
##
## $Gene7$TF3
## marker friend rank
## 7 3 2
```
Note that max.friends.n = "all" is the default value for both friends.test (see below) and friends.test.bic.
Let’s run the p-value-based option of the method on the same data. It is less sensitive on small dataset, so we use unpractical threshold for the p-value to see the result.
``` r
friends.test(regulation, threshold = .5, p.adjust.method = "no",
max.friends.n = 1)
```
```
## $Gene2
## $Gene2$TF1
## marker friend rank
## 2 1 1
```
And, as usual, if we relax the max.friends.no restriction, we will see more pairs, but now the friends are not required to be unique for a marker.
``` r
friends.test(regulation, threshold = .5, p.adjust.method = "no")
```
```
## $Gene2
## $Gene2$TF1
## marker friend rank
## 2 1 1
##
##
## $Gene4
## $Gene4$TF3
## marker friend rank
## 4 3 1
##
## $Gene4$TF1
## marker friend rank
## 4 1 2
##
##
## $Gene7
## $Gene7$TF2
## marker friend rank
## 7 2 1
##
## $Gene7$TF3
## marker friend rank
## 7 3 2
```
Let’s look for anomalies in the traditional iris dataset:
``` r
data("iris")
print(head(iris, n = 5))
```
```
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
```
We use only numeric columns:
``` r
friends.test.bic(iris[1:4], prior.to.have.friends = .001)
```
```
## $`14`
## $`14`$Sepal.Width
## marker friend rank
## 14 2 1
```
So, we see that the flower #14 is important for (we also can think about it as “popular in”) the Sepal.Width nomination. Let’s understand it a simple request: what is the quantile of the flower #14 in the four columns.
``` r
apply(iris[, 1:4], 2, function(col) ecdf(col)(col[14]))
```
```
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0.006666667 0.553333333 0.013333333 0.033333333
```
So, flower #14 is in top 50% in Sepal.Width and in lowest 5% for all other sizes. It is exacly what we describe as “Sepal.Width is friend for flower #14 and the flower #14 is a marker for Sepal.Width”.
Let’s note that the four sizes have different distributions; still, the friends.test naturally works with this type of data.
NATURAM EXPELLAS FURCA, TAMEN USQUE RECURRET
CoGAPS decomposes an expression dataset into pattetns of expression, ideally, eah of them represent an underlying biological process. We use the PDAC dataset from the Johnson et al, 2023 as an example.
``` r
data(friends.test.cogaps.example)
```
Feature loadings in CoGAPS are normalized expression values across patterns. Each column is a pattern, each row is a gene.
``` r
featureLoadings <- friends.test.cogaps.example$loadings
head(featureLoadings)
```
```
## Pattern_1 Pattern_2 Pattern_3 Pattern_4 Pattern_5 Pattern_6 Pattern_7
## SAMD11 0.002373853 0.002656909 0.006061856 0.001842319 0.0007711356 0.0005375241 0.01290628
## NOC2L 0.333290756 0.215134636 0.680900455 0.083199978 0.1951790750 0.1346961260 0.18365334
## KLHL17 0.055154882 0.005238459 0.047037147 0.013550087 0.0049433918 0.0671307519 0.01745793
## PLEKHN1 0.070879407 0.160753518 0.002134236 0.002048471 0.0000000000 0.4884256721 0.10620824
## HES4 0.345203698 0.139899164 0.149048895 0.026502691 1.6740946770 1.3930050135 0.18173441
## ISG15 0.012890226 1.408829808 0.849508107 0.025504649 0.1476552784 1.0514144897 1.27441263
## Pattern_8
## SAMD11 0.001200248
## NOC2L 0.018304676
## KLHL17 0.016975854
## PLEKHN1 0.075609177
## HES4 0.267374903
## ISG15 0.445318580
```
Let’s look for marker genes for the patterns using friends.test. In this setup, genes (possibly, markers) are rows, patterns (possibly, friends for the markers) are columns.
We use here the default version of friends.test that selects markers by the KS test. As far ad there is ~15000 genes, we use Bonferroni p=value adjustment rather than the default “BH”.
``` r
friends_of_genes <- friends.test(
featureLoadings,
p.adjust.method = "bonferroni"
)
```
The return is a list of lists, the outer list is by gene that is a marker, the inner is by pattern that is the marker’s friend. The output format allows a marker to have a lot of friends. The leave is a 3-integer vector, are the marker index and the friend index in the matrix we started with, and the rank of the friend for the marker. Let’s look at the first element, it’s rank is 1 as far as it is the first.
``` r
friends_of_genes[[1]][[1]]
```
```
## marker friend rank
## 232 5 1
```
The names of the list elements are in accordance. The first marker name:
``` r
rownames(featureLoadings)[friends_of_genes[[1]][[1]]["marker"]]
```
```
## [1] "RPL11"
```
``` r
names(friends_of_genes)[1]
```
```
## [1] "RPL11"
```
The name of the first friend of the first marker:
``` r
colnames(featureLoadings)[friends_of_genes[[1]][[1]]["friend"]]
```
```
## [1] "Pattern_5"
```
``` r
names(friends_of_genes[[1]])[1]
```
```
## [1] "Pattern_5"
```
The format is flexible enough, we exemplify posiible transformations later. The format is common for the friends.test and friends.test.bic main functions.
Let’s look what happens in the friends.test. First, we rank all the genes inside the loadings matrix.
``` r
ranks <- row.int.ranks(featureLoadings)
head(ranks)
```
```
## Pattern_1 Pattern_2 Pattern_3 Pattern_4 Pattern_5 Pattern_6 Pattern_7 Pattern_8
## SAMD11 12846 11721 10876 11106 11863 12096 10495 12220
## NOC2L 2093 4244 1081 3785 4573 5654 2866 9268
## KLHL17 8678 11147 8186 8078 10935 7095 9920 9361
## PLEKHN1 8048 5169 11974 11012 15128 2203 4717 6885
## HES4 1980 5613 5304 6617 346 486 2905 3742
## ISG15 11049 390 751 6698 5515 725 196 2293
```
The ranks of the first marker in all he patterns:
``` r
marker_1 <- names(friends_of_genes)[1]
ranks[marker_1, ]
```
```
## Pattern_1 Pattern_2 Pattern_3 Pattern_4 Pattern_5 Pattern_6 Pattern_7 Pattern_8
## 7599 80 149 31 12 48 86 92
```
The ranks of the first in all the friend patterns:
``` r
ranks[marker_1, names(friends_of_genes[[1]])]
```
```
## Pattern_5 Pattern_4 Pattern_6 Pattern_2 Pattern_7 Pattern_8 Pattern_3
## 12 31 48 80 86 92 149
```
The model splitted the patterns into two sets: Pattern_1 where the marker has a very low rank, and the reaminder. It is what we supposed to be done.
To look for genes that have only one friend (let’s call them exclusive markers), so we use max.friends.n = 1. The parameter actually is involved after the main part of the calculation, it filters out the genes that has more than max.friends.n friends, these genes are not stored in the output.
``` r
best_friends_of_genes <- friends.test(
featureLoadings,
p.adjust.method = "bonferroni",
max.friends.n = 1
)
names(best_friends_of_genes)
```
```
## [1] "CA4"
```
In this super-restrictive mode, only one (exclusive) marker, CA4, is found. Its friend is the Pattern_5. It is to the only by the task.
``` r
the_marker <- names(best_friends_of_genes)[1]
ranks[the_marker, ]
```
```
## Pattern_1 Pattern_2 Pattern_3 Pattern_4 Pattern_5 Pattern_6 Pattern_7 Pattern_8
## 15098 14889 14789 14885 632 14794 15081 14654
```
That is what the friends.test is a unique test to detect: the CA4 has a moderate rank in Pattern_5, but it is the best of its ranks in all other patterns and it is significantlu higher than all of them.
To perform an example of gene set enrichment analysis, we use the Baesian version of the test, friends.test.bic. We restrict the number of friends for a marker by 4, a half of number of the patterns, to keep them all in the similar “direction”. The reqired parameter prior.to.have.friends is set to approximately 1/(number od genes). We run 4 parallel processses, see Parallel calculation.
``` r
mirai::daemons(4)
good_friends_of_genes <- friends.test.bic(
featureLoadings,
prior.to.have.friends = 1E-4,
max.friends.n = 4
)
mirai::daemons(0)
```
For the gene overrrepresentation analysis, we need to gen a list of markers (genes) for each pttern. So, we swap the levels of the list, and we do not need leaves. Here, we can reverse the list via a data frame. It works if the number of elements do not reach hundred of thousands.
``` r
good_friends_df <- good_friends_of_genes |>
purrr::flatten() |>
purrr::reduce(.f = rbind) |>
data.frame() |>
tibble::remove_rownames() |>
dplyr::mutate(
gene = rownames(featureLoadings)[marker],
pattern = colnames(featureLoadings)[friend]
)
head(good_friends_df)
```
```
## marker friend rank gene pattern
## 1 62 5 1 AJAP1 Pattern_5
## 2 62 1 2 AJAP1 Pattern_1
## 3 76 6 1 TAS1R1 Pattern_6
## 4 92 6 1 CA6 Pattern_6
## 5 112 5 1 C1orf127 Pattern_5
## 6 117 5 1 ANGPTL7 Pattern_5
```
Now, let’s prepare a list of markers by pattern.
``` r
good_markers_list <- good_friends_df |>
split(good_friends_df$pattern) |>
purrr::map(~ .x |>
dplyr::pull(gene) |>
unique() |>
sort()
)
```
As far as we have gene lists, let’s examine them for overreprentation in the gene collections from MSigDB. We use fgsea and msigdbr for that.
First, prepare the collections sets for H (Hallmarks) and for for C4:3CA (Curated Cancer Cell Atlas).
``` r
msigdbr_H_df <- msigdbr::msigdbr(species = "human", collection = "H")
collections_H <- split(
msigdbr_H_df$gene_symbol,
msigdbr_H_df$gs_name
)
collections_H <- split(
msigdbr_H_df$gene_symbol,
msigdbr_H_df$gs_name
)
msigdbr_C4_3CA_df <-
msigdbr::msigdbr(
species = "human",
collection = "C4",
subcollection = "3CA"
)
collections_C4_3CA <- split(
msigdbr_C4_3CA_df$gene_symbol,
msigdbr_C4_3CA_df$gs_name
)
```
Now, let’s test the gene enrichment of the markers for each pattern in the Curated Cancer Cell Atlas and in Hallmarks MSigDB collections.
``` r
C4_3CA_by_markers <- good_markers_list |>
purrr::map(\(markers) {
fgsea::fora(
pathways = collections_C4_3CA,
markers,
rownames(featureLoadings)
) |>
dplyr::filter(padj < 0.05) |>
dplyr::select(pathway, padj)
})
C4_3CA_by_markers
```
```
## $Pattern_1
## Empty data.table (0 rows and 2 cols): pathway,padj
##
## $Pattern_2
## pathway padj
## <char> <num>
## 1: GAVISH_3CA_METAPROGRAM_EPITHELIAL_COLON_RELATED 0.01053558
## 2: GAVISH_3CA_METAPROGRAM_EPITHELIAL_PDAC_RELATED_2 0.01053558
##
## $Pattern_3
## Empty data.table (0 rows and 2 cols): pathway,padj
##
## $Pattern_4
## pathway padj
## <char> <num>
## 1: GAVISH_3CA_METAPROGRAM_EPITHELIAL_RESPIRATION 0.03434735
## 2: GAVISH_3CA_METAPROGRAM_EPITHELIAL_PDAC_RELATED_2 0.03434735
##
## $Pattern_5
## Empty data.table (0 rows and 2 cols): pathway,padj
##
## $Pattern_6
## Empty data.table (0 rows and 2 cols): pathway,padj
##
## $Pattern_7
## pathway padj
## <char> <num>
## 1: GAVISH_3CA_MALIGNANT_METAPROGRAM_23_SECRETED_2 0.0003684984
## 2: GAVISH_3CA_METAPROGRAM_EPITHELIAL_SECRETED 0.0003684984
## 3: GAVISH_3CA_MALIGNANT_METAPROGRAM_19_EPITHELIAL_SENESCENCE 0.0072089630
## 4: GAVISH_3CA_METAPROGRAM_EPITHELIAL_EPISEN 0.0454684340
## 5: GAVISH_3CA_METAPROGRAM_EPITHELIAL_ALVEOLAR 0.0454684340
## 6: GAVISH_3CA_METAPROGRAM_EPITHELIAL_PDAC_RELATED_2 0.0454684340
## 7: GAVISH_3CA_METAPROGRAM_FIBROBLASTS_METAL_RESPONSE 0.0454684340
##
## $Pattern_8
## pathway padj
## <char> <num>
## 1: GAVISH_3CA_MALIGNANT_METAPROGRAM_19_EPITHELIAL_SENESCENCE 0.01266589
## 2: GAVISH_3CA_METAPROGRAM_EPITHELIAL_EPISEN 0.03366002
## 3: GAVISH_3CA_METAPROGRAM_EPITHELIAL_SECRETED 0.03366002
## 4: GAVISH_3CA_METAPROGRAM_EPITHELIAL_PDAC_RELATED_2 0.03366002
```
``` r
hallmarks_by_markers <- good_markers_list |>
purrr::map(\(markers) {
fgsea::fora(
pathways = collections_H,
markers,
rownames(featureLoadings)
) |>
dplyr::filter(padj < 0.05) |>
dplyr::select(pathway, padj)
})
hallmarks_by_markers
```
```
## $Pattern_1
## Empty data.table (0 rows and 2 cols): pathway,padj
##
## $Pattern_2
## Empty data.table (0 rows and 2 cols): pathway,padj
##
## $Pattern_3
## Empty data.table (0 rows and 2 cols): pathway,padj
##
## $Pattern_4
## Empty data.table (0 rows and 2 cols): pathway,padj
##
## $Pattern_5
## Empty data.table (0 rows and 2 cols): pathway,padj
##
## $Pattern_6
## pathway padj
## <char> <num>
## 1: HALLMARK_KRAS_SIGNALING_DN 0.02078762
##
## $Pattern_7
## Empty data.table (0 rows and 2 cols): pathway,padj
##
## $Pattern_8
## Empty data.table (0 rows and 2 cols): pathway,padj
```
The result of both friends.test and friends.test.bic functions depends on random state generator. The dependency is because the test starts with ranking of the elemants of the input matrix in the columns. The test requires the rank values to be integer, so the ties are resolved randomly. It you want a reproduceablle result, keep the initial random generator state. Another option is to rank the matrix bfore the test call and then to pass the rank matrix as the input. It does not consume a lot of additional time, and the renking of already-ranked does not depend on random seeds. The special case when this approach is the only is if some additional information is used in the ranking.
Both main functions, friends.test and friends.test.bic, carries purrr::in_parallel code, so if you want each of them to run in parallel mode, say mirai::deamons(n) (n is number of processes) in your R code before the finction call, and mirai::daemons(0) after. See mirai - Quick Reference for more information.
``` r
sessionInfo()
```
```
## R version 4.6.0 RC (2026-04-17 r89917)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_GB
## [4] LC_COLLATE=C LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] friends.test_0.99.18 markdown_2.0 knitr_1.51
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.10 generics_0.1.4 fgsea_1.38.0 lattice_0.22-9
## [5] digest_0.6.39 magrittr_2.0.5 evaluate_1.0.5 grid_4.6.0
## [9] RColorBrewer_1.1-3 fastmap_1.2.0 Matrix_1.7-5 jsonlite_2.0.0
## [13] purrr_1.2.2 scales_1.4.0 codetools_0.2-20 jquerylib_0.1.4
## [17] cli_3.6.6 rlang_1.2.0 rtlr_0.1.0 cowplot_1.2.0
## [21] withr_3.0.2 cachem_1.1.0 yaml_2.3.12 otel_0.2.0
## [25] tools_4.6.0 parallel_4.6.0 BiocParallel_1.46.0 dplyr_1.2.1
## [29] nanonext_1.9.0 ggplot2_4.0.3 carrier_0.3.0.4 fastmatch_1.1-8
## [33] curl_7.1.0 assertthat_0.2.1 babelgene_22.9 vctrs_0.7.3
## [37] R6_2.6.1 lifecycle_1.0.5 pkgconfig_2.0.3 pillar_1.11.1
## [41] bslib_0.11.0 gtable_0.3.6 data.table_1.18.4 glue_1.8.1
## [45] Rcpp_1.1.1-1.1 xfun_0.57 tibble_3.3.1 tidyselect_1.2.1
## [49] dichromat_2.0-0.1 farver_2.1.2 msigdbr_26.1.0 htmltools_0.5.9
## [53] mirai_2.7.0 rmarkdown_2.31 compiler_4.6.0 S7_0.2.2
```