---
title: "findIPs: Detect Influential Points for Feature Rankings"
author:
- name: Shuo Wang
  affiliation: 
  - Medical Faculty Heidelberg, Heidelberg University
  - Institute of Medical Biometry and Statistics, University of Freiburg
  email: wangsures@foxmail.com
- name: Junyan Lu
  affiliation: Medical Faculty Heidelberg, Heidelberg University
date: "`r format(Sys.time(), '%B %d, %Y')`"

output:
   BiocStyle::html_document: default
   BiocStyle::pdf_document: default
link-citations: yes
header-includes:
    - \usepackage{setspace}
    - \doublespacing
    
vignette: >
  %\VignetteIndexEntry{Introduction to package findIPs}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Introduction

Feature rankings are important in analyzing high-throughput data, particularly 
for bioinformatics studies. These rankings are typically obtained by calculating
the marginal correlations with the outcomes. The ordered feature list reflects 
the importance of features, which plays a vital role in guiding subsequent 
research. For instance, researchers usually focus on a small subset of important
features that are associated with research objectives. However, the feature 
ranking can be distorted by a single case. The case exerts abnormal influence 
on the feature ranking is termed as influential points (IPs). The presence of 
IPs renders the feature ranking unreliable, consequently affecting the 
subsequent analysis based on feature rankings.

The *findIPs* R package is specifically designed to detect IPs for feature 
rankings. The method utilized in this package is based on the leave-one-out 
strategy, which involves comparing the rank difference between the original 
ranking and a new ranking that is obtained by removing a single observation (1). 
The  new rankings are leave-one-out rankings. The rank difference obtained 
through this comparison helps to determine the influence of the deleted 
observation.

The whole process can be divided into three steps,  

Step 1, generate the original and leave-one-out rankings using a feature 
ranking method, such as t-test. A dataset with n cases will result in one 
original ranking and n leave-one-out feature rankings.

Step 2, calculate rank changes. It is advisable to use top-prioritized 
weights when comparing ranks.

Step 3, calculate the cumulative rank changes for each observation. A diagnostic
check is also required to identify any potential influential points. 


# Installation

The *findIPs* package can be installed from  Bioconductor using the 
following commands:

```{r, eval = FALSE}

if (!requireNamespace("BiocManager", quietly = TRUE))
     install.packages("BiocManager")
BiocManager::install("findIPs")

```

# Dataset

The *findIPs* package includes the miller05 microarray data (2). The data 
contains 236 samples and 1,000 genes. It has two types of responses: binary 
and survival. The binary response is classified based on the p53 mutation: 58 
cases with a p53 mutant and 193 cases with the wild-type p53 mutation. The 
survival response is associated with a total of 55 recorded events. 

```{r}
library(findIPs)
data(miller05)
X <- miller05$X
y <- miller05$y
surv <- miller05$surv
```

# Detect IPs using *getdrop1ranks()* and *sumRanks()*  

We use a simple example where features are ranked based on t.test to demonstrate 
the use of *findIPs* package for IPs detection. We use *getdrop1ranks()* to 
derive the original ranking and leave-one-out rankings. Features are simply 
ranked according to the p.values of t.test. Of note, the rank criteria is the 
p.value if *fun = "t.test"*. P.values are ranked in ascending order by 
specifying *decreasing = FALSE*. We select the top 100 important features in 
the original ranking. The function returns an object containing a vector of 
original ranking (origRank) and a matrix of leave-one-out rankings (drop1Rank).

```{r}

obj <- getdrop1ranks(X, y,
                     fun = "t.test",
                     decreasing = FALSE,
                     topN = 100)
str(obj)

```

After obtaining the original ranking and leave-one-out rankings using the 
*getdrop1ranks()* function, we use the *sumRanks()* function to compute the 
distance between them. This function provides three methods for comparing ranks: 
unweighted, weighted Spearman, and method with adaptive weights. The unweighted 
method directly compares the ranks and assumes that all ranks have equal 
importance. However, this is not always the case as the top-ranked methods are 
usually more important. The weighted Spearman and adaptive weights methods 
address this issue by emphasizing the importance of the top-ranked methods (3). 
The adaptive weights method can further adjust the weights based on the 
distribution of rank changes in the data.


```{r}

results <- sumRanks(origRank = obj$origRank,
                    drop1Rank = obj$drop1Rank,
                    topN = 100,
                    method = "adaptive")

str(results)

```

The outputs of *sumRanks()* are diverse across the selected methods. For 
*method = "adaptive"*, *sumRanks()* returns a list with the following elements:

1, kappa, the shape parameter of the adaptive weights method;  
2, score, the accumulated weighted rank changes, reflecting the influence of 
each sample;  
3, origRank, the original ranking;  
4, drop1Rank, the leave-one-out rankings;   
5, origRankWeighted, weighted original ranking;  
6, drop1RankWeighted, weighted leave-one-out rankings.  

However, if the method is "weightedSpearman" or "unweighted", the function will 
only return three elements: "score", "origRank", and "drop1RankWeighted". The 
elements "kappa", "origRankWeighted", and "drop1RankWeighted" will not be 
returned.

# Use *findIPs()* to detect IPs in one-step

*findIPs()* combines *getdrop1ranks()* and *sumRanks()* into one step. The 
output is identical to that using the two-step process.

```{r}
results.ipsr <- findIPs(X, y, 
                        fun = "t.test",
                        decreasing = FALSE,
                        method = "adaptive")

identical(results, results.ipsr)
```
# Results visualization 

*findIPs* package offers three visualization functions: *plotIPs()*, 
*plotRankScatters()*, and *plotAdaptiveWeights()*. *plotIPs()* can directly 
utilize the output of *findIPs()* or *sumRanks()* to create a lollipop plot that 
displays the influence of each case. In Figure 1, we can see that the 
observation 68 (obs68) seems to be more influential on the final results. 
However, the difference between obs68 and the other observations is not that 
distinct, indicating a lower possibility of the presence of an influential 
observation.

```{r, fig.width = 7, fig.height = 4, fig.cap = "Figure 1, the influence scores for each observation"}
par(mar = c(4, 4, 2, 2))
plotIPs(results.ipsr, topn = 5, ylim = c(0, 8))

```

In addition to the lollipop, *findIPs* also provides a simple visualization 
function *plotRankScatters()* that exhibits the distribution of rank changes 
using a scatter plot (Figure 2). Alike *plotIPs()*, *plotRankScatters()* simply 
uses the output of *findIPs()* or *sumRanks()*. According to Figure 2, we can 
observe more rank changes in the tail side, but less changes in the head. The 
black points denote the rank changes caused by the most influential case.  

```{r, fig.width = 7, fig.height = 4, fig.cap = "Figure 2, the distribution of rank changes"}
par(mar = c(4, 4, 2, 2))
plotRankScatters(results.ipsr)

```

The *plotAdaptiveWeights()* function aims to visualize weight function if 
adaptive weights are used for rank comparison, that is *method = "adaptive"* 
for *findIPs()* or *sumRanks()*. The argument *kappa* refers to the shape 
parameter of the weight function. Here, the optimized kappa is 0.023 (Figure 3). 
n is the length of the feature list. We select the top 100 features, hence, 
*n = 100*. We can observe that more weights are allocated to the top-ranked 
features.  

```{r, fig.width = 7, fig.height = 4, fig.cap = "Figure 3, the weight function of the adaptive weights"}
par(mar = c(4, 4, 2, 2))
plotAdaptiveWeights(results.ipsr$kappa, n = 100, type = "line")

```

# Use findIPs in survival data

For survival analysis, we offer the option to rank features using univariate 
Cox model by setting *fun = "cox"*. The features are ranked in ascending order 
based on their P-values.

```{r, fig.width = 7, fig.height = 4, fig.cap = "Figure 4, IPs detection for survival data"}
par(mar = c(4, 4, 2, 2))
results.cox <- findIPs(X, surv, 
                       fun = "cox",
                       decreasing = FALSE,
                       method = "adaptive")

plotIPs(results.cox)
```


# Customize the rank criteria

In addition to the provided ranking criteria in *findIPs()* or *sumRanks()*, 
which includes "t.test", "cox", "log2fc", and "kruskal.test". We can also rank 
features based on a specified rank criterion. To this end, we can pass a function 
to the *fun* argument.  The function should take *x* and *y* as inputs and 
output the rank criterion, such as p-values.

As an example, we can rank features based on the p-values obtained from the 
kruskal.test. We can either specify *fun = "kruskal.test"*, as this test has 
been implemented in the package, or define our own function passing to 
*getdrop1ranks*. Both methods produce the same results.

```{r}
fun <- function(x, y){
  kruskal.test(x, y)$p.value
}

kruskal.test1 <- getdrop1ranks(X, y, 
                               fun = fun, 
                               decreasing = FALSE)

kruskal.test2 <- getdrop1ranks(X, y, 
                               fun = "kruskal.test", 
                               decreasing = FALSE)

identical(kruskal.test1, kruskal.test2)

```

# The choice of rank comparison methods

*findIPs* provides three rank comparison methods: unweighted, weighted Spearman, 
and adaptive weights. We recommend using the adaptive weights. Here, we compare 
the three methods. Weighted Spearman and the adaptive weights method 
demonstrates similar results. But both are different from the unweighted method. 

```{r, fig.width = 9, fig.height = 3, fig.cap = "Figure 5, IPs detection using three rank comparison methods"}
re.unweighted <- findIPs(X, y, 
                         fun = "t.test",
                         decreasing = FALSE,
                         method = "unweighted")
re.weighted <- findIPs(X, y, 
                       fun = "t.test",
                       decreasing = FALSE,
                       method = "weightedSpearman")
re.adaptive <- findIPs(X, y, 
                       fun = "t.test",
                       decreasing = FALSE,
                       method = "adaptive")

par(mfrow = c(1, 3), mar = c(4, 4, 2, 2))
plotIPs(re.unweighted)
mtext("A, unweighted", side = 3, line = 0, adj = 0)
plotIPs(re.weighted)
mtext("B, weighted Spearman", side = 3, line = 0, adj = 0)
plotIPs(re.adaptive)
mtext("C, adaptive weights", side = 3, line = 0, adj = 0)

```

# References {.unnumbered}

1, Wang, Shuo, and Junyan Lu. "Detect influential points of feature rankings." 
arXiv preprint arXiv:2303.10516 (2023).  
2, Miller, Lance D., et al. "An expression signature for p53 status in human 
breast cancer predicts mutation status, transcriptional effects, and patient 
survival." Proceedings of the National Academy of Sciences 102.38 (2005): 
13550-13555.doi:10.1073pnas.0506230102  
3, Da Pinto Costa, Joaquim; Soares, Carlos (2005): A weighted rank measure of 
correlation. In Australian & New Zealand Journal of Statistics 47 (4), pp. 
515–529.  


# Session info {.unnumbered}

```{r, echo = F}
sessionInfo()
```