# iteremoval The package provides a flexible algorithm to iteratively screen features in consideration of overfitting and overall performance. Two distinct groups of observations are required. It is originally tailored for methylation locus screening of NGS data, and it can also be used as a generic method for feature selection. Each step of the algorithm provides a default function for simple implemention, and it is able to be replaced by a user defined method. ## Install To install this package, start R and enter: ```{r eval=FALSE} # try http:// if https:// URLs are not supported source("https://bioconductor.org/biocLite.R") biocLite("iteremoval") ``` ## An example We identified 299 genomic regions related to the methylation status of a disease. Then, we sequenced the cfDNA of 44 subjects. 22 were malignant, while others were normal. We built a statistical model to compute the probability of the disease for each region and subject. Thus, we use the probability data to demonstrate how the package works. The input files are in either format: 1. Two datasets, `SWRG1`(malignant individuals) and `SWRG0`(normal individuals). Each row of the two datasets indicates a genomic region, and the columns are the probabilities. 2. A `SummarizedExperiment::`SummarizedExperiment object (`SummarizedData`) and a logical vector (`SummarizedData$Group==0`) to distinguish normal samples from all samples. Note: Name 'Group' is not required. We expect that higher the value means higher the probability of having a disease. However, we discover that not all genes are related to the disease. Therefore, we use the package `iteremoval` to select the gene locus with high probability in malignant samples and low probability in normal samples. `iteremoval` is oriented to find a dataset to classify two different groups, and the feature that removed in each iteration is comprehensively considered according to all observations. If the observations have subtypes, the process might generate a feature set favoring all subtypes with the default settings. You can also define the function that each step uses to meet your requirement. ### Removing features ```{r} library(iteremoval) # input two datasets removal.stat <- feature_removal(SWRG1, SWRG0, cutoff1=0.95, cutoff0=0.95, offset=c(0.25, 0.5, 2, 4)) ``` ```{r eval=TRUE} # input SummarizedExperiment object removal.stat <- feature_removal(SummarizedData, SummarizedData$Group==0, cutoff1=0.95, cutoff0=0.95, offset=c(0.25, 0.5, 2, 4)) ``` It is the core function of the package. The first four parameters are required. A vector of `offset` means doing the whole computational process with different offsets *respectively*, and you can also define only one numeric number to `offset`. To get a overall information of features, the iteration will not stop until there is no feature to remove. - You can also lower `cutoff0` and higher `cutoff1` to reduce overfitting. Why? You can type `?feature_removal` to see the whole algorithm. ### Ploting the iteration trace of removed features' scores It is useful to visulize how each feature is removed in the iterative process. The package provides a way to plot the scores of features being removed in each iteration. ```{r echo=TRUE} ggiteration_trace(removal.stat) + theme_bw() ``` `ggiteration_trace` returns a ggplot2 object, so `+` is used in the same way as `ggplot2` package. X-axis is the iteration index, and Y-axis is the score of the feature being removed, and the legend is the offset you passed to `feature_removal`. Normally, you can stop removing features from the index of which the scores fluctuate drastically. ### Generating the feature list Once you confirm the cutoff of iteration index, you can generate the feature list. Since using a vector of `offset` and the features are removed with different `cutoff` seperately, the remaining feature lists are not unique for multiple cutoffs. Thus, we compute the feature prevalence for the feature lists, using: ```{r echo=TRUE} features <- feature_prevalence(removal.stat, index=255, hist.plot=TRUE) features ``` ### Screening the features with prevalence At last, you can see the prevalence histogram of the features because of multiple offsets. You can specify a cutoff for the prevalence, and output the final feature list. ```{r echo=TRUE} feature_screen(features, prevalence=4) ```