Introduction
Unsupervised feature screening empowers users to uncover hidden insights and identify influential features without relying on explicit target labels.
Unsupervised methods operate on the inherent structure of the data, allowing for an unbiased examination of feature characteristics. By selectively retaining features that exhibit specific behaviors, unsupervised screening aids in reducing the dimensionality of datasets, facilitating more focused analyses.
featscreen provides different methods catering to a range of scenarios, from handling missing values to exploring variability and ratios, providing flexibility in addressing diverse data challenges.
In this article we explore the different unsupervised feature screening methods supported by featscreen.
Setup
Firstly, we load featscreen
and other needed
packages:
library(featscreen)
#>
#> Attaching package: 'featscreen'
#> The following objects are masked from 'package:stats':
#>
#> mad, sd
#> The following object is masked from 'package:graphics':
#>
#> screen
Seed
Now we want to set a seed for the random number generation (RNG). In
fact, different R sessions have different seeds created from current
time and process ID by default, and consequently different simulation
results. By fixing a seed we ensure we will be able to reproduce the
results of this vignette. We can specify a seed by calling
?set.seed
.
#Set a seed for RNG
set.seed(
#A seed
seed = 5381L, #a randomly chosen integer value
#The kind of RNG to use
kind = "Mersenne-Twister", #we make explicit the current R default value
#The kind of Normal generation
normal.kind = "Inversion" #we make explicit the current R default value
)
Screening Methods
To view the list of currently supported supervised screening methods,
use the ?listAvailableScreeningMethods
function with the
x
parameter set to unsupervised
:
#list screening methods
screening.methods = listAvailableScreeningMethods(x = 'unsupervised')
#print in table
knitr::kable(x = screening.methods, align = 'rc')
id | name |
---|---|
missing.value | missing value ratio |
above.median | above-median frequency ratio |
above.minimum | above-minimum frequency ratio |
median | median value |
variability | variability |
Missing-Value Ratio
The missing-value-ratio method targets features based on the proportion of missing values they contain. This statistic allows users to set a threshold, filtering out features that surpass a specified percentage of missing data. By doing so, analysts can identify features that might lack sufficient information for meaningful analysis.
To employ the missing-value ratio method in
featscreen, the ?rowMissingValueRatio
function is utilized.
#Define row/col size
nr = 5
nc = 10
#Data
x = matrix(
data = sample(x = c(1,2), size = nr*nc, replace = TRUE),
nrow = nr,
ncol = nc,
dimnames = list(
paste0("f",seq(nr)),
paste0("S",seq(nc))
)
)
#Force 1st feature to have 40% of missing values
x[1,seq(nc*0.4)] = NA
#Compute
rowMissingValueRatio(x = x)
#> f1 f2 f3 f4 f5
#> 0.4 0.0 0.0 0.0 0.0
We might be interested to know the missing value ratio per each group.
#Grouping variable
g = c(rep("a", nc/2), rep("b", nc/2))
#Compute
rowMissingValueRatio(x = x, g = g)
#> a b
#> f1 0.8 0
#> f2 0.0 0
#> f3 0.0 0
#> f4 0.0 0
#> f5 0.0 0
?rowFilterByMissingValueRatio
is a ready-to-use filter
function.
#Grouping variable
g = c(rep("a", nc/2), rep("b", nc/2))
#Compute
rowFilterByMissingValueRatio(
x = x,
g = g,
# Maximum proportion of samples with missing values
max.prop = 0.5
)
#> f1 f2 f3 f4 f5
#> FALSE TRUE TRUE TRUE TRUE
Threshold-Based Selection
Common unsupervised screening strategies are based on the feature values. These statistics allow users to set a threshold, filtering out features with values below the defined cutoff.
Typically applied to gene expression data, this rule is motivated by the notion that a gene must exhibit a minimal level of expression to be biologically relevant. A commonly used cut-off is the median expression of genes in a sample. Further refinement includes retaining features only if they surpass the threshold in a specified number of samples.
Above-Median Frequency Ratio
This unsupervised screening approach can be instrumental in isolating features showcasing distinct behavior relative to the samples’ central tendency.
This technique allows users to define a threshold based on the median value, thereby retaining features with above-median characteristics.
The above-median frequency ratio is computed in
featscreen by calling the function
?rowAboveMedianFreqRatio
.
#Define row/col size
nr = 5
nc = 10
#Data
x = matrix(
data = sample.int(n = 100, size = nr*nc, replace = TRUE),
nrow = nr,
ncol = nc,
dimnames = list(
paste0("f",seq(nr)),
paste0("S",seq(nc))
)
)
#Compute
rowAboveMedianFreqRatio(x = x)
#> f1 f2 f3 f4 f5
#> 0.6 0.7 0.7 0.5 0.5
We might be interested to know the above-median frequency ratio per each group.
#Grouping variable
g = c(rep("a", nc/2), rep("b", nc/2))
#Compute
rowAboveMedianFreqRatio(x = x, g = g)
#> a b
#> f1 0.6 0.6
#> f2 0.6 0.8
#> f3 0.8 0.6
#> f4 0.6 0.4
#> f5 0.4 0.6
?rowFilterByAboveMedianRatio
is a ready-to-use filter
function.
#Grouping variable
g = c(rep("a", nc/2), rep("b", nc/2))
#Compute
rowFilterByAboveMedianRatio(
x = x,
g = g,
# Minimum proportion of samples where the feature expression is above the median
min.prop = 0.5
)
#> f1 f2 f3 f4 f5
#> TRUE TRUE TRUE TRUE TRUE
Above-Minimum Frequency Ratio
The above-minimum ratio method focuses on features with values surpassing a defined minimum threshold. Users can set a minimum value, retaining features that exhibit behaviors above this specified limit.
The above-minimum frequency ratio is computed in
featscreen by calling the function
?rowAboveMinFreqRatio
.
#Define row/col size
nr = 5
nc = 10
#Data
x = matrix(
data = sample.int(n = 100, size = nr*nc, replace = TRUE),
nrow = nr,
ncol = nc,
dimnames = list(
paste0("f",seq(nr)),
paste0("S",seq(nc))
)
)
#Compute
rowAboveMinFreqRatio(x = x, min.expr = 20)
#> f1 f2 f3 f4 f5
#> 0.9 0.7 1.0 0.8 0.8
We might be interested to know the above-minimum frequency ratio per each group.
#Grouping variable
g = c(rep("a", nc/2), rep("b", nc/2))
#Compute
rowAboveMinFreqRatio(x = x, min.expr = 20, g = g)
#> a b
#> f1 1.0 0.8
#> f2 0.8 0.6
#> f3 1.0 1.0
#> f4 1.0 0.6
#> f5 0.6 1.0
?rowFilterByAboveMinRatio
is a ready-to-use filter
function.
#Grouping variable
g = c(rep("a", nc/2), rep("b", nc/2))
#Compute
rowFilterByAboveMinRatio(
x = x,
g = g,
# Minimum expression required
min.expr = 60,
# Minimum proportion of samples where the feature expression is above the min
min.prop = 0.5
)
#> f1 f2 f3 f4 f5
#> TRUE FALSE TRUE FALSE TRUE
Median Values
Commonly used in gene expression data, the median value is used as an indication of the level of expression. This is motivated by the notion that a gene must exhibit a minimal level of expression to be biologically relevant.
The median values are computed in featscreen by
calling the function ?rowMedians
.
#Define row/col size
nr = 5
nc = 10
#Data
x = matrix(
data = sample.int(n = 100, size = nr*nc, replace = TRUE),
nrow = nr,
ncol = nc,
dimnames = list(
paste0("f",seq(nr)),
paste0("S",seq(nc))
)
)
#Compute
rowMedians(x = x)
#> f1 f2 f3 f4 f5
#> 53.0 84.5 52.0 55.0 75.0
We might be interested to know the median value per each group.
#Grouping variable
g = c(rep("a", nc/2), rep("b", nc/2))
#Compute
rowMedians(x = x, g = g)
#> a b
#> f1 55 51
#> f2 59 86
#> f3 42 53
#> f4 19 57
#> f5 84 59
?rowFilterByMedianAboveMinExpr
is a ready-to-use filter
function.
#Grouping variable
g = c(rep("a", nc/2), rep("b", nc/2))
#Compute
rowFilterByMedianAboveMinExpr(
x = x,
g = g,
# Median minimum expression required
min.expr = 60
)
#> f1 f2 f3 f4 f5
#> FALSE TRUE FALSE FALSE TRUE
Variability-Based Selection
Another commonly used unsupervised screening strategy is based on the feature variability, i.e., features showing higher variability across samples are prioritized. The rationale is that genes with substantial variability may capture interesting variations linked to experimental conditions (e.g., drug administration). Variability can be measured using standard deviation, interquartile range, or median absolute deviation.
The above-minimum frequency ratio is computed in
featscreen by calling the function
?rowVariability
.
Five measures of variability are available:
-
sd
: The standard deviation. -
iqr
: The interquartile range. -
mad
: The median absolute deviation. -
rsd
: The relative standard deviation (i.e., coefficient of variation). -
efficiency
: The coefficient of variation squared. -
vmr
: The variance-to-mean ratio.
We can provide the type of measure we want to use as the
method
argument in ?rowVariability
.
#Define row/col size
nr = 5
nc = 10
#Data
x = matrix(
data = sample.int(n = 100, size = nr*nc, replace = TRUE),
nrow = nr,
ncol = nc,
dimnames = list(
paste0("f",seq(nr)),
paste0("S",seq(nc))
)
)
#Compute
rowVariability(x = x, method = 'sd')
#> f1 f2 f3 f4 f5
#> 36.74250 24.09265 34.13844 30.32857 28.79062
We might be interested to know the variability per each group.
#Grouping variable
g = c(rep("a", nc/2), rep("b", nc/2))
#Compute
rowVariability(x = x, method = 'sd', g = g)
#> a b
#> f1 34.91132 39.41446
#> f2 24.97399 22.39866
#> f3 28.90848 41.49096
#> f4 30.08654 33.94554
#> f5 13.22876 40.88765
?rowFilterByLowVariability
is a ready-to-use filter
function.
#Grouping variable
g = c(rep("a", nc/2), rep("b", nc/2))
#Compute
rowFilterByLowVariability(
x = x,
g = g,
method = 'sd',
# Percentage of features with highest variability to keep
percentile = 0.25
)
#> f1 f2 f3 f4 f5
#> TRUE FALSE FALSE FALSE FALSE