Introduction
featscreen is a versatile and user-friendly package designed to simplify the process of supervised and unsupervised feature screening in R.
The cornerstone of featscreen is the
?screen
function, that can be used to effortlessly perform
the three screening steps:
- Computation of the Statistic: Harness the power of various statistical techniques to compute relevant measures, allowing you to gain insights into the relationships and characteristics of your features.
- Aggregation of Results for Multi-Response Data: In scenarios where your data involves multiple responses, the results from step 1 are aggregated before selection.
- Selection of Features to Keep: Features are retained based on the screening results. The package offers a range of common options for feature selection.
Ready to dive in? Let’s walk through the steps of using
featscreen
to perform your first feature screening
analysis.
Setup
Firstly, we load featscreen
and other needed
packages:
library(featscreen)
#>
#> Attaching package: 'featscreen'
#> The following objects are masked from 'package:stats':
#>
#> mad, sd
#> The following object is masked from 'package:graphics':
#>
#> screen
Seed
Now we want to set a seed for the random number generation (RNG). In
fact, different R sessions have different seeds created from current
time and process ID by default, and consequently different simulation
results. By fixing a seed we ensure we will be able to reproduce the
results of this vignette. We can specify a seed by calling
?set.seed
.
#Set a seed for RNG
set.seed(
#A seed
seed = 5381L, #a randomly chosen integer value
#The kind of RNG to use
kind = "Mersenne-Twister", #we make explicit the current R default value
#The kind of Normal generation
normal.kind = "Inversion" #we make explicit the current R default value
)
1. Computation of the Statistic
The initial step in the feature screening process involves the computation of a statistic that will be utilized for subsequent feature selection.
featscreen offers a diverse range of both supervised and unsupervised screening techniques.
You can explore the currently supported screening methods by calling
the ?listAvailableScreeningMethods
function. This function
returns a table with at least two columns:
-
id
: The unique identifier for each screening method, used in function calls. -
name
: The name of the screening method.
Screening Methods
To view the list of currently supported supervised screening methods,
use the ?listAvailableScreeningMethods
function with the
x
parameter set to supervised
:
#list screening methods
screening.methods = listAvailableScreeningMethods(x = 'supervised')
#print in table
knitr::kable(x = screening.methods, align = 'rc')
id | name |
---|---|
pearson | Pearson’s product moment correlation coefficient t-test |
spearman | Spearman’s rank correlation coefficient t-test |
kendall | Kendall’s rank correlation coefficient t-test |
t.test.equal | two-sample Student’s pooled t-test |
t.test.unequal | two-sample t-test with the Welch modification to the degrees of freedom |
t.test.paired | paired two-sample Student’s t-test |
w.test.ranksum | two-sample Mann-Whitney U-test |
w.test.paired | paired two-sample Wilcoxon signed-rank test |
anova.equal | one-way analysis of variance F-test |
anova.unequal | one-way analysis of variance F-test with Welch correction |
kruskal.wallis | Kruskal-Wallis H-test |
chisq.test | Pearson’s χ²-test |
coxph | Cox PH regression coefficient z-test |
moderated.t | empirical Bayes moderated t-test |
moderated.F | empirical Bayes moderated F-test |
sam.test | significant analysis of microarrays permutation test |
Similarly, we can use the x
parameter to show the
unsupervised methods:
#list screening methods
screening.methods = listAvailableScreeningMethods(x = 'unsupervised')
#print in table
knitr::kable(x = screening.methods, align = 'rc')
id | name |
---|---|
missing.value | missing value ratio |
above.median | above-median frequency ratio |
above.minimum | above-minimum frequency ratio |
median | median value |
variability | variability |
Screening Functions
The name of the screening functions can be retrieved by calling
?listAvailableScreeningFunctions
.
Again, you can use the x
parameter to show only the
supervised methods:
#list screening functions
screening.functions = listAvailableScreeningFunctions(x = 'supervised')
#print in table
knitr::kable(x = screening.functions, align = 'rc')
id | name |
---|---|
pearson | rowPearsonCor |
spearman | rowSpearmanCor |
kendall | rowKendallCor |
t.test.equal | rowEqualVarT |
t.test.unequal | rowUnequalVarT |
t.test.paired | rowPairedT |
w.test.ranksum | rowWilcoxonT |
w.test.paired | rowPairedWilcoxonT |
anova.equal | rowEqualVarOneWayAnova |
anova.unequal | rowUnequalVarOneWayAnova |
kruskal.wallis | rowKruskalWallis |
chisq.test | rowPearsonChiSq |
coxph | rowCoxPH |
moderated.t | rowModeratedT |
moderated.F | rowModeratedOneWayAnova |
sam.test | rowSamStatistics |
Similarly, you can use the x
parameter to show only the
unsupervised methods:
#list screening functions
screening.functions = listAvailableScreeningFunctions(x = 'unsupervised')
#print in table
knitr::kable(x = screening.functions, align = 'rc')
id | name |
---|---|
missing.value | rowMissingValueRatio |
above.median | rowAboveMedianFreqRatio |
above.minimum | rowAboveMinFreqRatio |
median | rowMedians |
variability | rowVariability |
Each function is documented. To learn more about a specific method it
is possible to use the ?
operator. For example, let’s check
the function ?rowPearsonCor
.
#See documentation
?rowPearsonCor
From the documentation, we can see that this is a wrapper function
that internally calls ?row_cor_pearson
from the
matrixTests
package.
It accepts four arguments in input:
-
x
: A matrix or a data frame. -
y
: A numerical vector. -
alternative
: The alternative hypothesis for each row ofx
. -
conf.level
: The confidence levels of the intervals.
Let’s compute the Pearson’s correlation for each row of a matrix by using this function:
#Data
x = matrix(rnorm(10 * 20), 10, 20)
y = rnorm(20)
#Compute
rowPearsonCor(x = x, y = y)
#> $statistic
#> [1] 1.55424671 -3.21135119 -0.48824958 -0.62940963 0.01709722 -1.30914381
#> [7] -1.07986577 1.11138904 -0.42259796 0.39229005
#>
#> $significance
#> [1] 0.137531731 0.004840161 0.631267786 0.536988472 0.986547148 0.206952421
#> [7] 0.294459519 0.281026013 0.677593887 0.699450166
The function returns a list with two elements:
-
statistic
: A vector containing the values of the correlation coefficient t-test statistic. -
significance
: A vector containing the p-values of the correlation coefficient t-test.
2. Aggregation of Results
featscreen provides a straightforward function named
?multiresponse
for the aggregation of results from
multi-response data.
Let’s explore this function:
#See documentation
?multiresponse
From the documentation, we learn that the function accepts three arguments:
-
x
: A matrix containing the values to combine. -
multi
: The aggregation strategy to use. There are five available options:-
max
: Keep the maximum value across responses. -
min
: Keep the minimum value across responses. -
avg
: Average the values. -
sum
: Sum up the values. -
idx
: Return the column indicated by the argumentidx
.
-
-
idx
: An optional integer value or string indicating the column ofx
to keep.
3. Selection of Features
The final step in the feature screening process involves the selection of features to retain based on the computed statistics.
Selection Methods
The name of the selection methods can be retrieved by calling
?listAvailableSelectionMethods
:
#list selection methods
selection.methods = listAvailableSelectionMethods()
#print in table
knitr::kable(x = selection.methods, align = 'rc')
id | name |
---|---|
cutoff | selection by cutoff |
rank | selection by ranking |
percentile | selection by top percentile |
fpr | selection by false positive rate |
fdr | selection by false discovery rate |
featscreen provides five selection strategies:
-
cutoff
: Features are selected by a cutoff on their values. -
rank
: Features are selected based on their ranking. -
percentile
: Features are selected by a percentile of the highest values. -
fpr
: Features are selected by a cutoff on the false positive rates. -
fdr
: Features are selected by a cutoff on the false discovery rates.
Selection Functions
The name of the selection functions can be retrieved by calling
?listAvailableSelectionFunctions
.
#list selection functions
selection.functions = listAvailableSelectionFunctions()
#print in table
knitr::kable(x = selection.functions, align = 'rc')
id | name |
---|---|
cutoff | selectByCutoff |
rank | selectByRanking |
percentile | selectByPercentile |
fpr | selectByFpr |
fdr | selectByFdr |
Each function is documented. To learn more about a specific method it
is possible to use the ?
operator. For example, let’s check
the function ?selectByCutoff
.
#See documentation
?selectByCutoff
From the documentation, we can see that the function accepts three arguments in input:
-
x
: A numerical vector. -
cutoff
: The cutoff to use in the selection. -
operator
: A string indicating the relational operator to use.
Let’s select elements of a vector by using this function.
#Select by Cutoff
selectByCutoff(
x = seq_len(10),
cutoff = 3,
operator = "<"
)
#> [1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
As reported in the documentation, the function returns a logical vector of the same length as the input vector, indicating which elements to retain.
screen
Function
Rather than using individual screening functions, you can streamline
the process by using the ?screen
function. This function
provides a unified interface for various screening methods,
incorporating all three steps of the screening process.
Parameters
The parameters in input are:
-
x
: Either a matrix or a data frame where rows represent features, and columns represent observations. -
y
: Numeric vector of data values. -
g
: (Optional) Vector specifying the group for corresponding elements ofx
. -
method
: One of the supported screening techniques. See?listAvailableScreeningMethods
. -
...
: Additional arguments to the screening function. -
multi
: Strategy to adopt in case of multi-response data. -
idx
: (Optional) Integer value or character string indicating the column ofx
to keep. -
select.by
: The selection method to use. -
select.args
: A named list containing the arguments to be passed to the selecting function.
Output
The function returns an object of class ?featscreen
,
representing a set of screened features.
featscreen
S3 Class
The featscreen
class represents a set of screened
features.
A featscreen
object is a list
consisting of
eight elements:
-
method
: The id of the used screening method. -
multi
: The id of the used multi-response aggregation method. -
selection
: The id of the used selection method. -
summary
: A textual summary of the screening. -
n
: The dimension of the feature space. -
features
: The feature names. -
keep
: The features to keep. -
ranks
: The feature ranks.
Functions to facilitate access to the data stored in a
featscreen
object are available:
-
?getScreeningMethodId
: Returns the screening method id. -
?getMultiresponseAggregationMethodId
: Returns the multi-response aggregation method id. -
?getSelectionMethodId
: Returns the selection method id. -
?getSummary
: Returns the textual summary of the screening. -
?getFeatureDimensionality
: Returns the dimension of the feature space. -
?getFeatureNames
: Returns the feature names. -
?getScreenedFeatures
: Returns the features to keep. -
?getFeatureRanks
: Returns the feature ranks.
Another useful function is print
:
-
?print.featscreen
: Print a summary of thefeatscreen
object.
Feature Screening
Now let’s use the ?screen
function to subset a feature
space by using an unsupervised screening method.
First, let’s create the data.
#Define row/col size
nr = 100
nc = 20
#Data
x = matrix(
data = sample(x = c(1,2), size = nr*nc, replace = TRUE),
nrow = nr,
ncol = nc,
dimnames = list(
paste0("feature",seq(nr)),
paste0("S",seq(nc))
)
)
#Define grouping variable
g = c(rep("a", nc/2), rep("b", nc/2))
#Force 1st feature to have 40% of missing values
x[1,seq(nc*0.4)] = NA
A common unsupervised screening step is to remove a feature if the number of missing elements exceeds a defined cutoff.
1. Select Screening Function
If we look at the table returned by
?listAvailableScreeningMethods
, we can see that the
id
for feature selection by missing value ratio is
missing.value
.
Looking at the table returned by
?listAvailableScreeningFunctions
, we can see that the
function corresponding to missing.value
is
?rowMissingValueRatio
.
From the documentation, we can see that the function accepts two arguments in input:
-
x
: matrix or data frame, where rows are features and columns are observations -
g
: optional grouping vector or factor
If g
is provided, this function returns a matrix with
missing-value ratios for each group as column vectors.
Let’s start building our call.
We can use missing.value
as the method
argument in ?screen
.
To provide an example of multi-response aggregation, we will also
provide the g
argument through the ellipsis (i.e., the
three dots).
2. Choose aggregation strategy
As aggregation strategy we choose to keep the maximum value across
responses. We can do this by providing max
as the
multi
argument in ?screen
.
3. Select Selection Function
Now that we have defined the screening function and the aggregation strategy, we need to identify the selection method.
Here we use again the function ?selectByCutoff
.
We want to filter features having more than 50% of missing values,
and we can do this by providing a list containing cutoff
and operator
as the select.args
argument in
?screen
.
4. Run!
Now let’s select the features:
#Filter a feature if has more than 50% of missing values
obj = screen(
x = x,
g = g,
method = "missing.value",
multi = "max",
select.by = "cutoff",
select.args = list(cutoff = 0.5, operator = "<")
)
We can check if the returned object is of class
featscreen
:
#Is obj of class `featscreen`?
is.featscreen(obj)
#> [1] TRUE
Now, we could print a summary:
#Print
print(x = obj, show.top = F)
#>
#> 99 out of 100 features selected by a cutoff (< 0.5) on the missing
#> value ratio.
Next Steps
Congratulations! You’ve successfully completed your first feature screening analysis.
As you continue to explore the package, refer to the documentation for detailed information on each screening method and additional customization options.
Start your featscreen journey today and unlock the full potential of efficient and effective feature screening in R!