An introduction to `featscreen`

Introduction

featscreen is a versatile and user-friendly package designed to simplify the process of supervised and unsupervised feature screening in R.

The cornerstone of featscreen is the ?screen function, that can be used to effortlessly perform the three screening steps:

Computation of the Statistic: Harness the power of various statistical techniques to compute relevant measures, allowing you to gain insights into the relationships and characteristics of your features.
Aggregation of Results for Multi-Response Data: In scenarios where your data involves multiple responses, the results from step 1 are aggregated before selection.
Selection of Features to Keep: Features are retained based on the screening results. The package offers a range of common options for feature selection.

Ready to dive in? Let’s walk through the steps of using featscreen to perform your first feature screening analysis.

Setup

Firstly, we load featscreen and other needed packages:

library(featscreen)
#> 
#> Attaching package: 'featscreen'
#> The following objects are masked from 'package:stats':
#> 
#>     mad, sd
#> The following object is masked from 'package:graphics':
#> 
#>     screen

Seed

Now we want to set a seed for the random number generation (RNG). In fact, different R sessions have different seeds created from current time and process ID by default, and consequently different simulation results. By fixing a seed we ensure we will be able to reproduce the results of this vignette. We can specify a seed by calling ?set.seed.

#Set a seed for RNG
set.seed(
  #A seed
  seed = 5381L,                   #a randomly chosen integer value
  #The kind of RNG to use
  kind = "Mersenne-Twister",      #we make explicit the current R default value
  #The kind of Normal generation
  normal.kind = "Inversion"       #we make explicit the current R default value
)

1. Computation of the Statistic

The initial step in the feature screening process involves the computation of a statistic that will be utilized for subsequent feature selection.

featscreen offers a diverse range of both supervised and unsupervised screening techniques.

You can explore the currently supported screening methods by calling the ?listAvailableScreeningMethods function. This function returns a table with at least two columns:

id: The unique identifier for each screening method, used in function calls.
name: The name of the screening method.

Screening Methods

To view the list of currently supported supervised screening methods, use the ?listAvailableScreeningMethods function with the x parameter set to supervised:

#list screening methods
screening.methods = listAvailableScreeningMethods(x = 'supervised')

#print in table
knitr::kable(x = screening.methods, align = 'rc')

id	name
pearson	Pearson’s product moment correlation coefficient t-test
spearman	Spearman’s rank correlation coefficient t-test
kendall	Kendall’s rank correlation coefficient t-test
t.test.equal	two-sample Student’s pooled t-test
t.test.unequal	two-sample t-test with the Welch modification to the degrees of freedom
t.test.paired	paired two-sample Student’s t-test
w.test.ranksum	two-sample Mann-Whitney U-test
w.test.paired	paired two-sample Wilcoxon signed-rank test
anova.equal	one-way analysis of variance F-test
anova.unequal	one-way analysis of variance F-test with Welch correction
kruskal.wallis	Kruskal-Wallis H-test
chisq.test	Pearson’s χ²-test
coxph	Cox PH regression coefficient z-test
moderated.t	empirical Bayes moderated t-test
moderated.F	empirical Bayes moderated F-test
sam.test	significant analysis of microarrays permutation test

Similarly, we can use the x parameter to show the unsupervised methods:

#list screening methods
screening.methods = listAvailableScreeningMethods(x = 'unsupervised')

#print in table
knitr::kable(x = screening.methods, align = 'rc')

id	name
missing.value	missing value ratio
above.median	above-median frequency ratio
above.minimum	above-minimum frequency ratio
median	median value
variability	variability

Screening Functions

The name of the screening functions can be retrieved by calling ?listAvailableScreeningFunctions.

Again, you can use the x parameter to show only the supervised methods:

#list screening functions
screening.functions = listAvailableScreeningFunctions(x = 'supervised')

#print in table
knitr::kable(x = screening.functions, align = 'rc')

id	name
pearson	rowPearsonCor
spearman	rowSpearmanCor
kendall	rowKendallCor
t.test.equal	rowEqualVarT
t.test.unequal	rowUnequalVarT
t.test.paired	rowPairedT
w.test.ranksum	rowWilcoxonT
w.test.paired	rowPairedWilcoxonT
anova.equal	rowEqualVarOneWayAnova
anova.unequal	rowUnequalVarOneWayAnova
kruskal.wallis	rowKruskalWallis
chisq.test	rowPearsonChiSq
coxph	rowCoxPH
moderated.t	rowModeratedT
moderated.F	rowModeratedOneWayAnova
sam.test	rowSamStatistics

Similarly, you can use the x parameter to show only the unsupervised methods:

#list screening functions
screening.functions = listAvailableScreeningFunctions(x = 'unsupervised')

#print in table
knitr::kable(x = screening.functions, align = 'rc')

id	name
missing.value	rowMissingValueRatio
above.median	rowAboveMedianFreqRatio
above.minimum	rowAboveMinFreqRatio
median	rowMedians
variability	rowVariability

Each function is documented. To learn more about a specific method it is possible to use the ? operator. For example, let’s check the function ?rowPearsonCor.

#See documentation
?rowPearsonCor

From the documentation, we can see that this is a wrapper function that internally calls ?row_cor_pearson from the matrixTests package.

It accepts four arguments in input:

x: A matrix or a data frame.
y: A numerical vector.
alternative: The alternative hypothesis for each row of x.
conf.level: The confidence levels of the intervals.

Let’s compute the Pearson’s correlation for each row of a matrix by using this function:

#Data
x = matrix(rnorm(10 * 20), 10, 20)
y = rnorm(20)

#Compute
rowPearsonCor(x = x, y = y)
#> $statistic
#>  [1]  1.55424671 -3.21135119 -0.48824958 -0.62940963  0.01709722 -1.30914381
#>  [7] -1.07986577  1.11138904 -0.42259796  0.39229005
#> 
#> $significance
#>  [1] 0.137531731 0.004840161 0.631267786 0.536988472 0.986547148 0.206952421
#>  [7] 0.294459519 0.281026013 0.677593887 0.699450166

The function returns a list with two elements:

statistic: A vector containing the values of the correlation coefficient t-test statistic.
significance: A vector containing the p-values of the correlation coefficient t-test.

2. Aggregation of Results

featscreen provides a straightforward function named ?multiresponse for the aggregation of results from multi-response data.

Let’s explore this function:

#See documentation
?multiresponse

From the documentation, we learn that the function accepts three arguments:

x: A matrix containing the values to combine.
multi: The aggregation strategy to use. There are five available options:
- max: Keep the maximum value across responses.
- min: Keep the minimum value across responses.
- avg: Average the values.
- sum: Sum up the values.
- idx: Return the column indicated by the argument idx.
idx: An optional integer value or string indicating the column of x to keep.

3. Selection of Features

The final step in the feature screening process involves the selection of features to retain based on the computed statistics.

Selection Methods

The name of the selection methods can be retrieved by calling ?listAvailableSelectionMethods:

#list selection methods
selection.methods = listAvailableSelectionMethods()

#print in table
knitr::kable(x = selection.methods, align = 'rc')

id	name
cutoff	selection by cutoff
rank	selection by ranking
percentile	selection by top percentile
fpr	selection by false positive rate
fdr	selection by false discovery rate

featscreen provides five selection strategies:

cutoff: Features are selected by a cutoff on their values.
rank: Features are selected based on their ranking.
percentile: Features are selected by a percentile of the highest values.
fpr: Features are selected by a cutoff on the false positive rates.
fdr: Features are selected by a cutoff on the false discovery rates.

Selection Functions

The name of the selection functions can be retrieved by calling ?listAvailableSelectionFunctions.

#list selection functions
selection.functions = listAvailableSelectionFunctions()

#print in table
knitr::kable(x = selection.functions, align = 'rc')

id	name
cutoff	selectByCutoff
rank	selectByRanking
percentile	selectByPercentile
fpr	selectByFpr
fdr	selectByFdr

Each function is documented. To learn more about a specific method it is possible to use the ? operator. For example, let’s check the function ?selectByCutoff.

#See documentation
?selectByCutoff

From the documentation, we can see that the function accepts three arguments in input:

x: A numerical vector.
cutoff: The cutoff to use in the selection.
operator: A string indicating the relational operator to use.

Let’s select elements of a vector by using this function.

#Select by Cutoff
selectByCutoff(
  x = seq_len(10),
  cutoff = 3,
  operator = "<"
)
#>  [1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

As reported in the documentation, the function returns a logical vector of the same length as the input vector, indicating which elements to retain.

`screen` Function

Rather than using individual screening functions, you can streamline the process by using the ?screen function. This function provides a unified interface for various screening methods, incorporating all three steps of the screening process.

Parameters

The parameters in input are:

x: Either a matrix or a data frame where rows represent features, and columns represent observations.
y: Numeric vector of data values.
g: (Optional) Vector specifying the group for corresponding elements of x.
method: One of the supported screening techniques. See ?listAvailableScreeningMethods.
...: Additional arguments to the screening function.
multi: Strategy to adopt in case of multi-response data.
idx: (Optional) Integer value or character string indicating the column of x to keep.
select.by: The selection method to use.
select.args: A named list containing the arguments to be passed to the selecting function.

Output

The function returns an object of class ?featscreen, representing a set of screened features.

`featscreen` S3 Class

The featscreen class represents a set of screened features.

A featscreen object is a list consisting of eight elements:

method: The id of the used screening method.
multi: The id of the used multi-response aggregation method.
selection: The id of the used selection method.
summary: A textual summary of the screening.
n: The dimension of the feature space.
features: The feature names.
keep: The features to keep.
ranks: The feature ranks.

Functions to facilitate access to the data stored in a featscreen object are available:

?getScreeningMethodId: Returns the screening method id.
?getMultiresponseAggregationMethodId: Returns the multi-response aggregation method id.
?getSelectionMethodId: Returns the selection method id.
?getSummary: Returns the textual summary of the screening.
?getFeatureDimensionality: Returns the dimension of the feature space.
?getFeatureNames: Returns the feature names.
?getScreenedFeatures: Returns the features to keep.
?getFeatureRanks: Returns the feature ranks.

Another useful function is print:

?print.featscreen: Print a summary of the featscreen object.

Feature Screening

Now let’s use the ?screen function to subset a feature space by using an unsupervised screening method.

First, let’s create the data.

#Define row/col size
nr = 100
nc = 20

#Data
x = matrix(
  data = sample(x = c(1,2), size = nr*nc, replace = TRUE),
  nrow = nr,
  ncol = nc,
  dimnames = list(
    paste0("feature",seq(nr)),
    paste0("S",seq(nc))
  )
)

#Define grouping variable
g = c(rep("a", nc/2), rep("b", nc/2))

#Force 1st feature to have 40% of missing values
x[1,seq(nc*0.4)] = NA

A common unsupervised screening step is to remove a feature if the number of missing elements exceeds a defined cutoff.

1. Select Screening Function

If we look at the table returned by ?listAvailableScreeningMethods, we can see that the id for feature selection by missing value ratio is missing.value.

Looking at the table returned by ?listAvailableScreeningFunctions, we can see that the function corresponding to missing.value is ?rowMissingValueRatio.

From the documentation, we can see that the function accepts two arguments in input:

x: matrix or data frame, where rows are features and columns are observations
g: optional grouping vector or factor

If g is provided, this function returns a matrix with missing-value ratios for each group as column vectors.

Let’s start building our call.

We can use missing.value as the method argument in ?screen.

To provide an example of multi-response aggregation, we will also provide the g argument through the ellipsis (i.e., the three dots).

2. Choose aggregation strategy

As aggregation strategy we choose to keep the maximum value across responses. We can do this by providing max as the multi argument in ?screen.

3. Select Selection Function

Now that we have defined the screening function and the aggregation strategy, we need to identify the selection method.

Here we use again the function ?selectByCutoff.

We want to filter features having more than 50% of missing values, and we can do this by providing a list containing cutoff and operator as the select.args argument in ?screen.

4. Run!

Now let’s select the features:

#Filter a feature if has more than 50% of missing values
obj = screen(
  x           = x,
  g           = g,
  method      = "missing.value",
  multi       = "max",
  select.by   = "cutoff",
  select.args = list(cutoff = 0.5, operator = "<")
)

We can check if the returned object is of class featscreen:

#Is obj of class `featscreen`?
is.featscreen(obj)
#> [1] TRUE

Now, we could print a summary:

#Print
print(x = obj, show.top = F)
#> 
#> 99 out of 100 features selected by a cutoff (< 0.5) on the missing
#> value ratio.

Next Steps

Congratulations! You’ve successfully completed your first feature screening analysis.

As you continue to explore the package, refer to the documentation for detailed information on each screening method and additional customization options.

Start your featscreen journey today and unlock the full potential of efficient and effective feature screening in R!