Feature Screening

This function computes a statistical measure for each feature in input. In case of multi-response data, the screening statistics are then combined as defined by "multi". Finally, the features to keep are obtained via the chosen selecting method as indicated by select.by.

See the Details section below for further information.

Usage

screen(
  x,
  y = NULL,
  g = NULL,
  method = c("cor.test", "pearson", "spearman", "kendall", "t.test", "t.test.equal",
    "t.test.unequal", "t.test.paired", "w.test", "w.test.ranksum", "w.test.paired",
    "anova", "anova.equal", "anova.unequal", "kruskal.wallis", "chisq.test", "coxph",
    "moderated.t", "moderated.F", "sam.test", "missing.value", "above.median",
    "above.minimum", "median", "variability"),
  ...,
  multi = c("max", "min", "avg", "sum", "idx"),
  idx = NULL,
  select.by = c("cutoff", "rank", "percentile", "fpr", "fdr"),
  select.args = NULL
)

Arguments

x

matrix or data.frame, where rows are features and columns are observations.

y

numeric vector of data values having the same length as ncol(x) or data.frame with two columns, time and status.

g

(optional) vector or factor object giving the group for the corresponding elements of x.

method

character string, one of the supported screening techniques.

...

further arguments to screening function.

multi

character string indicating what to do in case of multi-response. Available options are:

"max": the maximum value across responses is kept
"min": the minimum value across responses is kept
"avg": values are averaged
"sum": values are summed up
"idx": return the column indicated by idx

idx

(optional) integer value or character string indicating the column of x to keep.

select.by

character string indicating the selecting method. Available options are:

"cutoff": selection by cutoff
"rank": selection by ranking
"percentile": selection by top percentile
"fpr": selection by false positive rate
"fdr": selection by false discovery rate

select.args

(optional) named list, arguments to be passed to the selecting function.

Value

An object of class featscreen.

Details

This function uses one of the selected screening technique to compute a statistical measure for each feature.

See the following functions for each specific implementation:

"cor.test": rowCor
"pearson": rowPearsonCor
"spearman": rowSpearmanCor
"kendall": rowKendallCor
"t.test": rowTwoSampleT
"t.test.equal": rowEqualVarT
"t.test.unequal": rowUnequalVarT
"t.test.paired": rowPairedT
"w.test": rowTwoSampleWilcoxonT
"w.test.ranksum": rowWilcoxonT
"w.test.paired": rowPairedWilcoxonT
"anova": rowOneWayAnova
"anova.equal": rowEqualVarOneWayAnova
"anova.unequal": rowUnequalVarOneWayAnova
"kruskal.wallis": rowKruskalWallis
"chisq.test": rowPearsonChiSq
"coxph": rowCoxPH
"moderated.t": rowModeratedT
"moderated.F": rowModeratedOneWayAnova
"sam.test": rowSamStatistics
"missing.value": rowMissingValueRatio
"above.median": rowAboveMedianFreqRatio
"above.minimum": rowAboveMinFreqRatio
"median": rowMedians
"variability": rowVariability

In case of multi-response data, the screening statistics are then combined by using the multiresponse function.

Finally, the features to keep are obtained via the chosen selecting method as indicated by select.by.

See the following functions for each specific implementation:

"cutoff": selectByCutoff
"rank": selectByRanking
"percentile": selectByPercentile
"fpr": selectByFpr
"fdr": selectByFdr

Author

Alessandro Barberis

Examples

#Seed
set.seed(1010)

#Define row/col size
nr = 5
nc = 10

# Unsupervised Screening

#Data
x = matrix(
 data = sample(x = c(1,2), size = nr*nc, replace = TRUE),
 nrow = nr,
 ncol = nc,
 dimnames = list(
   paste0("f",seq(nr)),
   paste0("S",seq(nc))
 )
)

#Grouping variable
g = c(rep("a", nc/2), rep("b", nc/2))

#Force 1st feature to have 40% of missing values
x[1,seq(nc*0.4)] = NA

#Filter a feature if has more than 50% of missing values
screen(
 x = x,
 method = "missing.value",
 select.args = list(cutoff = 0.5)
)
#> 
#> 5 out of 5 features selected by a cutoff (< 0.5) on the missing value
#> ratio.
#> 
#> Top 5 ranked features: f2, f3, f4, f5, f1
#> 

# Supervised Screening

#Filter by two-sample t-Test (cutoff on t statistic)
screen(
 x = x,
 g = g,
 method = "t.test",
 var = "equal",
 select.args = list(cutoff = 0.5)
)
#> 
#> 4 out of 5 features selected by a cutoff (< 0.5) on the two-sample
#> Student's pooled t-test.
#> 
#> Top 5 ranked features: f2, f5, f4, f1, f3
#>