This function filters the input matrix x
depending on the
presence of missing values.
A variable is removed if the ratio of missing elements is greater than a given
percentage defined by max.prop
.
See the Details section below for further information.
Arguments
- x
matrix
ordata.frame
, where rows are features and columns are observations.- g
(optional) vector or factor object giving the group for the corresponding elements of
x
.- max.prop
numerical value in the range \([0, 1]\). Maximum proportion of samples with missing values. Default to
0.5
.
Details
If g = NULL
, for each feature a missing value ratio (MVR) is computed as:
$$Missing Value Ratio (MVR) = \frac{Number of missing values}{Total number of observations}$$
Then, the i-th feature is kept if \(MVR_{i} < max.prop\).
If g
is provided, the missing value ratios \(MVR_{ij}\) are computed
for each group \(j\).
Then, the i-th feature is kept if \(MVR_{ij} < max.prop\) for each group.
Examples
#Seed
set.seed(1010)
#Define row/col size
nr = 5
nc = 10
#Data
x = matrix(
data = sample(x = c(1,2), size = nr*nc, replace = TRUE),
nrow = nr,
ncol = nc,
dimnames = list(
paste0("f",seq(nr)),
paste0("S",seq(nc))
)
)
#Grouping variable
g = c(rep("a", nc/2), rep("b", nc/2))
#Force 1st feature to have 40% of missing values
x[1,seq(nc*0.4)] = NA
#Filter a feature if has more than 50% of missing values
rowFilterByMissingValueRatio(x = x, max.prop = 0.5)
#> f1 f2 f3 f4 f5
#> TRUE TRUE TRUE TRUE TRUE
#Filter a feature if has more than 30% of missing values
rowFilterByMissingValueRatio(x = x, max.prop = 0.3)
#> f1 f2 f3 f4 f5
#> FALSE TRUE TRUE TRUE TRUE
#Set 3rd feature to have 40% of missing values for each class
x[3,seq(nc*0.4)] = NA
x[3,(seq(nc*0.4)+nc/2)] = NA
#Filter a feature if has more than 50% of missing values
rowFilterByMissingValueRatio(x = x, max.prop = 0.5)
#> f1 f2 f3 f4 f5
#> TRUE TRUE FALSE TRUE TRUE
#Filter a feature if has more than 50% of missing values in any group
rowFilterByMissingValueRatio(x = x, max.prop = 0.5, g = g)
#> f1 f2 f3 f4 f5
#> FALSE TRUE FALSE TRUE TRUE