Skip to contents

This function filters the input matrix x depending on the presence of missing values. A variable is removed if the ratio of missing elements is greater than a given percentage defined by max.prop.

See the Details section below for further information.

Usage

rowFilterByMissingValueRatio(x, g = NULL, max.prop = 0.5)

Arguments

x

matrix or data.frame, where rows are features and columns are observations.

g

(optional) vector or factor object giving the group for the corresponding elements of x.

max.prop

numerical value in the range \([0, 1]\). Maximum proportion of samples with missing values. Default to 0.5.

Value

A logical vector of length nrow(x) indicating which rows of x passed the filter.

Details

If g = NULL, for each feature a missing value ratio (MVR) is computed as:

$$Missing Value Ratio (MVR) = \frac{Number of missing values}{Total number of observations}$$

Then, the i-th feature is kept if \(MVR_{i} < max.prop\).

If g is provided, the missing value ratios \(MVR_{ij}\) are computed for each group \(j\).

Then, the i-th feature is kept if \(MVR_{ij} < max.prop\) for each group.

Author

Alessandro Barberis

Examples

#Seed
set.seed(1010)

#Define row/col size
nr = 5
nc = 10

#Data
x = matrix(
 data = sample(x = c(1,2), size = nr*nc, replace = TRUE),
 nrow = nr,
 ncol = nc,
 dimnames = list(
   paste0("f",seq(nr)),
   paste0("S",seq(nc))
 )
)

#Grouping variable
g = c(rep("a", nc/2), rep("b", nc/2))

#Force 1st feature to have 40% of missing values
x[1,seq(nc*0.4)] = NA

#Filter a feature if has more than 50% of missing values
rowFilterByMissingValueRatio(x = x, max.prop = 0.5)
#>   f1   f2   f3   f4   f5 
#> TRUE TRUE TRUE TRUE TRUE 

#Filter a feature if has more than 30% of missing values
rowFilterByMissingValueRatio(x = x, max.prop = 0.3)
#>    f1    f2    f3    f4    f5 
#> FALSE  TRUE  TRUE  TRUE  TRUE 

#Set 3rd feature to have 40% of missing values for each class
x[3,seq(nc*0.4)] = NA
x[3,(seq(nc*0.4)+nc/2)] = NA

#Filter a feature if has more than 50% of missing values
rowFilterByMissingValueRatio(x = x, max.prop = 0.5)
#>    f1    f2    f3    f4    f5 
#>  TRUE  TRUE FALSE  TRUE  TRUE 

#Filter a feature if has more than 50% of missing values in any group
rowFilterByMissingValueRatio(x = x, max.prop = 0.5, g = g)
#>    f1    f2    f3    f4    f5 
#> FALSE  TRUE FALSE  TRUE  TRUE