Skip to contents

Takes a balanced sample without replacement from the population. See the Details section below for further information.

Usage

balancedSampleWithoutReplacement(strata, n, prob = NULL)

Arguments

strata

vector of stratification variables. The population size is length(strata)

n

positive integer value, the sample size

prob

(optional) vector of positive numeric values, the probability weights for obtaining the strata elements. If provided, it must be the same length as strata

Value

A vector of length n containing the index of the computed random set of observations.

Details

When the number of elements per stratum (given by the sample size n divided by the number of groups in strata) is less than the number of elements in the minority group in strata, this function implements the so-called "random undersampling", in which the proportion of the strata in the population is adjusted in the taken sample by removing elements from the majority stratum, so that each group is balanced.

When the number of elements per stratum is greater than the number of elements in the minority group in strata, the function raises an error.

References

He and Garcia, Learning from Imbalanced Data, IEEE Transactions on Knowledge and Data Engineering (2009)

Author

Alessandro Barberis

Examples

#Set seed for reproducibility
set.seed(seed = 5381L)

#Define strata
strata = c(rep("a", 3),rep("b", 6))

#Check ratio
table(strata)/length(strata)
#> strata
#>         a         b 
#> 0.3333333 0.6666667 

#Balanced random sample
i = balancedSampleWithoutReplacement(
  strata = strata,
  n = 6
)
#Check indices
i
#> [1] 5 1 7 2 3 6
#Check ratio in the sample
table(strata[i])/length(strata[i])
#> 
#>   a   b 
#> 0.5 0.5