Skip to contents

Assigns the population to k balanced folds. See the Details section below for further information.

Usage

balancedKFolds(strata = NULL, k, undersample = FALSE, prob = NULL)

Arguments

strata

vector of stratification variables. The population size is length(strata)

k

number of folds

undersample

logical, whether to remove elements from the population in order to try to obtain balanced folds

prob

(optional) vector of positive numeric values, the probability weights for obtaining the strata elements. If provided, it must be the same length as strata

Value

A vector of length length(strata) containing the fold ids.

Details

If the population is balanced, each element in the population is assigned to one of the k folds so that the percentage of each stratum is preserved in each fold. If the population is unbalanced and undersample = TRUE, the so-called "random undersampling" is adopted, i.e. the proportion of the strata in the population is adjusted by removing elements from the majority groups, so that each stratum is balanced. An error is raised if the population is unbalanced and undersample = FALSE.

Author

Alessandro Barberis

Examples

#Set seed for reproducibility
set.seed(seed = 5381L)

#Define balanced strata
strata = c(rep(1,6),rep(2,6))

#Check ratio
table(strata)/length(strata)
#> strata
#>   1   2 
#> 0.5 0.5 

#Assign data to 3 folds
i = balancedKFolds(
 strata = strata,
 k = 3
)
#Check folds
i
#>  [1] 2 1 1 3 2 3 2 3 1 2 3 1
#Check ratio in the folds
table(strata[i==1])/length(strata[i==1])
#> 
#>   1   2 
#> 0.5 0.5 
table(strata[i==2])/length(strata[i==2])
#> 
#>   1   2 
#> 0.5 0.5 
table(strata[i==3])/length(strata[i==3])
#> 
#>   1   2 
#> 0.5 0.5 

#Define unbalanced strata
strata = c(rep(1,6),rep(2,12))

#Check ratio
table(strata)/length(strata)
#> strata
#>         1         2 
#> 0.3333333 0.6666667 

#Assign data to 3 folds
i = balancedKFolds(
 strata = strata,
 k = 3,
 undersample = TRUE
)
#Check folds
i
#>  [1]  2  3  3  2  1  1 NA NA  3 NA NA  1  2 NA  1  3 NA  2
#Check ratio in the folds
table(strata[!is.na(i) & i==1])/length(strata[!is.na(i) & i==1])
#> 
#>   1   2 
#> 0.5 0.5 
table(strata[!is.na(i) & i==2])/length(strata[!is.na(i) & i==2])
#> 
#>   1   2 
#> 0.5 0.5 
table(strata[!is.na(i) & i==3])/length(strata[!is.na(i) & i==3])
#> 
#>   1   2 
#> 0.5 0.5 


#Raise an error
try(balancedKFolds(
 strata = strata,
 k = 3,
 undersample = FALSE
))
#> Error in balancedKFolds(strata = strata, k = 3, undersample = FALSE) : 
#>   Data is not balanced. Change 'undersample' to TRUE.