Assigns the population to k balanced folds. See the Details section below for further information.
Arguments
- strata
vector of stratification variables. The population size is
length(strata)
- k
number of folds
- undersample
logical, whether to remove elements from the population in order to try to obtain balanced folds
- prob
(optional) vector of positive numeric values, the probability weights for obtaining the
strata
elements. If provided, it must be the same length asstrata
Details
If the population is balanced, each element in the population is
assigned to one of the k folds so that the percentage of each stratum is
preserved in each fold.
If the population is unbalanced and undersample = TRUE
, the so-called
"random undersampling" is adopted, i.e. the proportion of the strata in the
population is adjusted by removing elements from the majority groups, so that
each stratum is balanced.
An error is raised if the population is unbalanced and undersample = FALSE
.
Examples
#Set seed for reproducibility
set.seed(seed = 5381L)
#Define balanced strata
strata = c(rep(1,6),rep(2,6))
#Check ratio
table(strata)/length(strata)
#> strata
#> 1 2
#> 0.5 0.5
#Assign data to 3 folds
i = balancedKFolds(
strata = strata,
k = 3
)
#Check folds
i
#> [1] 2 1 1 3 2 3 2 3 1 2 3 1
#Check ratio in the folds
table(strata[i==1])/length(strata[i==1])
#>
#> 1 2
#> 0.5 0.5
table(strata[i==2])/length(strata[i==2])
#>
#> 1 2
#> 0.5 0.5
table(strata[i==3])/length(strata[i==3])
#>
#> 1 2
#> 0.5 0.5
#Define unbalanced strata
strata = c(rep(1,6),rep(2,12))
#Check ratio
table(strata)/length(strata)
#> strata
#> 1 2
#> 0.3333333 0.6666667
#Assign data to 3 folds
i = balancedKFolds(
strata = strata,
k = 3,
undersample = TRUE
)
#Check folds
i
#> [1] 2 3 3 2 1 1 NA NA 3 NA NA 1 2 NA 1 3 NA 2
#Check ratio in the folds
table(strata[!is.na(i) & i==1])/length(strata[!is.na(i) & i==1])
#>
#> 1 2
#> 0.5 0.5
table(strata[!is.na(i) & i==2])/length(strata[!is.na(i) & i==2])
#>
#> 1 2
#> 0.5 0.5
table(strata[!is.na(i) & i==3])/length(strata[!is.na(i) & i==3])
#>
#> 1 2
#> 0.5 0.5
#Raise an error
try(balancedKFolds(
strata = strata,
k = 3,
undersample = FALSE
))
#> Error in balancedKFolds(strata = strata, k = 3, undersample = FALSE) :
#> Data is not balanced. Change 'undersample' to TRUE.