Introduction
Sampling without replacement implies that elements of a population can be chosen only once in one sample. There are different techniques of sampling without replacement, including:
- simple random sampling
- random sampling with unequal probabilities
- stratified sampling
- balanced sampling (a special case of stratified sampling)
- permutation sampling
- k-fold sampling
- leave-p-out sampling
In this article, we show how to draw repeated samples without
replacement from a population by using the functions
implemented in resampling
.
For further information on the sampling techniques, see Sampling without replacement.
Setup
Loading
Firstly, we need to load resampling
and other needed R
packages:
#resampling
library(resampling)
#Packages for visualisation
require(ComplexHeatmap, quietly = TRUE)
require(grid, quietly = TRUE)
require(RColorBrewer, quietly = TRUE)
Seed
Then, we set a seed for the random number generation (RNG). In fact,
different R sessions have different seeds created from current time and
process ID by default, and consequently different simulation results. By
fixing a seed we ensure we will be able to reproduce the results of this
vignette. We can specify a seed by calling ?set.seed
.
#Set a seed for RNG
set.seed(
#A seed
seed = 5381L, #a randomly chosen integer value
#The kind of RNG to use
kind = "Mersenne-Twister", #we make explicit the current R default value
#The kind of Normal generation
normal.kind = "Inversion" #we make explicit the current R default value
)
Resampling Without Replacement
The available methods for taking repeated samples without replacement
can be listed through the ?listAvailableSamplingMethods
function call, setting the input argument to 'rswor'
.
?listAvailableSamplingMethods
returns a table with two
columns:
-
id
: the id of the sampling method, to be used in the function calls -
name
: the name of the sampling method
#list sampling methods
sampling.methods = listAvailableSamplingMethods(x = 'rswor')
#print in table
knitr::kable(x = sampling.methods)
id | name |
---|---|
rswor | random sampling without replacement |
srswor | simple random sampling without replacement |
stratified_rswor | stratified random sampling without replacement |
balanced_rswor | balanced random sampling without replacement |
permutation | permutation sampling |
kfolds | random k-fold sampling |
stratified_kfolds | stratified k-fold sampling |
balanced_kfolds | balanced k-fold sampling |
leave_p_out | leave-p-out sampling |
leave_one_out | leave-one-out sampling |
The name of the resampling functions can be retrieved by calling
?listResamplingFunctionNames
.
#list resampling function names
resampling.function.names = listResamplingFunctionNames(x = 'rswor')
#print in table
knitr::kable(x = resampling.function.names)
id | name |
---|---|
rswor | repeatedSampleWithoutReplacement |
srswor | repeatedSimpleRandomSampleWithoutReplacement |
stratified_rswor | repeatedStratifiedSampleWithoutReplacement |
balanced_rswor | repeatedBalancedSampleWithoutReplacement |
permutation | repeatedPermutationSample |
kfolds | repeatedRandomKm1Folds |
stratified_kfolds | repeatedStratifiedKm1Folds |
balanced_kfolds | repeatedBalancedKm1Folds |
leave_p_out | repeatedLeavePOutSample |
leave_one_out | repeatedLeaveOneOutSample |
Each function is documented. To learn more about a specific method it
is possible to use the ?
operator. For example, let’s check
the function
?repeatedSimpleRandomSampleWithoutReplacement
.
#See documentation
?repeatedSimpleRandomSampleWithoutReplacement
From the documentation, we can see that the function accepts 3 arguments in input:
-
k
: the number of repeated samples to generate -
N
: the population size -
n
: the sample size
Simple Random Sampling
In resampling via simple random sampling (SRS), simple random samples without replacement are repeatedly taken from the population.
The function implementing this sampling scheme is
?repeatedSimpleRandomSampleWithoutReplacement
, which
accepts 3 arguments:
-
k
: the number of repeated samples to generate -
N
: the population size -
n
: the sample size
#Simple random sampling without replacement
repeatedSimpleRandomSampleWithoutReplacement(
k = 2,
N = 10,
n = 8
)
#> [[1]]
#> [1] 1 9 7 2 4 3 6 10
#>
#> [[2]]
#> [1] 9 8 1 7 4 5 6 3
Instead of using
?repeatedSimpleRandomSampleWithoutReplacement
, we can take
repeated samples by using the ?resample
function.
#Simple random sampling without replacement
obj = resample(
x = 10,
n = 8,
k = 2,
method = 'srswor'
)
#Print
print(obj)
#>
#> 2 samples taken from a population of 10 elements by using simple random
#> sampling without replacement.
#>
#> sampleNumber sample sampleSize holdoutSize
#> 1 1 2, 3, 9, 5, ... 8 2
#> 2 2 2, 10, 8, , ... 8 2
#Samples
getSamples(obj)
#> [[1]]
#> [1] 2 3 9 5 7 4 6 1
#>
#> [[2]]
#> [1] 2 10 8 1 6 4 3 7
#Holdout Data
getHoldOutSamples(obj)
#> [[1]]
#> [1] 8 10
#>
#> [[2]]
#> [1] 5 9
#Plot
plot(x = obj)
Random Sampling With Unequal Probability
We can use ?repeatedSampleWithoutReplacement
to draw
repeated samples without replacement with unequal
probability. From the documentation, we can see that the
function accepts 4 arguments in input:
-
k
: the number of repeated samples to generate -
N
: the population size -
n
: the sample size -
prob
: an optional vector of probabilities for obtaining the population elements
For example, let’s assume our population of interest has 10 elements, and that the first 3 elements have an higher chance of being selected.
#Random sampling without replacement
repeatedSampleWithoutReplacement(
k = 2,
N = 10,
n = 5,
prob = c(rep(3,3), rep(1,7))
)
#> [[1]]
#> [1] 3 7 10 2 1
#>
#> [[2]]
#> [1] 7 2 1 9 3
We can take repeated samples by using the ?resample
function and setting method = 'rswor'
.
#Random sampling without replacement
obj = resample(
x = 10,
n = 5,
k = 2,
method = 'rswor',
prob = c(rep(3,3),rep(1,7))
)
#Print
print(obj)
#>
#> 2 samples taken from a population of 10 elements by using random
#> sampling without replacement.
#>
#> sampleNumber sample sampleSize holdoutSize
#> 1 1 7, 3, 6, 2, ... 5 5
#> 2 2 6, 1, 2, 3, ... 5 5
#Samples
getSamples(obj)
#> [[1]]
#> [1] 7 3 6 2 1
#>
#> [[2]]
#> [1] 6 1 2 3 8
#Holdout Data
getHoldOutSamples(obj)
#> [[1]]
#> [1] 4 5 8 9 10
#>
#> [[2]]
#> [1] 4 5 7 9 10
#Plot
plot(x = obj)
Stratified Random Sampling
Repeated stratified samples without replacement can
be taken by using
?repeatedStratifiedSampleWithoutReplacement
which accept a
strata
argument in input.
#Define strata
strata = c(rep("a", 3),rep("b", 6))
#Stratified sampling without replacement
repeatedStratifiedSampleWithoutReplacement(
k = 2,
strata = strata,
n = 6
)
#> [[1]]
#> [1] 2 1 4 9 6 5
#>
#> [[2]]
#> [1] 3 5 4 6 2 9
We can take repeated samples by using the ?resample
function and setting method = 'stratified_rswor'
.
#Stratified sampling without replacement
obj = resample(
x = strata,
n = 6,
k = 2,
method = 'stratified_rswor'
)
#Print
print(obj)
#>
#> 2 samples taken from a population of 9 elements by using stratified
#> random sampling without replacement.
#>
#> sampleNumber sample sampleSize holdoutSize
#> 1 1 1, 7, 5, 8, ... 6 3
#> 2 2 4, 8, 9, 2, ... 6 3
#Samples
getSamples(obj)
#> [[1]]
#> [1] 1 7 5 8 4 2
#>
#> [[2]]
#> [1] 4 8 9 2 5 1
#Holdout Data
getHoldOutSamples(obj)
#> [[1]]
#> [1] 3 6 9
#>
#> [[2]]
#> [1] 3 6 7
#Plot
plot(x = obj, strata = strata)
Balanced Random Sampling
Balanced sampling is a special case of stratified sampling used to ensure that subgroups of the population are equally represented in the taken sample.
Repeated balanced samples without replacement can be taken by using
?repeatedBalancedSampleWithoutReplacement
:
#Define strata
strata = c(rep("a", 3),rep("b", 6))
#Check ratio
table(strata)/length(strata)
#> strata
#> a b
#> 0.3333333 0.6666667
#Balanced sampling without replacement
s = repeatedBalancedSampleWithoutReplacement(
k = 2,
strata = strata,
n = 6
)
#Check ratio in the samples
table(strata[s[[1]]])/length(strata[s[[1]]])
#>
#> a b
#> 0.5 0.5
table(strata[s[[2]]])/length(strata[s[[2]]])
#>
#> a b
#> 0.5 0.5
We can take repeated samples by using the ?resample
function and setting method = 'balanced_rswor'
.
#Balanced sampling without replacement
obj = resample(
x = strata,
n = 6,
k = 2,
method = 'balanced_rswor'
)
#Print
print(obj)
#>
#> 2 samples taken from a population of 9 elements by using balanced
#> random sampling without replacement.
#>
#> sampleNumber sample sampleSize holdoutSize
#> 1 1 8, 9, 3, 7, ... 6 3
#> 2 2 1, 5, 2, 3, ... 6 3
#Samples
getSamples(obj)
#> [[1]]
#> [1] 8 9 3 7 1 2
#>
#> [[2]]
#> [1] 1 5 2 3 7 4
#Holdout Data
getHoldOutSamples(obj)
#> [[1]]
#> [1] 4 5 6
#>
#> [[2]]
#> [1] 6 8 9
#Plot
plot(x = obj, strata = strata)
Permutation Sampling
We can take repeated permutation samples by using
?repeatedPermutationSample
:
#Permutation sampling
repeatedPermutationSample(
k = 2,
N = 10
)
#> [[1]]
#> [1] 3 6 8 10 2 1 5 9 7 4
#>
#> [[2]]
#> [1] 9 7 2 1 5 8 4 6 3 10
We can also use the ?resample
function by setting
method = 'permutation'
.
#Permutation sampling
obj = resample(
x = 10,
k = 2,
method = 'permutation'
)
#Print
print(obj)
#>
#> 2 samples taken from a population of 10 elements by using permutation
#> sampling.
#>
#> sampleNumber sample sampleSize holdoutSize
#> 1 1 10, 3, 5, , ... 10 0
#> 2 2 3, 9, 5, 4, ... 10 0
#Samples
getSamples(obj)
#> [[1]]
#> [1] 10 3 5 8 7 6 4 2 9 1
#>
#> [[2]]
#> [1] 3 9 5 4 8 10 1 6 2 7
#Holdout Data
getHoldOutSamples(obj)
#> [[1]]
#> integer(0)
#>
#> [[2]]
#> integer(0)
#Plot
plot(x = obj)
K-fold Sampling
The idea behind k-fold resampling is taken from k-fold cross-validation: k samples are taken from the population, so that the i-th sample is generated by removing the i-th fold from the population and merging the remaining k - 1 folds together.
We can take k samples via k-fold sampling by calling
?repeatedRandomKm1Folds
. The function accepts two arguments
in input:
-
k
: the number of folds -
N
: the population size
?repeatedRandomKm1Folds
returns a list of length
k where each element is a sample obtained by merging
k-1 folds together.
#K-1 folds sampling
repeatedRandomKm1Folds(
N = 10,
k = 3
)
#> [[1]]
#> [1] 1 2 3 4 7 8
#>
#> [[2]]
#> [1] 3 4 5 6 8 9 10
#>
#> [[3]]
#> [1] 1 2 5 6 7 9 10
We can take repeated samples by using the ?resample
function and setting method = 'kfolds'
.
#K-folds sampling
obj = resample(
x = 10,
k = 3,
method = "kfolds"
)
#Print
print(obj)
#>
#> 3 samples taken from a population of 10 elements by using random k-fold
#> sampling.
#>
#> sampleNumber sample sampleSize holdoutSize
#> 1 1 1, 3, 4, 7, ... 6 4
#> 2 2 2, 3, 4, 5, ... 7 3
#> 3 3 1, 2, 5, 6, ... 7 3
#Samples
getSamples(obj)
#> [[1]]
#> [1] 1 3 4 7 8 10
#>
#> [[2]]
#> [1] 2 3 4 5 6 8 9
#>
#> [[3]]
#> [1] 1 2 5 6 7 9 10
#Holdout Data
getHoldOutSamples(obj)
#> [[1]]
#> [1] 2 5 6 9
#>
#> [[2]]
#> [1] 1 7 10
#>
#> [[3]]
#> [1] 3 4 8
#Plot
plot(obj)
Stratified K-fold Sampling
We can take repeated stratified samples via stratified k-fold
sampling by calling
?repeatedStratifiedKm1Folds
.
?repeatedStratifiedKm1Folds
assigns the population to
k stratified folds and returns k samples made of
elements from k-1 folds:
#Define strata
strata = c(rep("a", 3),rep("b", 6))
#Check ratio
table(strata)/length(strata)
#> strata
#> a b
#> 0.3333333 0.6666667
#Assign data to 3 folds
i = repeatedStratifiedKm1Folds(
strata = strata,
k = 3
)
#Check folds
i
#> [[1]]
#> [1] 1 2 5 7 8 9
#>
#> [[2]]
#> [1] 2 3 4 5 6 8
#>
#> [[3]]
#> [1] 1 3 4 6 7 9
#Check ratio in the folds
table(strata[i[[1]]])/length(strata[i[[1]]])
#>
#> a b
#> 0.3333333 0.6666667
table(strata[i[[2]]])/length(strata[i[[2]]])
#>
#> a b
#> 0.3333333 0.6666667
table(strata[i[[3]]])/length(strata[i[[3]]])
#>
#> a b
#> 0.3333333 0.6666667
We can take repeated samples by using the ?resample
function and setting method = 'stratified_kfolds'
.
#Define strata
strata = c(rep("a", 3),rep("b", 6))
#Stratified k-folds sampling
obj = resample(
x = strata,
k = 3,
method = "stratified_kfolds"
)
#Print
print(obj)
#>
#> 3 samples taken from a population of 9 elements by using stratified
#> k-fold sampling.
#>
#> sampleNumber sample sampleSize holdoutSize
#> 1 1 1, 3, 4, 5, ... 6 3
#> 2 2 1, 2, 5, 6, ... 6 3
#> 3 3 2, 3, 4, 7, ... 6 3
#Samples
getSamples(obj)
#> [[1]]
#> [1] 1 3 4 5 6 7
#>
#> [[2]]
#> [1] 1 2 5 6 8 9
#>
#> [[3]]
#> [1] 2 3 4 7 8 9
#Holdout Data
getHoldOutSamples(obj)
#> [[1]]
#> [1] 2 8 9
#>
#> [[2]]
#> [1] 3 4 7
#>
#> [[3]]
#> [1] 1 5 6
#Plot
plot(x = obj, strata = strata)
Balanced K-fold Sampling
We can take repeated balanced samples via balanced k-fold
sampling by calling ?repeatedBalancedKm1Folds
,
which returns k samples made of elements from k-1
folds.
Internally, it uses ?balancedKFolds
to assign the
population to k balanced folds:
- If the population is balanced, each element in the population is assigned to one of the k folds so that the percentage of each stratum is preserved in each fold
- If the population is unbalanced and
undersample = TRUE
, the so-called random undersampling is adopted, i.e. the proportion of the strata in the population is adjusted by removing elements from the majority groups, so that each stratum is balanced
#Define balanced strata
strata = c(rep(1,6),rep(2,6))
#Check ratio
table(strata)/length(strata)
#> strata
#> 1 2
#> 0.5 0.5
#Assign data to 3 folds and take a sample
i = repeatedBalancedKm1Folds(
strata = strata,
k = 3
)
#Check sample
i
#> [[1]]
#> [1] 1 2 4 5 7 8 9 10
#>
#> [[2]]
#> [1] 2 3 4 6 8 9 11 12
#>
#> [[3]]
#> [1] 1 3 5 6 7 10 11 12
#Check ratio in the sample
table(strata[i[[1]]])/length(strata[i[[1]]])
#>
#> 1 2
#> 0.5 0.5
table(strata[i[[2]]])/length(strata[i[[2]]])
#>
#> 1 2
#> 0.5 0.5
table(strata[i[[3]]])/length(strata[i[[3]]])
#>
#> 1 2
#> 0.5 0.5
#Define unbalanced strata
strata = c(rep(1,6),rep(2,12))
#Check ratio
table(strata)/length(strata)
#> strata
#> 1 2
#> 0.3333333 0.6666667
#Assign data to 3 folds and take a sample
i = repeatedBalancedKm1Folds(
strata = strata,
k = 3,
undersample = T
)
#Check folds
i
#> [[1]]
#> [1] 3 4 5 6 8 13 17 18
#>
#> [[2]]
#> [1] 1 2 4 6 8 12 14 18
#>
#> [[3]]
#> [1] 1 2 3 5 12 13 14 17
#>
#> attr(,"removed.data")
#> [1] 7 9 10 11 15 16
#Check ratio in the folds
table(strata[i[[1]]])/length(strata[i[[1]]])
#>
#> 1 2
#> 0.5 0.5
table(strata[i[[2]]])/length(strata[i[[2]]])
#>
#> 1 2
#> 0.5 0.5
table(strata[i[[3]]])/length(strata[i[[3]]])
#>
#> 1 2
#> 0.5 0.5
We can take repeated samples by using the ?resample
function.
#Define strata
strata = c(rep("a", 6),rep("b", 6))
#Balanced k-folds sampling (balanced population)
obj = resample(
x = strata,
k = 3,
method = "balanced_kfolds"
)
#Print
print(obj)
#>
#> 3 samples taken from a population of 12 elements by using balanced
#> k-fold sampling.
#>
#> sampleNumber sample sampleSize holdoutSize
#> 1 1 1, 3, 4, 6, ... 8 4
#> 2 2 1, 2, 3, 5, ... 8 4
#> 3 3 2, 4, 5, 6, ... 8 4
#Samples
getSamples(obj)
#> [[1]]
#> [1] 1 3 4 6 7 9 10 12
#>
#> [[2]]
#> [1] 1 2 3 5 7 8 11 12
#>
#> [[3]]
#> [1] 2 4 5 6 8 9 10 11
#Holdout Data
getHoldOutSamples(obj)
#> [[1]]
#> [1] 2 5 8 11
#>
#> [[2]]
#> [1] 4 6 9 10
#>
#> [[3]]
#> [1] 1 3 7 12
#Plot
plot(x = obj, strata = strata)
In case of unbalanced population, we can use the
undersample
argument.
#Define strata
strata = c(rep("a",6),rep("b",8))
#Balanced k-folds sampling (unbalanced population)
obj = resample(
x = strata,
k = 3,
method = "balanced_kfolds",
undersample = T
)
#Print
print(obj)
#>
#> 3 samples taken from a population of 14 elements by using balanced
#> k-fold sampling.
#>
#> sampleNumber sample sampleSize holdoutSize
#> 1 1 1, 2, 3, 4, ... 8 4
#> 2 2 3, 4, 5, 6, ... 8 4
#> 3 3 1, 2, 5, 6, ... 8 4
#Samples
getSamples(obj)
#> [[1]]
#> [1] 1 2 3 4 7 10 12 13
#>
#> [[2]]
#> [1] 3 4 5 6 7 8 10 14
#>
#> [[3]]
#> [1] 1 2 5 6 8 12 13 14
#Holdout Data
getHoldOutSamples(obj)
#> [[1]]
#> [1] 5 6 8 14
#>
#> [[2]]
#> [1] 1 2 12 13
#>
#> [[3]]
#> [1] 3 4 7 10
#Plot
plot(x = obj, strata = strata)
Leave-p-out Sampling
Leave-p-out resampling is an exhaustive resampling
technique, in which samples of size p
are repeatedly taken
from the population until all the possible combinations of
p
elements are considered. These samples are then used as
holdout data.
?repeatedLeavePOut
returns a list of length \(\binom{N}{p}\) where each element is a
sample obtained by removing the holdout data:
#Repeatedly sample leaving out p elements each time
repeatedLeavePOut(N = 5, p = 2)
#> [[1]]
#> [1] 3 4 5
#>
#> [[2]]
#> [1] 2 4 5
#>
#> [[3]]
#> [1] 2 3 5
#>
#> [[4]]
#> [1] 2 3 4
#>
#> [[5]]
#> [1] 1 4 5
#>
#> [[6]]
#> [1] 1 3 5
#>
#> [[7]]
#> [1] 1 3 4
#>
#> [[8]]
#> [1] 1 2 5
#>
#> [[9]]
#> [1] 1 2 4
#>
#> [[10]]
#> [1] 1 2 3
In order to use the ?resample
function, we need to set
method = 'leave_p_out'
.
#Leave-p-out sampling
obj = resample(
x = 5,
n = 2,
method = "leave_p_out"
)
#Print
print(obj)
#>
#> 10 samples taken from a population of 5 elements by using leave-p-out
#> sampling.
#>
#> sampleNumber sample sampleSize holdoutSize
#> 1 1 3, 4, 5 3 2
#> 2 2 2, 4, 5 3 2
#> 3 3 2, 3, 5 3 2
#> 4 4 2, 3, 4 3 2
#> 5 5 1, 4, 5 3 2
#> ...
#Samples
getSamples(obj)
#> [[1]]
#> [1] 3 4 5
#>
#> [[2]]
#> [1] 2 4 5
#>
#> [[3]]
#> [1] 2 3 5
#>
#> [[4]]
#> [1] 2 3 4
#>
#> [[5]]
#> [1] 1 4 5
#>
#> [[6]]
#> [1] 1 3 5
#>
#> [[7]]
#> [1] 1 3 4
#>
#> [[8]]
#> [1] 1 2 5
#>
#> [[9]]
#> [1] 1 2 4
#>
#> [[10]]
#> [1] 1 2 3
#Holdout Data
getHoldOutSamples(obj)
#> [[1]]
#> [1] 1 2
#>
#> [[2]]
#> [1] 1 3
#>
#> [[3]]
#> [1] 1 4
#>
#> [[4]]
#> [1] 1 5
#>
#> [[5]]
#> [1] 2 3
#>
#> [[6]]
#> [1] 2 4
#>
#> [[7]]
#> [1] 2 5
#>
#> [[8]]
#> [1] 3 4
#>
#> [[9]]
#> [1] 3 5
#>
#> [[10]]
#> [1] 4 5
#Plot
plot(obj)
Leave-one-out Sampling
The leave-one-out is a particular case of the
leave-p-out, where p = 1
. Similarly to
leave-p-out, leave-one-out resampling is an exhaustive
resampling technique, in which samples of size 1
are
repeatedly taken from the population until each element in the
population is considered as the holdout data.
?repeatedLeaveOneOut
returns a list of length
N
where each element is a sample obtained by removing the
holdout data:
#Take one sample leaving out 1 element
repeatedLeaveOneOut(N = 5)
#> [[1]]
#> [1] 1 3 4 5
#>
#> [[2]]
#> [1] 1 2 3 4
#>
#> [[3]]
#> [1] 2 3 4 5
#>
#> [[4]]
#> [1] 1 2 3 5
#>
#> [[5]]
#> [1] 1 2 4 5
In order to use the ?resample
function, we need to set
method = 'leave_one_out'
.
#Leave-one-out sampling
obj = resample(
x = 5,
method = "leave_one_out"
)
#Print
print(obj)
#>
#> 5 samples taken from a population of 5 elements by using leave-one-out
#> sampling.
#>
#> sampleNumber sample sampleSize holdoutSize
#> 1 1 1, 2, 4, 5 4 1
#> 2 2 1, 2, 3, 5 4 1
#> 3 3 1, 2, 3, 4 4 1
#> 4 4 1, 3, 4, 5 4 1
#> 5 5 2, 3, 4, 5 4 1
#Samples
getSamples(obj)
#> [[1]]
#> [1] 1 2 4 5
#>
#> [[2]]
#> [1] 1 2 3 5
#>
#> [[3]]
#> [1] 1 2 3 4
#>
#> [[4]]
#> [1] 1 3 4 5
#>
#> [[5]]
#> [1] 2 3 4 5
#Holdout Data
getHoldOutSamples(obj)
#> [[1]]
#> [1] 3
#>
#> [[2]]
#> [1] 4
#>
#> [[3]]
#> [1] 5
#>
#> [[4]]
#> [1] 2
#>
#> [[5]]
#> [1] 1
#Plot
plot(obj)