Introduction
In statistics, sampling is the selection of a subset of elements from a population, here defined as a complete set of subjects of interest.
Since often it is too expensive or logistically impossible to collect data for every case in a population, sampling is instead used as a cheap and fast methodology to estimate its characteristics.
Different sampling schemes exist, but they can be grouped into 2 main categories, i.e. sampling with or without replacement:
- sampling with replacement implies each element in the population may appear multiple times in one sample
- in sampling without replacement, each member of the population can be chosen only once in one sample
In this article, we show how to draw samples without
replacement from a population by using the functions
implemented in resampling
.
For further information on how to draw repeated samples without replacement, see Resampling without replacement.
Setup
Seed
Then, we set a seed for the random number generation (RNG). In fact,
different R sessions have different seeds created from current time and
process ID by default, and consequently different simulation results. By
fixing a seed we ensure we will be able to reproduce the results of this
vignette. We can specify a seed by calling ?set.seed
.
#Set a seed for RNG
set.seed(
#A seed
seed = 5381L, #a randomly chosen integer value
#The kind of RNG to use
kind = "Mersenne-Twister", #we make explicit the current R default value
#The kind of Normal generation
normal.kind = "Inversion" #we make explicit the current R default value
)
Sampling Without Replacement
As previously wrote, sampling without replacement implies that elements of a population can be chosen only once in one sample. There are different techniques of sampling without replacement, including:
- simple random sampling
- random sampling with unequal probabilities
- stratified sampling
- balanced sampling (a special case of stratified sampling)
- permutation sampling
- k-fold sampling
- leave-p-out sampling
The available methods can be listed through the
?listAvailableSamplingMethods
function call, setting the
input argument to 'rswor'
.
?listAvailableSamplingMethods
returns a table with two
columns:
-
id
: the id of the sampling method, to be used in the function calls -
name
: the name of the sampling method
#list sampling methods
sampling.methods = listAvailableSamplingMethods(x = 'rswor')
#print in table
knitr::kable(x = sampling.methods)
id | name |
---|---|
rswor | random sampling without replacement |
srswor | simple random sampling without replacement |
stratified_rswor | stratified random sampling without replacement |
balanced_rswor | balanced random sampling without replacement |
permutation | permutation sampling |
kfolds | random k-fold sampling |
stratified_kfolds | stratified k-fold sampling |
balanced_kfolds | balanced k-fold sampling |
leave_p_out | leave-p-out sampling |
leave_one_out | leave-one-out sampling |
The name of the sampling functions can be retrieved by calling
?listSamplingFunctionNames
.
#list sampling function names
sampling.function.names = listSamplingFunctionNames(x = 'rswor')
#print in table
knitr::kable(x = sampling.function.names)
id | name |
---|---|
rswor | sampleWithoutReplacement |
srswor | simpleRandomSampleWithoutReplacement |
stratified_rswor | stratifiedSampleWithoutReplacement |
balanced_rswor | balancedSampleWithoutReplacement |
permutation | permutationSample |
kfolds | randomKm1Folds |
stratified_kfolds | stratifiedKm1Folds |
balanced_kfolds | balancedKm1Folds |
leave_p_out | leavePOutSample |
leave_one_out | leaveOneOutSample |
Each function is documented. To learn more about a specific method it
is possible to use the ?
operator. For example, let’s check
the function ?simpleRandomSampleWithoutReplacement
.
#See documentation
?simpleRandomSampleWithoutReplacement
From the documentation, we can see that the function accepts 2 arguments in input:
-
N
: the population size -
n
: the sample size
Simple Random Sampling
Simple random sampling (SRS) is the easiest form of sampling without replacement. In SRS without replacement, each element of the population has the same probability of being selected for the sample.
#Simple random sampling without replacement
simpleRandomSampleWithoutReplacement(
N = 10,
n = 8
)
#> [1] 1 9 7 2 4 3 6 10
Random Sampling With Unequal Probability
The concept of random sampling without replacement with
unequal probability was developed by Narain (Narain 1951), Horvitz and Thompson (Horvitz and Thompson 1952). Under this sampling
design, elements of the population have different probabilities of being
selected. We can use ?sampleWithoutReplacement
to draw our
sample. For example, let’s assume our population of interest has 10
elements, and that the first 3 elements have an higher chance of being
selected.
#Random sampling without replacement
sampleWithoutReplacement(
N = 10,
n = 5,
prob = c(rep(3,3), rep(1,7))
)
#> [1] 3 5 1 8 7
Stratified Random Sampling
When a population can be partitioned into groups (i.e. strata or subpopulations) having certain properties in common, a stratified sampling approach can be used. This sampling design is adopted to ensure that subgroups of the population are represented in the taken sample.
A stratified sample without replacement can be taken by using
?stratifiedSampleWithoutReplacement
which accept a
strata
argument in input.
#Define strata
strata = c(rep("a", 3),rep("b", 6))
#Stratified sampling without replacement
stratifiedSampleWithoutReplacement(
strata = strata,
n = 9
)
#> [1] 5 6 8 4 2 7 3 9 1
?stratifiedSampleWithoutReplacement
implements the
so-called “proportionate allocation”, in which the proportion of the
strata in the population is maintained in the samples.
#Define strata
strata = c(rep("a", 3),rep("b", 6))
#Check ratio
table(strata)/length(strata)
#> strata
#> a b
#> 0.3333333 0.6666667
#Stratified sampling without replacement
s = stratifiedSampleWithoutReplacement(
strata = strata,
n = 6
)
#Check ratio in the sample
table(strata[s])/length(strata[s])
#>
#> a b
#> 0.3333333 0.6666667
Balanced Random Sampling
Balanced sampling is a special case of stratified sampling used to ensure that subgroups of the population are equally represented in the taken sample.
#Define strata
strata = c(rep("a", 3),rep("b", 6))
#Check ratio
table(strata)/length(strata)
#> strata
#> a b
#> 0.3333333 0.6666667
#Balanced sampling without replacement
s = balancedSampleWithoutReplacement(
strata = strata,
n = 6
)
#Check ratio in the sample
table(strata[s])/length(strata[s])
#>
#> a b
#> 0.5 0.5
Permutation Sampling
A permutation sample is a sample of the same size of the population, where the elements are simply rearranged in a random order.
A permutation sample can be taken by using
?permutationSample
:
#Permutation sampling
permutationSample(
N = 10
)
#> [1] 7 9 2 10 1 8 4 6 3 5
K-fold Sampling
In k-fold sampling, each element of the population
is randomly assigned to 1 of k folds. resampling
provides a function to easily select a random sample made of elements
from k-1 folds:
#K-1 folds sampling
randomKm1Folds(
N = 10,
k = 3
)
#> [1] 3 5 6 7 8 9 10
Stratified K-fold Sampling
In stratified k-fold sampling, each element in the population is assigned to one of the k folds so that the percentage of each stratum in the population is preserved in each fold.
?stratifiedKm1Folds
assigns the population to k
stratified folds and returns a sample made of elements from k-1
folds:
#Define strata
strata = c(rep("a", 3),rep("b", 6))
#Check ratio
table(strata)/length(strata)
#> strata
#> a b
#> 0.3333333 0.6666667
#Assign data to 3 folds and take a sample
i = stratifiedKm1Folds(
strata = strata,
k = 3
)
#Check folds
i
#> [1] 1 3 4 6 8 9
#Check ratio in the folds
table(strata[i])/length(strata[i])
#>
#> a b
#> 0.3333333 0.6666667
Balanced K-fold Sampling
Balanced k-fold sampling is a special case of stratified sampling in which the population is assigned to k balanced folds.
?balancedKm1Folds
assigns the population to k
balanced folds and returns a sample made of elements from k-1
folds. Internally, it uses ?balancedKFolds
to assign the
population to k balanced folds:
- If the population is balanced, each element in the population is assigned to one of the k folds so that the percentage of each stratum is preserved in each fold
- If the population is unbalanced and
undersample = TRUE
, the so-called random undersampling is adopted, i.e. the proportion of the strata in the population is adjusted by removing elements from the majority groups, so that each stratum is balanced
#Define balanced strata
strata = c(rep(1,6),rep(2,6))
#Check ratio
table(strata)/length(strata)
#> strata
#> 1 2
#> 0.5 0.5
#Assign data to 3 folds and take a sample
i = balancedKm1Folds(
strata = strata,
k = 3
)
#Check sample
i
#> [1] 1 3 5 6 7 8 9 10
#Check ratio in the sample
table(strata[i])/length(strata[i])
#>
#> 1 2
#> 0.5 0.5
#Define unbalanced strata
strata = c(rep(1,6),rep(2,12))
#Check ratio
table(strata)/length(strata)
#> strata
#> 1 2
#> 0.3333333 0.6666667
#Assign data to 3 folds and take a sample
i = balancedKm1Folds(
strata = strata,
k = 3,
undersample = T
)
#Check folds
i
#> [1] 1 3 4 5 7 10 11 17
#Check ratio in the folds
table(strata[i])/length(strata[i])
#>
#> 1 2
#> 0.5 0.5
Leave-p-out Sampling
In leave-p-out sampling, a random sample of size
p
is taken from the population and used as holdout
data.
?leavePOutSample
returns a random sample obtained by
removing the holdout sample from the population:
#Take one sample leaving out p elements
leavePOutSample(N = 5, p = 2)
#> [1] 1 2 3
Leave-one-out Sampling
The leave-one-out is a particular case of the
leave-p-out, where p = 1
. Particularly, a random
sample of size 1
is taken from the population and used as
holdout data.
For example, let’s assume our population of interest has 5 elements. In this case, a sample obtained with this sampling technique will result in 4 elements:
#Take one sample leaving out 1 element
leaveOneOutSample(N = 5)
#> [1] 1 2 3 5