Sampling Without Replacement

Introduction

In statistics, sampling is the selection of a subset of elements from a population, here defined as a complete set of subjects of interest.

Since often it is too expensive or logistically impossible to collect data for every case in a population, sampling is instead used as a cheap and fast methodology to estimate its characteristics.

Different sampling schemes exist, but they can be grouped into 2 main categories, i.e. sampling with or without replacement:

sampling with replacement implies each element in the population may appear multiple times in one sample
in sampling without replacement, each member of the population can be chosen only once in one sample

In this article, we show how to draw samples without replacement from a population by using the functions implemented in resampling.

For further information on how to draw repeated samples without replacement, see Resampling without replacement.

Setup

Loading

Firstly, we need to load the resampling R package:

library(resampling)

Seed

Then, we set a seed for the random number generation (RNG). In fact, different R sessions have different seeds created from current time and process ID by default, and consequently different simulation results. By fixing a seed we ensure we will be able to reproduce the results of this vignette. We can specify a seed by calling ?set.seed.

#Set a seed for RNG
set.seed(
  #A seed
  seed = 5381L,                   #a randomly chosen integer value
  #The kind of RNG to use
  kind = "Mersenne-Twister",      #we make explicit the current R default value
  #The kind of Normal generation
  normal.kind = "Inversion"       #we make explicit the current R default value
)

As previously wrote, sampling without replacement implies that elements of a population can be chosen only once in one sample. There are different techniques of sampling without replacement, including:

simple random sampling
random sampling with unequal probabilities
stratified sampling
balanced sampling (a special case of stratified sampling)
permutation sampling
k-fold sampling
leave-p-out sampling

The available methods can be listed through the ?listAvailableSamplingMethods function call, setting the input argument to 'rswor'. ?listAvailableSamplingMethods returns a table with two columns:

id: the id of the sampling method, to be used in the function calls
name: the name of the sampling method

#list sampling methods
sampling.methods = listAvailableSamplingMethods(x = 'rswor')

#print in table
knitr::kable(x = sampling.methods)

id	name
rswor	random sampling without replacement
srswor	simple random sampling without replacement
stratified_rswor	stratified random sampling without replacement
balanced_rswor	balanced random sampling without replacement
permutation	permutation sampling
kfolds	random k-fold sampling
stratified_kfolds	stratified k-fold sampling
balanced_kfolds	balanced k-fold sampling
leave_p_out	leave-p-out sampling
leave_one_out	leave-one-out sampling

The name of the sampling functions can be retrieved by calling ?listSamplingFunctionNames.

#list sampling function names
sampling.function.names = listSamplingFunctionNames(x = 'rswor')

#print in table
knitr::kable(x = sampling.function.names)

id	name
rswor	sampleWithoutReplacement
srswor	simpleRandomSampleWithoutReplacement
stratified_rswor	stratifiedSampleWithoutReplacement
balanced_rswor	balancedSampleWithoutReplacement
permutation	permutationSample
kfolds	randomKm1Folds
stratified_kfolds	stratifiedKm1Folds
balanced_kfolds	balancedKm1Folds
leave_p_out	leavePOutSample
leave_one_out	leaveOneOutSample

Each function is documented. To learn more about a specific method it is possible to use the ? operator. For example, let’s check the function ?simpleRandomSampleWithoutReplacement.

#See documentation
?simpleRandomSampleWithoutReplacement

From the documentation, we can see that the function accepts 2 arguments in input:

N: the population size
n: the sample size

Simple Random Sampling

Simple random sampling (SRS) is the easiest form of sampling without replacement. In SRS without replacement, each element of the population has the same probability of being selected for the sample.

#Simple random sampling without replacement
simpleRandomSampleWithoutReplacement(
  N = 10,
  n = 8
)
#> [1]  1  9  7  2  4  3  6 10

Random Sampling With Unequal Probability

The concept of random sampling without replacement with unequal probability was developed by Narain (Narain 1951), Horvitz and Thompson (Horvitz and Thompson 1952). Under this sampling design, elements of the population have different probabilities of being selected. We can use ?sampleWithoutReplacement to draw our sample. For example, let’s assume our population of interest has 10 elements, and that the first 3 elements have an higher chance of being selected.

#Random sampling without replacement
sampleWithoutReplacement(
  N = 10,
  n = 5,
  prob = c(rep(3,3), rep(1,7))
)
#> [1] 3 5 1 8 7

Stratified Random Sampling

When a population can be partitioned into groups (i.e. strata or subpopulations) having certain properties in common, a stratified sampling approach can be used. This sampling design is adopted to ensure that subgroups of the population are represented in the taken sample.

A stratified sample without replacement can be taken by using ?stratifiedSampleWithoutReplacement which accept a strata argument in input.

#Define strata
strata = c(rep("a", 3),rep("b", 6))

#Stratified sampling without replacement
stratifiedSampleWithoutReplacement(
  strata = strata,
  n = 9
)
#> [1] 5 6 8 4 2 7 3 9 1

?stratifiedSampleWithoutReplacement implements the so-called “proportionate allocation”, in which the proportion of the strata in the population is maintained in the samples.

#Define strata
strata = c(rep("a", 3),rep("b", 6))

#Check ratio
table(strata)/length(strata)
#> strata
#>         a         b 
#> 0.3333333 0.6666667

#Stratified sampling without replacement
s = stratifiedSampleWithoutReplacement(
  strata = strata,
  n = 6
)

#Check ratio in the sample
table(strata[s])/length(strata[s])
#> 
#>         a         b 
#> 0.3333333 0.6666667

Balanced Random Sampling

Balanced sampling is a special case of stratified sampling used to ensure that subgroups of the population are equally represented in the taken sample.

#Define strata
strata = c(rep("a", 3),rep("b", 6))

#Check ratio
table(strata)/length(strata)
#> strata
#>         a         b 
#> 0.3333333 0.6666667

#Balanced sampling without replacement
s = balancedSampleWithoutReplacement(
  strata = strata,
  n = 6
)

#Check ratio in the sample
table(strata[s])/length(strata[s])
#> 
#>   a   b 
#> 0.5 0.5

Permutation Sampling

A permutation sample is a sample of the same size of the population, where the elements are simply rearranged in a random order.

A permutation sample can be taken by using ?permutationSample:

#Permutation sampling
permutationSample(
  N = 10
)
#>  [1]  7  9  2 10  1  8  4  6  3  5

K-fold Sampling

In k-fold sampling, each element of the population is randomly assigned to 1 of k folds. resampling provides a function to easily select a random sample made of elements from k-1 folds:

#K-1 folds sampling
randomKm1Folds(
  N = 10,
  k = 3
)
#> [1]  3  5  6  7  8  9 10

Stratified K-fold Sampling

In stratified k-fold sampling, each element in the population is assigned to one of the k folds so that the percentage of each stratum in the population is preserved in each fold.

?stratifiedKm1Folds assigns the population to k stratified folds and returns a sample made of elements from k-1 folds:

#Define strata
strata = c(rep("a", 3),rep("b", 6))

#Check ratio
table(strata)/length(strata)
#> strata
#>         a         b 
#> 0.3333333 0.6666667

#Assign data to 3 folds and take a sample
i = stratifiedKm1Folds(
  strata = strata,
  k = 3
)

#Check folds
i
#> [1] 1 3 4 6 8 9

#Check ratio in the folds
table(strata[i])/length(strata[i])
#> 
#>         a         b 
#> 0.3333333 0.6666667

Balanced K-fold Sampling

Balanced k-fold sampling is a special case of stratified sampling in which the population is assigned to k balanced folds.

?balancedKm1Folds assigns the population to k balanced folds and returns a sample made of elements from k-1 folds. Internally, it uses ?balancedKFolds to assign the population to k balanced folds:

If the population is balanced, each element in the population is assigned to one of the k folds so that the percentage of each stratum is preserved in each fold
If the population is unbalanced and undersample = TRUE, the so-called random undersampling is adopted, i.e. the proportion of the strata in the population is adjusted by removing elements from the majority groups, so that each stratum is balanced

#Define balanced strata
strata = c(rep(1,6),rep(2,6))

#Check ratio
table(strata)/length(strata)
#> strata
#>   1   2 
#> 0.5 0.5

#Assign data to 3 folds and take a sample
i = balancedKm1Folds(
  strata = strata,
  k = 3
)

#Check sample
i
#> [1]  1  3  5  6  7  8  9 10

#Check ratio in the sample
table(strata[i])/length(strata[i])
#> 
#>   1   2 
#> 0.5 0.5

#Define unbalanced strata
strata = c(rep(1,6),rep(2,12))

#Check ratio
table(strata)/length(strata)
#> strata
#>         1         2 
#> 0.3333333 0.6666667

#Assign data to 3 folds and take a sample
i = balancedKm1Folds(
  strata = strata,
  k = 3,
  undersample = T
)

#Check folds
i
#> [1]  1  3  4  5  7 10 11 17

#Check ratio in the folds
table(strata[i])/length(strata[i])
#> 
#>   1   2 
#> 0.5 0.5

Leave-p-out Sampling

In leave-p-out sampling, a random sample of size p is taken from the population and used as holdout data.

?leavePOutSample returns a random sample obtained by removing the holdout sample from the population:

#Take one sample leaving out p elements
leavePOutSample(N = 5, p = 2)
#> [1] 1 2 3

Leave-one-out Sampling

The leave-one-out is a particular case of the leave-p-out, where p = 1. Particularly, a random sample of size 1 is taken from the population and used as holdout data.

For example, let’s assume our population of interest has 5 elements. In this case, a sample obtained with this sampling technique will result in 4 elements:

#Take one sample leaving out 1 element
leaveOneOutSample(N = 5)
#> [1] 1 2 3 5

References

Horvitz, D. G., and D. J. Thompson. 1952. “A Generalization of Sampling Without Replacement From a Finite Universe.” Journal of the American Statistical Association 47 (260): 663–85. https://doi.org/10.2307/2280784.

Narain, R. D. 1951. “On sampling without replacement with varying probabilities.” Journal of the Indian Society of Agricultural Statistics 3: 169–75.