set.seed(
seed = 5381L, # a fixed integer value
kind = "Mersenne-Twister", # default RNG algorithm in R
normal.kind = "Inversion" # default method for Normal variates
)Supervised Learning: Classification
1 Objective
In this tutorial, we explore and compare several supervised learning models for binary classification. We use gene expression data from The Cancer Genome Atlas (TCGA) Breast Invasive Carcinoma (BRCA) project to predict oestrogen receptor (ER) status — a clinically important molecular feature that distinguishes two major breast cancer subtypes and directly guides treatment decisions.
In the unsupervised learning lab, you applied hierarchical clustering and PCA to the same TCGA BRCA dataset. You may have noticed that two large groups of samples separated clearly along the first principal component — those groups correspond closely to ER+ and ER− tumours. In this lab, we move from discovering structure to predicting a known label.
We will cover the following classifiers:
- Logistic Regression (GLM)
- K-Nearest Neighbours (KNN)
- Support Vector Machines (SVM)
- Decision Trees
- Random Forests
- Gradient Boosting Machines (GBM)
For each method, we will:
- introduce the theory and key hyperparameters
- train the model on a training set
- evaluate performance on a held-out test set (accuracy and AUC-ROC)
- visualise and interpret the results
2 Setup
2.1 Seed
Before running the tutorial, we fix a seed for the random number generator (RNG). Different R sessions use different seeds by default (derived from the current time and process ID), so results would vary between runs. By setting a fixed seed we ensure full reproducibility.
2.2 Libraries
# Data access
library(here)
# Data manipulation and visualisation
library(dplyr)
library(tidyr)
library(ggplot2)
library(kableExtra)
# Classifiers
library(stats) # glm (logistic regression)
library(glmnet) # penalised logistic regression (LASSO, ridge, elastic net)
library(class) # knn
library(e1071) # svm
library(rpart) # decision tree
library(rpart.plot) # decision tree visualisation
library(randomForest)
library(gbm)
# Performance evaluation
library(pROC)
library(PRROC) # precision-recall curves and average precisionIf any package is not installed, you can install it with:
install.packages(c("here", "dplyr", "tidyr", "ggplot2", "kableExtra",
"glmnet", "class", "e1071", "rpart", "rpart.plot",
"randomForest", "gbm", "pROC", "PRROC"))3 Data
3.1 TCGA BRCA Dataset
We use gene expression data from the Breast Invasive Carcinoma (BRCA) project of The Cancer Genome Atlas (TCGA). The data were downloaded with the TCGAbiolinks R package. Expression values are in log₂(TPM + 1) units, where TPM stands for Transcripts Per Million — a normalisation that accounts for differences in sequencing depth and gene length across samples.
The dataset has been pre-processed: the 4,000 most variable genes (by standard deviation across samples) have been retained, and Ensembl gene IDs have been translated to HGNC gene symbols.
After loading, you should see two objects in your Environment:
-
tcga_brca_log2_tpm— a matrix of gene expression values (samples × genes), with expression in log₂(TPM + 1) units -
tcga_brca_sample_metadata— a data frame with clinical annotations for each sample, including ER status, PAM50 subtype, and other covariates
3.2 Sample alignment
Before extracting the outcome variable, we verify that the rows of the expression matrix and the rows of the clinical metadata refer to the same samples in the same order. This is a critical data integrity check: if the two objects were inadvertently re-ordered (e.g. during a join or merge step), predictions would be assigned to the wrong patients.
# Check that sample IDs match between the expression matrix and metadata
all_match <- all(rownames(tcga_brca_log2_tpm) == tcga_brca_sample_metadata$patient_id)
if (all_match) {
cat("✔ Sample IDs are aligned between expression matrix and metadata.\n")
} else {
stop("✘ Sample ID mismatch! Check the data before proceeding.")
}✔ Sample IDs are aligned between expression matrix and metadata.
In multi-omics datasets, different data sources (expression, mutation, clinical) are often assembled independently and may not share the same sample order. Always check that IDs match before linking clinical variables to molecular data — a mismatch here is one of the most common and most dangerous sources of error in computational biology.
We should start by having a look at the data.
Expression matrix — rows (samples): 996 | columns (genes): 4000
Value range: 0 18.06
# Inspect the top-left corner of the matrix
tcga_brca_log2_tpm[1:4, 1:5] SCGB2A2 SCGB1D2 TFF1 PIP CPB1
TCGA-BH-A18H 6.34592 3.18637 10.68947 5.71509 0.51702
TCGA-E2-A14P 3.61394 2.12542 2.41346 7.83945 0.04963
TCGA-AN-A04A 9.27772 9.63175 10.12391 5.66310 10.15165
TCGA-5L-AAT1 9.22320 7.95963 8.67059 10.34318 1.32562
Each row is a patient sample and each column is a gene. Values are in log2(TPM + 1): a value of 0 means the gene was not detected; larger values indicate higher expression. The log transformation compresses the dynamic range and makes the data more amenable to distance-based methods.
The metadata is stored in the tcga_brca_sample_metadata data frame.
# A tibble: 6 × 6
patient_id subtype ER_status PR_status stage age_years
<chr> <chr> <fct> <fct> <chr> <dbl>
1 TCGA-BH-A18H LumA ER+ PR+ Stage IA 63.1
2 TCGA-E2-A14P Her2 ER- PR- Stage IIIC 79.2
3 TCGA-AN-A04A LumA ER+ PR+ Stage IIIA 36.8
4 TCGA-5L-AAT1 LumA ER+ PR+ Stage IV 63.6
5 TCGA-AQ-A0Y5 LumA ER+ PR+ Stage IIIA 70.6
6 TCGA-A8-A07I Her2 ER+ PR- Stage IIIA 69.2
The key variables we will use in this tutorial are:
| Column | Description |
|---|---|
patient_id |
Short TCGA patient barcode |
ER_status |
Oestrogen receptor status (ER+ / ER-) |
3.3 Defining the Outcome
Our classification target is ER status (oestrogen receptor status), a binary variable indicating whether a tumour expresses the oestrogen receptor protein. ER+ tumours are typically treated with hormone therapy (e.g. tamoxifen), while ER− tumours are not, making this distinction clinically critical.
# Extract ER status from clinical metadata
# Keep only samples with known ER status (drop NAs)
er_status <- tcga_brca_sample_metadata$ER_status
valid_idx <- !is.na(er_status)
x <- tcga_brca_log2_tpm[valid_idx, , drop = FALSE]
y <- factor(er_status[valid_idx]) # factor with levels "ER-" / "ER+"
cat("ER status levels:", levels(y), "\n")ER status levels: ER- ER+
Number of samples with known ER status: 996
3.4 Data Exploration
Before modelling, it is always good practice to explore the data.
3.4.1 Sample size and class balance
# Class distribution
class_counts <- table(y)
class_pct <- round(prop.table(class_counts) * 100, 1)
data.frame(
Status = names(class_counts),
N = as.integer(class_counts),
Percentage = paste0(as.numeric(class_pct), "%")
) |>
kable(caption = "ER status distribution") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Status | N | Percentage |
|---|---|---|
| ER- | 228 | 22.9% |
| ER+ | 768 | 77.1% |
The dataset is imbalanced: ER+ tumours represent approximately 77.1% of the samples. This means that a naive classifier that always predicts “ER+” would still achieve high accuracy, making accuracy alone a misleading metric.
We therefore report both AUC-ROC and AUC-PR. AUC-ROC evaluates how well the model ranks ER+ and ER− samples across classification thresholds, whereas AUC-PR focuses on the trade-off between precision and recall and is often more informative when one class is less frequent.
3.4.2 Dimensionality
Samples: 996
Genes (features): 4000
We have 996 samples but 4000 features, so the number of predictors is larger than the number of samples. In this p > n regime, many classifiers can overfit easily, especially when many genes are correlated or only weakly informative. This is why we evaluate performance on a held-out test set, and why regularisation, feature selection, careful tuning, and appropriate model choice are important in gene-expression-based classification.
3.4.3 Feature distributions
Before applying any classifier, it is instructive to inspect whether individual genes already show differential expression between ER+ and ER− tumours. If the two classes separate well on even a handful of genes, we can expect classifiers to perform well. We visualise five well-characterised genes:
- ESR1 (oestrogen receptor 1) — encodes the ER protein directly; we expect dramatically higher expression in ER+ tumours.
- GATA3 and FOXA1 — transcription factors co-expressed with ESR1 in the luminal lineage; expected to be higher in ER+ tumours.
- MKI67 — a proliferation marker; often enriched in more aggressive, proliferative tumours, including many ER− and basal-like cancers.
- ERBB2 — encodes the HER2 receptor, overexpressed in the HER2 subtype (which is largely ER−); direction of difference may be reversed or mixed.
# Genes known to correlate with ER status
er_genes <- c("ESR1", "GATA3", "FOXA1", "MKI67", "ERBB2")
er_genes <- er_genes[er_genes %in% colnames(x)] # keep only those present
data.frame(
er_status = y,
x[, er_genes, drop = FALSE]
) |>
pivot_longer(-er_status, names_to = "gene", values_to = "expression") |>
mutate(gene = factor(gene, levels = er_genes)) |>
ggplot(aes(x = er_status, y = expression, fill = er_status)) +
geom_violin(alpha = 0.6, trim = FALSE) +
geom_boxplot(width = 0.15, outlier.size = 0.5) +
facet_wrap(~gene, scales = "free_y", nrow = 1) +
scale_fill_manual(values = c("ER-" = "#E64B35", "ER+" = "#4DBBD5")) +
labs(
title = "Expression of ER-associated genes by ER status",
x = "ER status", y = "log₂(TPM + 1)", fill = "ER status"
) +
theme_bw(base_size = 11) +
theme(legend.position = "bottom")The plots confirm our biological expectations: ESR1, GATA3, and FOXA1 are all markedly higher in ER+ tumours, providing strong individual discriminatory signal. MKI67 shows the opposite trend — higher in ER− tumours, consistent with the more proliferative, aggressive phenotype of ER− breast cancer. ERBB2 shows a more mixed picture, as expected given that HER2 amplification occurs across ER subtypes. The clear separation for ESR1 in particular tells us that even a single-gene logistic regression model should perform reasonably well — and we will verify this shortly.
4 Supervised Learning: Background
The goal of supervised learning is to find a model \(f\) such that \[\hat{y} = f(\mathbf{x})\] where \(\mathbf{x}\) is the vector of predictors (gene expression values) and \(\hat{y}\) is the predicted response (ER status).
It is called supervised because during training the true label \(y\) is known and guides the learning process.
Depending on the nature of \(y\), supervised learning problems split into:
- Regression — \(y\) is quantitative (e.g. tumour size in mm)
- Classification — \(y\) is qualitative (e.g. ER+/ER−, as in this lab)
5 Train / Test Split
To assess generalisation performance (i.e. how well a model predicts new data it has never seen), we split the data into:
- Training set (70%) — used to fit the model
- Test set (30%) — used only for final evaluation
Because ER status is imbalanced, the split should be stratified by ER status, so that the proportions of ER+ and ER− tumours are similar in the training and test sets.
A single train/test split is the simplest evaluation strategy, but it is sensitive to the particular random split. In practice, k-fold cross-validation (or repeated k-fold CV) should be used to obtain more stable estimates. We demonstrate a brief CV example at the end. Here we keep a single split for clarity and speed.
set.seed(seed = 5381L)
n_obs <- nrow(x)
full_index <- seq_len(n_obs)
train_index <- sample(full_index, size = floor(n_obs * 0.7), replace = FALSE)
# Separate predictor matrix and label vector
train_x <- x[train_index, , drop = FALSE]
test_x <- x[-train_index, , drop = FALSE]
train_y <- y[train_index]
test_y <- y[-train_index]
# Combined data frames (needed by formula-based functions)
train_df <- data.frame(train_x, er_status = train_y, stringsAsFactors = FALSE)
test_df <- data.frame(test_x, er_status = test_y, stringsAsFactors = FALSE)
# Scaled versions (needed by distance-based methods such as KNN and SVM)
train_scaled_x <- scale(train_x, center = TRUE, scale = TRUE)
test_scaled_x <- scale(
test_x,
center = attr(train_scaled_x, "scaled:center"),
scale = attr(train_scaled_x, "scaled:scale")
)
cat("Training samples:", nrow(train_x), "\n")Training samples: 697
Test samples: 299
When scaling the test set, always use the mean and standard deviation estimated on the training set (not on the test set itself). Scaling the test set with its own statistics would constitute data leakage — the test set would implicitly influence the preprocessing, giving an over-optimistic performance estimate.
In the code above, we pass center and scale attributes from the scaled training matrix train_scaled_x to scale() when processing the test matrix test_x.
6 Performance Metrics
We evaluate each classifier using three complementary metrics:
| Metric | Definition | Why it matters here |
|---|---|---|
| Accuracy | Proportion of correctly classified samples | Easy to interpret, but misleading with class imbalance |
| AUC-ROC | Area under the Receiver Operating Characteristic curve | Threshold-independent measure of class separation |
| AUC-PR | Area under the Precision-Recall curve (Average Precision) | Most informative for imbalanced datasets |
Accuracy is computed using a single, fixed decision threshold — typically 0.5, meaning that a sample is classified as ER+ if the predicted probability exceeds 0.5. Changing the threshold changes the accuracy. This is why accuracy alone can be misleading: with ~80% ER+ samples in the dataset, a trivial classifier that always predicts ER+ would achieve ~80% accuracy with no useful discriminatory power.
AUC-ROC evaluates the classifier at all possible thresholds simultaneously. The Receiver Operating Characteristic (ROC) curve plots sensitivity (the true positive rate) on the y-axis against 1 − specificity (the false positive rate) on the x-axis. If ER+ is treated as the positive class, sensitivity is the proportion of true ER+ tumours correctly identified, specificity is the proportion of true ER− tumours correctly identified, and 1 − specificity is the proportion of actual ER− cases that are incorrectly called ER+.
Specificity decreases from left (1.0) to right (0.0) along the x-axis, which is why ROC curves are sometimes displayed with the specificity axis reversed.
Each point on the ROC curve corresponds to one threshold value. The Area Under the Curve (AUC) summarises the curve in a single number: AUC = 0.5 corresponds to a random classifier (the diagonal), whereas AUC = 1.0 corresponds to a perfect classifier.
AUC-PR (Average Precision) complements AUC-ROC by focusing on the chosen positive class only. If ER+ is treated as the positive class, the Precision-Recall (PR) curve plots:
- Precision = \(\frac{TP}{TP + FP}\) — of all samples predicted as ER+, what fraction truly are?
- Recall (= Sensitivity) = \(\frac{TP}{TP + FN}\) — of all true ER+ samples, what fraction did we correctly identify?
At each possible threshold, we get one (Precision, Recall) pair, and the PR curve traces these pairs. The area under this curve — the Average Precision (AP) — summarises performance in a single number between 0 and 1.
Unlike ROC-AUC, the baseline value of AUC-PR depends on the prevalence of the positive class. Therefore, AUC-PR must always be interpreted with respect to which class has been defined as “positive”.
AUC-ROC and AUC-PR summarise different aspects of classifier performance. AUC-ROC measures how well the model ranks samples from the two classes across thresholds. AUC-PR focuses on the precision-recall trade-off for the class defined as positive.
The baseline for AUC-ROC is 0.5 for a non-informative classifier. In contrast, the baseline for AUC-PR is the prevalence of the positive class. If ER+ is the positive class, the no-skill AUC-PR baseline is approximately ~80%; if ER− is the positive class, the baseline is approximately ~20%.
AUC-ROC accounts for true negatives (ER− samples correctly classified as ER−) in its computation. With many true negatives, a classifier can achieve a high AUC-ROC even if it performs poorly on the positive class. The PR curve ignores true negatives entirely, focusing exclusively on how well the model identifies the positive class. As a result, AUC-PR is less easily inflated by class imbalance.
Both metrics are informative, but in different ways.
Throughout this tutorial, we mark the Youden threshold on each ROC curve. This is the probability threshold that maximises the Youden index: \[J = \text{sensitivity} + \text{specificity} - 1\] It represents the operating point on the ROC curve that is furthest from the diagonal random-classifier line. This threshold gives equal weight to sensitivity and specificity, making it a useful default summary value.
It is commonly used as a principled starting point for choosing a decision threshold in clinical applications, though the final choice may be adjusted based on the relative costs of false positives and false negatives.
The threshold value, sensitivity, and specificity at the Youden point are printed directly on the plots below (in the format: threshold (sensitivity, specificity)).
We will store all results in a common data frame for final comparison.
results <- data.frame(
Model = character(),
Accuracy = numeric(),
AUC_ROC = numeric(),
AUC_PR = numeric(),
stringsAsFactors = FALSE
)A helper function to compute all metrics:
evaluate_model <- function(model_name, true_labels, pred_labels, pred_probs) {
# Confusion matrix and accuracy
cm <- table(actual = true_labels, predicted = pred_labels)
acc <- sum(diag(cm)) / sum(cm)
# ROC curve and AUC-ROC
roc_obj <- pROC::roc(
response = true_labels,
predictor = pred_probs,
levels = c("ER-", "ER+"),
direction = "<",
quiet = TRUE
)
auc_roc <- as.numeric(pROC::auc(roc_obj))
# Precision-Recall curve and Average Precision (AUC-PR)
# PRROC::pr.curve expects numeric scores and binary labels (1 = positive)
pr_obj <- PRROC::pr.curve(
scores.class0 = pred_probs[true_labels == "ER+"], # scores for positives
scores.class1 = pred_probs[true_labels == "ER-"], # scores for negatives
curve = TRUE
)
auc_pr <- pr_obj$auc.integral
list(cm = cm, acc = acc, auc = auc_roc, auc_pr = auc_pr,
roc = roc_obj, pr = pr_obj)
}7 1. Logistic Regression
7.1 Theory and intuition
Logistic regression is the canonical linear model for binary classification. It models the log-odds of belonging to the positive class as a linear function of the predictors: \[\log \frac{P(Y=1 \mid \mathbf{x})}{1 - P(Y=1 \mid \mathbf{x})} = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p\]
Applying the inverse logit (sigmoid) function to both sides gives the probability of belonging to class 1: \[P(Y=1 \mid \mathbf{x}) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p)}}\]
This probability is bounded between 0 and 1 and traces the characteristic S-shaped (sigmoid) curve. A sample is classified as ER+ if the predicted probability exceeds a chosen threshold (typically 0.5).
The coefficients \(\beta_j\) are estimated by maximising the log-likelihood over the training data. A positive \(\beta_j\) means that, holding the other predictors fixed, higher values of predictor \(x_j\) increase the probability of being in the positive class; a negative \(\beta_j\) means the opposite.
Strengths: interpretable coefficients (the log-odds scale has direct biological meaning), natural probabilistic output, fast to fit. Limitations: assumes a linear decision boundary; in this p > n setting (with 4,000 genes and ~700 training samples) the model may overfit, and coefficients can become unstable without regularisation.
7.2 Key hyperparameters
glm
The standard stats::glm with family = binomial performs maximum likelihood estimation without regularisation. When p is large relative to n, consider using penalised logistic regression (ridge, LASSO, or elastic net) via the glmnet package, which adds a regularisation term to shrink coefficients toward zero. LASSO (α = 1) additionally performs automatic feature selection by setting many coefficients exactly to zero. We demonstrate this in the Exercise at the end of this section.
7.3 Training
7.3.1 Simple logistic regression (one predictor)
We start with a simple logistic regression using a single, biologically motivated predictor: ESR1 expression (the gene encoding the oestrogen receptor protein itself). Based on the exploratory violin plot above, we expect ESR1 expression to be a strong predictor of ER status — tumours with high ESR1 expression should have a high probability of being ER+, while tumours with low ESR1 expression should be predominantly ER−.
We fit the model and inspect its summary:
Call:
stats::glm(formula = er_status ~ ESR1, family = binomial, data = train_df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.4097 0.3250 -10.49 <2e-16 ***
ESR1 1.0498 0.0765 13.72 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 765.24 on 696 degrees of freedom
Residual deviance: 267.89 on 695 degrees of freedom
AIC: 271.89
Number of Fisher Scoring iterations: 6
The output confirms our expectation. The coefficient for ESR1 is positive (Estimate ≈ +1.05) and highly significant (p < 2×10⁻¹⁶), meaning that each unit increase in log₂(TPM+1) expression of ESR1 multiplies the odds of being ER+ by approximately \(e^{1.05} \approx 2.9\).
The model also fits much better than the intercept-only model: the residual deviance (268) is far lower than the null deviance (765), indicating that ESR1 alone explains a substantial fraction of the variance in ER status. In short, ESR1 is exactly the kind of single strong biomarker we would expect from basic biology.
7.3.2 Multiple logistic regression (all genes)
We now fit a multiple logistic regression using all 4,000 genes as predictors. In principle, using more genes should provide more discriminatory power. In practice, however, with far more predictors than samples (p > n), maximum likelihood estimation without regularisation becomes ill-conditioned: coefficients can grow arbitrarily large and the model will memorise the training data rather than generalising.
With 4000 predictors and 697 training samples, glm will issue convergence warnings and may produce extreme coefficient estimates — a clear sign of overfitting. For production use, regularisation is essential. Here we proceed deliberately to illustrate what happens when we ignore this constraint.
7.4 Prediction and evaluation
Having trained both models, we now evaluate them on the held-out test set — data the model has never seen during training. We use predict() with type = "response" to obtain predicted probabilities, then convert to class labels using a threshold of 0.5.
7.4.1 Simple logistic regression (ESR1 only)
res_simple_glm <- evaluate_model("Simple GLM (ESR1)", test_y,
simple_glm_classes, simple_glm_probs)
# Confusion matrix
res_simple_glm$cm |>
as.data.frame() |>
kable(caption = "Confusion matrix — Simple Logistic Regression (ESR1 only)") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| actual | predicted | Freq |
|---|---|---|
| ER- | ER- | 54 |
| ER+ | ER- | 9 |
| ER- | ER+ | 8 |
| ER+ | ER+ | 228 |
sprintf("Accuracy: %.2f%% | AUC-ROC: %.4f | AUC-PR: %.4f",
res_simple_glm$acc * 100, res_simple_glm$auc, res_simple_glm$auc_pr)[1] "Accuracy: 94.31% | AUC-ROC: 0.9621 | AUC-PR: 0.9893"
The output confirms that the simple ESR1-only model performs very well on the held-out test set. A single gene — ESR1 — achieves 94.31% accuracy, AUC-ROC = 0.9621, and AUC-PR = 0.9893. The AUC-PR of 0.989 is far above the random baseline of ~0.79 (the prevalence of ER+ in the test set), confirming that the model’s positive predictions are highly precise: when the model calls a sample ER+, it is almost always correct. The AUC-ROC of 0.962 confirms excellent discrimination across all thresholds. The model misclassifies some samples in both directions — some ER+ tumours with relatively low ESR1 expression are predicted ER−, and a small number of ER− tumours with unexpectedly high ESR1 expression are predicted ER+. These represent biologically interesting edge cases, such as tumours with discordant IHC and transcriptomic ER status.
7.4.2 Multiple logistic regression (all genes)
res_glm <- evaluate_model("Logistic Regression", test_y, glm_classes, glm_probs)
# Confusion matrix
res_glm$cm |>
as.data.frame() |>
kable(caption = "Confusion matrix — Multiple Logistic Regression (all genes)") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| actual | predicted | Freq |
|---|---|---|
| ER- | ER- | 32 |
| ER+ | ER- | 119 |
| ER- | ER+ | 30 |
| ER+ | ER+ | 118 |
sprintf("Accuracy: %.2f%% | AUC-ROC: %.4f | AUC-PR: %.4f",
res_glm$acc * 100, res_glm$auc, res_glm$auc_pr)[1] "Accuracy: 50.17% | AUC-ROC: 0.4951 | AUC-PR: 0.7932"
The multiple logistic regression achieves 50.17% accuracy, AUC-ROC = 0.4951, and AUC-PR = 0.7932 — essentially no better than random guessing. Note that the AUC-PR of 0.793 is approximately equal to the ER+ prevalence in the test set (~79%), which is exactly what a random classifier achieves: it assigns arbitrary probabilities, and by chance it will correctly rank ER+ above ER− about as often as the base rate predicts.
This is a striking and important result: using more data (more genes) produced a worse model than using just one gene. Why?
This is a textbook example of overfitting in high dimensions. With 4,000 predictors and only ~700 training samples, glm cannot estimate all coefficients reliably. The fitted decision boundary becomes unstable and captures noise rather than signal. When applied to the test set, where that noise is different, the predictions are essentially random.
The solution is regularisation — adding a penalty term that constrains the coefficients. This is what LASSO and ridge regression do, and what we explore in the exercise below.
7.5 ROC and Precision-Recall curves
The ROC curve shows the trade-off between sensitivity (true positive rate for ER+) and specificity (1 − false positive rate) as the decision threshold varies. We plot both models together on each curve type for a direct comparison.
# Plot simple GLM first
plot(
res_simple_glm$roc,
main = "ROC Curves — Logistic Regression",
col = "#4DBBD5", lwd = 2,
print.thres = "best", print.thres.best.method = "youden"
)
# Add full GLM
plot(res_glm$roc, col = "#E64B35", lwd = 2, add = TRUE)
legend(
"bottomright",
legend = c(
sprintf("Simple GLM / ESR1 only (AUC-ROC = %.3f)", res_simple_glm$auc),
sprintf("Full GLM / all genes (AUC-ROC = %.3f)", res_glm$auc)
),
col = c("#4DBBD5", "#E64B35"), lwd = 2, bty = "n", cex = 0.85
)The contrast between the two ROC curves is dramatic. The single-gene model (blue) traces a curve well above the diagonal, indicating genuine discriminatory power. The full model (red) lies close to the diagonal, consistent with no useful discriminatory signal on the test set. The Youden threshold shown for the simple model marks the operating point that maximises the Youden index, balancing sensitivity and specificity under equal weight.
plot(
res_simple_glm$pr,
col = "#4DBBD5", lwd = 2,
main = "Precision-Recall Curves — Logistic Regression",
auc.main = FALSE,
xlab = "Recall", ylab = "Precision"
)
plot(res_glm$pr, col = "#E64B35", lwd = 2, add = TRUE)
abline(h = mean(test_y == "ER+"), lty = 2, col = "grey50")
legend(
"bottomleft",
legend = c(
sprintf("Simple GLM / ESR1 only (AUC-PR = %.3f)", res_simple_glm$auc_pr),
sprintf("Full GLM / all genes (AUC-PR = %.3f)", res_glm$auc_pr)
),
col = c("#4DBBD5", "#E64B35"), lwd = 2, bty = "n", cex = 0.85
)The PR curves reinforce the ROC findings. The single-gene model (blue) maintains very high precision (~0.99) across most of the recall range, confirming that positive predictions are almost always correct. The unregularised full model (red) hovers near the random baseline (dashed line, AUC-PR ≈ 0.793 ≈ prevalence of ER+ in the test set), confirming its predictions are no better than chance.
7.6 Visualisation: ESR1 sigmoid curve
The sigmoid curve is the hallmark of logistic regression. It maps a continuous predictor (here ESR1 expression) to a probability between 0 and 1. We plot the estimated curve on the training data.
We expect to see: (a) ER− samples (red) concentrated at low ESR1 expression, (b) ER+ samples (blue) concentrated at high ESR1 expression, and (c) an S-shaped curve rising steeply in the range where the two groups overlap.
train_df_plot <- train_df |>
mutate(
prob = predict(simple_glm, newdata = train_df, type = "response"),
er_numeric = ifelse(er_status == "ER+", 1, 0)
)
ggplot(train_df_plot, aes(x = ESR1, y = er_numeric)) +
geom_point(aes(colour = er_status), alpha = 0.3, size = 1) +
geom_smooth(
method = "glm",
method.args = list(family = "binomial"),
se = TRUE, colour = "#E64B35", fill = "#E64B3533"
) +
scale_colour_manual(values = c("ER-" = "#E64B35", "ER+" = "#4DBBD5")) +
labs(
title = "Logistic regression: ESR1 → P(ER+)",
x = "ESR1 expression [log2(TPM+1)]",
y = "P(ER+)",
colour = "ER status"
) +
theme_bw(base_size = 11)The plot matches our expectation closely. At low ESR1 expression (left), nearly all samples are ER− (red), and the model correctly assigns near-zero probability of being ER+. At high expression (right), virtually all samples are ER+ (blue), and the model assigns near-1 probability. The transition zone (mid-range ESR1 expression, approximately 2.5–5 on the log2 scale) is where uncertainty is highest and where misclassifications occur. The shaded band shows the 95% confidence interval of the fit — note how uncertainty is largest in this transition region, as expected.
7.7 Store results
results <- rbind(results, data.frame(
Model = "Logistic Regression",
Accuracy = res_glm$acc,
AUC_ROC = res_glm$auc,
AUC_PR = res_glm$auc_pr
))7.8 Exercise
8 2. K-Nearest Neighbours (KNN)
8.1 Theory and intuition
KNN is one of the simplest non-parametric classifiers. To predict the class of a new point \(\mathbf{x}^*\), it:
- Finds the \(k\) training samples closest to \(\mathbf{x}^*\) (by Euclidean distance in feature space)
- Returns the majority class among those \(k\) neighbours
KNN makes no assumptions about the shape of the decision boundary, which makes it flexible. However, it stores the entire training set and is sensitive to the scale of features.
Strengths: non-parametric, no training phase, simple to understand.
Limitations: slow at prediction time (O(n) per query), sensitive to feature scaling, and sensitive to irrelevant/noisy features (the “curse of dimensionality”).
8.2 Key hyperparameters
k (number of neighbours)
- Small k (e.g. k = 1): very flexible, low bias, high variance → tends to overfit
- Large k (e.g. k = n): very smooth boundary, high bias, low variance → tends to underfit
- In practice, k is chosen by cross-validation. Odd values of k are preferred in binary classification to avoid ties (when exactly k/2 neighbours belong to each class, the outcome is ambiguous). In this lab we use k = 10 as a round number for illustration;
class::knnbreaks ties randomly when they occur. We explore the sensitivity to k in the exercise below.
8.3 Training and prediction
KNN does not have a separate training phase — all the computation happens at prediction time. We use the class::knn function and provide standardised data (z-scored gene expression). Standardisation is critical for distance-based methods: without it, genes with large expression ranges would dominate the distance calculation regardless of their actual discriminatory power.
We use k = 10 as a starting point. With ~700 training samples and a binary outcome, k = 10 is a reasonable default — small enough to capture local structure, large enough to smooth out individual noise. We explore the effect of other values of k in the exercise below.
We expect KNN to perform well here, because the ER+/ER− groups are well-separated in the full gene expression space (as seen in the PCA from the clustering lab). The main question is whether the high dimensionality (4,000 genes) causes the curse of dimensionality to degrade performance — we will see how well this works in practice.
set.seed(seed = 5381L)
knn_pred <- class::knn(
train = train_scaled_x, # standardised training features
test = test_scaled_x, # standardised test features
cl = train_y, # training labels
k = 10, # number of neighbours
prob = TRUE # return vote proportions
)
# The `prob` attribute stores the proportion of votes for the *winning* class.
# We need to convert this to P(ER+) for ROC analysis.
knn_vote_prop <- attributes(knn_pred)$prob
knn_probs <- ifelse(knn_pred == "ER+", knn_vote_prop, 1 - knn_vote_prop)8.4 Evaluation
We evaluate the KNN classifier on the held-out test set. We compute the confusion matrix — which shows the number of correct and incorrect predictions for each class — as well as accuracy and AUC-ROC.
res_knn <- evaluate_model("KNN (k=10)", test_y, knn_pred, knn_probs)
res_knn$cm |>
as.data.frame() |>
kable(caption = "Confusion matrix — KNN (k=10)") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| actual | predicted | Freq |
|---|---|---|
| ER- | ER- | 49 |
| ER+ | ER- | 5 |
| ER- | ER+ | 13 |
| ER+ | ER+ | 232 |
sprintf("Accuracy: %.2f%% | AUC-ROC: %.4f | AUC-PR: %.4f",
res_knn$acc * 100, res_knn$auc, res_knn$auc_pr)[1] "Accuracy: 93.98% | AUC-ROC: 0.9356 | AUC-PR: 0.9713"
KNN achieves 93.98% accuracy, AUC-ROC = 0.9356, and AUC-PR = 0.9713. Compared to the unregularised multiple logistic regression (AUC-ROC = 0.495, AUC-PR = 0.793), this is a dramatic improvement. Compared to the simple ESR1 logistic regression (AUC-ROC = 0.962, AUC-PR = 0.989), KNN is slightly lower on both metrics — suggesting that a single well-chosen linear predictor can outperform a non-parametric distance-based method when one gene strongly dominates the signal.
The AUC-PR of 0.971 is well above the random baseline of ~0.79, confirming that KNN’s positive-class predictions are highly precise. The confusion matrix reveals that KNN errs slightly more on the ER− class (the minority group): with more ER+ training examples in the neighbourhood of any given test point, the majority-vote mechanism is biased toward the more frequent class.
8.5 ROC and Precision-Recall curves
The ROC curve shows how the classifier’s sensitivity and specificity trade off as we vary the probability threshold. For KNN, the “probability” is the proportion of votes for the winning class among the k neighbours, so the curve has fewer distinct points than for continuous probability models.
plot(
res_knn$roc,
main = "ROC Curve — KNN (k=10)",
col = "#4DBBD5", lwd = 2,
print.thres = "best", print.thres.best.method = "youden"
)The Youden threshold indicates the optimal operating point. The AUC-ROC of 0.936 confirms that KNN can distinguish ER+/ER− tumours with high reliability, despite operating in 4,000 dimensions and using only local neighbourhood information with no explicit feature selection.
plot(
res_knn$pr,
col = "#4DBBD5", lwd = 2,
main = "Precision-Recall Curve — KNN (k=10)",
auc.main = FALSE,
xlab = "Recall", ylab = "Precision"
)
abline(h = mean(test_y == "ER+"), lty = 2, col = "grey50")
legend("bottomleft",
legend = sprintf("KNN k=10 (AUC-PR = %.3f)", res_knn$auc_pr),
col = "#4DBBD5", lwd = 2, bty = "n")The PR curve maintains high precision across most of the recall range, confirming that KNN rarely produces false positive ER+ calls. The small drop in precision at very high recall (near 1.0) reflects the few remaining hard-to-classify samples that require a very low threshold to capture — the same samples that tend to have expression patterns more similar to ER+ tumours than most ER− tumours.
8.6 Store results
results <- rbind(results, data.frame(
Model = "KNN (k=10)",
Accuracy = res_knn$acc,
AUC_ROC = res_knn$auc,
AUC_PR = res_knn$auc_pr
))8.7 Exercise
9 3. Support Vector Machines (SVM)
9.1 Theory and intuition
SVMs find the optimal separating hyperplane — the one that maximises the margin between the closest training points of opposite classes (the support vectors). When data are not linearly separable, SVMs use the kernel trick to implicitly map data into a higher-dimensional space where a linear separator may exist.
Common kernels:
| Kernel | Description | When to use |
|---|---|---|
| Linear | Dot product | Linearly separable or p >> n |
| Radial (RBF) | Gaussian function of distance | General non-linear boundaries |
| Polynomial | Polynomial of dot product | Specific polynomial structure |
Strengths: effective in high-dimensional spaces, flexible via kernels, and able to trade margin width against misclassification through the soft-margin formulation.
Limitations: SVMs do not natively produce probability estimates — the decision function outputs a signed distance from the hyperplane, not a probability. The sign of the SVM decision value determines the predicted class under the zero threshold, and the absolute value indicates how far the sample lies from the decision boundary. To obtain probabilities for use in ROC analysis, a post-hoc calibration step is needed. Calibration is the process of mapping the raw SVM decision values to well-behaved probabilities in [0, 1]. The most common approach is Platt scaling, which fits a logistic regression on the SVM outputs on a held-out subset of the training data; another option is isotonic regression. Without calibration, the SVM can still be used for hard classification (above/below a threshold), but the resulting “probabilities” are not directly interpretable. SVMs are also slow to train on large datasets and sensitive to feature scaling.
9.2 Key hyperparameters
cost and gamma
-
cost(C): the penalty for misclassifying training points. High C → narrow margin, tries to classify all training points correctly (risk of overfitting). Low C → wide margin, allows some misclassifications (more regularised). -
gamma(RBF kernel only): controls the reach of each training point. High γ → each point influences only nearby points (complex boundary). Low γ → smoother boundary.
Both should be tuned jointly, typically via cross-validation on a grid of (C, γ) values using tune.svm.
9.3 Training
We fit an SVM with an RBF kernel using the e1071::svm function. We specify kernel = "radial" for the RBF kernel and set scale = TRUE so that the function internally standardises all features before fitting — analogous to what we did manually for KNN. We also set probability = TRUE, which instructs the function to apply Platt scaling after training (as described in the limitations above), enabling the model to output calibrated class probabilities rather than just hard class decisions. Finally, cost = 10 sets the penalty parameter C, which moderately penalises misclassifications.
Call:
svm(formula = er_status ~ ., data = train_df, kernel = "radial",
cost = 10, probability = TRUE, scale = TRUE)
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 10
Number of Support Vectors: 393
The summary reports the number of support vectors — the training points that lie on or inside the margin. A large number of support vectors relative to the training set size can indicate that the decision boundary is complex or that the data are noisy. In high-dimensional settings like ours, the number of support vectors is often substantial.
9.4 Evaluation
To predict on the test set, we pass probability = TRUE to predict(), which returns both the predicted class and the calibrated probability for each class. We extract the probability column for ER+ to compute the ROC curve.
res_svm <- evaluate_model("SVM (RBF)", test_y, svm_classes, svm_probs)
res_svm$cm |>
as.data.frame() |>
kable(caption = "Confusion matrix — SVM (RBF kernel)") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| actual | predicted | Freq |
|---|---|---|
| ER- | ER- | 53 |
| ER+ | ER- | 9 |
| ER- | ER+ | 9 |
| ER+ | ER+ | 228 |
sprintf("Accuracy: %.2f%% | AUC-ROC: %.4f | AUC-PR: %.4f",
res_svm$acc * 100, res_svm$auc, res_svm$auc_pr)[1] "Accuracy: 93.98% | AUC-ROC: 0.9471 | AUC-PR: 0.9806"
The RBF SVM achieves 93.98% accuracy, AUC-ROC = 0.9471, and AUC-PR = 0.980. Looking at the confusion matrix, misclassifications are split roughly evenly: 9 ER− samples are incorrectly predicted as ER+, and 9 ER+ samples are incorrectly predicted as ER−. The AUC-PR of 0.980 is notably higher than KNN’s 0.971, suggesting that the SVM’s calibrated probability scores (via Platt scaling) are more precise for the positive class than the raw vote proportions from KNN, even at identical accuracy and similar overall confusion patterns.
9.5 ROC and Precision-Recall curves
The ROC curve summarises the SVM’s discriminatory performance across all possible probability thresholds. Recall that the SVM probabilities are obtained via Platt scaling — a logistic calibration fitted on top of the raw SVM decision values — which is useful for interpretability, while the ROC curve itself primarily reflects ranking/discrimination.
plot(
res_svm$roc,
main = "ROC Curve — SVM (RBF)",
col = "#00A087", lwd = 2,
print.thres = "best", print.thres.best.method = "youden"
)The curve rises steeply from the bottom-left corner, which indicates that the model achieves high sensitivity while maintaining high specificity across a broad range of thresholds. In other words, only a small change in threshold is needed to move from a very strict classifier to one that captures most ER+ tumours. The AUC of about 0.947 places the model in a very strong discrimination range.
The Youden threshold (marked on the plot) indicates the operating point that maximises the sum of sensitivity and specificity, and is often used as a simple default threshold for binary classification; moving the threshold lower would increase sensitivity at the cost of more false positives, while moving it higher would increase specificity at the cost of missing more ER+ tumours.
plot(
res_svm$pr,
col = "#00A087", lwd = 2,
main = "Precision-Recall Curve — SVM (RBF)",
auc.main = FALSE,
xlab = "Recall", ylab = "Precision"
)
abline(h = mean(test_y == "ER+"), lty = 2, col = "grey50")
legend("bottomleft",
legend = sprintf("SVM RBF (AUC-PR = %.3f)", res_svm$auc_pr),
col = "#00A087", lwd = 2, bty = "n")The SVM PR curve maintains near-perfect precision (~0.98) across most recall levels, consistent with the high AUC-PR of 0.980. The underlying score separation remains strong even when the threshold is lowered to recover more ER+ tumours, which is a useful practical property in clinical settings where confident positive calls matter.
9.6 Store results
results <- rbind(results, data.frame(
Model = "SVM (RBF)",
Accuracy = res_svm$acc,
AUC_ROC = res_svm$auc,
AUC_PR = res_svm$auc_pr
))9.7 Exercise
10 4. Decision Tree
10.1 Theory and intuition
Decision trees recursively partition the feature space into rectangular regions. At each internal node, the algorithm selects the feature and threshold that best separate the classes (maximising a purity criterion such as the Gini impurity or information gain). The resulting model is a binary tree of if-then-else rules.
Strengths: highly interpretable, handles mixed feature types, no scaling required, captures non-linear interactions.
Limitations: high variance — small changes in training data can yield very different trees. Prone to overfitting without pruning.
10.2 Key hyperparameters
cp, maxdepth, minsplit
rpart uses cost-complexity pruning, controlled by:
-
cp(complexity parameter): a split is only attempted if it decreases the overall lack of fit by at least this amount. Smallercp→ deeper, more complex tree; largercp→ simpler tree. Userpart::printcpandrpart::plotcpto select the optimalcpvia cross-validation error. -
maxdepth: maximum depth of the tree. -
minsplit: minimum number of observations in a node for a split to be attempted.
10.3 Training
We fit a decision tree using the rpart::rpart function. We specify method = "class" to indicate a classification problem (as opposed to "anova" for regression). The control argument accepts an rpart.control object that fine-tunes the tree-growing algorithm:
-
cp = 0.001(complexity parameter): a split is only made if it reduces the overall lack-of-fit by at least this fraction. Setting a very small value (0.001) allows the tree to grow deep initially; we will prune it back later. -
minsplit(default 20): the minimum number of samples in a node required before attempting a split. -
maxdepth(default 30): the maximum depth of the tree.
We intentionally let the tree grow large first, then prune it back to the optimal size — this is generally preferable to constraining depth upfront, because it gives us a data-driven way to choose the right complexity.
set.seed(seed = 5381L)
tree_fit <- rpart::rpart(
formula = er_status ~ .,
data = train_df,
method = "class",
control = rpart::rpart.control(cp = 0.001) # allow growth, prune after
)We then use rpart::plotcp to visualise how the cross-validation error (the xerror column in the internal cptable) changes as a function of the complexity parameter cp. Each point on the plot corresponds to a subtree of a different size: moving left (larger cp) gives a simpler tree with fewer splits; moving right (smaller cp) gives larger, more complex trees.
A common strategy is the 1-SE rule: choose the simplest tree whose cross-validation error is within one standard error of the minimum — this corresponds to the leftmost point below the dashed horizontal line on the plot. In the code below, for simplicity we instead select the cp value that directly minimises the cross-validation error (which.min(xerror)), which may yield a slightly larger tree but is easy to understand and still guards against overfitting.
rpart::plotcp(tree_fit)The dashed horizontal line marks the 1-SE threshold above the minimum cross-validation error. Notice how the cross-validation error first decreases as the tree grows larger (capturing genuine signal) and then plateaus or increases slightly (beginning to fit noise): this U-shaped pattern is the empirical signature of the bias-variance trade-off. The selected cp (minimum CV error) corresponds to the tree size at the bottom of this curve.
In this plot, the minimum appears at cp ≈ 0.14, corresponding to a tree with 2 terminal nodes. That means the pruned tree is very small: the data support only one split under cross-validation, which is a useful reminder that classification trees can be highly interpretable but often need to remain shallow to generalise well.
10.4 Visualisation
One of the key strengths of a decision tree is that it can be visualised directly as a diagram of if-then-else rules. We use rpart.plot::rpart.plot to render the pruned tree. Each node is colour-coded by predicted class (red for ER−, blue for ER+), and displays three pieces of information:
- Predicted class — the majority class among training samples in that node
- Class probability — the proportion of ER− samples (left) and ER+ samples (right) reaching that node
- Percentage of data — the fraction of all training samples reaching that node
rpart.plot::rpart.plot(
tree_pruned,
type = 4,
extra = 104,
box.palette = list("ER-" = "#E64B35", "ER+" = "#4DBBD5"),
main = "Decision Tree — ER status (pruned)"
)In the tree shown here, the pruned model is a single split on ESR1 at 3.9, so this is effectively a decision stump. Samples with ESR1 expression below 3.9 are sent to the left terminal node, and samples with ESR1 expression at or above 3.9 are sent to the right terminal node.
The left leaf reads ER− | 0.87 0.13 | 26%, which means:
- The predicted class is ER−
- 87% of samples in this node are ER− (the model is confident)
- 13% are actually ER+ (the model’s error rate for this leaf)
- 26% of all training samples end up in this leaf
The right leaf reads ER+ | .02 .98 | 74%, which means:
- The predicted class is ER+
- 98% of the samples in this node are ER+
- 2% are ER−
- 74% of the training samples end up in this leaf
The root node reads ER+ | .24 .76 | 100%, which reflects the class composition of the full training set before splitting.
This is a particularly nice example because a single biologically meaningful threshold on ESR1 already captures most of the ER-status structure in the data. For classification trees, rpart.plot can display the per-class probabilities at each node, and predict.rpart(..., type = "prob") returns class probabilities for new samples.
The tree structure you see depends on the data and the pruning criterion. In this fit, the pruned tree has only one split on ESR1. That is not a failure of the method: it shows that a simple, biologically motivated rule captures most of the structure. It means that, under cross-validation, the data only support one genuinely useful partition of the feature space. This is exactly what we would expect if ESR1 is the dominant discriminator, as suggested by the logistic regression section.
10.5 Evaluation
We now apply the pruned tree to the held-out test set and compute the confusion matrix, accuracy, and AUC. We use predict() with type = "class" for the class labels and type = "prob" for the class probabilities used in the ROC computation.
res_tree <- evaluate_model("Decision Tree", test_y, tree_classes, tree_probs)
res_tree$cm |>
as.data.frame() |>
kable(caption = "Confusion matrix — Decision Tree") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| actual | predicted | Freq |
|---|---|---|
| ER- | ER- | 56 |
| ER+ | ER- | 12 |
| ER- | ER+ | 6 |
| ER+ | ER+ | 225 |
sprintf("Accuracy: %.2f%% | AUC-ROC: %.4f | AUC-PR: %.4f",
res_tree$acc * 100, res_tree$auc, res_tree$auc_pr)[1] "Accuracy: 93.98% | AUC-ROC: 0.9263 | AUC-PR: 0.9690"
The pruned decision tree achieves 93.98% accuracy, AUC-ROC = 0.9263, and AUC-PR = 0.9690. These are the lowest AUC values among all methods except the unregularised logistic regression, yet still well above random. The AUC-PR of 0.969 is above the the approximate ER+ prevalence in the test set, confirming that the tree’s positive predictions are still meaningfully precise.
The main trade-off is that the tree gives up some predictive performance in exchange for complete interpretability: every prediction is a human-readable threshold rule on ESR1. The downside is that tree probabilities are coarse, because they are tied to terminal-node class proportions rather than to a smooth score function.
10.6 ROC and Precision-Recall curves
The ROC curve summarises the decision tree’s discriminatory performance across all possible probability thresholds. For a decision tree, the predicted probability is the class proportion within a terminal node, so all samples that fall in the same leaf receive exactly the same score. In this pruned tree, there is only one split on ESR1, so there are only two distinct probability values for ER+; this is why the ROC curve has a very coarse, staircase-like shape.
The Youden threshold shown on the plot is the probability cutoff that maximises the sum of sensitivity and specificity, and it provides a simple default operating point for classification.
plot(
res_tree$roc,
main = "ROC Curve — Decision Tree",
col = "#3C5488", lwd = 2,
print.thres = "best", print.thres.best.method = "youden"
)For models that output a continuous probability (e.g. logistic regression, SVM after Platt scaling), each sample has a unique probability value, so the curves can step at many distinct thresholds — producing a smooth line. For a decision tree, however, all samples reaching the same leaf receive exactly the same probability, because the probability is simply the class proportion in that terminal node. With only a few leaves, there are only a few distinct probability values, and therefore only a few threshold levels at which the curves can step — giving them a characteristic staircase appearance. Each abrupt step corresponds to one leaf’s probability crossing the threshold.
This is not a defect: it is the natural consequence of a piecewise-constant score function. Ensemble methods such as Random Forests or GBMs typically produce more finely graded score distributions because they average predictions over many trees, which usually makes the curves look smoother.
The Youden-optimal threshold is 0.556, corresponding to a sensitivity of 0.903 and specificity of 0.949. Despite the staircase shape — which limits the smoothness of the curve — the AUC-ROC of 0.926 shows that the tree still discriminates well between ER+ and ER− tumours. In other words, the tree is simple, but the signal it is using — the ESR1 split — is genuinely informative.
plot(
res_tree$pr,
col = "#3C5488", lwd = 2,
main = "Precision-Recall Curve — Decision Tree",
auc.main = FALSE,
xlab = "Recall", ylab = "Precision"
)
abline(h = mean(test_y == "ER+"), lty = 2, col = "grey50")
legend("bottomleft",
legend = sprintf("Decision Tree (AUC-PR = %.3f)", res_tree$auc_pr),
col = "#3C5488", lwd = 2, bty = "n")The PR curve is even more visibly stepwise than the ROC curve, for the same reason: the tree produces only a small number of distinct probability values.
Each abrupt change in the curve corresponds to the threshold crossing from one leaf probability to the other. Despite this coarseness, the AUC-PR of 0.969 is well above the no-skill baseline, confirming that the tree’s positive predictions are still highly precise across the operating range.
10.7 Store results
results <- rbind(results, data.frame(
Model = "Decision Tree",
Accuracy = res_tree$acc,
AUC_ROC = res_tree$auc,
AUC_PR = res_tree$auc_pr
))10.8 Exercise
11 5. Random Forests
11.1 Theory and intuition
Random Forests are an ensemble method that addresses the high variance of individual decision trees by building many trees on bootstrap samples of the data, each using a random subset of features at every split. Predictions are obtained by majority vote (classification) or averaging (regression).
The two sources of randomness — bootstrap sampling and feature subsampling — ensure that the trees are de-correlated, which is what drives the variance reduction compared to bagging alone.
Strengths: robust, low variance, can model nonlinear interactions, provides built-in internal feature ranking, and often works well in high-dimensional tabular data. Limitations: less interpretable than a single tree; training and prediction are slower; can still overfit noisy data with very deep trees.
11.2 Key hyperparameters
ntree, mtry
-
ntree: number of trees. More trees → more stable predictions but no improvement in bias. Typically 500–1000 is sufficient; monitor the OOB error curve to see when it stabilises. -
mtry: number of features randomly sampled at each split. The default for classification is \(\sqrt{p}\). Smallermtry→ more de-correlated trees but higher per-tree variance. Tunable via cross-validation.
11.3 Training
We fit a Random Forest using the randomForest::randomForest function. We supply the predictor matrix train_x and the label vector train_y directly (rather than using a formula), which is more efficient when p is large. We set ntree = 500 and importance = TRUE to compute variable importance scores after fitting.
set.seed(seed = 5381L)
rf_fit <- randomForest::randomForest(
x = train_x,
y = train_y,
ntree = 500,
importance = TRUE, # compute variable importance,
norm.votes = TRUE # result of votes as fractions
)
print(rf_fit)
Call:
randomForest(x = train_x, y = train_y, ntree = 500, importance = TRUE, norm.votes = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 63
OOB estimate of error rate: 6.31%
Confusion matrix:
ER- ER+ class.error
ER- 143 23 0.13855422
ER+ 21 510 0.03954802
The printed summary tells us several important things:
-
No. of variables tried at each split: 63 — this is
mtry, automatically set to \(\lfloor\sqrt{4000}\rfloor = 63\). At each node, only 63 randomly selected genes are considered for splitting, which decorrelates the trees and reduces variance. - OOB estimate of error rate: 6.31% — only 6.31% of training samples are misclassified by their out-of-bag predictions. This is a free, nearly unbiased estimate of generalisation error computed without ever touching the test set.
-
Confusion matrix:
- ER− samples: 143 correctly classified, 23 misclassified → class error 13.9%
- ER+ samples: 510 correctly classified, 21 misclassified → class error 4.0%
The asymmetry in per-class error rates directly reflects the class imbalance: ER− is the minority class, so there are fewer ER− examples in bootstrap samples and the trees have less opportunity to learn ER−-specific structure. As a result, the forest tends to classify ER+ slightly better than ER− under the default voting rule.
11.4 Learning curve
As we add more trees to the ensemble, the training error and the out-of-bag (OOB) error evolve differently. Plotting this learning curve helps us verify that 500 trees is sufficient and understand the error structure by class.
oob_df <- data.frame(
trees = seq_len(nrow(rf_fit$err.rate)),
OOB = rf_fit$err.rate[, "OOB"],
`ER-` = rf_fit$err.rate[, "ER-"],
`ER+` = rf_fit$err.rate[, "ER+"],
check.names = FALSE
) |>
pivot_longer(-trees, names_to = "class", values_to = "error")
ggplot(oob_df, aes(x = trees, y = error, colour = class)) +
geom_line(linewidth = 0.7) +
scale_colour_manual(
values = c("OOB" = "black", "ER-" = "#E64B35", "ER+" = "#4DBBD5")
) +
labs(
title = "Random Forest: OOB error vs number of trees",
x = "Number of trees",
y = "Error rate",
colour = "Class"
) +
theme_bw(base_size = 11)The OOB error rate stabilises after approximately 361 trees, suggesting that 500 trees is more than sufficient for this dataset. Additional trees beyond that point mainly stabilise the ensemble rather than improving performance substantially.
Notice that:
- The black line (overall OOB error) drops rapidly in the first ~50 trees and then flattens — adding more trees beyond ~100 yields negligible improvement.
- The red line (ER− class error) is persistently higher than the blue line (ER+ class error), again reflecting the class imbalance: the minority ER− class is harder to classify correctly.
- The blue line (ER+ class error) stabilises at a lower plateau, consistent with the fact that ER+ samples are more numerous and therefore better represented in each bootstrap sample.
11.5 Variable importance
Random Forests provide a natural way to rank genes by their contribution to classification, without any external statistical test. We retrieve the variable importance scores using randomForest::importance(rf_fit, type = 2), which returns the Mean Decrease in Gini impurity for each gene.
The Gini impurity of a node \(t\) is defined as: \[G(t) = 1 - \sum_{k} p_k^2\] where \(p_k\) is the proportion of class \(k\) samples in node \(t\). A perfectly pure node (all one class) has \(G = 0\); a maximally impure node has \(G = 0.5\) for binary classification. At each split, the algorithm chooses the gene and threshold that maximally reduces Gini impurity. The Mean Decrease in Gini for gene \(j\) is then: \[\text{MDG}_j = \frac{1}{B} \sum_{b=1}^{B} \sum_{t \in T_b,\, j \text{ used}} \Delta G_{b,t}(j)\] where the outer sum is over all \(B = 500\) trees, the inner sum is over all nodes in tree \(b\) where gene \(j\) was used for splitting, and \(\Delta G\) is the weighted decrease in impurity. Genes that are used frequently at early (high- impact) nodes accumulate large MDG scores.
We extract the scores, sort genes by MDG in descending order, and plot the top 20:
imp_df <- as.data.frame(randomForest::importance(rf_fit, type = 2)) # MeanDecreaseGini
imp_df$gene <- rownames(imp_df)
colnames(imp_df)[1] <- "MeanDecreaseGini"
imp_df |>
arrange(desc(MeanDecreaseGini)) |>
head(20) |>
ggplot(aes(x = reorder(gene, MeanDecreaseGini), y = MeanDecreaseGini)) +
geom_col(fill = "#4DBBD5", alpha = 0.85) +
coord_flip() +
labs(
title = "Top 20 genes by Mean Decrease in Gini impurity",
x = NULL,
y = "Mean Decrease Gini"
) +
theme_bw(base_size = 11)The ranking provides a biologically interesting picture, though it also contains some surprises. ESR1 leads with the highest MDG score (5.40) — it is the most informative single gene for ER status, consistent with the single-gene logistic regression and the violin plots from the data exploration section. The second and third positions are occupied by C5AR2 (MDG 5.29) and AGR3 (MDG 4.62), both of which are known luminal breast cancer markers, though less recognised than the canonical ER pathway genes. THSD4, DNALI1, FSIP1, and CCDC170 also rank highly — these are less well-characterised in the context of ER status, but their presence in the ranking suggests that the Random Forest is capturing coordinated expression patterns associated with the luminal subtype beyond the most obvious pathway genes. GATA3 (rank 12, MDG 2.43) and FOXA1 (rank 14, MDG 2.37) — the two canonical ER co-regulators — do appear in the top 20, but lower than one might expect given their biological importance. This illustrates a known property of tree-based importance scores: highly correlated genes (like ESR1, GATA3, and FOXA1, which are co-expressed in the luminal lineage) share credit, so the importance of each individual gene may be underestimated relative to its biological role. CA12 (rank 9) is another established luminal marker.
The fact that known ER-associated genes appear near the top is reassuring, but variable importance should still be interpreted as a model-based ranking rather than a causal or definitive biomarker ordering.
11.6 Evaluation
We predict class labels and class probabilities on the test set using predict. Note that for Random Forests we supply the test feature matrix test_x directly (not the data frame test_df), consistent with how we supplied the training data as a matrix above.
res_rf <- evaluate_model("Random Forest", test_y, rf_classes, rf_probs)
res_rf$cm |>
as.data.frame() |>
kable(caption = "Confusion matrix — Random Forest") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| actual | predicted | Freq |
|---|---|---|
| ER- | ER- | 54 |
| ER+ | ER- | 8 |
| ER- | ER+ | 8 |
| ER+ | ER+ | 229 |
sprintf("Accuracy: %.2f%% | AUC-ROC: %.4f | AUC-PR: %.4f",
res_rf$acc * 100, res_rf$auc, res_rf$auc_pr)[1] "Accuracy: 94.65% | AUC-ROC: 0.9444 | AUC-PR: 0.9795"
The Random Forest achieves 94.65% accuracy, AUC-ROC = 0.9444, and AUC-PR = 0.9795 on the test set — consistent with the OOB error estimate of 6.31%. The OOB estimate is an internal estimate computed from out-of-bag samples, so it is useful for model selection and early diagnostics, but it is still not a substitute for the held-out test set.
In this tutorial, the RF AUC-PR of 0.9795 is the third-highest among all models (after SVM at 0.980 and GBM at 0.978). Looking at the confusion matrix, misclassifications are split symmetrically: 8 ER− samples are incorrectly predicted as ER+, and 8 ER+ samples are incorrectly predicted as ER−, suggesting no strong directional bias at the default 0.5 threshold.
11.7 ROC and Precision-Recall curves
The ROC curve summarises the random forest’s discriminatory performance across all possible probability thresholds. Before interpreting the curves, note that a Random Forest produces a score by aggregating votes across many trees. In the randomForest implementation, classification votes are returned as fractions when norm.votes=TRUE, and these fractions are what we use as the score for ROC and PR analysis. Because the score comes from an ensemble of trees rather than a single split rule, the curves are usually much smoother than those of a single decision tree, while still retaining the stepwise character of finite score levels.
plot(
res_rf$roc,
main = "ROC Curve — Random Forest",
col = "#F39B7F", lwd = 2,
print.thres = "best", print.thres.best.method = "youden"
)The Youden-optimal threshold is 0.640, which indicates that the score values produced by the forest are best separated at a cutoff above 0.5 for this dataset; this is the threshold that maximises the balance between sensitivity and specificity.
plot(
res_rf$pr,
col = "#F39B7F", lwd = 2,
main = "Precision-Recall Curve — Random Forest",
auc.main = FALSE,
xlab = "Recall", ylab = "Precision"
)
abline(h = mean(test_y == "ER+"), lty = 2, col = "grey50")
legend("bottomleft",
legend = sprintf("Random Forest (AUC-PR = %.3f)", res_rf$auc_pr),
col = "#F39B7F", lwd = 2, bty = "n")The RF PR curve is noticeably smoother than the single-tree PR curve because the ensemble aggregates votes across many trees, yielding a finer-grained score distribution. It maintains high precision across most of the recall range before a modest drop near recall = 1.0. The AUC-PR of 0.979 confirms that the ensemble’s positive-class scores remain highly informative across operating thresholds.
11.8 Store results
results <- rbind(results, data.frame(
Model = "Random Forest",
Accuracy = res_rf$acc,
AUC_ROC = res_rf$auc,
AUC_PR = res_rf$auc_pr
))11.9 Exercise
12 6. Gradient Boosting Machines (GBM)
12.1 Theory and intuition
GBMs build an ensemble of trees sequentially, where each new tree is fitted to the residuals of the current ensemble. To be more precise: rather than fitting to the raw residuals (as in ordinary least squares regression), GBM fits each new tree to the negative gradient of the loss function with respect to the current predictions. For a Bernoulli (logistic) loss — which is what we use for binary classification — this negative gradient is closely related to the difference between the observed class label (0 or 1) and the current predicted probability. In other words, each tree focuses on the mistakes of the current ensemble: samples where the predicted probability is far from the true label contribute the most to the gradient, and the new tree is trained to correct those errors. This is why boosting is sometimes described as “learning from mistakes”: at each step, the algorithm identifies where the current model is wrong and adds a tree that moves the predictions in the right direction — but only by a small step controlled by the learning rate.
This is in contrast to Random Forests, where trees are built independently in parallel.
Strengths: typically very high predictive performance, naturally handles mixed feature types and missing data (in some implementations), built-in feature importance.
Limitations: more hyperparameters to tune than Random Forests, prone to overfitting if not carefully regularised, slower to train (sequential by nature).
12.2 Key hyperparameters
n.trees, shrinkage, interaction.depth
-
n.trees: total number of boosting iterations (trees). More iterations → lower bias, but risk of overfitting. Use cross-validation to find the optimal number. -
shrinkage(learning rate, η): scales the contribution of each tree. Smaller η → more trees needed, but usually better generalisation. Typical values: 0.001–0.1. -
interaction.depth: maximum depth of each tree. Depth 1 = additive model (stumps); depth 2–4 allows pairwise interactions.
The general advice is: set shrinkage small (≤ 0.01) and n.trees large, then use cross-validation to determine the optimal number of trees.
12.3 Training
We fit a GBM using the gbm::gbm function. Before training, note the encoding step: gbm with distribution = "bernoulli" requires the outcome to be numeric (0 or 1) rather than a factor. We therefore recode ER− → 0 and ER+ → 1 before passing the data to the function.
The gbm package with distribution = "bernoulli" expects a numeric outcome in {0, 1}. We recode ER− → 0 and ER+ → 1 before fitting.
We set n.trees = 500 (the upper limit of our grid), interaction.depth = 3 (each tree can model three-way interactions), and shrinkage = 0.01. Crucially, we pass cv.folds = 5 to perform 5-fold cross-validation internally during training. This means that, while building each successive tree, gbm maintains five parallel sub-models — each trained on 80% of the data and validated on the remaining 20%. This allows gbm to estimate the out-of-sample deviance at each boosting iteration without using the test set. The cross-validated deviance curve is then used to select the number of trees that gives the best generalisation performance.
set.seed(seed = 5381L)
train_y_bin <- ifelse(train_y == "ER+", 1, 0)
test_y_bin <- ifelse(test_y == "ER+", 1, 0)
gbm_fit <- gbm::gbm(
formula = er_status ~ .,
distribution = "bernoulli",
data = data.frame(train_x, er_status = train_y_bin),
n.trees = 500,
interaction.depth = 3,
shrinkage = 0.01,
cv.folds = 5, # 5-fold CV to find optimal number of trees
verbose = FALSE
)
# Find the optimal number of trees via cross-validation
best_iter <- gbm::gbm.perf(gbm_fit, method = "cv", plot.it = TRUE)cat("Optimal number of trees:", best_iter, "\n")Optimal number of trees: 314
The plot from gbm.perf shows the training deviance and the cross-validated deviance as a function of the number of trees:
- Black line (training deviance): decreases monotonically — adding more trees always reduces training error, but the model eventually overfits.
- Green line (CV deviance): decreases initially, then levels off or increases slightly once the model begins to overfit the training data. The vertical dashed blue line marks the optimal number of trees (the minimum of the green curve).
In our case, the optimal number of trees is 314, well below the maximum of 500 — confirming that stopping early prevents overfitting. Using 500 trees instead of 314 would reduce training deviance further but increase test error. This is exactly the scenario that the cv.folds argument is designed to prevent: the ensemble keeps fitting the training data better as trees are added, but the cross-validated error shows when additional complexity stops paying off.
12.4 Variable importance
In GBMs, variable importance is measured by the relative influence of each predictor: the total reduction in the loss function (here Bernoulli deviance) attributable to splits on that predictor, averaged over all trees and normalised to sum to 100. We retrieve these scores using summary(gbm_fit) and plot the top 20 genes.
gbm_imp <- summary(gbm_fit, n.trees = best_iter, plotit = FALSE)
gbm_imp |>
head(20) |>
ggplot(aes(x = reorder(var, rel.inf), y = rel.inf)) +
geom_col(fill = "#7E6148", alpha = 0.85) +
coord_flip() +
labs(
title = "GBM: Top 20 genes by relative influence",
x = NULL,
y = "Relative influence (%)"
) +
theme_bw(base_size = 11)The relative influence plot reveals an extremely concentrated importance structure: ESR1 alone accounts for more than 63.9% of the total relative influence, dwarfing all other genes. This does not mean that the remaining genes are unimportant; rather, it means that the boosting process repeatedly finds ESR1 to be the strongest splitter, so it accumulates most of the explained loss reduction. Because boosting is sequential, a dominant predictor can absorb much of the available signal, and correlated genes may receive less credit than their biological relevance would suggest.
Several luminal-associated genes, including AGR3 and CA12, also appear in the top ranks, which is reassuring. However, relative influence is a model-based ranking rather than a causal or definitive biomarker ordering, so the lower positions should be interpreted cautiously.
The consistent appearance of known ER-associated genes in the top 20 is encouraging and aligns with the signal seen in the logistic regression and Random Forest analyses. It suggests that the GBM is learning biologically meaningful structure, while also reinforcing the fact that ESR1 is by far the dominant predictor in this dataset.
12.5 Evaluation
We now predict on the test set using predict() with type = "response", which returns probabilities on the [0, 1] scale (the inverse logit of the raw GBM score). We then threshold at 0.5 to obtain predicted class labels.
res_gbm <- evaluate_model("GBM", test_y, gbm_classes, gbm_probs)
res_gbm$cm |>
as.data.frame() |>
kable(caption = "Confusion matrix — GBM") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| actual | predicted | Freq |
|---|---|---|
| ER- | ER- | 55 |
| ER+ | ER- | 9 |
| ER- | ER+ | 7 |
| ER+ | ER+ | 228 |
sprintf("Accuracy: %.2f%% | AUC-ROC: %.4f | AUC-PR: %.4f",
res_gbm$acc * 100, res_gbm$auc, res_gbm$auc_pr)[1] "Accuracy: 94.65% | AUC-ROC: 0.9418 | AUC-PR: 0.9776"
The GBM achieves 94.65% accuracy, AUC-ROC = 0.9418, and AUC-PR = 0.9776. Looking at the confusion matrix: 55 ER− correctly classified, 7 misclassified as ER+ (false positives); 228 ER+ correctly classified, 9 misclassified as ER− (false negatives). The total of 16 misclassifications is slightly fewer than KNN and SVM (18 each).
The AUC-PR of 0.9776 indicates excellent positive-class ranking and precision-recall trade-off across thresholds. This reflects the model’s strong discriminatory signal and its smooth probability estimates.
In clinical terms, false negatives (ER+ patients classified as ER−) are the more costly error type: these patients would be denied hormone therapy they could benefit from. The ROC and PR curves allow us to explore threshold choices that reduce false negatives at the cost of more false positives — a trade-off that should be guided by clinical priorities rather than fixed at 0.5 by default.
12.6 ROC and Precision-Recall curves
Before interpreting the curves, note that the GBM returns fitted probability estimates on the [0, 1] scale when type = "response" is used. These scores come from the Bernoulli boosting model at the selected number of trees and are used here as ranking scores for ROC and Precision-Recall analysis. Because the ensemble is built sequentially and aggregates many small trees, it typically produces a more finely graded score distribution than a single decision tree.
plot(
res_gbm$roc,
main = "ROC Curve — GBM",
col = "#8491B4", lwd = 2,
print.thres = "best", print.thres.best.method = "youden"
)The ROC curve rises steeply, showing that the GBM separates ER+ and ER− samples well across a wide range of thresholds. The Youden-optimal threshold is 0.553, with sensitivity 0.903 and specificity 0.962. Notably, the GBM achieves slightly higher specificity (0.962) at the same sensitivity level (0.903) compared to both the decision tree (0.949) and the Random Forest (0.954) — meaning fewer ER− patients are incorrectly assigned ER+ status at this operating point.
plot(
res_gbm$pr,
col = "#8491B4", lwd = 2,
main = "Precision-Recall Curve — GBM",
auc.main = FALSE,
xlab = "Recall", ylab = "Precision"
)
abline(h = mean(test_y == "ER+"), lty = 2, col = "grey50")
legend("bottomleft",
legend = sprintf("GBM (AUC-PR = %.3f)", res_gbm$auc_pr),
col = "#8491B4", lwd = 2, bty = "n")The PR curve stays close to the top of the plot for most of the recall range, which means that positive predictions remain highly reliable even when the threshold is lowered to recover more ER+ cases. The AUC-PR of 0.978 is very high, indicating strong positive-class ranking performance across thresholds. It is slightly below the SVM and slightly above the decision tree and KNN, which is consistent with the GBM’s strong but not perfect separation of the two classes.
12.7 Store results
results <- rbind(results, data.frame(
Model = "GBM",
Accuracy = res_gbm$acc,
AUC_ROC = res_gbm$auc,
AUC_PR = res_gbm$auc_pr
))12.8 Exercise
13 Performance Comparison
We have now trained and evaluated six classifiers on the same TCGA BRCA test set. This final section brings all the results together for a direct comparison and draws broader conclusions about model selection in high-dimensional genomic classification.
13.1 Summary table
The table below collects the test-set accuracy and AUC for each model. The row with the best AUC is highlighted.
results |>
mutate(
Accuracy = paste0(round(Accuracy * 100, 2), "%"),
AUC_ROC = round(AUC_ROC, 4),
AUC_PR = round(AUC_PR, 4)
) |>
kable(caption = "Classification performance on the TCGA BRCA test set",
col.names = c("Model", "Accuracy", "AUC-ROC", "AUC-PR")) |>
kable_styling(bootstrap_options = c("striped", "hover", "bordered"),
full_width = FALSE) |>
column_spec(3, bold = TRUE) |>
column_spec(4, bold = TRUE) |>
row_spec(which.max(results$AUC_ROC), background = "#d9f0e8", bold = TRUE)| Model | Accuracy | AUC-ROC | AUC-PR |
|---|---|---|---|
| Logistic Regression | 50.17% | 0.4951 | 0.7932 |
| KNN (k=10) | 93.98% | 0.9356 | 0.9713 |
| SVM (RBF) | 93.98% | 0.9471 | 0.9806 |
| Decision Tree | 93.98% | 0.9263 | 0.9690 |
| Random Forest | 94.65% | 0.9444 | 0.9795 |
| GBM | 94.65% | 0.9418 | 0.9776 |
The most striking feature of the table is the gap between the unregularised logistic regression and all other methods: using all 4,000 genes without regularisation yields essentially random performance (~50% accuracy, AUC-ROC ≈ 0.50, AUC-PR near the prevalence baseline), while every other approach achieves 93–95% accuracy and AUC-ROC > 0.92. This gap is not a property of logistic regression as a method, but of the failure to regularise in a high-dimensional setting.
Among the well-performing models, the AUC-ROC differences are modest (~0.02 spread). The SVM is slightly ahead on both AUC-ROC and AUC-PR in this split, with Random Forest and GBM very close behind. The ROC and PR curves show that once the model is properly regularised or ensemblized, performance is consistently strong across a range of methods.
13.2 Overlaid ROC curves
Plotting all ROC curves together on the same axes is the best way to compare classifiers holistically — it shows performance simultaneously across all decision thresholds rather than at a single operating point.
roc_list <- list(
"Logistic Regression" = res_glm$roc,
"KNN (k=10)" = res_knn$roc,
"SVM (RBF)" = res_svm$roc,
"Decision Tree" = res_tree$roc,
"Random Forest" = res_rf$roc,
"GBM" = res_gbm$roc
)
roc_colours <- c("#E64B35", "#4DBBD5", "#00A087", "#3C5488", "#F39B7F", "#8491B4")
plot(
roc_list[[1]],
col = roc_colours[1],
lwd = 2,
main = "ROC Curves — All Classifiers",
xlab = "False Positive Rate (%)",
ylab = "True Positive Rate (%)"
)
for (i in seq_along(roc_list)[-1]) {
plot(roc_list[[i]], col = roc_colours[i], lwd = 2, add = TRUE)
}
legend(
"bottomright",
legend = paste0(names(roc_list), " (AUC-ROC=",
round(sapply(roc_list, function(r) as.numeric(pROC::auc(r))), 3), ")"),
col = roc_colours,
lwd = 2,
cex = 0.75,
bty = "n"
)The unregularised logistic regression (red) sits along the diagonal — random performance — and is clearly separated from the cluster of well-performing models. Among the remaining classifiers, the curves are tightly grouped in the upper-left region of the plot, all achieving strong discrimination.
A few observations worth noting:
- KNN, SVM, Random Forest, and GBM all produce curves that are much closer to the top-left corner than the decision tree, reflecting strong class separation.
- The decision tree (dark blue) shows the most “staircase” shape — a consequence of having only a few distinct probability levels.
- Random Forest and GBM (orange and purple) have slightly smoother curves than the single tree because ensemble averaging yields finer-grained score values.
- SVM (green) is the strongest overall performer in this split, with the highest AUC-ROC and AUC-PR.
13.3 Overlaid Precision-Recall curves
The PR curves complement the ROC analysis by focusing exclusively on the positive class (ER+). They are particularly informative in imbalanced settings because the no-skill baseline equals the positive-class prevalence rather than 0.50 (recall that ~80% of samples are ER+, so the random baseline for AUC-PR is approximately 0.80).
pr_list <- list(
"Logistic Regression" = res_glm$pr,
"KNN (k=10)" = res_knn$pr,
"SVM (RBF)" = res_svm$pr,
"Decision Tree" = res_tree$pr,
"Random Forest" = res_rf$pr,
"GBM" = res_gbm$pr
)
# Plot the first PR curve, then overlay the rest
plot(
pr_list[[1]],
col = roc_colours[1],
lwd = 2,
main = "Precision-Recall Curves — All Classifiers",
auc.main = FALSE,
xlab = "Recall",
ylab = "Precision"
)
for (i in seq_along(pr_list)[-1]) {
plot(pr_list[[i]], col = roc_colours[i], lwd = 2, add = TRUE)
}
# Add the random baseline (prevalence of ER+)
er_prevalence <- mean(test_y == "ER+")
abline(h = er_prevalence, lty = 2, col = "grey50")
text(0.05, er_prevalence + 0.01, sprintf("Random baseline (%.0f%%)", er_prevalence * 100),
cex = 0.75, col = "grey40", adj = 0)
legend(
"bottomleft",
legend = paste0(names(pr_list), " (AUC-PR=",
round(results$AUC_PR, 3), ")"),
col = roc_colours,
lwd = 2,
cex = 0.75,
bty = "n"
)The dashed horizontal line marks the random classifier baseline (AUC-PR ≈ 0.79, equal to ER+ prevalence). The unregularised logistic regression (red, AUC-PR = 0.793) hugs this baseline throughout — its positive predictions are no better than chance. All other classifiers cluster tightly near the top of the plot, all achieving AUC-PR in the range 0.969–0.981.
Looking at the actual AUC-PR values:
| Model | AUC-PR |
|---|---|
| SVM (RBF) | 0.981 |
| Random Forest | 0.979 |
| GBM | 0.978 |
| KNN (k=10) | 0.971 |
| Decision Tree | 0.969 |
| Logistic Regression (all genes) | 0.793 |
The SVM achieves the highest AUC-PR (0.981), followed closely by Random Forest (0.979) and GBM (0.978). The decision tree is slightly lower, while KNN sits in between. The most important takeaway is that the well-performing models maintain very high precision (~0.97–0.99) across most of the recall range before a characteristic drop at Recall → 1.0, where the hardest-to-classify samples are forced into predictions.
13.4 Accuracy bar chart
results |>
arrange(desc(Accuracy)) |>
mutate(Model = factor(Model, levels = Model)) |>
ggplot(aes(x = Model, y = Accuracy * 100, fill = Model)) +
geom_col(alpha = 0.85, show.legend = FALSE) +
geom_text(aes(label = paste0(round(Accuracy * 100, 1), "%")),
vjust = -0.4, size = 3.5) +
scale_fill_manual(values = setNames(roc_colours, results$Model)) +
scale_y_continuous(limits = c(0, 105)) +
labs(
title = "Test set accuracy by classifier",
x = NULL,
y = "Accuracy (%)"
) +
theme_bw(base_size = 11) +
theme(axis.text.x = element_text(angle = 20, hjust = 1))The bar chart reinforces the picture from the ROC and PR comparisons. With the exception of the unregularised logistic regression, all models cluster tightly between 93% and 95% accuracy. The differences are small enough that other considerations — interpretability, speed, ease of tuning, robustness to hyperparameter choice — should inform the final model selection alongside raw performance metrics.
- Regularisation is non-negotiable in high dimensions. Unregularised logistic regression fails catastrophically with 4,000 features and ~700 samples. LASSO, ensemble methods, and distance-based methods all handle this correctly.
- Performance differences between well-regularised models are small. Once you use an appropriate method, the choice between KNN, SVM, Random Forest, and GBM matters less than the choice between regularised and unregularised approaches.
- Single-train/test split results should be interpreted cautiously. The small performance differences between models (< 1% accuracy) are likely within the variance of the split. Cross-validation gives more reliable rankings — as we demonstrate in the next section.
- Use multiple metrics in imbalanced settings. Accuracy alone is misleading with ~80% ER+ samples. AUC-ROC provides a threshold-independent view of discrimination, and AUC-PR is the strictest test: a model that maintains high precision across all recall levels is genuinely useful for clinical positive-class prediction, not just good at ranking.
- Clinical context matters. In an oncology setting, minimising false negatives (ER+ patients missed and denied hormone therapy) may justify operating at a lower threshold than the Youden optimum, accepting lower specificity in exchange for higher sensitivity. The ROC and PR curves together provide the tools to reason about this trade-off explicitly.
14 Bonus: A Brief Cross-Validation Example
A single 70/30 split gives a point estimate of performance, but the variance is high — a different random split can shift accuracy by several percentage points.
K-fold cross-validation (CV) partitions the entire dataset into k folds of equal size. In each of the k iterations, one fold is held out as the test set and the model is trained on the remaining k − 1 folds. The k performance estimates are then averaged. Because every sample appears in the test set exactly once, k-fold CV uses the data much more efficiently than a single split.
Here we demonstrate 5-fold CV using the Random Forest classifier. We manually create the fold assignments to keep the code transparent, though in practice packages such as caret or tidymodels automate this process.
set.seed(5381L)
k_folds <- 5
fold_ids <- sample(rep(seq_len(k_folds), length.out = nrow(x)))
cv_results <- lapply(seq_len(k_folds), function(fold) {
idx_train <- which(fold_ids != fold)
idx_test <- which(fold_ids == fold)
cv_train_x <- x[idx_train, , drop = FALSE]
cv_test_x <- x[idx_test, , drop = FALSE]
cv_train_y <- y[idx_train]
cv_test_y <- y[idx_test]
# Random Forest on this fold
rf_cv <- randomForest::randomForest(x = cv_train_x, y = cv_train_y,
ntree = 200)
rf_cv_prob <- predict(rf_cv, newdata = cv_test_x, type = "prob")[, "ER+"]
rf_cv_cls <- predict(rf_cv, newdata = cv_test_x, type = "class")
acc <- mean(rf_cv_cls == cv_test_y)
roc_obj <- pROC::roc(cv_test_y, rf_cv_prob,
levels = c("ER-", "ER+"),
direction = "<", quiet = TRUE)
auc_roc <- as.numeric(pROC::auc(roc_obj))
pr_obj <- PRROC::pr.curve(
scores.class0 = rf_cv_prob[cv_test_y == "ER+"],
scores.class1 = rf_cv_prob[cv_test_y == "ER-"],
curve = FALSE
)
auc_pr <- pr_obj$auc.integral
data.frame(fold = fold, Accuracy = acc, AUC_ROC = auc_roc, AUC_PR = auc_pr, check.names = FALSE)
})
cv_df <- do.call(rbind, cv_results)
cv_df |>
kable(
caption = paste0(
k_folds, "-fold CV results — Random Forest | ",
"mean Accuracy: ", round(mean(cv_df$Accuracy) * 100, 2), "% | ",
"mean AUC-ROC: ", round(mean(cv_df$AUC_ROC), 4), " | ",
"mean AUC-PR: ", round(mean(cv_df$AUC_PR), 4)
),
col.names = c("Fold", "Accuracy", "AUC-ROC", "AUC-PR"),
digits = 4
) |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Fold | Accuracy | AUC-ROC | AUC-PR |
|---|---|---|---|
| 1 | 0.9400 | 0.9642 | 0.9803 |
| 2 | 0.9447 | 0.9348 | 0.9585 |
| 3 | 0.9095 | 0.9448 | 0.9841 |
| 4 | 0.9246 | 0.9623 | 0.9904 |
| 5 | 0.9548 | 0.9634 | 0.9860 |
The per-fold results show some variability, which is normal in cross-validation. Fold 3 yields the lowest accuracy, while Fold 5 yields the highest. This spread reflects sampling variability and fold composition: some folds contain a slightly harder mix of samples than others. Even so, all five folds perform well, with accuracy above 90% and AUC-ROC above 0.93, indicating that the Random Forest is stable across different partitions of the TCGA BRCA data.
The summary statistics provide a more stable estimate than a single 70/30 split, but they should still be interpreted as an internal validation result rather than a definitive external benchmark.
Accuracy: mean=93.47% sd=1.78%
AUC-ROC: mean=0.9539 sd=0.0134
AUC-PR: mean=0.9798 sd=0.0125
The cross-validation summary is more stable than the single train/test split:
Mean accuracy: 93.47% (SD = 1.78%). The single-split accuracy of 94.65% falls comfortably within this range, suggesting that the earlier result was not unusually lucky or unlucky. However, the spread across folds shows that performance still depends on the particular partition.
Mean AUC-ROC: 0.9539 (SD = 0.0134). Slightly higher than the single-split value of 0.9444, with a low standard deviation indicating that the Random Forest’s ranking performance is fairly consistent across folds.
Mean AUC-PR: 0.9798 (SD = 0.0125). This is also very consistent across folds, but note that the PR baseline depends on class prevalence, so AUC-PR should ideally be computed on pooled out-of-fold predictions or on stratified folds.
Taken together, these results support the broader conclusion of the tutorial: the Random Forest is a robust classifier for ER status prediction in TCGA BRCA, and the single-split results are broadly in line with its cross-validated performance.
15 Session Info
The section below records the exact versions of R and all loaded packages used to generate this document. Including it at the end of every analysis script is good practice for reproducibility.
R version 4.5.2 (2025-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] PRROC_1.4 rlang_1.2.0 pROC_1.19.0.1
[4] gbm_2.2.3 randomForest_4.7-1.2 rpart.plot_3.1.4
[7] rpart_4.1.27 e1071_1.7-17 class_7.3-23
[10] glmnet_5.0 Matrix_1.7-4 kableExtra_1.4.0
[13] ggplot2_4.0.3 tidyr_1.3.2 dplyr_1.2.1
[16] here_1.0.2
loaded via a namespace (and not attached):
[1] utf8_1.2.6 generics_0.1.4 xml2_1.5.2 shape_1.4.6.1
[5] stringi_1.8.7 lattice_0.22-7 digest_0.6.39 magrittr_2.0.5
[9] evaluate_1.0.5 grid_4.5.2 RColorBrewer_1.1-3 iterators_1.0.14
[13] fastmap_1.2.0 foreach_1.5.2 rprojroot_2.1.1 jsonlite_2.0.0
[17] survival_3.8-3 mgcv_1.9-3 purrr_1.2.2 viridisLite_0.4.3
[21] scales_1.4.0 codetools_0.2-20 textshaping_1.0.5 cli_3.6.6
[25] splines_4.5.2 withr_3.0.2 yaml_2.3.12 otel_0.2.0
[29] parallel_4.5.2 tools_4.5.2 vctrs_0.7.3 R6_2.6.1
[33] proxy_0.4-29 lifecycle_1.0.5 stringr_1.6.0 htmlwidgets_1.6.4
[37] pkgconfig_2.0.3 pillar_1.11.1 gtable_0.3.6 glue_1.8.1
[41] Rcpp_1.1.1-1.1 systemfonts_1.3.2 xfun_0.57 tibble_3.3.1
[45] tidyselect_1.2.1 rstudioapi_0.18.0 knitr_1.51 farver_2.1.2
[49] nlme_3.1-168 htmltools_0.5.9 labeling_0.4.3 rmarkdown_2.31
[53] svglite_2.2.2 compiler_4.5.2 S7_0.2.2