This document is written in Quarto, the next-generation version of R Markdown. Quarto documents mix prose, code, and output in a single file, and can be rendered to HTML, PDF, or Word. You are reading the HTML output.
Each grey box below is an R code chunk. When you render the document, R executes each chunk in order and embeds the output directly beneath it. You can also run chunks interactively in RStudio by clicking the green arrow on the top-right of each chunk.
# This is a code chunk. Run it to see the output below.summary(cars)
speed dist
Min. : 4.0 Min. : 2.00
1st Qu.:12.0 1st Qu.: 26.00
Median :15.0 Median : 36.00
Mean :15.4 Mean : 42.98
3rd Qu.:19.0 3rd Qu.: 56.00
Max. :25.0 Max. :120.00
TipHow to use this document
Sections inside collapsible boxes marked Going deeper are additional material for you to explore at your own pace after the session.
2 Data types
Every value in R has a type. Understanding types is essential because many common errors in R arise from a mismatch between the type you expect and the type you actually have (for instance, a number stored as text that refuses to be plotted on a numeric axis).
The most common types you will encounter in biomedical data analysis are logical, integer, double, and character. We cover these first, then provide an overview of the less common types for reference.
You can always inspect any object with three complementary functions:
Integers are whole numbers, stored more efficiently than decimals. In R they are written with a trailing L. You will encounter them most often as counts or indices.
Doubles (also called numeric in R) are real numbers with decimal precision. Most continuous measurements — age, weight, expression values, survival time — are doubles.
Character values (strings) hold text. They are used for labels, patient IDs, gene names, and any categorical variable before it is converted to a factor.
# Create a character vectorc.v<-c("a", "b", "c")c.v
These two types rarely appear in biomedical data analysis, but are included here for completeness.
Complex numbers have a real and an imaginary part (written a + bi). They arise in signal processing and certain mathematical transformations (e.g. Fourier analysis).
A data structure is a way of organising multiple values. The four structures you will use most in data analysis are vectors, matrices, data frames, and lists. The table below gives a quick orientation:
Structure
Dimensions
Homogeneous?
Typical use
Vector
1D
Yes
Single variable
Matrix
2D
Yes
Numeric data (expression matrix)
List
1D
No
Mixed-type collections
Data frame
2D
No (per column)
Clinical / omics datasets
3.1 Vectors
A vector is a one-dimensional, ordered collection of values all of the same type. It is the fundamental building block in R — a single number is actually a vector of length 1.
# Regular sequence with seq()v<-seq(from =1, to =10, by =2)v
[1] 1 3 5 7 9
# Repeated pattern with rep()v<-rep(x =1:3, times =2)v
[1] 1 2 3 1 2 3
# Random sample (set.seed() makes the result reproducible)set.seed(42)v<-sample(x =1:10, size =5, replace =FALSE)v
[1] 1 5 10 8 2
TipReproducibility tip
Use set.seed() before any function that involves randomness. This ensures that anyone running your code will get the same result — essential for reproducible research.
A matrix is a two-dimensional, homogeneous structure — essentially a vector with rows and columns. Matrices are common in bioinformatics: gene expression data is often stored as a matrix with genes on rows and samples on columns.
3.2.1 Create
m<-matrix( data =1:9, nrow =3, ncol =3, byrow =FALSE# fill column by column (default))m
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
A matrix can also be created by assigning dimensions to a vector with dim():
v2<-1:9dim(v2)<-c(3, 3)identical(m, v2)# TRUE: the two objects are the same
[1] TRUE
3.2.2 Inspect
Elements are indexed by [row, col]. Leaving one dimension blank selects everything in that dimension.
m<-rbind(m, c(10, 11, 12))# add a rowm<-m[-4, ]# remove 4th rowm<-cbind(m, c(13, 14, 15))# add a columnm<-m[, -4]# remove 4th columnm[1, 1]<-10# modify one elementm
d e f
a 10 4 7
b 2 5 8
c 3 6 9
3.3 Lists
A list is a one-dimensional, heterogeneous structure: each element can be of any type, including another list. Lists are how R represents complex objects such as the output of a statistical model.
Most functions expect the element itself, not a list containing it — so [[ and $ are more commonly useful when working with list contents.
3.4 Data frames
A data frame is the central data structure for tabular data in R. It is a two-dimensional, heterogeneous structure: a collection of vectors of equal length, each potentially of a different type. Each row represents an observation (e.g. a patient) and each column a variable (e.g. age, sex, diagnosis).
3.4.1 Create
df<-data.frame( patientid =1:3, sex =c("M", "F", "M"), age =c(34, 41, 57))df
patientid sex age
Min. :1.0 Length:3 Min. :34.0
1st Qu.:1.5 Class :character 1st Qu.:37.5
Median :2.0 Mode :character Median :41.0
Mean :2.0 Mean :44.0
3rd Qu.:2.5 3rd Qu.:49.0
Max. :3.0 Max. :57.0
patientid sex age height
1 1 M 34 1.75
2 2 F 41 1.68
3 3 M 57 1.80
# Remove a columndf$height<-NULLdf
patientid sex age
1 1 M 34
2 2 F 41
3 3 M 57
# Add a row — use data.frame() to preserve typesdf<-rbind(df, data.frame(patientid =4, sex ="F", age =51))df
patientid sex age
1 1 M 34
2 2 F 41
3 3 M 57
4 4 F 51
WarningType coercion trap
Avoid using c() to create new rows for a data frame. Because c() coerces everything to the most general type in the vector, numeric columns will be silently converted to character:
# This coerces patientid and age to character — probably not what you wantdf_bad<-rbind(df, c(5, "M", 28))str(df_bad)
Factors represent categorical variables with a fixed set of possible values (called levels). They are important for statistical modelling — if you include sex or tumour stage in a regression, R needs to know they are categories, not arbitrary numbers or strings.
as.numeric() returns the underlying integer codes (1, 2, 3…), not the level labels. This is a common source of errors. To get numeric values from a factor whose labels are numbers, use as.numeric(as.character(f)).
A function is a reusable block of code that takes inputs (arguments) and returns an output. R comes with thousands of built-in functions, and writing your own is essential for keeping your analysis organised and avoiding repetition.
Both packages need to be installed once with install.packages() before use.
6 Session Info
The section below records the exact versions of R and all loaded packages used to generate this document. Including it at the end of every analysis script is good practice for reproducibility.