Introduction to R

Author

Alessandro Barberis

Published

May 16, 2026

1 R Markdown and Quarto

This document is written in Quarto, the next-generation version of R Markdown. Quarto documents mix prose, code, and output in a single file, and can be rendered to HTML, PDF, or Word. You are reading the HTML output.

Each grey box below is an R code chunk. When you render the document, R executes each chunk in order and embeds the output directly beneath it. You can also run chunks interactively in RStudio by clicking the green arrow on the top-right of each chunk.

# This is a code chunk. Run it to see the output below.
summary(cars)
     speed           dist       
 Min.   : 4.0   Min.   :  2.00  
 1st Qu.:12.0   1st Qu.: 26.00  
 Median :15.0   Median : 36.00  
 Mean   :15.4   Mean   : 42.98  
 3rd Qu.:19.0   3rd Qu.: 56.00  
 Max.   :25.0   Max.   :120.00  
TipHow to use this document

Sections inside collapsible boxes marked Going deeper are additional material for you to explore at your own pace after the session.


2 Data types

Every value in R has a type. Understanding types is essential because many common errors in R arise from a mismatch between the type you expect and the type you actually have (for instance, a number stored as text that refuses to be plotted on a numeric axis).

The most common types you will encounter in biomedical data analysis are logical, integer, double, and character. We cover these first, then provide an overview of the less common types for reference.

You can always inspect any object with three complementary functions:

Function What it tells you
typeof() Internal storage type
mode() Storage mode (similar to typeof, slightly higher level)
class() How R treats the object in function dispatch

2.1 Logical

Logical values are TRUE or FALSE. They arise naturally from comparisons and conditions, and are the backbone of filtering and subsetting data.

# Create a logical vector
l <- c(TRUE, FALSE, TRUE)
l
[1]  TRUE FALSE  TRUE
[1] "logical"
mode(l)
[1] "logical"
[1] "logical"

2.2 Integer

Integers are whole numbers, stored more efficiently than decimals. In R they are written with a trailing L. You will encounter them most often as counts or indices.

# Create an integer vector
i <- c(1L, 2L, 3L)
i
[1] 1 2 3
[1] "integer"
mode(i)
[1] "numeric"
[1] "integer"

2.3 Double

Doubles (also called numeric in R) are real numbers with decimal precision. Most continuous measurements — age, weight, expression values, survival time — are doubles.

# Create a double vector
d <- c(1.0, 2.5, 3.7)
d
[1] 1.0 2.5 3.7
[1] "double"
mode(d)
[1] "numeric"
[1] "numeric"

2.4 Character

Character values (strings) hold text. They are used for labels, patient IDs, gene names, and any categorical variable before it is converted to a factor.

# Create a character vector
c.v <- c("a", "b", "c")
c.v
[1] "a" "b" "c"
typeof(c.v)
[1] "character"
mode(c.v)
[1] "character"
class(c.v)
[1] "character"

Less common types

These two types rarely appear in biomedical data analysis, but are included here for completeness.

Complex numbers have a real and an imaginary part (written a + bi). They arise in signal processing and certain mathematical transformations (e.g. Fourier analysis).

cpx <- c(1+2i, 3+4i, 5+6i)
cpx
[1] 1+2i 3+4i 5+6i
typeof(cpx)
[1] "complex"

Raw values store binary data as bytes. You might encounter them when reading binary file formats or network data.

r <- as.raw(c(1, 2, 3))
r
[1] 01 02 03
[1] "raw"

3 Data structures

A data structure is a way of organising multiple values. The four structures you will use most in data analysis are vectors, matrices, data frames, and lists. The table below gives a quick orientation:

Structure Dimensions Homogeneous? Typical use
Vector 1D Yes Single variable
Matrix 2D Yes Numeric data (expression matrix)
List 1D No Mixed-type collections
Data frame 2D No (per column) Clinical / omics datasets

3.1 Vectors

A vector is a one-dimensional, ordered collection of values all of the same type. It is the fundamental building block in R — a single number is actually a vector of length 1.

3.1.1 Create

# Combine values with c()
v <- c(1, 2, 3, 4, 5)
v
[1] 1 2 3 4 5
# Regular sequence with seq()
v <- seq(from = 1, to = 10, by = 2)
v
[1] 1 3 5 7 9
# Repeated pattern with rep()
v <- rep(x = 1:3, times = 2)
v
[1] 1 2 3 1 2 3
# Random sample (set.seed() makes the result reproducible)
set.seed(42)
v <- sample(x = 1:10, size = 5, replace = FALSE)
v
[1]  1  5 10  8  2
TipReproducibility tip

Use set.seed() before any function that involves randomness. This ensures that anyone running your code will get the same result — essential for reproducible research.

3.1.2 Inspect

[1] "integer"
[1] "integer"
[1] 5

Elements are accessed with square bracket indexing []. R uses 1-based indexing (the first element is [1], not [0] as in Python).

v[1]        # first element
[1] 1
v[1:3]      # elements 1 to 3
[1]  1  5 10

Vectors can also have names, allowing access by label:

names(v) <- c("a", "b", "c", "d", "e")
v["a"]
a 
1 

3.1.3 Modify

v <- c(v, 6)         # append an element
v
 a  b  c  d  e    
 1  5 10  8  2  6 
v <- v[-1]           # remove first element
v
 b  c  d  e    
 5 10  8  2  6 
v[1] <- 10           # overwrite first element
v
 b  c  d  e    
10 10  8  2  6 

3.2 Matrices

A matrix is a two-dimensional, homogeneous structure — essentially a vector with rows and columns. Matrices are common in bioinformatics: gene expression data is often stored as a matrix with genes on rows and samples on columns.

3.2.1 Create

m <- matrix(
  data  = 1:9,
  nrow  = 3,
  ncol  = 3,
  byrow = FALSE    # fill column by column (default)
)
m
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

A matrix can also be created by assigning dimensions to a vector with dim():

v2 <- 1:9
dim(v2) <- c(3, 3)
identical(m, v2)   # TRUE: the two objects are the same
[1] TRUE

3.2.2 Inspect

Elements are indexed by [row, col]. Leaving one dimension blank selects everything in that dimension.

m[2, 1]          # element at row 2, column 1
[1] 2
m[1:2, ]         # first two rows, all columns
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8

Named indexing also works for matrices:

rownames(m) <- c("a", "b", "c")
colnames(m) <- c("d", "e", "f")
m["a", "d"]
[1] 1
dim(m)     # returns c(nrows, ncols)
[1] 3 3

3.2.3 Modify

m <- rbind(m, c(10, 11, 12))   # add a row
m <- m[-4, ]                   # remove 4th row
m <- cbind(m, c(13, 14, 15))   # add a column
m <- m[, -4]                   # remove 4th column
m[1, 1] <- 10                  # modify one element
m
   d e f
a 10 4 7
b  2 5 8
c  3 6 9

3.3 Lists

A list is a one-dimensional, heterogeneous structure: each element can be of any type, including another list. Lists are how R represents complex objects such as the output of a statistical model.

3.3.1 Create

l <- list(
  name   = "John",
  age    = 30,
  height = 1.75
)
l
$name
[1] "John"

$age
[1] 30

$height
[1] 1.75

3.3.2 Inspect

Lists support three types of subsetting, with subtly different results:

l[1]       # returns a list of length 1
$name
[1] "John"
l[[1]]     # returns the element itself
[1] "John"
l$name     # same as l[["name"]], using the element name
[1] "John"
[1] 3

3.3.3 Modify

l <- c(l, list(weight = 70))   # add an element
l$height <- NULL               # remove an element
l$name <- "Jane"               # modify an element
l
$name
[1] "Jane"

$age
[1] 30

$weight
[1] 70

[ always returns a list (the container), while [[ returns the element inside. This matters when passing list elements to functions:

# These two calls behave differently
class(l[1])    # "list"
[1] "list"
class(l[[1]])  # "character"
[1] "character"

Most functions expect the element itself, not a list containing it — so [[ and $ are more commonly useful when working with list contents.

3.4 Data frames

A data frame is the central data structure for tabular data in R. It is a two-dimensional, heterogeneous structure: a collection of vectors of equal length, each potentially of a different type. Each row represents an observation (e.g. a patient) and each column a variable (e.g. age, sex, diagnosis).

3.4.1 Create

df <- data.frame(
  patientid = 1:3,
  sex       = c("M", "F", "M"),
  age       = c(34, 41, 57)
)
df
  patientid sex age
1         1   M  34
2         2   F  41
3         3   M  57

3.4.2 Inspect

Three functions cover most inspection needs:

head(df)       # first 6 rows
  patientid sex age
1         1   M  34
2         2   F  41
3         3   M  57
str(df)        # structure: column types and a preview of values
'data.frame':   3 obs. of  3 variables:
 $ patientid: int  1 2 3
 $ sex      : chr  "M" "F" "M"
 $ age      : num  34 41 57
summary(df)    # statistical summary per column
   patientid       sex                 age      
 Min.   :1.0   Length:3           Min.   :34.0  
 1st Qu.:1.5   Class :character   1st Qu.:37.5  
 Median :2.0   Mode  :character   Median :41.0  
 Mean   :2.0                      Mean   :44.0  
 3rd Qu.:2.5                      3rd Qu.:49.0  
 Max.   :3.0                      Max.   :57.0  

3.4.3 Modify

# Add a column
df$height <- c(1.75, 1.68, 1.80)
df
  patientid sex age height
1         1   M  34   1.75
2         2   F  41   1.68
3         3   M  57   1.80
# Remove a column
df$height <- NULL
df
  patientid sex age
1         1   M  34
2         2   F  41
3         3   M  57
# Add a row — use data.frame() to preserve types
df <- rbind(df, data.frame(patientid = 4, sex = "F", age = 51))
df
  patientid sex age
1         1   M  34
2         2   F  41
3         3   M  57
4         4   F  51
WarningType coercion trap

Avoid using c() to create new rows for a data frame. Because c() coerces everything to the most general type in the vector, numeric columns will be silently converted to character:

# This coerces patientid and age to character — probably not what you want
df_bad <- rbind(df, c(5, "M", 28))
str(df_bad)
'data.frame':   5 obs. of  3 variables:
 $ patientid: chr  "1" "2" "3" "4" ...
 $ sex      : chr  "M" "F" "M" "F" ...
 $ age      : chr  "34" "41" "57" "51" ...

3.5 Factors

Factors represent categorical variables with a fixed set of possible values (called levels). They are important for statistical modelling — if you include sex or tumour stage in a regression, R needs to know they are categories, not arbitrary numbers or strings.

3.5.1 Create

f <- factor(
  c("M", "F", "M"),
  levels = c("M", "F")
)
f
[1] M F M
Levels: M F

3.5.2 Inspect

levels(f)       # the possible categories
[1] "M" "F"
table(f)        # frequency of each level
f
M F 
2 1 
summary(f)      # similar to table() for factors
M F 
2 1 
str(f)          # internal representation (integers + levels)
 Factor w/ 2 levels "M","F": 1 2 1
Warningas.numeric() on factors

as.numeric() returns the underlying integer codes (1, 2, 3…), not the level labels. This is a common source of errors. To get numeric values from a factor whose labels are numbers, use as.numeric(as.character(f)).

f2 <- factor(c("10", "20", "30"))
as.numeric(f2)              # returns 1 2 3 — WRONG
[1] 1 2 3
as.numeric(as.character(f2)) # returns 10 20 30 — correct
[1] 10 20 30

3.5.3 Modify

levels(f) <- c(levels(f), "U")   # add a level
f <- droplevels(f)               # remove unused levels
f
[1] M F M
Levels: M F

4 Functions

A function is a reusable block of code that takes inputs (arguments) and returns an output. R comes with thousands of built-in functions, and writing your own is essential for keeping your analysis organised and avoiding repetition.

4.1 Create

myfun <- function(x, y) {
  result <- x + y
  return(result)
}

4.2 Inspect

formals(myfun)      # the argument list
$x


$y
body(myfun)         # the code inside
{
    result <- x + y
    return(result)
}

4.3 Call

Arguments can be passed by position or by name. Using names makes code more readable and prevents errors when a function has many arguments.

myfun(4, 6)             # by position
[1] 10
myfun(x = 4, y = 6)    # by name (preferred)
[1] 10

5 Input / Output

In practice you will almost always read data from a file and save results to disk.

5.1 Save and load R objects

save() and load() preserve R objects in their native binary format (.RData). This is useful for caching intermediate results in a long analysis.

save(df, file = "data.RData")
load(file = "data.RData")

5.2 Read and write CSV

CSV is the most portable format for sharing data between R and other tools.

# Write
write.csv(df, file = "data.csv", row.names = FALSE)

# Read
df <- read.csv(file = "data.csv")

Base R’s read.csv() is fine for small files. For larger datasets (hundreds of thousands of rows), consider:

  • readr::read_csv() from the tidyverse: faster than base R, better type guessing, and returns a tibble instead of a plain data frame.
  • data.table::fread(): extremely fast for very large files.

Both packages need to be installed once with install.packages() before use.


6 Session Info

The section below records the exact versions of R and all loaded packages used to generate this document. Including it at the end of every analysis script is good practice for reproducibility.

R version 4.5.2 (2025-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.5.2    fastmap_1.2.0     cli_3.6.6        
 [5] tools_4.5.2       htmltools_0.5.9   otel_0.2.0        yaml_2.3.12      
 [9] rmarkdown_2.31    knitr_1.51        jsonlite_2.0.0    xfun_0.57        
[13] digest_0.6.39     rlang_1.2.0       evaluate_1.0.5