Skip to contents

Introduction

This article provides a use case for renoir.

Data

We considered the data accompanying the paper from Sakellaropoulos et al., and made available at this link:

  • batch-normalized GDSC/bortezomib dataset and clinical response data

Samples information for the clinical data was provided as supplementary material.

Setup

Firstly, we load renoir and other needed packages:

library(renoir)
#> Warning: package 'Matrix' was built under R version 4.2.3
library(plotly)
library(htmltools)

Unsupervised Screening

Given the high dimensional data, we want to apply an initial step of pre-processing by sub-setting the features space. A list of supported methods is available through the ?list_supported_unsupervised_screening_methods function call.

#list methods
u.screening.methods = list_supported_unsupervised_screening_methods()

#print in table
knitr::kable(x = u.screening.methods)
id description
na filter by NA presence in the data (remove if NAs > threshold)
intensity filter by features intensity (keep if value > median across observations)
variability filter by features variability (select top variable features)

We decided to reduce the dimensionality by selecting the top variable features.

#method id
u.screening.method.id = "variability"

Learning Methods

Now we retrieve the ids of all the supported learners. A list of supported methods is available through the ?list_supported_learning_methods function call.

#list methods
learning.methods = list_supported_learning_methods()

#print in table
knitr::kable(x = learning.methods)
id method default_hyperparameters
lasso generalized linear model via L1 penalized maximum likelihood (lasso penalty) lambda
ridge generalized linear model via L2 penalized maximum likelihood (ridge penalty) lambda
elasticnet generalized linear model via L1/L2 penalized maximum likelihood (elasticnet penalty) lambda, ….
relaxed_lasso generalized linear model via L1 penalized maximum likelihood (relaxed lasso penalty) lambda, ….
relaxed_ridge generalized linear model via L2 penalized maximum likelihood (relaxed ridge penalty) lambda, ….
relaxed_elasticnet generalized linear model via L1/L2 penalized maximum likelihood (relaxed elasticnet penalty) lambda, ….
randomForest random forest ntree
gbm generalized boosted model eta, ntree
linear_SVM linear support vector machine cost
polynomial_SVM polynomial support vector machine cost, ga….
radial_SVM radial support vector machine cost, gamma
sigmoid_SVM sigmoid support vector machine cost, gamma
linear_NuSVM linear nu-type support vector machine nu
polynomial_NuSVM polynomial nu-type support vector machine nu, gamm….
radial_NuSVM radial nu-type support vector machine nu, gamma
sigmoid_NuSVM sigmoid nu-type support vector machine nu, gamma
gknn generalized k-nearest neighbours model k
nsc nearest shrunken centroid model threshold

We can extract the ids by selecting the id column. For this use case we decided to focus only on some methods.

#learning method ids
learning.methods.ids = learning.methods$id[c(1,2,3,7)]

From the table, we can also see the default hyperparameters of the methods. We create here our default values, and define a function to dispatch them depending on the id in input:

#get hyperparameters
get_hp = function(id, y){

  #Generalised Linear Models with Penalisation
  lambda = 10^seq(3, -2, length=100)
  # alpha = seq(0.1, 0.9, length = 9)
  # alpha = seq(0.1, 0.9, length = 5)
  alpha = seq(0.05, 0.95, by = 0.05)^2
  gamma = c(0, 0.25, 0.5, 0.75, 1)
  
  #Random Forest
  ntree = c(100, 500, 1000)
  
  #Generalised Boosted Regression Modelling
  eta = c(0.3, 0.1, 0.01, 0.001)
  
  #Support Vector Machines
  cost      = 2^seq(from = -5, to = 15, length.out = 5)
  svm.gamma = 2^seq(from = -15, to = 3, length.out = 4)
  degree    = seq(from = 1, to = 3, length.out = 3)
  #Note that for classification nu must be
  #nu * length(y)/2 <= min(table(y)). So instead of
  #fixing it as
  # nu        = seq(from = 0.1, to = 0.6, length.out = 5)
  #we do
  nu.to = floor((min(table(y)) * 2/length(y)) * 10) / 10
  nu = seq(from = 0.1, to = nu.to, length.out = 5)
  
  #kNN
  k = seq(from = 1, to = 9, length.out = 5)

  #Nearest Shrunken Centroid
  threshold = seq(0, 2, length.out = 30)

  #hyperparameters
  out = switch(
    id,
    'lasso'              = list(lambda = lambda),
    'ridge'              = list(lambda = lambda),
    'elasticnet'         = list(lambda = lambda, alpha = alpha),
    'relaxed_lasso'      = list(lambda = lambda, gamma = gamma),
    'relaxed_ridge'      = list(lambda = lambda, gamma = gamma),
    'relaxed_elasticnet' = list(lambda = lambda, gamma = gamma, alpha = alpha),
    'randomForest'       = list(ntree = ntree),
    'gbm'                = list(eta = eta, ntree = ntree),
    'linear_SVM'         = list(cost = cost),
    'polynomial_SVM'     = list(cost = cost, gamma = svm.gamma, degree = degree),
    'radial_SVM'         = list(cost = cost, gamma = svm.gamma),
    'sigmoid_SVM'        = list(cost = cost, gamma = svm.gamma),
    'linear_NuSVM'       = list(nu = nu),
    'polynomial_NuSVM'   = list(nu = nu, gamma = svm.gamma, degree = degree),
    'radial_NuSVM'       = list(nu = nu, gamma = svm.gamma),
    'sigmoid_NuSVM'      = list(nu = nu, gamma = svm.gamma),
    'gknn'               = list(k = k),
    'nsc'                = list(threshold = threshold)
  )

  return(out)
}

Performance metrics

A list of supported scoring metrics is available through the ?list_supported_performance_metrics function call.

#list metrics
performance.metrics = list_supported_performance_metrics(resp.type = "gaussian")

#print in table
knitr::kable(x = performance.metrics)
id name problem
mae Mean Absolute Error regression
mape Mean Absolute Percentage Error regression
mse Mean-squared Error regression
rmse Root-mean-square Error regression
msle Mean-squared Logarithmic Error regression
r2 R2 regression

For this benchmark we want to select mean absolute error and mean squared error.

#metric for tuning
performance.metric.id.tuning = "mse"

#metrics for evaluation
performance.metric.ids.evaluation = c("mae", "mse")

Sampling Methods

A list of supported sampling methods is available through the ?list_supported_sampling_methods function call.

#list methods
sampling.methods = list_supported_sampling_methods()

#print in table
knitr::kable(x = sampling.methods)
id name supported
random random sampling without replacement stratification, balance
bootstrap random sampling with replacement stratification, balance
cv cross-validation stratification

We decided to select the common scenario of a stratified k-fold cross-validation for the tuning of hyperparameters, and repeated random sampling for the evaluation of the methodology.

#sampling for tuning
sampling.method.id.tuning = "cv"

#sampling for evaluation
sampling.method.id.evaluation = "random"

We can see that we are in a high-dimensional settings, as the number of variables is >> the number of observations. To reduce the noise associated with this settings (and the computational time) we can apply an initial unsupervised screening step. Moreover, we can also take advantage of a supervised screening step during the training.

Analysis

We can now run our analyses.

Bortezomib

Load data

#path
fp = file.path("..", "..", "data-raw", "use_case", "cell_lines", "data")

#load data
x = readRDS(file = file.path(fp, "bortezomib_cells.rds"))

#process
y = stats::setNames(object = x$Bortezomib, nm = rownames(x))
x = as.matrix(x[,-1])

#remove data without response
y = y[!is.na(y)]
x = x[names(y),]

#inspect predictors matrix
cat(paste("Number of features:", ncol(x)), sep = "\n")
cat(paste("Number of observations:", nrow(x)), sep = "\n")

We need to set the response type.

#set response type
resp.type = "gaussian"

Setup

Unsupervised Screener

Firstly, we want to setup a filter to reduce the dimensionality of our problem.


top.genes = c(0.2, 0.8, 0.9, 1)[2]

#filter
uscreener = Filter(
  #id of the method
  id = u.screening.method.id,
  #fix some arguments
  parameters = list(  
    threshold = top.genes, #select features in the top 25\%
    method    = "mad"  #use standard deviation to measure variability
  ),
  #logger
  logger   = Logger(verbose = T, level = "INFO")
)

We could provide the Filter object to the ?renoir function call, or we can perform an unsupervised screening separately by using the ?filter function.

#unsupervised screening
filtered = renoir::filter(
  filter = uscreener,
  x = x
)

#get x
x.sub = filtered@filtered
Supervised Screener

We setup a supervised screener to adopt during tuning.

#screener
screener = Screener(
  #id of the method
  id = "ebayes",
  #fix some arguments
  parameters = list(  
    p.value = 0.05,    #select features with p < 0.05
    logged2 = F,    #data was logged
    assay.type = 'array' #applies limma-trend method
  ),
  #logger
  logger   = Logger(verbose = T, level = "INFO")
)
Tuner

Now we can create a tuner performing a grid.search via 5-fold cross-validation.

#tuner
tuner = Tuner(
  id = "grid.search",
  #Sampling strategy
  sampler = Sampler(
    method = sampling.method.id.tuning,
    k = 5L,
    n = integer()
    # ,strata = y
  ),
  #
  screener = ScreenerList(screener),
  #Use parallel
  looper   = Looper(cores = 1L),
  #Logger
  logger   = Logger(verbose = T, level = "INFO")
  # logger   = Logger(verbose = T, level = "DEBUG")
)
Learner

We can now create the related ?Learner objects.

#container
learners = list()

#loop
for(learning.method.id in learning.methods.ids){
  #manual setup
  learners[[learning.method.id]] = Learner(
    tuner      = tuner,
    trainer    = Trainer(id = learning.method.id),
    forecaster = Forecaster(id = learning.method.id),
    scorer     = ScorerList(Scorer(id = performance.metric.id.tuning)),
    selector   = Selector(id = learning.method.id),
    recorder   = Recorder(id = learning.method.id, logger = Logger(level = "ALL", verbose = T)),
    marker     = Marker(id = learning.method.id, logger = Logger(level = "ALL", verbose = T)),
    logger     = Logger(level = "ALL")
  )
}
Evaluator

Finally, we need to set up the ?Evaluator.

#Evaluator
evaluator = Evaluator(
  #Sampling strategy: stratified random sampling without replacement
  sampler = Sampler(               
    method = "random",             
    k = 10L,                       
    # strata = y,                    
    N = as.integer(length(y))      
  ),

  #Performance metric
  scorer  = ScorerList(
    Scorer(id = performance.metric.ids.evaluation[1]),
    Scorer(id = performance.metric.ids.evaluation[2])
    # Scorer(id = performance.metric.ids.evaluation[3])
  )
)

Analysis with Pre-processsing

Let’s create a directory to store the results.

#define path
outdir = file.path("E:", "Personal", "My Software", "R packages", "renoir", "data-raw", "use_case", "cell_lines", "analysis", "bortezomib", paste0("top",top.genes*100))
outdir = file.path("..", "..", "data-raw", "use_case", "cell_lines", "analysis", "bortezomib", paste0("top",top.genes*100))

#create if not existing
if(!dir.exists(outdir)){dir.create(path = outdir, showWarnings = F, recursive = T)}

Now we can run the analysis.

Before running the analysis, we want to set a seed for the random number generation (RNG). In fact, different R sessions have different seeds created from current time and process ID by default, and consequently different simulation results. By fixing a seed we ensure we will be able to reproduce the results. We can specify a seed by calling ?set.seed.

In the code below, we set a seed before running the analysis for each considered learning method.

#container list
resl = list()

#loop
# learning.method.id = learning.methods.ids[2]
for(learning.method.id in learning.methods.ids){
  
  #Each analysis can take hours, so we save data 
  #for future faster load
  
  #path to file
  fp.obj = file.path(outdir, paste0(learning.method.id,".rds"))
  fp.sum = file.path(outdir, paste0("st_",learning.method.id,".rds"))
  
  #check if exists
  if(file.exists(fp.sum)){
    #load
    cat(paste("Reading", learning.method.id, "..."), sep = "")
    resl[[learning.method.id]] = readRDS(file = fp.sum)
    cat("DONE", sep = "\n")
  } else {
  
    cat(paste("Learning method:", learning.method.id), sep = "\n")
    
    #Set a seed for RNG
    set.seed(
      #A seed
      seed = 5381L,                   #a randomly chosen integer value
      #The kind of RNG to use
      kind = "Mersenne-Twister",      #we make explicit the current R default value
      #The kind of Normal generation
      normal.kind = "Inversion"       #we make explicit the current R default value
    )
    
    resl[[learning.method.id]] = renoir(
      #Unsupervised screening
      # filter = filter,
  
      #Training set size
      npoints = 3,
      # ngrid,
      nmin = round(nrow(x)/2),
  
      #Loop
      looper = Looper(),
  
      #Store
      filename = "renoir",
      outdir   = file.path(outdir, learning.method.id),
      restore  = TRUE,
  
      #Learn
      learner   = learners[[learning.method.id]],
  
      #Evaluate
      evaluator = evaluator,
  
      #Log
      logger    = Logger(
        path = file.path(outdir, learning.method.id, "log.txt"),
        level = "ALL", 
        verbose = T),
  
      #Data for training
      hyperparameters = get_hp(id = learning.method.id, y = y),
      # x         = x,
      x         = x.sub,
      y         = y,
      weights   = NULL,
      offset    = NULL,
      resp.type = resp.type,
  
      #space
      rm.call = T,
      rm.fit  = T,
  
      #Group results
      grouping = TRUE,
  
      #No screening
      # screening = c(500, 1000, 2358),
      screening = NULL,
      
      #Remove call from trainer to reduce space
      keep.call = F
    )
    
    #save
    saveRDS(object = resl[[learning.method.id]], file = fp.obj)
    
    #create summary table
    resl[[learning.method.id]] = renoir:::summary_table.RenoirList(resl[[learning.method.id]], key = c("id", "config"))
    
    #save summary table
    saveRDS(object = resl[[learning.method.id]], file = fp.sum)
    
    cat("\n\n", sep = "\n")
  }
}

#create summary table
resl = do.call(what = rbind, args = c(resl, make.row.names = F, stringsAsFactors = F))
Performance

Let’s now plot the performance metrics for the opt and 1se configurations, considering the test set of data.

Mean-squared error

We consider the mean-squared error

Test

This is the mse score for the opt configuration when considering the test set.

#plot
renoir:::plot.RenoirSummaryTable(
  x = resl[resl$config == "opt",,drop=F],
  measure     = "mse", 
  set         = "test", 
  interactive = T, 
  add.boxplot = F,
  add.scores  = F,
  add.best    = F,
  key         = c("id", "config")
  )

This is the mse score for the 1se configuration when considering the test set.

#plot
renoir:::plot.RenoirSummaryTable(
  x = resl[resl$config == "1se",,drop=F], #select 1se config 
  measure     = "mse", 
  set         = "test", 
  interactive = T, 
  add.boxplot = F,
  add.scores  = F,
  add.best    = F,
  key         = c("id", "config")
  )
Plot for paper
#set colors
col.lasso = "rgba(255, 127,  14, 1)"
col.ridge = "rgba(140,  86,  75, 1)"
col.elnet = "rgba( 23, 190, 207, 1)"
col.rndfr = "rgba(214,  39,  40, 1)"

col.list = list(lasso = col.lasso, ridge = col.ridge, elasticnet = col.elnet, randomForest = col.rndfr)

#list to store data
outl = list()

#lasso
learning.method.id = 'lasso'
for(set in c("train", "test", "full")){
  title = switch(
    set,
    "train" = "Train set", 
    "test" = "Test set", 
    "full" = "Full set"
  )
  #create and store
  outl[[set]][[learning.method.id]] = renoir:::plot.RenoirSummaryTable(
      x = resl[resl$config == "opt" & resl$learning == learning.method.id,,drop=F],
      measure = "mse",
      name.ms = paste0("mean score\n", learning.method.id),
      set = "train",
      interactive = T,
      add.boxplot = T,
      add.scores = T,
      add.best = T,
      key = c("id", "config"),
      colour = col.list[[learning.method.id]]
    ) %>% layout(yaxis = list(title = title))
}


for(set in c("train", "test", "full")){
  title = switch(
    set,
    "train" = "Train set", 
    "test" = "Test set", 
    "full" = "Full set"
  )
  for(learning.method.id in learning.methods.ids[2:4]){
    #create
    #store
    outl[[set]][[learning.method.id]] = renoir:::plot.RenoirSummaryTable(
      x = resl[resl$config == "opt" & resl$learning == learning.method.id,,drop=F],
      measure = "mse",
      name.ms = paste0("mean score\n", learning.method.id),
      set = set,
      interactive = T,
      add.boxplot = T,
      add.scores = T,
      add.best = T,
      key = c("id", "config"),
      colour = col.list[[learning.method.id]]
    ) # %>% plotly::layout(yaxis = list(title = paste0('Mean-squared Error\n', title)))
  }
}

fig <- plotly::subplot(
  unlist(outl, recursive = FALSE), 
  nrows = 3, 
  titleY = TRUE,
  titleX = FALSE,
  shareY = TRUE,
  shareX = TRUE
)

#layout
finalfig = fig %>% plotly::layout(margin = list(l = 75, r = 25, b = 50, t = 50))

finalfig = finalfig %>% plotly::layout(
  annotations = list(
    list(
      x = 0, 
      y = 0.5, 
      text = "Mean-squared Error",
      xshift = -75,
      font = list(color = "black",size = 14),
      textangle = 270,
      showarrow = F, 
      xref='paper', 
      yref='paper', 
      size=48),
    list(
      x = 0.5, 
      y = 0, 
      text = "Trainig-set size",
      yshift = -50,
      font = list(color = "black",size = 14),
      showarrow = F, 
      xref='paper', 
      yref='paper', 
      size=48)
  )
)
Mean-absolute error

We consider the mae.

Test

This is the accuracy. score for the opt configuration when considering the test set.

#plot
renoir:::plot.RenoirSummaryTable(
  x = resl[resl$config == "opt",,drop=F],
  measure     = "mae", 
  set         = "test", 
  interactive = T, 
  add.boxplot = F,
  add.scores  = F,
  add.best    = F,
  key         = c("id", "config")
  )

This is the accuracy score for the 1se configuration when considering the test set.

#plot
renoir:::plot.RenoirSummaryTable(
  x = resl[resl$config == "1se",,drop=F], #select 1se config 
  measure     = "mae", 
  set         = "test", 
  interactive = T, 
  add.boxplot = F,
  add.scores  = F,
  add.best    = F,
  key         = c("id", "config")
  )