Significance Analysis of Microarrays — rowSamStatistics • featscreen

This function implements a common workflow to compute repeated permutations of the input data to determine if any features are significantly related to the response.

The following steps are executed:

correlate each feature with outcome variable: samr
compute thresholds, cutpoints, and false discovery rates for SAM analysis: samr.compute.delta.table
compute SAM statistics and significance: samr.compute.siggenes.table

Usage

rowSamStatistics(
  x,
  y = NULL,
  observations = NULL,
  technology = c("array", "seq"),
  geneid = NULL,
  genenames = NULL,
  censoring.status = NULL,
  logged2 = FALSE,
  eigengene.number = 1,
  resp.type = c("Quantitative", "Two class unpaired", "Survival", "Multiclass",
    "One class", "Two class paired", "Two class unpaired timecourse",
    "One class timecourse", "Two class paired timecourse", "Pattern discovery"),
  s0 = NULL,
  s0.perc = NULL,
  nperms = 100,
  center.arrays = FALSE,
  testStatistic = c("standard", "wilcoxon"),
  time.summary.type = c("slope", "signed.area"),
  regression.method = c("standard", "ranks"),
  knn.neighbors = 10,
  random.seed = NULL,
  nresamp = 20,
  nresamp.perm = NULL,
  dels = NULL,
  nvals = 50,
  fdr.output = 0.2,
  logger = NULL
)

Arguments

x

Feature matrix: p (number of features) by n (number of samples), one observation per column (missing values allowed)

y

n-vector of outcome measurements

observations

(optional) integer vector, the indices of observations to keep.

technology

character string, the technology used to generate the data. Available options are:

array: data generated with microarray technology
seq: data generated with RNA-seq technology

geneid

Optional character vector of geneids for output.

genenames

Optional character vector of genenames for output.

censoring.status

n-vector of censoring censoring.status (1= died or event occurred, 0=survived, or event was censored), needed for a censored survival outcome

logged2

Has the data been transformed by log (base 2)? This information is used only for computing fold changes

eigengene.number

Eigengene to be used (just for resp.type="Pattern discovery")

resp.type

Problem type: "Quantitative" for a continuous parameter (Available for both array and sequencing data); "Two class unpaired" (for both array and sequencing data); "Survival" for censored survival outcome (for both array and sequencing data); "Multiclass": more than 2 groups (for both array and sequencing data); "One class" for a single group (only for array data); "Two class paired" for two classes with paired observations (for both array and sequencing data); "Two class unpaired timecourse" (only for array data), "One class time course" (only for array data), "Two class.paired timecourse" (only for array data), or "Pattern discovery" (only for array data)

s0

Exchangeability factor for denominator of test statistic; Default is automatic choice. Only used for array data.

s0.perc

Percentile of standard deviation values to use for s0; default is automatic choice; -1 means s0=0 (different from s0.perc=0, meaning s0=zeroeth percentile of standard deviation values= min of sd values. Only used for array data.

nperms

Number of permutations used to estimate false discovery rates

center.arrays

Should the data for each sample (array) be median centered at the outset? Default =FALSE. Only used for array data.

testStatistic

Test statistic to use in two class unpaired case.Either "standard" (t-statistic) or ,"wilcoxon" (Two-sample wilcoxon or Mann-Whitney test). Only used for array data.

time.summary.type

Summary measure for each time course: "slope", or "signed.area"). Only used for array data.

regression.method

Regression method for quantitative case: "standard", (linear least squares) or "ranks" (linear least squares on ranked data). Only used for array data.

knn.neighbors

Number of nearest neighbors to use for imputation of missing features values. Only used for array data.

random.seed

Optional initial seed for random number generator (integer)

nresamp

For assay.type="seq", number of resamples used to construct test statistic. Default 20. Only used for sequencing data.

nresamp.perm

For assay.type="seq", number of resamples used to construct test statistic for permutations. Default is equal to nresamp and it must be at most nresamp. Only used for sequencing data.

dels

vector of delta values used. Delta is the vertical distance from the 45 degree line to the upper and lower parallel lines that define the SAM threshold rule. By default, for array data, 50 values are chosen in the relevant operating change for delta. For sequencing data, the maximum number of effective delta values are chosen automatically according to the data.

nvals

Number of delta values used. For array data, the default value is 50. For sequencing data, the value will be chosen automatically.

fdr.output

(Approximate) False Discovery Rate cutoff for output in significant genes table

logger

a Logger object.

Value

A list containing two elements:

statistic: A numeric vector, the values of the test statistic
significance: A numeric vector, the q-values of the selected test

Author

Alessandro Barberis

Examples

#Seed
set.seed(1010)

#Define row/col size
nr = 10
nc = 20

#Data
x = matrix(
 data = stats::rnorm(n = nr*nc),
 nrow = nr,
 ncol = nc,
 dimnames = list(
   paste0("f",seq(nr)),
   paste0("S",seq(nc))
 )
)

#Categorical output vector (binomial)
y = c(rep(1,nc/2), rep(2,nc/2))
names(y) = paste0("S",seq(nc))
rowSamStatistics(x=x,y=y)
#> $statistic
#>  [1] -0.181  0.234 -0.021  0.747  0.415  1.120 -0.020  0.482  0.070  0.600
#> 
#> $significance
#>  [1] 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4
#> 

#Categorical output vector (multinomial)
y = c(rep(1,nc/4), rep(2,nc/4), rep(3,nc/2))
names(y) = paste0("S",seq(nc))
rowSamStatistics(x=x, y=y, resp.type = "Multiclass")
#> $statistic
#>  [1] 0.500 0.219 0.048 0.345 0.194 0.778 0.089 0.286 0.272 0.271
#> 
#> $significance
#>  [1] 0.6 0.6 0.6 0.6 0.6 0.0 0.6 0.6 0.6 0.6
#>