intro

Permutation-based All-Resolutions Inference

Overview of the method

pARI is an R package developed to a perform permutation-based closed testing method. It computes a simultaneous lower bound for the true discovery proportions of all possible subsets of a hypothesis testing problem.

pARI find the percentage of true discoveries for each set of statistical tests while controlling the familywise error rate for multiple testing and taking into account that the set was chosen in a data-driven way.

Permutation theory adapts to the correlation structure, as a simultaneous method, it allows the decision of which hypotheses sets to analyze to be entirely flexible and post-hoc, that is, the user can choose it after seeing the data and revise the choice as often as he/she wants.

pARI is entirely mild, flexible, and post-hoc. The required input is the permutation p-values matrix, i.e., null p-values distribution, that describes the p-values associated with each feature’s statistical tests and permutation. pARI is valid if the exchangeability assumption under the null hypothesis is satisfied for the permutation procedure’s validity. If the permutation matrix is not available, the user can directly insert the data specifying the type of test to perform for each feature.

pARI for each set of features, i.e., clusters, returns the simultaneous lower confidence bound to the actual proportion of significant features. The analysis can be carried out as many times as the researcher wants; also, he/she can drill down into the cluster as often as the user wants without making any selection error and ensuring the family-wise error rate (FWER).

Usage

The pARI package can be installed by

#devtools::install_github("angeella/pARI")
#install.packages("pARI")
library(pARI)

There are two main functions in the pARI package.

The function pARIbrain was developed for the fMRI cluster analysis framework, while the function pARI was developed for every multiple-testing framework.

Simulations

We perform a simple simulation using the function simulateData. 1000 features are generated 30 times as normally distributed with mean 0 under the null hypothesis and mean under the alternative computed considering the difference in means having the power of the one-sample t-test equals 0.8. The proportion of true null hypothesis equals π = 0.8.

datas <- simulateData(pi0 = 0.8, m = 1000, n = 30, power = 0.9, rho = 0.5,seed = 123)

pARI then computes the lower bound for the number of true discoveries inside the set containing the first 200 features. The user must specify the cluster in the ix set, i.e., from 1 to 200 in this case. We apply the one-sample t-test for each feature.

out <- pARI(X = datas, ix = c(1:200), test.type = "one_sample", seed = 123)
out$TDP
#>      [,1]
#> [1,] 0.68

Therefore, we can say that at least 68 of features are truly significant inside the ix cluster.

However, the pARI function can analyzed directly the matrix of permuted p-values. We can compute it by signTest function:

out <- signTest(X = datas, B = 1000, rand = F)
P <- cbind(out$pv, out$pv_H0)
pARI(pvalues = P, ix = c(1:200),test.type = "one_sample")$TDP
#>      [,1]
#> [1,] 0.72

The set of features can also be expressed as a vector with length equals the number of features where different values indicate the different sets. For example, we can construct four random clusters as

ix <- sample(c(1:4), size = 1000, replace = T)
out <- pARI(pvalues = P, ix = ix,test.type = "one_sample", clusters = TRUE)$TDP
out
#> [1] 0.1298701 0.1150794 0.1198502 0.1000000

specifying in the pARI function cluster = TRUE. Then, pARI returns the lower bound for the true discovery proportion for each set of features. We can say that we have at least 12.987013 of truly active features in the first cluster.

Gene cluster analysis

Let consider a simple example using Montgomery et al. (2010) and Pickrell et. al (2010) data, i.e., an expression set that combines two studies transcriptome genetics using second generation sequencing HapMap. The data comprises 52580 genes and 129 samples (i.e., 60 unrelated Caucasian individuals of European descent and 69 unrelated Nigerian individuals). After pre-processing steps, we perform a two-sample t-test for each gene, and we define the sets of interest the ones computed by the hclust function.

if (!requireNamespace("BiocManager", quietly = TRUE)){
  install.packages("BiocManager")
}
    

if (!requireNamespace("dynamicTreeCut", quietly = TRUE)){
  install.packages("dynamicTreeCut")
}
    

#BiocManager::install(c("Biobase","genefilter", "EnrichmentBrowser"))

library(Biobase)
library(genefilter)
library(dynamicTreeCut)
load(file=url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/montpick_eset.RData"))

pdata<- pData(montpick.eset)
edata <- as.matrix(exprs(montpick.eset))
fdata <- fData(montpick.eset)

edata <- log2(as.matrix(edata) + 1)
edata <- edata[rowMeans(edata) > 10, ]

my.dist <- dist(edata)
my.tree <- hclust(my.dist, method="ward.D2")

my.clusters <- unname(cutreeDynamic(my.tree, distM=as.matrix(my.dist),minClusterSize=10))

Having the data edata with labels referring to the type of population, we use pARI considering as ix the hierarchical cluster analysis output.

out <-pARI(X = edata,alpha = 0.05, test.type = "two_samples",
           label = as.factor(pdata$population),
           ix = my.clusters, family = "simes", clusters = TRUE)
out$TDP

For each cluster computed by hclust, pARI returns the lower bound for the true discoveries proportion.

Alternatively, you can use as clusters the pathways of genes from the getGenesets function:

pathways <- EnrichmentBrowser::getGenesets(org = "hsa", db = "kegg", gene.id.type = "ENSEMBL")

out <- c()
for(i in seq(pathways)){
  
  ix <- which(rownames(edata) %in% pathways[[i]])
  if(length(ix)!=0){
  out[i] <-pARI(X = edata,alpha = 0.05, test.type = "two_samples",
           label = as.factor(pdata$population),
           ix = ix, family = "simes", clusters = TRUE)$TDP
  }else{
  out[i] <- NA
  }
}

db <- data.frame(TDP = out, size = sapply(seq(pathways), function(x) length(which(rownames(edata) %in% pathways[[x]]))), name = names(pathways))

So, we represent the lower confidence bounds for the proportion of differential expressed genes (TDP) inside each pathway in the following plot:

library(tidyverse)
db <- db %>% dplyr::filter(!is.na(TDP) | TDP!=0) %>% dplyr::filter(size >10) 

db$name = factor(db$name, levels=db[order(db$TDP), "name"])

db %>% dplyr::filter(!is.na(TDP) | TDP!=0) %>% dplyr::filter(size >10) %>% arrange(TDP)%>% ggplot(aes(x = TDP, y = name, size = size)) + geom_point() + 
  xlab(expression(bar(pi)(S[m]))) + ylab("Pathways") + labs(size = expression(paste("|", S, "|"))) + theme_classic()+
  scale_size_continuous(breaks = c(15, 25, 35, 45))

In this case, we filter out the pathways having less than 10 genes to simply the plot.

fMRI cluster analysis

pARI is particularly useful in functional Magnetic Resonance Imaging cluster analysis, where it is of interest to select a cluster of voxels and to provide a confidence statement on the percentage of truly activated voxels within that cluster, avoiding the well-known spatial specificity paradox.

We analyzed the Auditory data collected by Pernet et al. (2015), i.e., people listening vocal and non-vocal sounds.

First, let download the data from the fMRIdata package:

if (!requireNamespace("fMRIdata", quietly = TRUE)){
  remotes::install_github("angeella/fMRIdata")
}
library(fMRIdata)
data(Auditory_clusterTH3_2)
data(Auditory_copes)
data(Auditory_mask)

We have three ingredients:

  1. The set of copes Auditory_copes as a list of niftiImage objects, one for each subject. The copes represent the 3 dimensional (91 × 109 × 91) contrast map. Each element of the array describes the estimated parameter used in the hypotheses. In this case, the copes represent the statistics maps regarding the contrast that describes the difference of neural activation during vocal and non-vocal stimuli for each participant, computed by FSL. The one-sample t-test is computed for each voxel to analyze the hypothesis of zero mean across the subjects, i.e.,

H0 : μi = 0

where $\mu_i = \sum_{j = 1}^{J} copes_{ji}/J$, where J is the total number of subjects.

  1. The cluster map Auditory_clusterTH3_2 is used as a set of features in pARIbrain. While our method allows any method for forming clusters, we started from a map computed using Random Field Theory (RFT) with a cluster-forming-threshold equalling 3.2.

  2. The brain mask Auditory_mask. In this case, we extract it from the group-level analysis by FSL.

Finally, pARIbrain can be used.

auditory_out <- pARIbrain(copes = Auditory_copes, clusters = Auditory_clusterTH3_2, mask = Auditory_mask, alpha = 0.05, silent = TRUE)
auditory_out$out

For each cluster, pARIbrain returns the lower bounds of the proportion of active voxels, the cluster’s size, the coordinates, and the maximum statistical test value inside the cluster.

Finally, you can also produce the True Discovey Proportion brain map using the map_TDP function.

Citing pARI

If you use the pARI package, please cite the following paper:

Andreella, A., Hemerik, J., Finos, L., Weeda, W., & Goeman, J. (2023). Permutation-based true discovery proportions for functional magnetic resonance imaging cluster analysis. Statistics in Medicine, 42(14), 2311-2340.