Permutation Tests

Author

Dylan Spicker

Permutation Tests

Suppose that we have two samples of data, \(X_1,\dots,X_n\) and \(Y_1,\dots,Y_m\). Under our assumptions we have \(X \sim F\) and \(Y \sim G\). We often wish to understand the distributions \(F\) and \(G\), how they compare or are related, without making strong assumptions regarding the parametric structure of these distributions. That is, we may want to test whether \(F = G\), without assuming that \(F\) and \(G\) are normally distributed. In this sense we are looking for a nonparametric testing procedure.

Suppose that we want to test the null hypothesis, \(H_0: F = G\) versus the alternative \(H_A: F \neq G\). A realization to make is that, under the null hypothesis, we would have \(\{X_1, \dots, X_n, Y_1, \dots, Y_m\}\) to be an \(N=n+m\) size sample from \(F\). Denote the elements of this combined sample as \(Z_i\), with \(i\) ranging from \(1\) to \(N\). We take \(Z_i = X_i\) for \(i=1,\dots,n\), and then \(Z_i = Y_{i+n}\) for \(i = n+1,\dots,N\). If the null hypothesis is true and \(F = G\), then the behaviour of this partitioned sample should not differ if we were to reshuffle the observations. That is, suppose that we take \(Y^* = \{Z_1,\dots,Z_m\}\) and \(X^* = \{Z_{m+1},\dots,Z_{N}\}\). In this case, since \(F=G\), \(Y^*\) is a sample of size \(m\) from \(G\) and \(X^*\) is a sample of size \(n\) from \(F\), exactly as in the original sample. As a result, if we have any statistic that is computed on the partitioned data, \(T(X, Y)\) then reordering the data should produce a statistic, \(T(X^*,Y^*)\) that is equivalently behaved. That is to say, if \(F = G\) then \(T(X,Y) \sim T(X^*, Y^*)\). If the distributions are different, \(F \neq G\), then with a well chosen statistic \(T(X, Y) \not\sim T(X^*, Y^*)\). This is, informally, the motivation of the permutation test.

The practical idea, which we will spend some time formalizing, is to find a statistic \(T\) that is computed on two samples of data. The statistic can then be computed on the various different permutations of the data, giving \(T(X_1^*, Y_1^*), T(X_2^*, Y_2^*), \dots, T(X_M^*, Y_M^*)\), where \(M = \dbinom{N}{n}\). If \(H_0: F = G\) is true, then each of these \(M\) statistics have been drawn from the same underlying distribution. By then comparing the calculated version on the true sample, \(T(X, Y)\), and seeing how extreme it was relative to the other plausible values that have been computed. If it is extreme in the set of all possible permutations then either we got very unlucky with the specific sample we took, or else \(F \neq G\). We will formalize this idea with some more rigour as we proceed.

The Permutation Lemma

Suppose that we take two samples of data, all observations independent of all others, given by \(X_1,\dots,X_n \stackrel{iid}{\sim} F\) and \(Y_1,\dots,Y_m \stackrel{iid}{\sim} G\). Next, suppose that we combine the samples into one, ordered by the values respective magnitudes. That is, take \(Z = \{X_1, \dots, X_n, Y_1, \dots, Y_m\}\) and then order this set as \(\dot{Z} = \{Z_{(1)} \leq Z_{(2)} \leq \cdots \leq Z_{(N)}\}\), where \(N = n + m\). Next, define \(\nu\) to be a length \(N\) sequence containing the labels \(X\) or \(Y\) at each space, based on the elements of \(\dot{Z}\). That is, \(\nu_i = X\) if \(\dot{Z}_i = Z_{(i)}\) came from the sample of \(X\)s, and \(\nu_i = Y\) otherwise. Taken together \(\dot{Z}\) and \(\nu\) give precisely the same information as \(X\) and \(Y\).

It is worth interrogating the distribution of the quantity \(\nu\), given the breakdown of labels, \(n\) and \(m\). To do so, let us first consider a more straightforward problem which consists of a single iid sample. If \(W_1, \dots, W_\ell\) are iid from some unknown distribution, and we consider the order statistics \(W_{(1)} \leq W_{(2)} \leq \cdots \leq W_{(\ell)}\), we can define \(\eta\) to be the order of indices from this sample. That is, if \(W_{(i)} = W_j\) then \(\eta_i = j\). There are \(n!\) total permutations for \(\eta\), and we can ask what the likelihood of observing any of those permutations are. It is obvious that, since each element of the set is independent and identically distributed, none can be any more likely than any other to occupy any spot. If they were that would violate the assumption of them being iid. We can extend this same reasoning to our two sample problem if we assume that the null hypothesis, \(H_0: F=G\), holds.

In this case the combined sample can be seen as an iid sample from \(F\) (or \(G\)), and the indices must be equally likely to take on any permutation. The permutation of the indices directly translates to permutations of the class labels, where each \(i \leq n\) receives and \(X\) and each \(i > n\) receives a \(Y\). Thus, the probability of observing any particular permutation, \(\pi([N])\), is given by the sum of all ways that \(\nu = \pi([N])\) is possible when compared to the value of \(\eta\). Each permutation can be expressed as \(n!m!\) total permutations of \(\eta\) (the first \(X\) has \(n\) possibilities, second \(n-1\), and so forth…), and as a result \[P(\nu = \pi[N]) = \frac{n!m!}{N!} = \frac{1}{\dbinom{N}{n}}.\]

That is to say, under the null hypothesis \(H_0: F = G\), every permutation for \(\nu\) is equally likely. This has been referred to as the permutation lemma. The idea with the permutation lemma is that if the null hypothesis really is accurate, then looking at \(\dot{Z}\) with any possible permutation of the labels should be equally likely.

Using the Permutation Lemma for Permutation Tests

With the knowledge that, under the null hypothesis of \(F = G\) any permutation is equally likely, we can construct a test for this hypothesis. Specifically, suppose that we have any statistic, \(T(X,Y)\) which measures the similarity of the distributions of \(X\) and \(Y\). For instance, we may take \[T = \left|\frac{1}{n}\sum_{i=1}^n X_i - \frac{1}{m}\sum_{i=1}^m Y_i\right|,\] or similar. We will assume (without loss of generality) than higher values for \(T\) represent stronger evidence against the null hypothesis. Because \((\dot{Z},\nu)\) contain the same information as \((X,Y)\), it must be the case that we can compute \(T\) through \(T(\dot{Z},\nu)\). We can then ask, conditional on \(\dot{Z} = Z^*\), leaving \(\nu\) random, what is the probability that, were the null hypothesis to be true, we observe a value greater than some threshold. That is, \[\begin{align*} P(T(Z^*,\nu) \geq t^* | \dot{Z} = Z^*) &= \sum_{i=1}^M P(T(Z^*,\pi_i([N])) \geq t^*, \nu=\pi_i([N]) | \dot{Z} = Z^*) \\ &= \sum_{i=1}^M I(T(Z^*,\pi_i([N])) \geq t^*)P(\nu=\pi_i([N]) | \dot{Z} = Z^*) \\ &= \sum_{i=1}^M I(T(Z^*,\pi_i([N])) \geq t^*)\frac{1}{M} \\ &= \frac{\#\{i \mid T_i \geq t^*\}}{M}, \end{align*}\] where \(T_i = T(Z^*, \pi_i([N]))\). This probability is interpreted as the probability that we observe a value for our statistic at least as extreme as \(t^*\), assuming that the null hypothesis is true, given the set of values that we have observed.

As a result, if we take \(t^* = T(Z^*, \nu^*)\), where \(\nu^*\) is the observed value for \(\nu\), then taking \[p = \frac{\#\{i\mid T_i \geq \}}{M},\] we can interpret \(p\) as a conditional p-value for the null hypothesis that \(H_0: F = G\). If \(p\) is sufficiently small then, given the observed values for \(\dot{Z} = Z^*\), it is very unlikely that we would have observed a statistic as extreme as the one we did. Conversely, if the data are actually drawn from the same distribution we should expect \(p\) to be larger.

set.seed(31415)

# Define a statistic that measures distance
t_stat <- function(data, perm) {
  vals <- unique(perm)
  abs(mean(data[perm == vals[1]]) - mean(data[perm == vals[2]]))
}

# Generate two different datasets
n <- 6
m <- 4 

X <- rnorm(n)
Y1 <- rnorm(m)
Y2 <- rexp(m, rate = 1/4)

# Generate the various permutations
perms <- unique(gtools::permutations(n+m, 
                                     n+m,
                                     v=c(rep("X",n), rep("Y",m)), 
                                     set=FALSE, 
                                     repeats.allowed = FALSE))

# H0: F_X = F_Y1
Zstar1 <- sort(c(X,Y1))

# H0: F_X = F_Y2
Zstar2 <- sort(c(X, Y2))

stats <- matrix(nrow = nrow(perms), ncol = 2)

# Loop through the various permutations
for(ii in 1:nrow(stats)) {
  current_perm <- as.character(perms[ii,])
  stats[ii,] <- c(t_stat(Zstar1, current_perm), t_stat(Zstar2, current_perm))
}

# Plot out the data
matplot(y = stats, 
        x = matrix(c(rep(1, nrow(stats)),
                     rep(2, nrow(stats))),
                   nrow = nrow(stats)),
        pch = 19,
        ylab = "Estimated Value",
        xaxt = 'n', 
        xlim = c(0, 3),
        xlab = 'Test (#1 and #2)')

t1 <- t_stat(c(X,Y1), c(rep("X", n), rep("Y", m)))
t2 <- t_stat(c(X,Y2), c(rep("X", n), rep("Y", m)))

abline(h = t1, lty = 3, col = 'black')
abline(h = t2, lty = 3, col = 'red')

The empirical p-values are 0.4952381 and 0.0190476, respectively. This suggests (correctly) that \(X \sim Y_1\) and \(X \not\sim Y_2\).

Permutation Tests in Practice

While the theory of permutation tests depends on being able to work out the exact probability under the permutation (conditional null) distribution, this becomes infeasible in practice at even moderate sample sizes. Note that if we had two samples, each of size \(15\), there are a total of \(M = 155,117,520\) possible permutations. While this is not computationally infeasible quite yet, it rapidly is becoming so. Instead of exact permutation tests we instead tend to use approximate permutation tests where we sample a set number of random permutations. This is analogous to bootstrap resampling, however, importantly this resampling occurs without replacement. We will arbitrarily sample \(n\) points to be labelled as \(X\), and the other \(m\) will be assigned \(Y\). Then, we can repeat this for some large number of resamplse, \(R\), to approximate the procedure outlined above. When we do this, we should always make sure to include the statistic that we compute on our sample at least once, to avoid a p-value of \(0\). That is, if \(T_i\) are the sequence of computed test statistics for \(i = 1,\dots,R\), then \[\widehat{p} = \frac{1 + \#\{i \mid T_i > T^*\}}{1 + R}.\] Larger values of \(R\) are typically preferable (though \(R\) never needs to exceed \(M\)), and some sources list taking \(R\) between \(99\) and \(999\). As with most things in statistics, this is a rule of thumb rather than an exact recommendation.

set.seed(31415)

# Define a statistic that measures distance
t_stat <- function(X, Y) {
  abs(mean(X) - mean(Y))
}

# Generate two different data sets
R <- 1000
N <- 1000
n <- 600
m <- N-n

X <- rnorm(n)
Y1 <- rnorm(m)
Y2 <- rexp(m, rate = 6)

# H0: F_X = F_Y1
Zstar1 <- sort(c(X,Y1))

# H0: F_X = F_Y2
Zstar2 <- sort(c(X, Y2))

stats <- matrix(nrow = R+1, ncol = 2)

# Loop through the various permutations
for(ii in 1:R) {
  X_idx <- sample(1:N, n, FALSE)
  stats[ii,] <- c(t_stat(Zstar1[X_idx], Zstar1[-X_idx]), 
                  t_stat(Zstar2[X_idx], Zstar2[-X_idx]))
}


t1 <- t_stat(X, Y1)
t2 <- t_stat(X, Y2)

stats[R+1,] <- c(t1, t2)

# Plot out the data
matplot(y = stats, 
        x = matrix(c(rep(1, nrow(stats)),
                     rep(2, nrow(stats))),
                   nrow = nrow(stats)),
        pch = 19,
        ylab = "Estimated Value",
        xaxt = 'n', 
        xlim = c(0, 3),
        xlab = 'Test (#1 and #2)')


abline(h = t1, lty = 3, col = 'black')
abline(h = t2, lty = 3, col = 'red')

The empirical p-values are 0.9110889 and 0.002997, respectively. This suggests (correctly) that \(X \sim Y_1\) and \(X \not\sim Y_2\).

Choosing a Statistic for the Permutation Test

The statistic, \(T\), that is selected will change the behaviour and the results of the permutation test when it is run. In the event of using the absolute mean difference, as we have been, it is worth noting that if the means of the distributions are equal we may expect comparatively low results, even if the distributions themselves are not equal. This problem would be amplified if we also had the variances of the distributions being equal to one another. Consider the following examples.

set.seed(31415)

# Define a statistic that measures distance
t_stat <- function(X, Y) {
  abs(mean(X) - mean(Y))
}

# Generate three different datasets
# X ~ N(1, 1)
# Y ~ Exp(1)
# W ~ N(1, 25)
# Each will have 500 data points generated.

R <- 5000
N <- 1000
n1 <- 500
n2 <- 500
n3 <- 500

X <- rnorm(n1, 1, 1)
Y <- rexp(n2, 1)
W <- rnorm(n3, 1, 5)

# Generate a matrix for each pairwise comparison
stats <- matrix(nrow = R+1, ncol = 3)

# Generate the sorted lists
Zstar1 <- sort(c(X,Y))
Zstar2 <- sort(c(X,W))
Zstar3 <- sort(c(W,Y))

# Loop through the various permutations
for(ii in 1:R) {
  X_idx <- sample(1:N, n1, FALSE)
  stats[ii,] <- c(t_stat(Zstar1[X_idx], Zstar1[-X_idx]), 
                  t_stat(Zstar2[X_idx], Zstar2[-X_idx]),
                  t_stat(Zstar3[X_idx], Zstar3[-X_idx]))
}


t1 <- t_stat(X, Y)
t2 <- t_stat(X, W)
t3 <- t_stat(W, Y)

stats[R+1,] <- c(t1, t2, t3)

# Plot out the data
matplot(y = stats, 
        x = matrix(c(rep(1, nrow(stats)),
                     rep(2, nrow(stats)),
                     rep(3, nrow(stats))),
                   nrow = nrow(stats)),
        pch = 19,
        ylab = "Estimated Value",
        xaxt = 'n', 
        xlim = c(0, 4),
        xlab = 'Test (#1, #2, and #3)')


abline(h = t1, lty = 3, col = 'black')
abline(h = t2, lty = 3, col = 'red')
abline(h = t3, lty = 3, col = 'green')

The p-values are, respectively, 0.8554956, 0.790242, and 0.9300807. Despite the fact that these are three distributions which differ, there is no evidence of it from our statistic. That is because this mean difference does not differentiate well between the three large samples with equal mean, even when the variance is substantially different between the samples. While the permutation test will correctly control the Type I error probability (\(\alpha\)), it will not provide a particularly powerful test if \(T\) is not chosen well.¹ Instead, if our hope is to determine whether there are distributional differences, even in these cases, we should select a statistic which is more accurately able to distinguish between the distributional behaviour of the samples.

Knowing that the ECDF is a good estimator for the CDF of the distributions, and knowing that CDFs uniquely define distributions, one thought may be to compare the CDF of the distributions instead. Given that they are each functions there is not one single way to compare them, however, the maximum distance between the two of them or the average distance between the two of them are each reasonable sounding selections. That is, we can take \[T(\dot{Z}, \nu) = \sup_{x\in\mathbb{R}}\left|\frac{1}{n}\sum_{i : \nu_i = X} I(\dot{Z}_i \leq x) - \frac{1}{m}\sum_{i : \nu_i = Y} I(\dot{Z}_i \leq x)\right|.\] Doing this renders the permutation test describe to be the two sample Kolmogorov-Smirnov (KS) test. We can implement this statistic in the following function.

ks.statistic <- function(labels) {
  vals <- unique(labels)
  v1 <- vals[1]
  v2 <- vals[2]
  
  n <- sum(labels == v1)
  m <- length(labels) - n
  
  # Initialize Values 
  CF1 <- 0
  CF2 <- 0
  cur_max <- 0
  
  # Walk through possible jumps
  for (l in labels) {
    if(l == v1) {
      CF1 <- CF1 + 1/n
    } else {
      CF2 <- CF2 + 1/m
    }
    
    if(abs(CF1 - CF2) > cur_max) {
      cur_max <- abs(CF1 - CF2)
    }
  }
  
  cur_max
}

set.seed(31415)

# Generate three different datasets
# X ~ N(1, 1)
# Y ~ Exp(1)
# W ~ N(1, 25)
# Each will have 500 data points generated.

R <- 5000
N <- 1000
n1 <- 500
n2 <- 500
n3 <- 500

X <- rnorm(n1, 1, 1)
Y <- rexp(n2, 1)
W <- rnorm(n3, 1, 5)

names(X) <- rep("X", n1)
names(Y) <- rep("Y", n2)
names(W) <- rep("W", n3)

# Generate a vector for each comparison; 
# Note that the comparisons are not specific to the given comparison.
null_dist <- vector(mode="numeric", length = R)

# Loop through the various permutations
for(ii in 1:R) {
  X_idx <- sample(1:N, n1, FALSE)
  
  full_names <- rep("Y", N)
  full_names[X_idx] <- "X"
  
  null_dist[ii] <- ks.statistic(full_names)
}

# Generate the sorted lists
Zstar1 <- sort(c(X,Y))
Zstar2 <- sort(c(X,W))
Zstar3 <- sort(c(W,Y))

t1 <- ks.statistic(names(Zstar1))
t2 <- ks.statistic(names(Zstar2))
t3 <- ks.statistic(names(Zstar3))

# Plot out the data
plot(y = null_dist, 
     x = rep(1, R),
     xaxt = 'n',
     ylab = "Estimated KS Statistic",
     xlab = "",
     pch = 19,
     ylim = c(0, max(c(t1, t2, t3, null_dist))))

abline(h = t1, lty = 3, col = 'black')
abline(h = t2, lty = 3, col = 'red')
abline(h = t3, lty = 3, col = 'green')

In this case, all three distributions are shown to be substantially different from one another, where the p-value in each case will be \(\dfrac{1}{5001}\) since no observed values (under the null hypothesis) would have exceeded the computed statistic. In order to move off of \(p = \dfrac{1}{R+1}\) we would need a substantially larger \(R\). For instance, we can regenerate the null list with \(R = 500,000\).

set.seed(31415)
R <- 500000

# Generate a vector for each comparison; 
# Note that the comparisons are not specific to the given comparison.
null_dist <- vector(mode="numeric", length = R)

# Loop through the various permutations
for(ii in 1:R) {
  X_idx <- sample(1:N, n1, FALSE)
  
  full_names <- rep("Y", N)
  full_names[X_idx] <- "X"
  
  null_dist[ii] <- ks.statistic(full_names)
}

Using this we would have 16, 0, and 0, observations in the null (permutation) distribution which are greater than the observed statistics for \(X\) versus \(Y\), \(X\) versus \(W\), and \(W\) versus \(Y\) respectively. Note that in R the KS test is built in using ks.test. We can get the three tests as:

ks.test(X, Y)


    Asymptotic two-sample Kolmogorov-Smirnov test

data:  X and Y
D = 0.148, p-value = 3.505e-05
alternative hypothesis: two-sided

ks.test(X, W)


    Asymptotic two-sample Kolmogorov-Smirnov test

data:  X and W
D = 0.318, p-value < 2.2e-16
alternative hypothesis: two-sided

ks.test(W, Y)


    Asymptotic two-sample Kolmogorov-Smirnov test

data:  W and Y
D = 0.42, p-value < 2.2e-16
alternative hypothesis: two-sided

Permutation Tests for Independence

Suppose that we have bivariate samples of data, \((X_i, Y_i) \stackrel{iid}{\sim} F_{X,Y}\) for \(i=1,\dots,n\). It is often of interest for us to test whether \(X\perp Y\). Note that independence is defined through a factorization of the distribution functions, which is to say if \(X\perp Y\) then \(F_{X,Y} = F_XF_Y\). As a result, we can test the hypothesis \(H_0: F_{X,Y} = F_XF_Y\) versus the alternative \(H_A: F_{X,Y} \neq F_XF_Y\).

If we suppose that \(X\) and \(Y\) are independent, then if we consider a reshuffling of the \(Y_i\)s, such that we form pairs \((X_i, Y_j)\) for \(i\neq j\), the joint distribution of \(X\) and \(Y\) will be identical to the original sample. This gives us a motivation for attempting a permutation test in bivariate samples to test for the independence of two or more factors. Namely, suppose that we have a statistic, \(T(X,Y)\) which is some measure of dependence for \(X\) and \(Y\) such that smaller values are expected when \(X \perp Y\). Then, we can compute \(t^* = T(X, Y)\) based on the true observations. Then, for some large number of replicates², \(R\), we can compute \(T_i = T(X, Y(\pi_i))\), where \(Y(\pi_i)\) refers to a random permutation of the data for \(Y\). If we then calculate \[\widehat{p} = \frac{1+\#\{i \mid T_i \geq t^*\}}{R + 1},\] this will give an approximate p-value for testing \(H_0\).

Choosing a statistic is an important consideration for implementing a usable version of the test. One thought might be to try a similar statistic as the Kolmogorov-Smirnov test, namely taking \[T(X,Y) = \sup_{(x,y)} \left|\widehat{F}_{X,Y}(x,y) - \widehat{F}_X(x)\widehat{F}_Y(y)\right|.\] This is computationally more challenging than in the one dimensional case. The package robusTest implements the bivariate empirical CDF.

mv.ks.statistic <- function(X, Y) {
  n <- length(X)
  
  # The matrix under the assumption of independence is simply the 
  # product of [i/n][j/n] for the (i,j)-th entry of the matrix
  indep_mat <- as.matrix((1:n)/n) %*% t(1:n/n)
  
  # Return the maximum difference
  max(abs(robusTest::ecdf2D(X, Y)$ecdf - indep_mat))
}

set.seed(31415) 

# Generate X and Y dependent; X and W independent
R <- 10000
n <- 100
X <- rnorm(n)
Y <- 4*X + rnorm(n, 0, 3)
W <- rexp(n)

# Compute the statistics
t1 <- mv.ks.statistic(X, Y)
t2 <- mv.ks.statistic(X, W)

results_mat <- matrix(nrow = R, ncol = 2)

for(ii in 1:R) {
  Xs <- X[sample(1:n, n)]
  results_mat[ii, ] <- c(mv.ks.statistic(Xs, Y),
                         mv.ks.statistic(Xs, W))
}

# Plot the results
matplot(y = results_mat, 
        x = matrix(c(rep(1, nrow(results_mat)),
                     rep(2, nrow(results_mat))),
                   nrow = nrow(results_mat)),
        pch = 19,
        ylab = "Estimated Value",
        xaxt = 'n', 
        xlim = c(0, 3),
        ylim = c(0, max(c(results_mat, t1, t2))),
        xlab = 'Test (#1 and #2)')


abline(h = t1, lty = 3, col = 'black')
abline(h = t2, lty = 3, col = 'red')

For \(X\) versus \(Y\) and \(X\) versus \(W\), respectively, we observe 0, 1126 observations more extreme than those computed on the sample. This corresponds to estimated p-values of \(\dfrac{1}{R+1} = 0.00009999\) and \(\dfrac{1127}{R+1} = 0.11268873112\), respectively. Note, the robusTest package implements this test as the function indeptest. Calling this gives roughly the same results.

robusTest::indeptest(X,Y)


Robust independence test for two continuous variables

t = 1.85, p-value <1e-4

robusTest::indeptest(X,W)


Robust independence test for two continuous variables

t = 0.72, p-value = 0.1156

Permutation Tests for Multiple Samples and for other Hypotheses

The general logic of the permutation test holds under any null hypothesis that encodes the exchangeability of data. We say that data are exchangeable if the joint distribution of the random variables does not change if you reorder them. That is, \(X_1,X_2,X_3\) must share a distribution with \(X_3, X_1, X_2\).³ Exchangeability is closely tied to the concept of iid random variables in that every sequence of iid random variables will be exchangeable (though, you need not have random variables being iid to be exchangeable).

This is an important consideration, for instance, if instead of testing \(H_0: F = G\) you wish to test \(H_0: \mu_X = \mu_Y\) using permutation tests. In this case, under the null hypothesis alone, you would not have permutations being equivalent. In terms of the discussion, the permutation lemma is violated. To understand this concretely, suppose that we have \(X \sim N(0, 1)\) and \(Y \sim N(0, 100)\), with \(n = m = 5\). Next suppose that we observe order statistics of \[-9.10, -1.42, -1.14, -1.09, -0.33, -0.19, 1.24, 9.01, 12.66, 16.29.\] Do you think that every possible labeling of these data are equally likely?

While the values between \(-1.42\) and \(1.24\) could reasonably be from either \(X\) or \(Y\), the values \(\{-9.10, 9.01, 12.66, 16.29\}\) are far more likely to be from \(Y\) then from \(X\). As a result, it is quite evident that not every labeling is going to be equally likely for these data, and the permutation lemma will not apply. In this sense, testing \(H_0: \mu_X = \mu_Y\) cannot be tested without (at least) further assuming that \(\sigma_X^2 = \sigma_Y^2\). This means that, for testing location, permutation tests require the same types of assumptions as parametric tests do.

Footnotes

Recall that the power of a test is the probability that the null hypothesis is rejected. We want this to be equal to exactly \(\alpha\) when the null hypothesis is true, however, we want the power to be as large as possible otherwise. If we think of a test that simply generates \(Y \sim \text{Unif}(0,1)\) and rejects if \(Y \leq 0.05\) and fails to reject otherwise, then if the null is true we will reject with probability \(0.05\). However, if the null is untrue, we will also reject with probability \(0.05\). This is a test without any power. Choosing a bad statistic for detecting mean differences is analogous to this test. If you imagine a statistic that is computed that gives nearly no information about the distribution, the permutation test will be analogous to the naive test described. While the absolute mean difference is better than an irrelevant statistic, there are plenty of distributions which are not at all similar, but which share a mean. We could improve the absolute difference by considering something that, for instance, also accounts for the variance, but then we would fail to detect differences between distributions which have equal means and variances but otherwise differ. It is preferable to think of a statistic which may summarize the complete distributional behaviour.↩︎
In the event that \(n\) is small, it may be possible to compute the exact p-value by considering all possible permutations. In this case there will be \(n!\) total permutations, and so computational feasibility tops out fairly quickly.↩︎
Any sequence of non-identically distributed random variables will be, in general, not exchangeable. Consider, for instance, \(X_1 = 1\) with probability \(1\) and \(X_2 = 2\) with probability \(1\). Then \(P_{X_1,X_2}(1,2) = 1\) while \(P_{X_2,X_1}(1,2) = 0\). Thus, you cannot exchange the order here.↩︎