The Page test is a non-parametric test for monotonically ordered differences in ranks. It can be used to assess the statistical evidence for an increase in the ordinal ranks between \(k\) ‘treatments’ (conditions or generations), based on \(N\) independent replications for each treatment. The predicted ordering of the treatments \(1 \ldots k\) has to be specified a priori.

Let \(m_i\) be the mean ordinal rank of the measure of interest obtained for treatment \(i\), then the null hypothesis of the test (identical to many other tests, e.g. Friedman’s) is \[ m_1 = m_2 = \ldots = m_k \] i.e. there is no difference between the expected ranks for the \(k\) conditions. For some reason the original formulation of the *alternative* hypothesis being tested given in Page (1963) is \[m_1 > m_2 > \ldots > m_k.\]

Later papers and textbook entries correctly point out that the alternative hypothesis is actually \[m_1 \le m_2 \le \ldots \le m_k\] where *at least one* of the inequalities has to be a true inequality (Siegel and Castellan 1988; Hollander and Wolfe 1999, 284; Van De Wiel and Di Bucchianico 2001, 143). What this means is that strong evidence for nothing more than a single step-wise change in the mean rank, e.g. \[m_1 < m_2 = \ldots = m_k\] can be sufficient for the test to *reject* the null hypothesis.

As the alternative hypothesis shows, the Page test is not a ‘trend test’ in any meaningful way, since it does not test for successive or cumulative changes in ranks. (Note how the original paper speaks of k *treatments* rather than generations, i.e. the Page test was not intended for dependent measures.)

It also cannot show whether *differences* between the conditions/generations are significant, since it is a non-parametric test that only considers *ranks*, not absolute changes in the underlying measure. These points will be demonstrated using some semi-randomly generated data sets.

To test the sensitivity of the test to a single step-wise difference across conditions we can take a typical sample set of \(N=4\) replications with \(k=10\) levels each and fix the very first position to always be ranked first, with all successive ranks being randomly shuffled. This is equivalent to the first generation doing badly at a task, with all successive generations outperforming the first one, but no cumulative improvement between them.

```
# make results reproducible
set.seed(1000)
pseudorandomranks <- function(...)
unlist(lapply(list(...), function(p) if (length(p) > 1) sample(p) else p))
lowestthenrandom <- function()
pseudorandomranks(1, 2:10)
# example ordering
t(replicate(4, lowestthenrandom()))
```

```
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 4 8 2 6 10 9 7 3 5
## [2,] 1 4 10 7 3 6 5 2 9 8
## [3,] 1 2 6 3 9 8 10 7 4 5
## [4,] 1 9 2 10 5 8 4 7 3 6
```

Given this semi-random data generation function, we can now create a large number of data sets and compute their expected distribution of significance levels according to the Page test, and see how it varies based on the number of replications \(N\).

```
library(cultevo)
sampleLs <- function(datafun, nrepetitions) {
ps <- list("0.001" = 0, "0.01" = 0, "0.05" = 0, "NS" = 0)
for (i in seq(nrepetitions)) {
p <- page.test(datafun(), verbose=FALSE)$p.value
if (p <= 0.001) {
p <- "0.001"
} else if (p <= 0.01) {
p <- "0.01"
} else if (p <= 0.05) {
p <- "0.05"
} else {
p <- "NS"
}
ps[[p]] <- ps[[p]] + 1
}
unlist(ps) / nrepetitions
}
# choose some N (number of replications)
sampleps <- function(testfun, N, datafun, nrepetitions=1000)
testfun(function() t(replicate(N, datafun())), nrepetitions)
```

Generating 1000 datasets like the one above – which really only exhibit a single point change in the distribution of mean ranks – we get a significant result about half of the time:

`sampleps(sampleLs, 4, lowestthenrandom)`

```
## 0.001 0.01 0.05 NS
## 0.043 0.173 0.290 0.494
```

Increasing the number of replications to \(N=10\), still only assuming that the first generation performs differently from all the other ones:

`sampleps(sampleLs, 10, lowestthenrandom)`

```
## 0.001 0.01 0.05 NS
## 0.261 0.341 0.275 0.123
```

The influence is even stronger when the single change point occurs closer to the middle of the number of conditions. Generating 1000 datasets where the first two ranks are always shuffled in the first two positions, followed by ranks 3-10 also shuffled randomly, we obtain the following distribution of p values:

`sampleps(sampleLs, 4, function() pseudorandomranks(1:2, 3:10))`

```
## 0.001 0.01 0.05 NS
## 0.432 0.431 0.127 0.010
```

The test is so sensitive to evidence for a change in the *a prior* suspected direction (even if it is just a single point-wise change) that it is largely unaffected by evidence for a consistent trend in the opposite direction, as can be seen in this data set where more than half of the pairwise differences between ranks indicate downwardness:

```
# upwards jump from the first three observations to the remaining 7, but the
# remaining 7 exhibit a consistent downwards trend
upwardsjumpdownardstrend <- function() pseudorandomranks(1:3, 10, 9, 8, 7, 6, 5, 4)
sampleps(sampleLs, 10, upwardsjumpdownardstrend)
```

```
## 0.001 0.01 0.05 NS
## 0 1 0 0
```

```
# start around the middle, then sudden downward followed by extreme upwards jump
updownup <- function() pseudorandomranks(3:4, 5:6, 1:2, 7:8)
sampleps(sampleLs, 10, updownup)
```

```
## 0.001 0.01 0.05 NS
## 0.983 0.017 0.000 0.000
```

It’s not 1963 anymore, so everybody has a computer, and probably some prior expectations about the development of their (presumably continuous) measure of interest. Will its value rise indefinitely across conditions/generations, or is there a ceiling where it will level out? Do you have an idea of the value at which it will level out? Will it rise linearly between conditions until it hits its maximum? Logarithmically? Exponentially? All of these are specific hypotheses corresponding to specific models that can be fit and then compared based on your data (Winter and Wieling 2016).

If you are simply looking for other non-parametric tests for sequential (or otherwise temporally dependent) data, the seasonal Kendall test (Hirsch, Slack, and Smith 1982; Gilbert 1987; Gibbons, Bhaumik, and Aryal 2009) takes seasonal effects on environmental measurements into account by computing the Mann Kendall test on each of \(k\) seasons/months separately, and then combining the individual test results. Since the order of the individual seasons is not actually taken into account (it only is in a later version of the test, Hirsch and Slack (1984)), the test is essentially a within-subject version that combines the results of \(k\) independent Mann-Kendall tests into one to increase the statistical power (Gibbons, Bhaumik, and Aryal 2009, 211). The test was in fact already used to test for trends in different geographic sample locations rather than seasons (Helsel and Frans 2006). The seasonal’s test alternative hypothesis is “a monotone trend in one or more seasons” (Hirsch and Slack 1984, 728).

This tutorial can be cited as:

```
techreport{Stadler2017,
author = {Stadler, Kevin},
title = {{The Page test is not a trend test}},
url = {https://kevinstadler.github.io/cultevo/articles/page.test.html},
year = {2017}
}
```

Gibbons, Robert D, Dulal Bhaumik, and Subhash Aryal. 2009. *Statistical Methods for Groundwater Monitoring*. 2nd ed.

Gilbert, Richard. 1987. *Statistical Methods for Environmental Pollution Monitoring*. John Wiley & Sons, Inc.

Helsel, Dennis R, and Lonna M Frans. 2006. “Regional Kendall test for trend.” *Environmental Science and Technology* 40 (13): 4066–73. doi:10.1021/es051650b.

Hirsch, Robert M, and James R Slack. 1984. “A Nonparametric Trend Test for Seasonal Data With Serial Dependence.” *Water Resources Research* 20 (6): 727–32. doi:10.1029/WR020i006p00727.

Hirsch, Robert M, James R Slack, and Richard A Smith. 1982. “Techniques of trend analysis for monthly water quality data.” *Water Resources Research* 18 (1): 107–21. doi:10.1029/WR018i001p00107.

Hollander, Miles, and Douglas A Wolfe. 1999. *Nonparametric Statistical Methods*.

Page, Ellis Batten. 1963. “Ordered hypotheses for multiple treatments: a significance test for linear ranks.” *Journal of the American Statistical Association* 58: 216–30. doi:10.1080/01621459.1963.10500843.

Siegel, Sidney, and N. John Castellan. 1988. *Nonparametric Statistics for the Behavioral Sciences*. McGraw-Hill.

Van De Wiel, Mark A., and A. Di Bucchianico. 2001. “Fast computation of the exact null distribution of Spearman’s rho and Page’s L statistic for samples with and without ties.” *Journal of Statistical Planning and Inference* 92: 133–45. doi:10.1016/S0378-3758(00)00166-X.

Winter, Bodo, and Martijn Wieling. 2016. “How to analyze linguistic change using mixed models, Growth Curve Analysis and Generalized Additive Modeling.” *Journal of Language Evolution* 1 (1): 7–18. doi:10.1093/jole/lzv003.