Bayesian mixed effects (aka multi-level) ordinal regression models with brms

21 Feb 2017 | all notes

In the past two years I’ve found myself doing lots of statistical analyses on ordinal response data from a (Likert-scale) dialectology questionnaire. I’ve ended up with a good pipeline to run and compare many ordinal regression models with random effects in a Bayesian way using the handy R formula interface in the brms package.

In defense of ordinal regression

In (applied statistical) practice, ordinal data is often simply fit using linear regression (this seems to be particularly true in contemporary, quantitative grammaticality judgment-based syntax literature). While treating ordinal responses as continuous measures is in principle always wrong (because the scale is definitely not ratio), it can in practice be ok to apply linear regression to it, as long as it is reasonable to assume that the scale can be treated as interval data (i.e. the distances between individual response categories are meaningful/comparable). Situations that completely rule out the use of linear methods are:

If you are happy with simply predicting something like a mean response which at best doesn’t match the actual response scale (whether they predicted response is actually outside the actual response scale or just between categories), and is at worst inaccurate due to its not taking into account any uneven distribution of responses across ordinal categories, that’s ok too. But the most reliable and insightful way of checking the quality/fit of a model is by visual inspection of the model predictions, which is why I always want to be able to produce meaningful (ordinal distribution) predictions to match them against the original data.

Ordinal regression methods are typically generalisations of methods used for modelling categorical (in the minimal case binary outcome) data. The most frequently used ordinal regression, ordered logistic (or more accurately ordered logit) regression is an extension of logistic/logit regression: where in logistic regression you model one coefficient that captures the relative likelihood (in log-odds) of one outcome occurring over another (i.e. 2 outcomes captured by 1 coefficient), ordered logit regression models the relative likelihood of k different outcomes based on k-1 coefficients. The k-1 coefficients in the model capture the cumulative likelihood of responses falling into an expanding set of ordered response categories, e.g. in the case of 4 possible outcome categories (o1-o4) the three intercepts of the model, each of which will be at least as high as the preceding one, capture the baseline log-odds of observing:

log(o1         vs. o2, o3, o4)
log(o1, o2     vs.     o3, o4)
log(o1, o2, o3 vs.         o4)

As a consequence of this representation of the underlying probabilities, any log-odds coefficients that are added in the linear regression model actually lead to a multiplication of the odds in the model (i.e. the strength of the effect of a predictor on the modelled odds ratios is proportional to the original likelihood of those ratios, this is a consequence of the proportional odds assumption of logit regression).

The formulation and modelling in log-odds that is the result of the logit transformation are specific to ordered logit regression, however several other methods for modelling binary responses (such as probit models) also have ordinal counterparts (i.e. ordered probit) where the coefficients and cumulative intercepts are interpreted in different ways.

So while the idea of representing the cumulative likelihood of an increasing pool of ordinal responses is a general one, there are several possible formats in which those cumulative probabilities can be represented and modelled, which is where link functions come into play. The role of a link function is simply to transform the probability or odds coefficients from the probability space [0,1] to the full linear space [-Inf,Inf] in such a way that the transformed coefficients can be modelled using linear regression in the best possible (read: most accurate/useful) way.

Different statistical packages support different link families, for example the ordinal package (which offers ordinal regression with one random effect) supports the cumulative links “logit”, “probit”, “cloglog”, “loglog” and “cauchit”, while brms (full-on Bayesian multi-level modelling) supports “logit”, “probit”, “probit_approx”, “cloglog” and “cauchit”. The choice of link function is typically not critical and most methods assume the “logit” function (the log-odds transformation that forms the basis of ordered logit regression) by default, but a different choice can be informed by your knowledge of the data.

The most appropriate or useful link function given a particular data set and model of it can be determined from the data by building models with several different link functions and selecting the one that yields the best (i.e. most predictive) fit, as measured by the models’ log-likelihood. While this is quite expensive to do with Bayesian models, here is a simple function making use of the clm function from the ordinal package to quickly fit several possible link functions and determine the one producing the best fit for a given model (note that this does not take any random effects into account; results in Bayesian models might vary even more):

cumulativemodelfit <- function(formula, data, links=c("logit", "probit", "cloglog", "cauchit"),
    thresholds=c("flexible", "equidistant"), verbose=TRUE) {
  names(links) <- links
  names(thresholds) <- thresholds
  llks <- outer(links, thresholds,
    Vectorize(function(link, threshold)
      # catch error for responses with 2 levels
      tryCatch(ordinal::clm(formula, data=data, link=link, threshold=threshold)$logLik,
        error = function(e) NA)))
  print(llks)
  if (verbose) {
    bestfit <- which.max(llks)
    cat("\nThe best link function is ", links[bestfit %% length(links)], " with a ",
    thresholds[1 + bestfit %/% length(thresholds)], " threshold (logLik ", llks[bestfit],
    ")\n", sep="")
  }
  invisible(llks)
}

The cumulative link function’s thresholds parameter can be used to apply additional constraints to the spacing of the model’s intercepts, but you’ll typically want to leave this at “flexible”, its most general default.

Before creating a full-on Bayesian model below, we can already create some simple models to inform our choice of link function for later:

cumulativemodelfit(self ~ 1, data=d)
##          flexible equidistant
## logit   -306.3613   -307.6126
## probit  -306.3613   -310.1114
## cloglog -306.3613   -306.9339
## cauchit -306.3632   -311.3471
## 
## The best link function is logit with a flexible threshold (logLik -306.3613)

Running a model in brms

While running Bayesian models using brms can be slightly more time-consuming than other R packages (because the STAN models have to be compiled first), its neat lmer()-like formula interface means it’s easy to create a large number of models with different sets of predictors which can then be compared. This maximally transparent way of presenting statistical model results is typical in fields such as sociolinguistics where you don’t just present your final model, but also dwell on insights gained along the way based on discovering that the addition of certain coefficients does not increase the predictiveness of the model.

When no priors over the model’s parameters are supplied explicitly via the prior argument, uninformative priors are assumed.

library(brms)
# call this once to distribute chains across cpu cores:
#options(mc.cores=parallel::detectCores())

# this will take some time
nullmdl <- brm(self ~ (1|id), data=d, family=cumulative("logit"), threshold="flexible")
gendermdl <- brm(self ~ gender + (1|id), data=d, family=cumulative("logit"), threshold="flexible")
agemdl <- brm(self ~ age + (1|id), data=d, family=cumulative("logit"), threshold="flexible")

The link function fit above suggested that the logit link function with flexible thresholds is a good choice – both of these are the default for the cumulative() family, they’re only passed explicitly for the sake of clarity. After running the model, the resulting model fit can be inspected in usual ways using summary(), etc.

##  Family: cumulative (logit) 
## Formula: self ~ (1 | id) 
##          disc = 1
##    Data: d (Number of observations: 292) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; 
##          total post-warmup samples = 4000
##    WAIC: Not computed
##  
## Group-Level Effects: 
## ~id (Number of levels: 77) 
##               Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sd(Intercept)     0.82      0.23     0.35     1.27       1087 1.01
## 
## Population-Level Effects: 
##              Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept[1]    -4.19      0.45    -5.15    -3.36       2965    1
## Intercept[2]    -3.21      0.32    -3.87    -2.61       2692    1
## Intercept[3]    -2.07      0.23    -2.56    -1.64       2947    1
## Intercept[4]    -0.54      0.16    -0.89    -0.23       4000    1
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

##  Family: cumulative (logit) 
## Formula: self ~ age + (1 | id) 
##          disc = 1
##    Data: d (Number of observations: 292) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; 
##          total post-warmup samples = 4000
##    WAIC: Not computed
##  
## Group-Level Effects: 
## ~id (Number of levels: 77) 
##               Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sd(Intercept)     0.83      0.25     0.26     1.29        781    1
## 
## Population-Level Effects: 
##              Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept[1]    -4.88      0.60    -6.11    -3.76       2986    1
## Intercept[2]    -3.92      0.52    -4.98    -2.96       4000    1
## Intercept[3]    -2.78      0.47    -3.71    -1.91       4000    1
## Intercept[4]    -1.23      0.43    -2.09    -0.42       4000    1
## age             -0.02      0.01    -0.04     0.00       4000    1
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

##  Family: cumulative (logit) 
## Formula: self ~ gender + (1 | id) 
##          disc = 1
##    Data: d (Number of observations: 292) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; 
##          total post-warmup samples = 4000
##    WAIC: Not computed
##  
## Group-Level Effects: 
## ~id (Number of levels: 77) 
##               Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sd(Intercept)     0.78      0.25     0.22     1.27        861    1
## 
## Population-Level Effects: 
##              Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept[1]    -4.46      0.47    -5.44    -3.60       2273    1
## Intercept[2]    -3.49      0.36    -4.23    -2.82       2137    1
## Intercept[3]    -2.34      0.28    -2.92    -1.83       1926    1
## Intercept[4]    -0.81      0.21    -1.25    -0.41       3078    1
## gendermale      -0.71      0.33    -1.38    -0.07       2317    1
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

The trace of the Bayesian model fit as well as the posterior distribution of the coefficients can be visually inspected by calling plot(agemdl).

Bayesian model comparison and hypothesis testing

brms models support comparison via the Watanabe-Akaike information criterion as well as Leave-One-Out cross-validation:

LOO(nullmdl, agemdl, gendermdl)
##                      LOOIC    SE
## nullmdl             611.89 32.36
## agemdl              610.77 32.45
## gendermdl           609.09 32.59
## nullmdl - agemdl      1.12  3.31
## nullmdl - gendermdl   2.79  3.86
## agemdl - gendermdl    1.67  5.15

The model comparison suggests that neither adding the gender nor the age coefficients is particularly predictive (the standard errors of the improvements of the models is bigger than the improvements themselves), but we can still test hypotheses about specific parameters directly, such as whether the age coefficient is positive (where alpha specifies the size of the credible interval):

hypothesis(agemdl, "age>0", alpha=.05)
## Hypothesis Tests for class b:
##           Estimate Est.Error l-95% CI u-95% CI Evid.Ratio 
## (age) > 0    -0.02      0.01    -0.04      Inf       0.04 
## ---
## '*': The expected value under the hypothesis lies outside the 95% CI.

Visual inspection, posteriors, random effects,…

brms provides many other useful functions, from ranef(agemdl) for estimating the relative size of the random effects per group to launch_shiny(agemdl), which opens an interactive web interface that allows complete exploration of the model results and posterior distributions in your browser.

Comments