We are increasingly working with multi-site studies, which is great, but there is the risk of overlooking systematic differences between geographical sites, that may result in a biased model.

An example could be a different testing environment, and different levels of experience of the staff, depending on the site. If a centre in a certain country comes with decades of experience with a certain technique (such as eye-tracking), we may expect to report slighly more accurate measures compared to a centre where the technique is used for the first time.

Here we present an handy way of assessing bias depending on site, inspired by the methods explained in the HilL&Gelman book “Data Analysis Using Regression and Multilevel/Hierarchical Models” https://g.co/kgs/zkuYrJ

Geographic info

We are going to attach some invented site and outcome info to the lalonde datasent. The example dataframe lalonde contains data of two treatment groups with demographic information. Site info is represented by a factor with 3 levels (centre 1, centre 2…), while outcome is a continous variable with normal distribution.

set.seed(7) #set seed when generating data
library(arm)
data(lalonde)
lalonde$site <- factor(sample(x = c(1:3), 
                              size = nrow(lalonde), 
                              replace=T))
lalonde$outcome <- rnorm(n = nrow(lalonde), 
                         mean = 0, 
                         sd = 100)
head(lalonde)

##   age educ black hisp married nodegr re74 re75     re78 u74 u75
## 1  37   11     1    0       1      1    0    0  9930.05   1   1
## 2  22    9     0    1       0      1    0    0  3595.89   1   1
## 3  30   12     1    0       0      0    0    0 24909.50   1   1
## 4  27   11     1    0       0      1    0    0  7506.15   1   1
## 5  33    8     1    0       0      1    0    0   289.79   1   1
## 6  22    9     1    0       0      1    0    0  4056.49   1   1
##   treat site     outcome
## 1     1    2 -217.316443
## 2     1    3  -64.937705
## 3     1    3   52.526155
## 4     1    3  -27.572731
## 5     1    2    5.470977
## 6     1    3  -38.824961

We fit a mixed model taking site as a covariante, and allowing random intercept and varying slopes per site (i.e, we obtain individual estimates for each site).

# install.packages(lmerTest)
library(lmerTest)
ml.fit <- lmer(outcome ~ treat + site + (0 + treat | site), data=lalonde)
fixef(ml.fit)

## (Intercept)       treat       site2       site3 
##   13.384262   -1.969604  -22.765333  -18.451882

We now make some predictions of the outcome based on our fixed and random effects. This procedure is based on one of the excellent and numerous resources provided by Ben Bolker https://github.com/bbolker

# create a ds with the essential levels of the predictors; leave the outcome blank to fill it with the predictions
newdat <- expand.grid(
    treat=(c(0,1)),
    site=factor(c("1","2","3")),
    outcome = 0
)

# make predictions
newdat$outcome <- predict(ml.fit,newdat,re.form=NA)

# estimate confidence intervals of the predictions 

mm <- model.matrix(terms(ml.fit),newdat)
pvar1 <- diag(mm %*% tcrossprod(vcov(ml.fit),mm))
tvar1 <- pvar1+VarCorr(ml.fit)$site[1] #adding the site specific variance on the slope
cmult <- 2
newdat <- data.frame(
    newdat
    , plo = newdat$outcome-cmult*sqrt(pvar1)
    , phi = newdat$outcome+cmult*sqrt(pvar1)
    , tlo = newdat$outcome-cmult*sqrt(tvar1)
    , thi = newdat$outcome+cmult*sqrt(tvar1)
)

We summarise the raw data to compare predictions with what happens in real life.

library(dplyr)
m_ci <- lalonde %>% group_by(site, treat) %>%
  summarise(N=n(),
            m=mean(outcome),
            s=sd(outcome),
            er=qnorm(0.975)*s/sqrt(N),
            upCI=m+er,
            downCI=m-er) %>%
  select(treat, site, m, upCI, downCI) %>% 
  left_join(newdat, by=c("treat", "site")) # here we attach the predictions
head(m_ci)

## # A tibble: 6 x 10
## # Groups:   site [3]
##   treat site       m  upCI downCI outcome    plo   phi    tlo   thi
##   <dbl> <fct>  <dbl> <dbl>  <dbl>   <dbl>  <dbl> <dbl>  <dbl> <dbl>
## 1     0 1      17.7  37.5   -2.03   13.4   -4.09 30.9   -4.09 30.9 
## 2     1 1       4.00 28.1  -20.1    11.4   -8.49 31.3   -8.49 31.3 
## 3     0 2      -7.55 12.5  -27.6    -9.38 -27.0   8.21 -27.0   8.21
## 4     1 2     -13.7  12.1  -39.4   -11.4  -30.0   7.35 -30.0   7.35
## 5     0 3     -12.0   7.41 -31.4    -5.07 -23.2  13.1  -23.2  13.1 
## 6     1 3       2.00 27.0  -23.0    -7.04 -26.4  12.3  -26.4  12.3

We now regress the predictions on the actual outcome (how well the actual data do fit the predictions of our model?). We add site in the random effect because we want to investigate how it influeces the deviance between the two.

reg <- lmer(outcome~m+ (0 + treat | site), data=m_ci)
ci <- confint(reg)[3:4,]

And we plot the observed effect of treatment on the predicted effect, together with the slope of the treatment effect.

library(ggplot2)
g0 <- ggplot(data=subset(m_ci, treat==1), #only take the active treatment subset!
             aes(y=outcome, x=m))+ #and plot actual values on preducted values
  geom_point()
g0 + 
  geom_pointrange(aes(ymin = plo, ymax = phi, col=site)) + # add the prediction interval as an error bar
  geom_abline(aes(intercept=fixef(reg)[1], 
                  slope=fixef(reg)[2])) + # add the slope of the treatment effect
  geom_abline(aes(intercept=ci[1,1], slope=ci[2,1]), 
              linetype="dashed", col="grey") + # add the confidence intervals of the treatment effect
  geom_abline(aes(intercept=ci[1,2], slope=ci[2,2]), 
              linetype="dashed", col="grey") +
  labs(y="Predicted Estimates", x="Observed Estimates")

We can see here that estimates of site 2 and 3 are quite in line with the treatment effect, as estimated by the model. The most reliable estimate is from site 2 (lies within the confidence interval), meaning that this model captures what is happening at site 2 quite well. However, the behaviour with site 3 is also not that bad, but below the actual effect. The estimate of the treatment effect is a bit more overestimated at site 1. This might be a useful information when considering the generalisation to all the sites of the treatment effect.

Modelling multi-site studies with geographical bias

Geographic info

Plot coefficients