Thanks to everyone who's visited this blog and provided encouragement and suggestions. My blog is moving to http://danieljhocking.wordpress.com/quantitative-ecology-blog/. I hope that you continue this dialog about quantitative and statistical methods for ecology with me in this new location. All old posts have also been migrated to the new location.

Thanks!

Dan

## Thursday, July 26, 2012

### Plotting 95% Confidence Bands in R

I am comparing estimates from subject-specific GLMMs and
population-average GEE models as part of a publication I am working on.
As part of this, I want to visualize predictions of each type of model
including 95% confidence bands.

First I had to make a new data set for prediction. I could have compared fitted values with confidence intervals but I am specifically interested in comparing predictions for particular variables while holding others constant. For example, soil temperature is especially important for salamanders, so I am interested in the predicted effects of soil temperature from the different models. I used the expand.grid and model.matrix functions in R to generate a new data set where soil temperature varied from 0 to 30 C. The other variables were held constant at their mean levels during the study. Because of the nature of the contrast argument in the model.matrix function, I had to include more than one level of the factor “season”. I then removed all season except spring. In effect I am asking, what is the effect of soil temperature on salamander activity during the spring when the other conditions are constant (e.g. windspeed = 1.0 m/s, rain in past 24 hours = This code is based on code from Ben Bolker via http://glmm.wikidot.com

# Compare Effects of SoilT with 95% CIs

formula(glmm1)

newdat.soil <- expand.grid(

SoilT = seq(0, 30, 1),

RainAmt24 = mean(RainAmt24),

RH = mean(RH),

windspeed = mean(windspeed),

season = c(“spring”, “summer”, “fall”),

droughtdays = mean(droughtdays),

count = 0

)

newdat.soil$SoilT2 <- newdat.soil$SoilT^2

# Spring

newdat.soil.spring <- newdat.soil[newdat.soil$season == 'spring', ]

mm = model.matrix(terms(glmm1), newdat.soil)

Next I calculated the 95% confidence intervals for both the GLMM and GEE models. For the GLMM the plo and phi are the low and high confidence intervals for the fixed effects assuming zero effect of the random sites. tlo and thi account for the uncertainty in the random effects.

newdat.soil$count = mm %*% fixef(glmm1)

pvar1 <- diag(mm %*% tcrossprod(vcov(glmm1),mm))

tvar1 <- pvar1+VarCorr(glmm1)$plot[1]

newdat.soil <- data.frame(

newdat.soil

, plo = newdat.soil$count-2*sqrt(pvar1)

, phi = newdat.soil$count+2*sqrt(pvar1)

, tlo = newdat.soil$count-2*sqrt(tvar1)

, thi = newdat.soil$count+2*sqrt(tvar1)

)

mm.geeEX = model.matrix(terms(geeEX), newdat.soil)

newdat.soil$count.gee = mm.geeEX %*% coef(geeEX)

tvar1.gee <- diag(mm.geeEX %*% tcrossprod(geeEX$geese$vbeta, mm.geeEX))

newdat.soil <- data.frame(

newdat.soil

, tlo.gee = newdat.soil$count-2*sqrt(tvar1.gee)

, thi.gee = newdat.soil$count+2*sqrt(tvar1.gee)

)

The standard error of the fixed effects are larger in the GEE model than in the GLMM, but when the variation associated with the random effects are accounted for, the uncertainty (95% CI) around the estimates is greater in the GLMM. This is especially evident when the estimated values are large since the random effects are exponential on the original scale. This can be seen in the below plots

Although this plot does the job, it isn’t an efficient use of space, nor is it easy to compare exactly where the different lines fall. It would be nice to plot everything on one set of axes. The only trouble is that all the lines could be difficult to see just using solid and dashed/dotted lines. To help with this, I combine the plots but added color and shading using the polygon function. The code and plot are below

plot(newdat.soil.spring$SoilT, exp(newdat.soil.spring$count.gee),

xlab = “Soil temperature (C)”,

ylab = ‘Predicted salamander observations’,

type = ‘l’,

ylim = c(0, 25))

polygon(c(newdat.soil.spring$SoilT, rev(newdat.soil.spring$SoilT)), c(exp(newdat.soil.spring$thi.gee), rev(exp(newdat.soil.spring$tlo.gee))),

col = ‘grey’,

border = NA)

lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$thi.gee),

type = ‘l’,

lty = 2)

lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$tlo.gee),

type = ‘l’,

lty = 2)

lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$count.gee),

type = ‘l’,

lty = 1,

col = 2)

lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$count),

col = 1)

lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$thi),

type = ‘l’,

lty = 2)

lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$tlo),

type = ‘l’,

lty = 2)

Now you can directly compare the results of the GLMM and GEE models. The predicted values (population-averaged) for the GEE is represented by the red line, while the average (random effects = 0, just fixed effects) from the GLMM are represented by the solid black line. The dashed lines represent the 95% confidence intervals for the GLMM and the shaded area is the 95% confidence envelope for the GEE model. As you can see, the GEE has much higher confidence in it’s prediction of soil temperature effects on salamander surface activity than the GLMM model. This would not be apparent without visualizing the predictions with confidence intervals because the standard errors of the fixed effects were lower in the GLMM than in the GEE. This is because the SEs in the GEE include the site-level (random effect) variation while the GLMM SEs of the covariates do not include this variation and are interpreted as the effect of a change of 1 X on Y

First I had to make a new data set for prediction. I could have compared fitted values with confidence intervals but I am specifically interested in comparing predictions for particular variables while holding others constant. For example, soil temperature is especially important for salamanders, so I am interested in the predicted effects of soil temperature from the different models. I used the expand.grid and model.matrix functions in R to generate a new data set where soil temperature varied from 0 to 30 C. The other variables were held constant at their mean levels during the study. Because of the nature of the contrast argument in the model.matrix function, I had to include more than one level of the factor “season”. I then removed all season except spring. In effect I am asking, what is the effect of soil temperature on salamander activity during the spring when the other conditions are constant (e.g. windspeed = 1.0 m/s, rain in past 24 hours = This code is based on code from Ben Bolker via http://glmm.wikidot.com

# Compare Effects of SoilT with 95% CIs

formula(glmm1)

newdat.soil <- expand.grid(

SoilT = seq(0, 30, 1),

RainAmt24 = mean(RainAmt24),

RH = mean(RH),

windspeed = mean(windspeed),

season = c(“spring”, “summer”, “fall”),

droughtdays = mean(droughtdays),

count = 0

)

newdat.soil$SoilT2 <- newdat.soil$SoilT^2

# Spring

newdat.soil.spring <- newdat.soil[newdat.soil$season == 'spring', ]

mm = model.matrix(terms(glmm1), newdat.soil)

Next I calculated the 95% confidence intervals for both the GLMM and GEE models. For the GLMM the plo and phi are the low and high confidence intervals for the fixed effects assuming zero effect of the random sites. tlo and thi account for the uncertainty in the random effects.

newdat.soil$count = mm %*% fixef(glmm1)

pvar1 <- diag(mm %*% tcrossprod(vcov(glmm1),mm))

tvar1 <- pvar1+VarCorr(glmm1)$plot[1]

newdat.soil <- data.frame(

newdat.soil

, plo = newdat.soil$count-2*sqrt(pvar1)

, phi = newdat.soil$count+2*sqrt(pvar1)

, tlo = newdat.soil$count-2*sqrt(tvar1)

, thi = newdat.soil$count+2*sqrt(tvar1)

)

mm.geeEX = model.matrix(terms(geeEX), newdat.soil)

newdat.soil$count.gee = mm.geeEX %*% coef(geeEX)

tvar1.gee <- diag(mm.geeEX %*% tcrossprod(geeEX$geese$vbeta, mm.geeEX))

newdat.soil <- data.frame(

newdat.soil

, tlo.gee = newdat.soil$count-2*sqrt(tvar1.gee)

, thi.gee = newdat.soil$count+2*sqrt(tvar1.gee)

)

The standard error of the fixed effects are larger in the GEE model than in the GLMM, but when the variation associated with the random effects are accounted for, the uncertainty (95% CI) around the estimates is greater in the GLMM. This is especially evident when the estimated values are large since the random effects are exponential on the original scale. This can be seen in the below plots

Although this plot does the job, it isn’t an efficient use of space, nor is it easy to compare exactly where the different lines fall. It would be nice to plot everything on one set of axes. The only trouble is that all the lines could be difficult to see just using solid and dashed/dotted lines. To help with this, I combine the plots but added color and shading using the polygon function. The code and plot are below

plot(newdat.soil.spring$SoilT, exp(newdat.soil.spring$count.gee),

xlab = “Soil temperature (C)”,

ylab = ‘Predicted salamander observations’,

type = ‘l’,

ylim = c(0, 25))

polygon(c(newdat.soil.spring$SoilT, rev(newdat.soil.spring$SoilT)), c(exp(newdat.soil.spring$thi.gee), rev(exp(newdat.soil.spring$tlo.gee))),

col = ‘grey’,

border = NA)

lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$thi.gee),

type = ‘l’,

lty = 2)

lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$tlo.gee),

type = ‘l’,

lty = 2)

lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$count.gee),

type = ‘l’,

lty = 1,

col = 2)

lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$count),

col = 1)

lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$thi),

type = ‘l’,

lty = 2)

lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$tlo),

type = ‘l’,

lty = 2)

Now you can directly compare the results of the GLMM and GEE models. The predicted values (population-averaged) for the GEE is represented by the red line, while the average (random effects = 0, just fixed effects) from the GLMM are represented by the solid black line. The dashed lines represent the 95% confidence intervals for the GLMM and the shaded area is the 95% confidence envelope for the GEE model. As you can see, the GEE has much higher confidence in it’s prediction of soil temperature effects on salamander surface activity than the GLMM model. This would not be apparent without visualizing the predictions with confidence intervals because the standard errors of the fixed effects were lower in the GLMM than in the GEE. This is because the SEs in the GEE include the site-level (random effect) variation while the GLMM SEs of the covariates do not include this variation and are interpreted as the effect of a change of 1 X on Y

*.***at a given****site**## Monday, March 26, 2012

### Installing and Running JAGS on Mac OS 10.5.8

JAGS is an alternative to BUGS (WinBUGS or OpenBUGS) for conducting a Bayesian Analysis. It stands for Just Another Gibbs Sampler, and like WinBUGS, it is essentially an MCMC machine that employs a Gibbs sampler so you don't have to write your own for every analysis. JAGS code is very similar to the more popular BUGS so it is an easy transition. JAGS has the advantage of running on multiple platforms (Windows, Mac, Linux). It is also open source and based in C++ so it will likely have more continued development than the more well established BUGS software. Unlike WinBUGS, JAGS has no user interface and you will not see it in your Programs/Applications folder. It has to be run from another program, most commonly R using rjags. R2JAGS is an R wrapper for JAGS and rjags that provides some additional features.

I have not yet updated my operating system or all of my software, and as such, I've had some difficulty installing and running JAGS/rjags. I finally got it working after two long days and thought I'd post my solution in case anyone finds themselves in the same situation. Hopefully when I do update to Snow Leopard in the next month I don't have any problems just using the most up to date versions. For now, here is a solution using:

Mac OS 10.5.8

R 2.13.2

JAGS 2.2.0

rjags 2.2

1. Go to https://sourceforge.net/projects/mcmc-jags/files/ and download JAGSdist-2.2.0.dmg. Follow the normal install procedures.

2. From the same site download the rjags_2.2.0-1.tar.gz file to your desktop

3. Install the rjags package. I tried install.packages('/Users/Dan/Desktop/rjags_2.2.0-1.tar.gz', repos = NULL, type = "source") and it seemed to work, but when I typed library(rjags) I got the following error message:

Error: package 'rjags' is not installed for 'arch=x86_64'

4. If that happens, I found that installing from the Terminal with additional instructions worked like a charm. If like me you are not experienced with the Mac Terminal and command line entry, I will provide explicit instructions that I found:

I have not yet updated my operating system or all of my software, and as such, I've had some difficulty installing and running JAGS/rjags. I finally got it working after two long days and thought I'd post my solution in case anyone finds themselves in the same situation. Hopefully when I do update to Snow Leopard in the next month I don't have any problems just using the most up to date versions. For now, here is a solution using:

Mac OS 10.5.8

R 2.13.2

JAGS 2.2.0

rjags 2.2

1. Go to https://sourceforge.net/projects/mcmc-jags/files/ and download JAGSdist-2.2.0.dmg. Follow the normal install procedures.

2. From the same site download the rjags_2.2.0-1.tar.gz file to your desktop

3. Install the rjags package. I tried install.packages('/Users/Dan/Desktop/rjags_2.2.0-1.tar.gz', repos = NULL, type = "source") and it seemed to work, but when I typed library(rjags) I got the following error message:

Error: package 'rjags' is not installed for 'arch=x86_64'

4. If that happens, I found that installing from the Terminal with additional instructions worked like a charm. If like me you are not experienced with the Mac Terminal and command line entry, I will provide explicit instructions that I found:

- Open Terminal (/Applications/Utilities/Terminal.app)
- Navigate to where you downloaded the source package. In my case the desktop, so I typed, "cd /Users/Dan/Desktop/" (without the quotes). You should notice that the cursor is now indicating that directory.
- Now that it knows where to find the file, have the Terminal tell R to install the package as a 64-bit version by typing the following into the Terminal: R --arch x86_64 CMD INSTALL rjags_2.2.0-1.tar.gz

## Friday, March 23, 2012

### R script to calculate QIC for Generalized Estimating Equation (GEE) Model Selection

Generalized Estimating Equations (GEE) can be used to analyze longitudinal count data; that is, repeated counts taken from the same subject or site. This is often referred to as repeated measures data, but longitudinal data often has more repeated observations. Longitudinal data arises from studies in virtually all branches of science. In psychology or medicine, repeated measurements are taken on the same patients over time. In sociology, schools or other social distinct groups are observed over time. In my field, ecology, we frequently record data from the same plants or animals repeated over time. Furthermore, the repeated measures don't have to be separated in time. A researcher could take multiple tissue samples from the same subject at a given time. I often repeatedly visit the same field sites (e.g. same patch of forest) over time. If the data are discrete counts of things (e.g. number of red blood cells, number of acorns, number of frogs), the data will generally follow a Poisson distribution.

Longitudinal count data, following a Poisson distribution, can be analyzed with Generalized Linear Mixed Models (GLMM) or with GEE. I won't get into the computational or philosophical differences between conditional, subject-specific estimates associated with GLMM and marginal, population-level estimates obtained by GEE in this post. However, if you decide that GEE is right for you (I have a paper in preparation comparing GLMM and GEE), you may also want to compare multiple GEE models. Unlike GLMM, GEE does not use full likelihood estimates, but rather, relies on a quasi-likelihood function. Therefore, the popular AIC approach to model selection don't apply to GEE models. Luckily, Pan (2001) developed an equivalent QIC for model comparison. Like AIC, it balances the model fit with model complexity to pick the most parsimonious model.

Unfortunately, there is currently no QIC package in R for GEE models. geepack is a popular R package for GEE analysis. So, I wrote the short R script below to calculate Pan's QIC statistic from the output of a GEE model run in geepack using the geese function. It currently employs the Moore-Penrose Generalized Matrix Inverse through the MASS package. I left in my original code using the identity matrix but it is preceded by a pound sign so it doesn't run.

I hope you find it useful. I'm still fairly new to R and this is one of my first custom functions, so let me know if you have problems using it or if there are places it can be improved. If you decide to use this for analysis in a publication, please let me know just for my own curiosity (and ego boost!).

Longitudinal count data, following a Poisson distribution, can be analyzed with Generalized Linear Mixed Models (GLMM) or with GEE. I won't get into the computational or philosophical differences between conditional, subject-specific estimates associated with GLMM and marginal, population-level estimates obtained by GEE in this post. However, if you decide that GEE is right for you (I have a paper in preparation comparing GLMM and GEE), you may also want to compare multiple GEE models. Unlike GLMM, GEE does not use full likelihood estimates, but rather, relies on a quasi-likelihood function. Therefore, the popular AIC approach to model selection don't apply to GEE models. Luckily, Pan (2001) developed an equivalent QIC for model comparison. Like AIC, it balances the model fit with model complexity to pick the most parsimonious model.

Unfortunately, there is currently no QIC package in R for GEE models. geepack is a popular R package for GEE analysis. So, I wrote the short R script below to calculate Pan's QIC statistic from the output of a GEE model run in geepack using the geese function. It currently employs the Moore-Penrose Generalized Matrix Inverse through the MASS package. I left in my original code using the identity matrix but it is preceded by a pound sign so it doesn't run.

*[edition: April 10, 2012] The input for the QIC function needs to come from the geeglm function (as opposed to "geese") within geepack.*I hope you find it useful. I'm still fairly new to R and this is one of my first custom functions, so let me know if you have problems using it or if there are places it can be improved. If you decide to use this for analysis in a publication, please let me know just for my own curiosity (and ego boost!).

```
#####################################################################################
# QIC for GEE models
# Daniel J. Hocking
# 07 February 2012
# Refs:
# Pan (2001)
# Liang and Zeger (1986)
# Zeger and Liang (1986)
# Hardin and Hilbe (2003)
# Dornmann et al 2007
# # http://www.unc.edu/courses/2010spring/ecol/562/001/docs/lectures/lecture14.htm
#####################################################################################
# Poisson QIC for geese{geepack} output
# Ref: Pan (2001)
QIC.pois.geeglm <- function(model.R, model.indep) {
library(MASS)
# Fitted and observed values for quasi likelihood
mu.R <- model.R$fitted.values
# alt: X <- model.matrix(model.R)
# names(model.R$coefficients) <- NULL
# beta.R <- model.R$coefficients
# mu.R <- exp(X %*% beta.R)
y <- model.R$y
# Quasi Likelihood for Poisson
quasi.R <- sum((y*log(mu.R)) - mu.R) # poisson()$dev.resids - scale and weights = 1
```

```
# Trace Term (penalty for model complexity)
AIinverse <- ginv(model.Indep$vbeta.naiv) # Omega-hat(I) via Moore-Penrose
```

```
generalized inverse of a matrix in MASS package
# Alt: AIinverse <- solve(model.Indep$vbeta.naiv) # solve via identity
Vr <- model.R$vbeta
trace.R <- sum(diag(AIinverse %*% Vr))
px <- length(mu.R) # number non-redunant columns in design matrix
# QIC
QIC <- (-2)*quasi.R + 2*trace.R
QICu <- (-2)*quasi.R + 2*px # Approximation assuming model structured correctly
output <- c(QIC, QICu, quasi.R, trace.R, px)
names(output) <- c('QIC', 'QICu', 'Quasi Lik', 'Trace', 'px')
output}
```

Subscribe to:
Posts (Atom)