Search This Blog

Thursday, July 26, 2012

Blog Moving

Thanks to everyone who's visited this blog and provided encouragement and suggestions. My blog is moving to I hope that you continue this dialog about quantitative and statistical methods for ecology with me in this new location. All old posts have also been migrated to the new location.


Plotting 95% Confidence Bands in R

I am comparing estimates from subject-specific GLMMs and population-average GEE models as part of a publication I am working on. As part of this, I want to visualize predictions of each type of model including 95% confidence bands.
First I had to make a new data set for prediction. I could have compared fitted values with confidence intervals but I am specifically interested in comparing predictions for particular variables while holding others constant. For example, soil temperature is especially important for salamanders, so I am interested in the predicted effects of soil temperature from the different models. I used the expand.grid and model.matrix functions in R to generate a new data set where soil temperature varied from 0 to 30 C. The other variables were held constant at their mean levels during the study. Because of the nature of the contrast argument in the model.matrix function, I had to include more than one level of the factor “season”. I then removed all season except spring. In effect I am asking, what is the effect of soil temperature on salamander activity during the spring when the other conditions are constant (e.g. windspeed = 1.0 m/s, rain in past 24 hours =  This code is based on code from Ben Bolker via
# Compare Effects of SoilT with 95% CIs
newdat.soil <- expand.grid(
SoilT = seq(0, 30, 1),
RainAmt24 = mean(RainAmt24),
RH = mean(RH),
windspeed = mean(windspeed),
season = c(“spring”, “summer”, “fall”),
droughtdays = mean(droughtdays),
count = 0
newdat.soil$SoilT2 <- newdat.soil$SoilT^2
# Spring
newdat.soil.spring <- newdat.soil[newdat.soil$season == 'spring', ]
mm = model.matrix(terms(glmm1), newdat.soil)
Next I calculated the 95% confidence intervals for both the GLMM and GEE models. For the GLMM the plo and phi are the low and high confidence intervals for the fixed effects assuming zero effect of the random sites. tlo and thi account for the uncertainty in the random effects.
newdat.soil$count = mm %*% fixef(glmm1)
pvar1 <- diag(mm %*% tcrossprod(vcov(glmm1),mm))
tvar1 <- pvar1+VarCorr(glmm1)$plot[1]
newdat.soil <- data.frame(
, plo = newdat.soil$count-2*sqrt(pvar1)
, phi = newdat.soil$count+2*sqrt(pvar1)
, tlo = newdat.soil$count-2*sqrt(tvar1)
, thi = newdat.soil$count+2*sqrt(tvar1)
mm.geeEX = model.matrix(terms(geeEX), newdat.soil)
newdat.soil$count.gee = mm.geeEX %*% coef(geeEX)
tvar1.gee <- diag(mm.geeEX %*% tcrossprod(geeEX$geese$vbeta, mm.geeEX))
newdat.soil <- data.frame(
, tlo.gee = newdat.soil$count-2*sqrt(tvar1.gee)
, thi.gee = newdat.soil$count+2*sqrt(tvar1.gee)
The standard error of the fixed effects are larger in the GEE model than in the GLMM, but when the variation associated with the random effects are accounted for, the uncertainty (95% CI) around the estimates is greater in the GLMM. This is especially evident when the estimated values are large since the random effects are exponential on the original scale. This can be seen in the below plots

Although this plot does the job, it isn’t an efficient use of space, nor is it easy to compare exactly where the different lines fall. It would be nice to plot everything on one set of axes. The only trouble is that all the lines could be difficult to see just using solid and dashed/dotted lines. To help with this, I combine the plots but added color and shading using the polygon function. The code and plot are below
plot(newdat.soil.spring$SoilT, exp(newdat.soil.spring$count.gee),
xlab = “Soil temperature (C)”,
ylab = ‘Predicted salamander observations’,
type = ‘l’,
ylim = c(0, 25))
polygon(c(newdat.soil.spring$SoilT, rev(newdat.soil.spring$SoilT)), c(exp(newdat.soil.spring$thi.gee), rev(exp(newdat.soil.spring$tlo.gee))),
col = ‘grey’,
border = NA)
lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$thi.gee),
type = ‘l’,
lty = 2)
lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$tlo.gee),
type = ‘l’,
lty = 2)
lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$count.gee),
type = ‘l’,
lty = 1,
col = 2)
lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$count),
col = 1)
lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$thi),
type = ‘l’,
lty = 2)
lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$tlo),
type = ‘l’,
lty = 2)
GLMM vs GEE plot with 95% confidence intervals
Now you can directly compare the results of the GLMM and GEE models. The predicted values (population-averaged) for the GEE is represented by the red line, while the average (random effects = 0, just fixed effects) from the GLMM are represented by the solid black line. The dashed lines represent the 95% confidence intervals for the GLMM and the shaded area is the 95% confidence envelope for the GEE model. As you can see, the GEE has much higher confidence in it’s prediction of soil temperature effects on salamander surface activity than the GLMM model. This would not be apparent without visualizing the predictions with confidence intervals because the standard errors of the fixed effects were lower in the GLMM than in the GEE. This is because the SEs in the GEE include the site-level (random effect) variation while the GLMM SEs of the covariates do not include this variation and are interpreted as the effect of a change of 1 X on Y at a given site.

Monday, March 26, 2012

Installing and Running JAGS on Mac OS 10.5.8

JAGS is an alternative to BUGS (WinBUGS or OpenBUGS) for conducting a Bayesian Analysis. It stands for Just Another Gibbs Sampler, and like WinBUGS, it is essentially an MCMC machine that employs a Gibbs sampler so you don't have to write your own for every analysis. JAGS code is very similar to the more popular BUGS so it is an easy transition. JAGS has the advantage of running on multiple platforms (Windows, Mac, Linux). It is also open source and based in C++ so it will likely have more continued development than the more well established BUGS software. Unlike WinBUGS, JAGS has no user interface and you will not see it in your Programs/Applications folder. It has to be run from another program, most commonly R using rjags. R2JAGS is an R wrapper for JAGS and rjags that provides some additional features.

I have not yet updated my operating system or all of my software, and as such, I've had some difficulty installing and running JAGS/rjags. I finally got it working after two long days and thought I'd post my solution in case anyone finds themselves in the same situation. Hopefully when I do update to Snow Leopard in the next month I don't have any problems just using the most up to date versions. For now, here is a solution using:

Mac OS 10.5.8
R 2.13.2
JAGS 2.2.0
rjags 2.2

1. Go to and download JAGSdist-2.2.0.dmg. Follow the normal install procedures.

2. From the same site download the rjags_2.2.0-1.tar.gz file to your desktop

3. Install the rjags package. I tried install.packages('/Users/Dan/Desktop/rjags_2.2.0-1.tar.gz', repos = NULL, type = "source") and it seemed to work, but when I typed library(rjags) I got the following error message:
Error: package 'rjags' is not installed for 'arch=x86_64'

4. If that happens, I found that installing from the Terminal with additional instructions worked like a charm. If like me you are not experienced with the Mac Terminal and command line entry, I will provide explicit instructions that I found:
  • Open Terminal (/Applications/Utilities/
  • Navigate to where you downloaded the source package. In my case the desktop, so I typed, "cd /Users/Dan/Desktop/" (without the quotes). You should notice that the cursor is now indicating that directory.
  • Now that it knows where to find the file, have the Terminal tell R to install the package as a 64-bit version by typing the following into the Terminal: R --arch x86_64 CMD INSTALL rjags_2.2.0-1.tar.gz
5. Open R64 back up and rjags should be installed. Load the library with "library(rjags)" and it should work. At least it worked for me. Good luck!

Friday, March 23, 2012

R script to calculate QIC for Generalized Estimating Equation (GEE) Model Selection

Generalized Estimating Equations (GEE) can be used to analyze longitudinal count data; that is, repeated counts taken from the same subject or site. This is often referred to as repeated measures data, but longitudinal data often has more repeated observations. Longitudinal data arises from studies in virtually all branches of science. In psychology or medicine, repeated measurements are taken on the same patients over time. In sociology, schools or other social distinct groups are observed over time. In my field, ecology, we frequently record data from the same plants or animals repeated over time. Furthermore, the repeated measures don't have to be separated in time. A researcher could take multiple tissue samples from the same subject at a given time. I often repeatedly visit the same field sites (e.g. same patch of forest) over time. If the data are discrete counts of things (e.g. number of red blood cells, number of acorns, number of frogs), the data will generally follow a Poisson distribution.

Longitudinal count data, following a Poisson distribution, can be analyzed with Generalized Linear Mixed Models (GLMM) or with GEE. I won't get into the computational or philosophical differences between conditional, subject-specific estimates associated with GLMM and marginal, population-level estimates obtained by GEE in this post. However, if you decide that GEE is right for you (I have a paper in preparation comparing GLMM and GEE), you may also want to compare multiple GEE models. Unlike GLMM, GEE does not use full likelihood estimates, but rather, relies on a quasi-likelihood function. Therefore, the popular AIC approach to model selection don't apply to GEE models. Luckily, Pan (2001) developed an equivalent QIC for model comparison. Like AIC, it balances the model fit with model complexity to pick the most parsimonious model.

Unfortunately, there is currently no QIC package in R for GEE models. geepack is a popular R package for GEE analysis. So, I wrote the short R script below to calculate Pan's QIC statistic from the output of a GEE model run in geepack using the geese function. It currently employs the Moore-Penrose Generalized Matrix Inverse through the MASS package. I left in my original code using the identity matrix but it is preceded by a pound sign so it doesn't run. [edition: April 10, 2012] The input for the QIC function needs to come from the geeglm function (as opposed to "geese") within geepack.

I hope you find it useful. I'm still fairly new to R and this is one of my first custom functions, so let me know if you have problems using it or if there are places it can be improved. If you decide to use this for analysis in a publication, please let me know just for my own curiosity (and ego boost!).

# QIC for GEE models
# Daniel J. Hocking
# 07 February 2012
# Refs:
  # Pan (2001)
  # Liang and Zeger (1986)
  # Zeger and Liang (1986)
  # Hardin and Hilbe (2003)
  # Dornmann et al 2007
  # #
# Poisson QIC for geese{geepack} output
# Ref: Pan (2001)
QIC.pois.geeglm <- function(model.R, model.indep) {
  # Fitted and observed values for quasi likelihood
  mu.R <- model.R$fitted.values
  # alt: X <- model.matrix(model.R)
      #  names(model.R$coefficients) <- NULL
      #  beta.R <- model.R$coefficients
      #  mu.R <- exp(X %*% beta.R)
  y <- model.R$y

  # Quasi Likelihood for Poisson
  quasi.R <- sum((y*log(mu.R)) - mu.R) # poisson()$dev.resids - scale and weights = 1
  # Trace Term (penalty for model complexity)
  AIinverse <- ginv(model.Indep$vbeta.naiv) # Omega-hat(I) via Moore-Penrose 
generalized inverse of a matrix in MASS package
  # Alt: AIinverse <- solve(model.Indep$vbeta.naiv) # solve via identity
  Vr <- model.R$vbeta
  trace.R <- sum(diag(AIinverse %*% Vr))
  px <- length(mu.R) # number non-redunant columns in design matrix

  # QIC
  QIC <- (-2)*quasi.R + 2*trace.R
  QICu <- (-2)*quasi.R + 2*px    # Approximation assuming model structured correctly 
  output <- c(QIC, QICu, quasi.R, trace.R, px)
  names(output) <- c('QIC', 'QICu', 'Quasi Lik', 'Trace', 'px')

Monday, October 31, 2011

Plotting grouped data vs time with error bars in R

This is my first blog since joining R-bloggers. I’m quite excited to be part of this group and apologize if I bore any experienced R users with my basic blogs for learning R or offend programmers with my inefficient, sloppy coding. Hopefully writing for this most excellent community will help improve my R skills while helping other novice R users.

I have a dataset from my dissertation research were I repeatedly counted salamanders from some plots and removed salamanders from other plots. I was interested in plotting the captures over time (sensu Hairston 1987). As with all statistics and graphs in R, there are a variety of ways to create the same or similar output. One challenge I always have is dealing with dates. Another challenge is plotting error bars on plots. Now I had the challenge of plotting the average captures per night or month ± SE for two groups (reference and depletion plots – 5 of each on each night) vs. time. This is the type of thing that could be done in 5 minutes on graph paper by hand but took me a while in R and I’m still tweaking various plots. Below I explore a variety of ways to handle and plot this data. I hope it’s helpful for others. Chime in with comments if you have any suggestions or similar experiences.

> ####################################################################
> # Summary of Count and Removal Data
> # Dissertation Project 2011
> # October 25, 2011
> # Daniel J. Hocking
> ####################################################################
> Data <- read.table('/Users/Dan/…/AllCountsR.txt', header = TRUE, na.strings = "NA")
> str(Data)
'data.frame':            910 obs. of  32 variables:
 $ dates            : Factor w/ 91 levels "04/06/09","04/13/11",..: 12 14 15 16 17 18 19 21 22 23 ...

You’ll notice that when importing a text file created in excel with the default date format, R treats the date variable as a Factor within the data frame. We need to convert it to a date form that R can recognize. Two such built-in functions are as.Date and as.POSIXct. The latter is a more common format and the one I choose to use (both are very similar but not fully interchangeable). To get the data in the POSIXct format in this case I use the strptime function as seen below. I also create a couple new columns of day, month, and year in case they become useful for aggregating or summarizing the data later.

> dates <-strptime(as.character(Data$dates), "%m/%d/%y")  # Change date from excel storage to internal R format
> dim(Data)
[1] 910  32
> Data = Data[,2:32]               # remove      
> Data = data.frame(date = dates, Data)      #this is now the date in useful fashion
> Data$mo <- strftime(Data$date, "%m")
> Data$mon <- strftime(Data$date, "%b")
> Data$yr <- strftime(Data$date, "%Y")
> monyr <- function(x)
+ {
+     x <- as.POSIXlt(x)
+     x$mday <- 1
+     as.POSIXct(x)
+ }
> Data$mo.yr <- monyr(Data$date)
> str(Data)
'data.frame':            910 obs. of  36 variables:
 $ date             : POSIXct, format: "2008-05-17" "2008-05-18" ...
$ mo               : chr  "05" "05" "05" "05" ...
 $ mon              : chr  "May" "May" "May" "May" ...
 $ yr               : chr  "2008" "2008" "2008" "2008" ...
 $ mo.yr            : POSIXct, format: "2008-05-01 00:00:00" "2008-05-01 00:00:00" ...

As you can see the date is now in the internal R date form of POSIXct (YYYY-MM-DD).

Now I use a custom function to summarize each night of counts and removals. I forgot offhand how to call to a custom function stored elsewhere to I lazily pasted it in my script. I found this nice little function online but I apologize to the author because I don’t remember were I found it.

> library(ggplot2)
> library(doBy)
> ## Summarizes data.
> ## Gives count, mean, standard deviation, standard error of the mean, and confidence interval (default 95%).
> ## If there are within-subject variables, calculate adjusted values using method from Morey (2008).
> ##   measurevar: the name of a column that contains the variable to be summariezed
> ##   groupvars: a vector containing names of columns that contain grouping variables
> ##   na.rm: a boolean that indicates whether to ignore NA's
> ##   conf.interval: the percent range of the confidence interval (default is 95%)
> summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE, conf.interval=.95) {
+     require(doBy)
+     # New version of length which can handle NA's: if na.rm==T, don't count them
+     length2 <- function (x, na.rm=FALSE) {
+         if (na.rm) sum(!
+         else       length(x)
+     }
+     # Collapse the data
+     formula <- as.formula(paste(measurevar, paste(groupvars, collapse=" + "), sep=" ~ "))
+     datac <- summaryBy(formula, data=data, FUN=c(length2,mean,sd), na.rm=na.rm)
+     # Rename columns
+     names(datac)[ names(datac) == paste(measurevar, ".mean",    sep="") ] <- measurevar
+     names(datac)[ names(datac) == paste(measurevar, ".sd",      sep="") ] <- "sd"
+     names(datac)[ names(datac) == paste(measurevar, ".length2", sep="") ] <- "N"
+     datac$se <- datac$sd / sqrt(datac$N)  # Calculate standard error of the mean
+     # Confidence interval multiplier for standard error
+     # Calculate t-statistic for confidence interval:
+     # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
+     ciMult <- qt(conf.interval/2 + .5, datac$N-1)
+     datac$ci <- datac$se * ciMult
+     return(datac)
+ }
> # summarySE provides the standard deviation, standard error of the mean, and a (default 95%) confidence interval
> gCount <- summarySE(Count, measurevar="count", groupvars=c("date","trt"))

Now I’m ready to plot. I’ll start with a line graph of the two treatment plot types.

> ### Line Plot ###
> # Ref: Quick-R
> # convert factor to numeric for convenience
> gCount$trtcode <- as.numeric(gCount$trt)
> ntrt <- max(gCount$trtcode)
> # ranges for x and y axes
> xrange <- range(gCount$date)
> yrange <- range(gCount$count)
> # Set up blank plot
> plot(gCount$date, gCount$count, type = "n",
+      xlab = "Date",
+      ylab = "Mean number of salamanders per night")
> # add lines
> color <- seq(1:ntrt)
> line <- seq(1:ntrt)
> for (i in 1:ntrt){
+   Treatment <- subset(gCount, gCount$trtcode==i)
+   lines(Treatment$date, Treatment$count, col = color[i])#, lty = line[i])
+ }

As you can see this is not attractive but it does show that I generally caught more salamander in the reference (red line) plots. This makes it apparent that a line graph is probably a bit messy for this data and might even be a bit misleading because data was not collected continuously or at even intervals (weather and season dependent collection). So, I tried a plot of just the points using the scatterplot function from the “car” package.

> ### package car scatterplot by groups ###
> library(car)

> # Plot
> scatterplot(count ~ date + trt, data = gCount,
+             smooth = FALSE, grid = FALSE, reg.line = FALSE,
+    xlab="Date", ylab="Mean number of salamanders per night")

This was nice because it was very simple to code. It includes points from every night but I would still like to summarize it more. Before I get to that, I would like to try having breaks between the years. The lattice package should be useful for this.

> library(lattice)
> # Add year, month, and day to dataframe
> chardate <- as.character(gCount$date)
> splitdate <- strsplit(chardate, split = "-")
> gCount$year <- as.numeric(unlist(lapply(splitdate, "[", 1)))
> gCount$month <- as.numeric(unlist(lapply(splitdate, "[", 2)))
> gCount$day <- as.numeric(unlist(lapply(splitdate, "[", 3)))
> # Plot
> xyplot(count ~ trt + date | year,
+ data = gCount,
+ ylab="Daily salamander captures", xlab="date",
+ pch = seq(1:ntrt),
+ scales=list(x=list(alternating=c(1, 1, 1))),
+ between=list(y=1),
+ par.strip.text=list(cex=0.7),
+ par.settings=list(axis.text=list(cex=0.7)))

Obviously there is a problem with this. I am not getting proper overlaying of the two treatments. I tried adjusting the equation (e.g. count ~ month | year*trt), but nothing was that enticing and I decided to go back to other plotting functions. The lattice package is great for trellis plots and visualizing mixed effects models.

 I now decided to summarize the data by month rather than by day and add standard error bars. This goes back to using the base plot function.

> ### Line Plot ###
> # Ref:
> # summarySE provides the standard deviation, standard error of the mean, and a (default 95%) confidence interval
> mCount <- summarySE(Count, measurevar="count", groupvars=c("mo.yr","trt"))
> refmCount <- subset(mCount, mCount$trt == "Reference")
> depmCount <- subset(mCount, mCount$trt == "Depletion")
> daterange=c(as.POSIXct(min(mCount$mo.yr)),as.POSIXct(max(mCount$mo.yr)))
> # determine the lowest and highest months
> ylims <- c(0, max(mCount$count + mCount$se))
> r <- as.POSIXct(range(refmCount$mo.yr), "month")
> plot(refmCount$mo.yr, refmCount$count, type = "n", xaxt = "n",
+      xlab = "Date",
+      ylab = "Mean number of salamanders per night",
+      xlim = c(r[1], r[2]),
+      ylim = ylims)
> axis.POSIXct(1, at = seq(r[1], r[2], by = "month"), format = "%b")
> points(refmCount$mo.yr, refmCount$count, type = "p", pch = 19)
> points(depmCount$mo.yr, depmCount$count, type = "p", pch = 24)
> arrows(refmCount$mo.yr, refmCount$count+refmCount$se, refmCount$mo.yr, refmCount$count-refmCount$se, angle=90, code=3, length=0)
> arrows(depmCount$mo.yr, depmCount$count+depmCount$se, depmCount$mo.yr, depmCount$count-depmCount$se, angle=90, code=3, length=0)

Now that’s a much better visualization of the data and that’s the whole goal of a figure for publication. The only thing I might change would be I might plot by year with the labels of Month-Year (format = %b $Y). I might add a legend but with only two treatments I might just include the info in the figure description.

Although that is probably going to be my final product for my current purposes, I wanted to explore a few other graphing options for visualizing this data. The first is to use box plots. I use the add = TRUE option to add a second group after subsetting the data.

> ### Boxplot ###
> # Ref:
> #    as.POSIXlt(date)$mon     #gives the months in numeric order mod 12 with January = 0 and December = 11
> refboxplot <- boxplot(count ~ date, data = Count, subset = trt == "Reference",
+                       ylab = "Mean number of salamanders per night",
+                       xlab = "Date")   #show the graph and save the data
> depboxplot <- boxplot(count ~ date, data = Count, subset = trt == "Depletion", col = 2, add = TRUE)

Clearly this is a mess and not useful. But you can imagine that with some work and summarizing by month or season it could be a useful and efficient way to present the data. Next I tried the popular package ggplot2.

> ### ggplot ###
> # Refs:
> #
> #
> library(ggplot2)
> ggplot(data = gCount, aes(x = date, y = count, group = trt)) +
+     #geom_point(aes(shape = factor(trt))) +
+     geom_point(aes(colour = factor(trt), shape = factor(trt)), size = 3) +
+     #geom_line() +
+     geom_errorbar(aes(ymin=count-se, ymax=count+se), width=.1) +
+     #geom_line() +
+    # scale_shape_manual(values=c(24,21)) + # explicitly have sham=fillable triangle, ACCX=fillable circle
+     #scale_fill_manual(values=c("white","black")) + # explicitly have sham=white, ACCX=black
+     xlab("Date") +
+     ylab("Mean number of salamander captures per night") +
+     scale_colour_hue(name="Treatment", # Legend label, use darker colors
+                      l=40) +                  # Use darker colors, lightness=40
+     theme_bw() + # make the theme black-and-white rather than grey (do this before font changes, or it overrides them)
+     opts(legend.position=c(.2, .9), # Position legend inside This must go after theme_bw
+           panel.grid.major = theme_blank(), # switch off major gridlines
+           panel.grid.minor = theme_blank(), # switch off minor gridlines
+          legend.title = theme_blank(), # switch off the legend title
+          legend.key = theme_blank()) # switch off the rectangle around symbols in the legend

This plot could work with some fine tuning, especially with the legend(s) but you get the idea. It wasn’t as easy for me as the plot function but ggplot is quite versatile and probably a good package to have in your back pocket for complicated graphing.

Next up was the gplots package for the plotCI function.

> library(gplots)
> plotCI(
+   x = refmCount$mo.yr,
+   y = refmCount$count,
+   uiw = refmCount$se, # error bar length (default is to put this much above and below point)
+   pch = 24, # symbol (plotting character) type: see help(pch); 24 = filled triangle pointing up
+ = "white", # fill colour for symbol
+   cex = 1.0, # symbol size multiplier
+   lty = 1, # error bar line type: see help(par) - not sure how to change plot lines
+   type = "p", # p=points, l=lines, b=both, o=overplotted points/lines, etc.; see help(plot.default)
+   gap = 0, # distance from symbol to error bar
+   sfrac = 0.005, # width of error bar as proportion of x plotting region (default 0.01)
+   xlab = "Year", # x axis label
+   ylab = "Mean number of salamanders per night",
+   las = 1, # axis labels horizontal (default is 0 for always parallel to axis)
+   font.lab = 1, # 1 plain, 2 bold, 3 italic, 4 bold italic, 5 symbol
+   xaxt = "n") # Don't print x-axis
> )
>   axis.POSIXct(1, at = seq(r[1], r[2], by = "year"), format = "%b %Y") # label the x axis by month-years
> plotCI(
+   x=depmCount$mo.yr,
+   y = depmCount$count,
+   uiw=depmCount$se, # error bar length (default is to put this much above and below point)
+   pch=21, # symbol (plotting character) type: see help(pch); 21 = circle
+"grey", # fill colour for symbol
+   cex=1.0, # symbol size multiplier
+   lty=1, # line type: see help(par)
+   type="p", # p=points, l=lines, b=both, o=overplotted points/lines, etc.; see help(plot.default)
+   gap=0, # distance from symbol to error bar
+   sfrac=0.005, # width of error bar as proportion of x plotting region (default 0.01)
+   xaxt = "n",
+   add=TRUE # ADD this plot to the previous one
+ )

Now this is a nice figure. I must say that I like this. It is very similar to the standard plot code and graph but it was a little easier to add the error bars.

So that’s it, what a fun weekend I had! Let me know what you think or if you have any suggestions. I’m new to this and love to learn new ways of coding things in R.

Thursday, October 6, 2011

Assumptions of the Linear Model

Linear Assumptions from the Analysis Factor - Assumptions of linear regression (and ANOVA) are about the residuals, not the normality or independence of the response variable (Y). If you don't know what this means be sure to read this brief blog article.

Monday, July 18, 2011

Model Validation: Interpreting Residual Plots

When conducting any statistical analysis it is important to evaluate how well the model fits the data and that the data meet the assumptions of the model. There are numerous ways to do this and a variety of statistical tests to evaluate deviations from model assumptions. However, there is little general acceptance of any of the statistical tests. Generally statisticians (which I am not but I do my best impression) examine various diagnostic plots after running their regression models. There are a number of good sources of information on how to do this. My recommendation is Fox and Weisberg's An R Companion to Applied Regression (Chp 6). You can refer to Fox's book, Applied Regression Analysis and Generalized Linear Models for the theory and details behind these plots but the corresponding R book is more of the "how to" guide. A very brief but good introduction to checking linear model assumptions can be found here.

The point of this post isn't to go over the details or theory but rather discuss one of the challenges that I and others have had with interpreting these diagnostic plots. Without going into the differences between standardized, studentized, Pearson's and other residuals, I will say that most of the model validation centers around the residuals (essentially the distance of the data points from the fitted regression line). Here is an example from Zuur and Colleagues' excellent book, Mixed Effects Models and Extensions in Ecology with R:

So these residuals appear exhibit homogeneity, normality, and independence. Those are pretty clear, although I'm not sure if the variation in residuals associated with the predictor (independent) variable Month is a problem. This might be a problem with heterogeneity. Most books just show a few examples like this and then residuals with clear patterning, most often increasing residual values with increasing fitted values (i.e. large values in the response/dependent variable results in greater variation, which is often correct with a log transformation). A good example of this can be see in (d) below in fitted vs. residuals plots (like top left plot in figure above).

These are the type of idealized examples usually shown. I think it's important to show these perfect examples of problems but I wish I could get expert opinions on more subtle, realistic examples. These figures are often challenging to interpret because the density of points also changes along the x-axis. I don't have a good example of this but will add one in when I get one. Instead I will show some diagnostic plots that I've generated as part of a recent attempt to fit a Generalized Linear Mixed Model (GLMM) to problematic count data.

The assumption of normality (upper left) is probably sufficient. However, the plot of the fitted vs. residuals (upper right) seems to have more variation at mid-level values compared with the low or high fitted values. Is this patten enough to be problematic and suggest a poor model fit? Is it driven by greater numbers of points at mid-level fitted values? I'm not sure. The diagonal dense line of points is generated by the large number of zeros in the dataset. My model does seem to have some problem fitting the zeros. I have two random effects in my GLMM. The residuals across plots (5 independent sites/subjects on which the data was repeatedly measured - salamanders were counted on the same 5 plots repeatedly over 4 years) don't show any pattern. However, there is heterogeneity in residuals among years (bottom right). This isn't surprising given that I collected much more data over a greater range of conditions in some years. This is a problem for the model and this variation will need to be modeled better.

So I refit the model and came up with these plots (different plots for further discussion rather than direct comparison):

Here you can see considerable variation from normality for the overall model (upper left) but okay normality within plots (lower right). The upper right plot is an okay example of what I was talking about with changes in density making interpretation difficult. There are far more points at lower values and a sparsity of points are very high fitted values. The eye is often pulled in the direction of the few points on the right creating difficult in interpretation. To help with this I like to add a loess smoother or smoothing spline (solid line) and a horizontal line at zero (broken line). The smoothing line should be approximately straight and horizontal around zero. Basically it should overlay the horizontal zero line. Here's the code to do it in R for a fitted linear mixed model (lme1):
plot(fitted(lme1), residuals(lme1),
  xlab = "Fitted Values", ylab = "Residuals")
  abline(h=0, lty=2)
  lines(smooth.spline(fitted(lme1), residuals(lme1)))

This also helps determine if the points are symmetrical around zero. I often also find it useful to plot the absolute value of the residuals with the fitted values. This helps visualize if there is a trend in direction (bias). It can also help to better see changes in spread of the residuals indicating heterogeneity. The bias can be detected with a sloping loess or smooth spline. In the lower left plot, you can see little evidence of bias but some evidence of heterogeneity (change in spread of points). Again, I an not sure if this is bad enough to invalidate the model but in combination with the deviation from normality I would reject the fit of this model.

In a mixed model it can be important to look at variation across the values of the random effects. In my case here is an example of fitted vs. residuals for each of the plots (random sites/subjects). I used the following code, which takes advantage of the lattice package in R.
# Check for residual pattern within groups and difference between groups     
xyplot(residuals(glmm1) ~ fitted(glmm1) | Count$plot, main = "glmm1 - full model by plot",
  panel=function(x, y){
    panel.xyplot(x, y)
    panel.loess(x, y, span = 0.75)
    panel.lmline(x, y, lty = 2)  # Least squares broken line

And here is another way to visualize a mixed model:

You can see that the variation in the two random effects (Plot and Year) is much better in this model but there are problems with normality and potentially heterogeneity. Since violations of normality are off less concern than the other assumptions, I wonder if this model is completely invalid or if I could make some inference from it. I don't know and would welcome expert opinion.

Regardless, this model was fit using a poisson GLMM and the deviance divided by the residual degrees of freedom (df) was 5.13, which is much greater than 1, indicating overdispersion. Therefore, I tried to fit the regression using a negative binomial distribution:

# Using glmmPQL via MASS package


#recommended to run model first as non-mixed to get a starting value for the theta estimate:


glmNB1 <- glm.nb(count ~ cday +cday2 + cSoilT + cSoilT2 + cRainAmt24 + cRainAmt242 + RHpct + soak24 + windspeed, data = Count, na.action = na.omit)




# Now run full GLMM with initial theta starting point from glm

glmmPQLnb1 <- glmmPQL(count ~ cday +cday2 + cSoilT + cSoilT2 + cRainAmt24 + cRainAmt242 + RHpct + soak24 + windspeed, random = list(~1 | plot, ~1 | year), data = Count, family = negative.binomial(theta = 1.480, link = log), na.action = na.exclude)

Unfortunately, I got the following validation plots:
Clearly, this model doesn't work for the data. It is quite surprising given the fit of the poisson and that the negative binomial is a more general distribution than the poisson and handles overdispersed count data well usually. I'm not sure what the problem is in this case.

Next I tried to run the model as if all observations were random:

glmmObs1 <- lmer(count ~ cday +cday2 + cSoilT + cSoilT2 + cRainAmt24 + cRainAmt242 + RHpct + soak24 + windspeed + (1 | plot) + (1 | year) + (1 | obs), data = Count, family = poisson) Again I end up with more problematic validation/diagnostic plots:

So that's about it for now. Hopefully this post helps some people with model validation and interpretation of fitted vs. residual plots. I would love to hear opinions regarding interpretation of residuals and when some pattern is too much and when it is acceptable. Let me know if you have examples of other more subtle residual plots.

Happy coding and may all your analyses run smoothly and provide clear interpretations!