Search This Blog

Monday, October 31, 2011

Plotting grouped data vs time with error bars in R

This is my first blog since joining R-bloggers. I’m quite excited to be part of this group and apologize if I bore any experienced R users with my basic blogs for learning R or offend programmers with my inefficient, sloppy coding. Hopefully writing for this most excellent community will help improve my R skills while helping other novice R users.

I have a dataset from my dissertation research were I repeatedly counted salamanders from some plots and removed salamanders from other plots. I was interested in plotting the captures over time (sensu Hairston 1987). As with all statistics and graphs in R, there are a variety of ways to create the same or similar output. One challenge I always have is dealing with dates. Another challenge is plotting error bars on plots. Now I had the challenge of plotting the average captures per night or month ± SE for two groups (reference and depletion plots – 5 of each on each night) vs. time. This is the type of thing that could be done in 5 minutes on graph paper by hand but took me a while in R and I’m still tweaking various plots. Below I explore a variety of ways to handle and plot this data. I hope it’s helpful for others. Chime in with comments if you have any suggestions or similar experiences.


> ####################################################################
> # Summary of Count and Removal Data
> # Dissertation Project 2011
> # October 25, 2011
> # Daniel J. Hocking
> ####################################################################
>
> Data <- read.table('/Users/Dan/…/AllCountsR.txt', header = TRUE, na.strings = "NA")
>
> str(Data)
'data.frame':            910 obs. of  32 variables:
 $ dates            : Factor w/ 91 levels "04/06/09","04/13/11",..: 12 14 15 16 17 18 19 21 22 23 ...


You’ll notice that when importing a text file created in excel with the default date format, R treats the date variable as a Factor within the data frame. We need to convert it to a date form that R can recognize. Two such built-in functions are as.Date and as.POSIXct. The latter is a more common format and the one I choose to use (both are very similar but not fully interchangeable). To get the data in the POSIXct format in this case I use the strptime function as seen below. I also create a couple new columns of day, month, and year in case they become useful for aggregating or summarizing the data later.


>
> dates <-strptime(as.character(Data$dates), "%m/%d/%y")  # Change date from excel storage to internal R format
>
> dim(Data)
[1] 910  32
> Data = Data[,2:32]               # remove      
> Data = data.frame(date = dates, Data)      #this is now the date in useful fashion
>
> Data$mo <- strftime(Data$date, "%m")
> Data$mon <- strftime(Data$date, "%b")
> Data$yr <- strftime(Data$date, "%Y")
>
> monyr <- function(x)
+ {
+     x <- as.POSIXlt(x)
+     x$mday <- 1
+     as.POSIXct(x)
+ }
>
> Data$mo.yr <- monyr(Data$date)
>
> str(Data)
'data.frame':            910 obs. of  36 variables:
 $ date             : POSIXct, format: "2008-05-17" "2008-05-18" ...
$ mo               : chr  "05" "05" "05" "05" ...
 $ mon              : chr  "May" "May" "May" "May" ...
 $ yr               : chr  "2008" "2008" "2008" "2008" ...
 $ mo.yr            : POSIXct, format: "2008-05-01 00:00:00" "2008-05-01 00:00:00" ...
>


As you can see the date is now in the internal R date form of POSIXct (YYYY-MM-DD).


Now I use a custom function to summarize each night of counts and removals. I forgot offhand how to call to a custom function stored elsewhere to I lazily pasted it in my script. I found this nice little function online but I apologize to the author because I don’t remember were I found it.

> library(ggplot2)
> library(doBy)
>
> ## Summarizes data.
> ## Gives count, mean, standard deviation, standard error of the mean, and confidence interval (default 95%).
> ## If there are within-subject variables, calculate adjusted values using method from Morey (2008).
> ##   measurevar: the name of a column that contains the variable to be summariezed
> ##   groupvars: a vector containing names of columns that contain grouping variables
> ##   na.rm: a boolean that indicates whether to ignore NA's
> ##   conf.interval: the percent range of the confidence interval (default is 95%)
> summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE, conf.interval=.95) {
+     require(doBy)
+
+     # New version of length which can handle NA's: if na.rm==T, don't count them
+     length2 <- function (x, na.rm=FALSE) {
+         if (na.rm) sum(!is.na(x))
+         else       length(x)
+     }
+
+     # Collapse the data
+     formula <- as.formula(paste(measurevar, paste(groupvars, collapse=" + "), sep=" ~ "))
+     datac <- summaryBy(formula, data=data, FUN=c(length2,mean,sd), na.rm=na.rm)
+
+     # Rename columns
+     names(datac)[ names(datac) == paste(measurevar, ".mean",    sep="") ] <- measurevar
+     names(datac)[ names(datac) == paste(measurevar, ".sd",      sep="") ] <- "sd"
+     names(datac)[ names(datac) == paste(measurevar, ".length2", sep="") ] <- "N"
+    
+     datac$se <- datac$sd / sqrt(datac$N)  # Calculate standard error of the mean
+    
+     # Confidence interval multiplier for standard error
+     # Calculate t-statistic for confidence interval:
+     # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
+     ciMult <- qt(conf.interval/2 + .5, datac$N-1)
+     datac$ci <- datac$se * ciMult
+    
+     return(datac)
+ }
>
> # summarySE provides the standard deviation, standard error of the mean, and a (default 95%) confidence interval
> gCount <- summarySE(Count, measurevar="count", groupvars=c("date","trt"))


Now I’m ready to plot. I’ll start with a line graph of the two treatment plot types.

> ### Line Plot ###
> # Ref: Quick-R
>
> # convert factor to numeric for convenience
> gCount$trtcode <- as.numeric(gCount$trt)
> ntrt <- max(gCount$trtcode)
>
> # ranges for x and y axes
> xrange <- range(gCount$date)
> yrange <- range(gCount$count)
>
> # Set up blank plot
> plot(gCount$date, gCount$count, type = "n",
+      xlab = "Date",
+      ylab = "Mean number of salamanders per night")
>
> # add lines
> color <- seq(1:ntrt)
> line <- seq(1:ntrt)
> for (i in 1:ntrt){
+   Treatment <- subset(gCount, gCount$trtcode==i)
+   lines(Treatment$date, Treatment$count, col = color[i])#, lty = line[i])
+ }





As you can see this is not attractive but it does show that I generally caught more salamander in the reference (red line) plots. This makes it apparent that a line graph is probably a bit messy for this data and might even be a bit misleading because data was not collected continuously or at even intervals (weather and season dependent collection). So, I tried a plot of just the points using the scatterplot function from the “car” package.

> ### package car scatterplot by groups ###
> library(car)

>
> # Plot
> scatterplot(count ~ date + trt, data = gCount,
+             smooth = FALSE, grid = FALSE, reg.line = FALSE,
+    xlab="Date", ylab="Mean number of salamanders per night")






This was nice because it was very simple to code. It includes points from every night but I would still like to summarize it more. Before I get to that, I would like to try having breaks between the years. The lattice package should be useful for this.

> library(lattice)
>
> # Add year, month, and day to dataframe
> chardate <- as.character(gCount$date)
> splitdate <- strsplit(chardate, split = "-")
> gCount$year <- as.numeric(unlist(lapply(splitdate, "[", 1)))
> gCount$month <- as.numeric(unlist(lapply(splitdate, "[", 2)))
> gCount$day <- as.numeric(unlist(lapply(splitdate, "[", 3)))
>
>
> # Plot
> xyplot(count ~ trt + date | year,
+ data = gCount,
+ ylab="Daily salamander captures", xlab="date",
+ pch = seq(1:ntrt),
+ scales=list(x=list(alternating=c(1, 1, 1))),
+ between=list(y=1),
+ par.strip.text=list(cex=0.7),
+ par.settings=list(axis.text=list(cex=0.7)))





Obviously there is a problem with this. I am not getting proper overlaying of the two treatments. I tried adjusting the equation (e.g. count ~ month | year*trt), but nothing was that enticing and I decided to go back to other plotting functions. The lattice package is great for trellis plots and visualizing mixed effects models.

 I now decided to summarize the data by month rather than by day and add standard error bars. This goes back to using the base plot function.

> ### Line Plot ###
> # Ref: http://personality-project.org/r/r.plottingdates.html
>
> # summarySE provides the standard deviation, standard error of the mean, and a (default 95%) confidence interval
> mCount <- summarySE(Count, measurevar="count", groupvars=c("mo.yr","trt"))
> refmCount <- subset(mCount, mCount$trt == "Reference")
> depmCount <- subset(mCount, mCount$trt == "Depletion")
>
> daterange=c(as.POSIXct(min(mCount$mo.yr)),as.POSIXct(max(mCount$mo.yr)))
>
> # determine the lowest and highest months
> ylims <- c(0, max(mCount$count + mCount$se))
> r <- as.POSIXct(range(refmCount$mo.yr), "month")
>
> plot(refmCount$mo.yr, refmCount$count, type = "n", xaxt = "n",
+      xlab = "Date",
+      ylab = "Mean number of salamanders per night",
+      xlim = c(r[1], r[2]),
+      ylim = ylims)
> axis.POSIXct(1, at = seq(r[1], r[2], by = "month"), format = "%b")
> points(refmCount$mo.yr, refmCount$count, type = "p", pch = 19)
> points(depmCount$mo.yr, depmCount$count, type = "p", pch = 24)
> arrows(refmCount$mo.yr, refmCount$count+refmCount$se, refmCount$mo.yr, refmCount$count-refmCount$se, angle=90, code=3, length=0)
> arrows(depmCount$mo.yr, depmCount$count+depmCount$se, depmCount$mo.yr, depmCount$count-depmCount$se, angle=90, code=3, length=0)





Now that’s a much better visualization of the data and that’s the whole goal of a figure for publication. The only thing I might change would be I might plot by year with the labels of Month-Year (format = %b $Y). I might add a legend but with only two treatments I might just include the info in the figure description.

Although that is probably going to be my final product for my current purposes, I wanted to explore a few other graphing options for visualizing this data. The first is to use box plots. I use the add = TRUE option to add a second group after subsetting the data.

> ### Boxplot ###
> # Ref: http://personality-project.org/r/r.plottingdates.html
>
> #    as.POSIXlt(date)$mon     #gives the months in numeric order mod 12 with January = 0 and December = 11
>
> refboxplot <- boxplot(count ~ date, data = Count, subset = trt == "Reference",
+                       ylab = "Mean number of salamanders per night",
+                       xlab = "Date")   #show the graph and save the data
> depboxplot <- boxplot(count ~ date, data = Count, subset = trt == "Depletion", col = 2, add = TRUE)
>

Clearly this is a mess and not useful. But you can imagine that with some work and summarizing by month or season it could be a useful and efficient way to present the data. Next I tried the popular package ggplot2.

> ### ggplot ###
> # Refs: http://learnr.wordpress.com/2010/02/25/ggplot2-plotting-dates-hours-and-minutes/
> # http://had.co.nz/ggplot2/
> # http://wiki.stdout.org/rcookbook/Graphs/Plotting%20means%20and%20error%20bars%20%28ggplot2%29
>  
> library(ggplot2)
>
> ggplot(data = gCount, aes(x = date, y = count, group = trt)) +
+     #geom_point(aes(shape = factor(trt))) +
+     geom_point(aes(colour = factor(trt), shape = factor(trt)), size = 3) +
+     #geom_line() +
+     geom_errorbar(aes(ymin=count-se, ymax=count+se), width=.1) +
+     #geom_line() +
+    # scale_shape_manual(values=c(24,21)) + # explicitly have sham=fillable triangle, ACCX=fillable circle
+     #scale_fill_manual(values=c("white","black")) + # explicitly have sham=white, ACCX=black
+     xlab("Date") +
+     ylab("Mean number of salamander captures per night") +
+     scale_colour_hue(name="Treatment", # Legend label, use darker colors
+                      l=40) +                  # Use darker colors, lightness=40
+     theme_bw() + # make the theme black-and-white rather than grey (do this before font changes, or it overrides them)
+
+     opts(legend.position=c(.2, .9), # Position legend inside This must go after theme_bw
+           panel.grid.major = theme_blank(), # switch off major gridlines
+           panel.grid.minor = theme_blank(), # switch off minor gridlines
+          legend.title = theme_blank(), # switch off the legend title
+          legend.key = theme_blank()) # switch off the rectangle around symbols in the legend


This plot could work with some fine tuning, especially with the legend(s) but you get the idea. It wasn’t as easy for me as the plot function but ggplot is quite versatile and probably a good package to have in your back pocket for complicated graphing.

Next up was the gplots package for the plotCI function.

> library(gplots)
>
> plotCI(
+   x = refmCount$mo.yr,
+   y = refmCount$count,
+   uiw = refmCount$se, # error bar length (default is to put this much above and below point)
+   pch = 24, # symbol (plotting character) type: see help(pch); 24 = filled triangle pointing up
+   pt.bg = "white", # fill colour for symbol
+   cex = 1.0, # symbol size multiplier
+   lty = 1, # error bar line type: see help(par) - not sure how to change plot lines
+   type = "p", # p=points, l=lines, b=both, o=overplotted points/lines, etc.; see help(plot.default)
+   gap = 0, # distance from symbol to error bar
+   sfrac = 0.005, # width of error bar as proportion of x plotting region (default 0.01)
+   xlab = "Year", # x axis label
+   ylab = "Mean number of salamanders per night",
+   las = 1, # axis labels horizontal (default is 0 for always parallel to axis)
+   font.lab = 1, # 1 plain, 2 bold, 3 italic, 4 bold italic, 5 symbol
+   xaxt = "n") # Don't print x-axis
> )
>   axis.POSIXct(1, at = seq(r[1], r[2], by = "year"), format = "%b %Y") # label the x axis by month-years
>
> plotCI(
+   x=depmCount$mo.yr,
+   y = depmCount$count,
+   uiw=depmCount$se, # error bar length (default is to put this much above and below point)
+   pch=21, # symbol (plotting character) type: see help(pch); 21 = circle
+   pt.bg="grey", # fill colour for symbol
+   cex=1.0, # symbol size multiplier
+   lty=1, # line type: see help(par)
+   type="p", # p=points, l=lines, b=both, o=overplotted points/lines, etc.; see help(plot.default)
+   gap=0, # distance from symbol to error bar
+   sfrac=0.005, # width of error bar as proportion of x plotting region (default 0.01)
+   xaxt = "n",
+   add=TRUE # ADD this plot to the previous one
+ )


Now this is a nice figure. I must say that I like this. It is very similar to the standard plot code and graph but it was a little easier to add the error bars.

So that’s it, what a fun weekend I had! Let me know what you think or if you have any suggestions. I’m new to this and love to learn new ways of coding things in R.

Thursday, October 6, 2011

Assumptions of the Linear Model

Linear Assumptions from the Analysis Factor - Assumptions of linear regression (and ANOVA) are about the residuals, not the normality or independence of the response variable (Y). If you don't know what this means be sure to read this brief blog article.

Monday, July 18, 2011

Model Validation: Interpreting Residual Plots

When conducting any statistical analysis it is important to evaluate how well the model fits the data and that the data meet the assumptions of the model. There are numerous ways to do this and a variety of statistical tests to evaluate deviations from model assumptions. However, there is little general acceptance of any of the statistical tests. Generally statisticians (which I am not but I do my best impression) examine various diagnostic plots after running their regression models. There are a number of good sources of information on how to do this. My recommendation is Fox and Weisberg's An R Companion to Applied Regression (Chp 6). You can refer to Fox's book, Applied Regression Analysis and Generalized Linear Models for the theory and details behind these plots but the corresponding R book is more of the "how to" guide. A very brief but good introduction to checking linear model assumptions can be found here.

The point of this post isn't to go over the details or theory but rather discuss one of the challenges that I and others have had with interpreting these diagnostic plots. Without going into the differences between standardized, studentized, Pearson's and other residuals, I will say that most of the model validation centers around the residuals (essentially the distance of the data points from the fitted regression line). Here is an example from Zuur and Colleagues' excellent book, Mixed Effects Models and Extensions in Ecology with R:


So these residuals appear exhibit homogeneity, normality, and independence. Those are pretty clear, although I'm not sure if the variation in residuals associated with the predictor (independent) variable Month is a problem. This might be a problem with heterogeneity. Most books just show a few examples like this and then residuals with clear patterning, most often increasing residual values with increasing fitted values (i.e. large values in the response/dependent variable results in greater variation, which is often correct with a log transformation). A good example of this can be see in (d) below in fitted vs. residuals plots (like top left plot in figure above).

These are the type of idealized examples usually shown. I think it's important to show these perfect examples of problems but I wish I could get expert opinions on more subtle, realistic examples. These figures are often challenging to interpret because the density of points also changes along the x-axis. I don't have a good example of this but will add one in when I get one. Instead I will show some diagnostic plots that I've generated as part of a recent attempt to fit a Generalized Linear Mixed Model (GLMM) to problematic count data.


The assumption of normality (upper left) is probably sufficient. However, the plot of the fitted vs. residuals (upper right) seems to have more variation at mid-level values compared with the low or high fitted values. Is this patten enough to be problematic and suggest a poor model fit? Is it driven by greater numbers of points at mid-level fitted values? I'm not sure. The diagonal dense line of points is generated by the large number of zeros in the dataset. My model does seem to have some problem fitting the zeros. I have two random effects in my GLMM. The residuals across plots (5 independent sites/subjects on which the data was repeatedly measured - salamanders were counted on the same 5 plots repeatedly over 4 years) don't show any pattern. However, there is heterogeneity in residuals among years (bottom right). This isn't surprising given that I collected much more data over a greater range of conditions in some years. This is a problem for the model and this variation will need to be modeled better.

So I refit the model and came up with these plots (different plots for further discussion rather than direct comparison):

Here you can see considerable variation from normality for the overall model (upper left) but okay normality within plots (lower right). The upper right plot is an okay example of what I was talking about with changes in density making interpretation difficult. There are far more points at lower values and a sparsity of points are very high fitted values. The eye is often pulled in the direction of the few points on the right creating difficult in interpretation. To help with this I like to add a loess smoother or smoothing spline (solid line) and a horizontal line at zero (broken line). The smoothing line should be approximately straight and horizontal around zero. Basically it should overlay the horizontal zero line. Here's the code to do it in R for a fitted linear mixed model (lme1):
plot(fitted(lme1), residuals(lme1),
  xlab = "Fitted Values", ylab = "Residuals")
  abline(h=0, lty=2)
  lines(smooth.spline(fitted(lme1), residuals(lme1)))


This also helps determine if the points are symmetrical around zero. I often also find it useful to plot the absolute value of the residuals with the fitted values. This helps visualize if there is a trend in direction (bias). It can also help to better see changes in spread of the residuals indicating heterogeneity. The bias can be detected with a sloping loess or smooth spline. In the lower left plot, you can see little evidence of bias but some evidence of heterogeneity (change in spread of points). Again, I an not sure if this is bad enough to invalidate the model but in combination with the deviation from normality I would reject the fit of this model.

In a mixed model it can be important to look at variation across the values of the random effects. In my case here is an example of fitted vs. residuals for each of the plots (random sites/subjects). I used the following code, which takes advantage of the lattice package in R.
# Check for residual pattern within groups and difference between groups     
xyplot(residuals(glmm1) ~ fitted(glmm1) | Count$plot, main = "glmm1 - full model by plot",
  panel=function(x, y){
    panel.xyplot(x, y)
    panel.loess(x, y, span = 0.75)
    panel.lmline(x, y, lty = 2)  # Least squares broken line
  }
)



And here is another way to visualize a mixed model:

You can see that the variation in the two random effects (Plot and Year) is much better in this model but there are problems with normality and potentially heterogeneity. Since violations of normality are off less concern than the other assumptions, I wonder if this model is completely invalid or if I could make some inference from it. I don't know and would welcome expert opinion.

Regardless, this model was fit using a poisson GLMM and the deviance divided by the residual degrees of freedom (df) was 5.13, which is much greater than 1, indicating overdispersion. Therefore, I tried to fit the regression using a negative binomial distribution:

# Using glmmPQL via MASS package

library(MASS)



#recommended to run model first as non-mixed to get a starting value for the theta estimate:



#negbin



glmNB1 <- glm.nb(count ~ cday +cday2 + cSoilT + cSoilT2 + cRainAmt24 + cRainAmt242 + RHpct + soak24 + windspeed, data = Count, na.action = na.omit)

summary(glmNB1)

#anova(glmNB1)

#plot(glmNB1)



# Now run full GLMM with initial theta starting point from glm

glmmPQLnb1 <- glmmPQL(count ~ cday +cday2 + cSoilT + cSoilT2 + cRainAmt24 + cRainAmt242 + RHpct + soak24 + windspeed, random = list(~1 | plot, ~1 | year), data = Count, family = negative.binomial(theta = 1.480, link = log), na.action = na.exclude)


Unfortunately, I got the following validation plots:
Clearly, this model doesn't work for the data. It is quite surprising given the fit of the poisson and that the negative binomial is a more general distribution than the poisson and handles overdispersed count data well usually. I'm not sure what the problem is in this case.

Next I tried to run the model as if all observations were random:

glmmObs1 <- lmer(count ~ cday +cday2 + cSoilT + cSoilT2 + cRainAmt24 + cRainAmt242 + RHpct + soak24 + windspeed + (1 | plot) + (1 | year) + (1 | obs), data = Count, family = poisson) Again I end up with more problematic validation/diagnostic plots:

So that's about it for now. Hopefully this post helps some people with model validation and interpretation of fitted vs. residual plots. I would love to hear opinions regarding interpretation of residuals and when some pattern is too much and when it is acceptable. Let me know if you have examples of other more subtle residual plots.

Happy coding and may all your analyses run smoothly and provide clear interpretations!

Thursday, July 7, 2011

GLMM Hell

I have been starting to analyze some data I have of repeated counts of salamanders from 5 plots over 4 years. I am trying to develop a predictive model of salamander nighttime surface activity as a function of weather variables. The repeated counting leads to the need for Generalized Linear Mixed Models (GLMM). Count data often results in data that are best described with a Poisson distribution, hence the "generalized" term. Because the counts were repeated on the same plots, plot needs to be considered a random effect. If the plot term was not included in this way it would suggest that all the counts were independent but in reality counts on one plot over time are likely to have some correlation that needs to be accounted for to avoid pseudoreplication. So I am stuck with a GLMM. The problem with GLMM in a frequentist statistical framework is that they are difficult to analyze. Bolker and colleagues give the best overview of the analysis process and it's challenges in: Generalized Linear Mixed Models: A Practical Guide for Ecology and Evolution. They do have an online supplement to that paper that provides a workthrough example complete with R code using the lme4 package. I HIGHLY recommend everyone read Bolker's paper if considering using GLMMs. Personally, I like the idea of analyzing GLMMs with Bayesian statistics rather than traditional frequentist stats. Below are a few emails that I've recently been exchanging with colleagues regarding GLMM. Let me know what you think.



Question About Selection of Correlated Predictor Variables and Model Selection:
 How much correlation among independent variables is too much in a GLMM? If I have correlation in the variables does it affect the interpretation or model selection?

Answer from a Statistician Friend:
0.8 and above is high and often one variable can be replaced by the other, and
both are not necessary in the model.

Below 0.7 typically both variables are needed for a good model fit.
I usually use stepAIC (from the MASS package in R) for model selection.

The difficulty comes in interpreting the regression coefficients: with correlation in the predictor variables, the variable that appears first
on the model statement usually gets the larger absolute value, whereas
the other variable has a smaller (in absolute value) coefficient.
Remember the interpretation of regression coefficients: the change
in the response per unit increase GIVEN ALL THE OTHER VARIABLEs IN THE
MODEL.

If you want coefficients that represent "additive" contributions to the
variation in the response (regardless of the order in which predictors
appear in the model statement), and if you have considerable multicollinearity
you might want to consider doing a principal component regression with all
or perhaps with only a subgroup of similar predictor variables.

As with most issues in statistics, there is not a clear-cut hard-fact simple
answer. Live would be simpler if there was....

Question of GLMM Bayesian Approach:
Hey Dan - I'm using GLMM b/c I have a repeated-measures design, count data response (negative binomial distribution), etc. I'm finding admb in R is doing the job - and I read the article you mentioned a few months back, when I started considering GLMMs...

I have never worked with Bayesian stats and wouldn't even know where to begin. Do you have any recommendations for overview reading, and can I analyze a repeated-measures design (i.e., is there a way to cope with random factors)?

My Response:
My data sounds very similar to yours. I usually use lmer in the lme4 package. Right now I am just essentially copying the code in Bolker et al 2009 from the online supplements in the TREE paper previously mentioned. I have never see the admb package and will have to check it out. I've tried glmmPQL and glmmML but there are more examples in lmer and it's Splus predecessor. I am annoyed that in Zuur et al. "Mixed Effects Models and Extensions in Ecology with R" they don't spend much time on model assumptions or model comparison. I feel like they show users how to do the analysis but not how to evaluate it. Pinheiro and Bates do a better job in "Mixed-Effects Models in S and S-Plus" but they focus on linear mixed models and non-linear mixed models and less on GLMM. Plus the code is similar to but differs enough from R that it can be challenging to use at times. The "SAS for Mixed Models" book is good but SAS isn't free and the code isn't as transparent to me. Plus it doesn't have good graphics so I prefer R.

Anyway, Bayesian stats have their own can of worms but I find it more intuitively appealing and I like the transparency in the code using WinBUGS (no Mac version) called from R. There are two very good, practical books to get started. McCarthy presents a good overview and introduction to bayesian stats in "Bayesian Methods for Ecology" but the examples don't get very advanced. Personally I recommend getting that from the library and reading the first few chapters. I would then buy Marc Kery's excellent book, "Introduction to WinBUGS for Ecologists." It is very well written and has a wider range of examples that typically relate to many animal ecology studies. Clark and Gelfand have a decent modeling book with Bayesian analysis in R examples but it's more ecosystem/environmentally oriented than animal ecology.

Bayesian analysis treats all factors sort of like random variables from population distributions. Therefore there is not need for explicit random vs. fixed delineation. You get estimates and credibility intervals for all variables. You can essentially write the same GLMM model and then analyze it in a Bayesian framework. The big difference in the philosophy behind frequentist vs Bayesian statistics. Bayesians use prior information (even noninformative priors contain information on the underlying distributions). Some scientists are opposed to this but for various reasons that I won't go into now, I like it. Some people do want a sensitivity analysis to go along with Bayesian analysis to determine the influence of the priors. I might go as far as to say that in the case of GLMM type data Bayesian statistics are more sound (robust?) than frequentist methods but they differ significantly from a philosophical standpoint.

Anyway, I hope that helps.



Tuesday, March 22, 2011

Tuesday, March 8, 2011

New R IDE

I'm always looking for ways to improve my workflow and overall academic efficiency. I've tried a variety of text editors, GUIs, and integrated development environments (IDEs) for R. I have some preferences but I haven't found anything that I'm completely happy with. I just heard about a new one from RStudio and am going to give it a try. I'll let you know what I think. You can find some info from another blog here: http://blog.rstudio.org/2011/02/28/rstudio-new-open-source-ide-for-r/

Monday, February 14, 2011

Reference and PDF management

Organizing and Managing PDFs and References
Organization and administration is an ever increasing part of any academic's or researcher's life. It begins as a small piece as an undergraduate and initial master's student but grows almost exponentially over time. Luckily there are numerous programs to help manage the academic's electronic life. Below is a list of programs and some useful links. Here is a link to one rather negative review of a variety of programs. You can find another summary of programs here. Also see these recommendations and this link for a nice comparison of the programs.

Papers - organize your .pdf files like iTunes does for music (mac only). Here is a video showing how to integrate Papers and Endnote.  I love this program but have not figured out a smooth workflow (or combination of existing files) for Papers-to-Endnote.

Mendeley - A great reference manager that I recommend for people who aren't Mac uses and therefore can't use Papers. It is better at linking to PDFs than Endnote.

Zotero - another excellent reference manager. I have less experience with this than the others but it seems very nice.

Endnote - citation manager that has been the standard for academics. Excellent for citations and you can link to the PDFs but it's not great for managing your PDFs like Papers or even Mendeley.

Bibdesk (Mac only, oriented for LaTeX)

Refworks - no experience with this one

Labmeeting - I haven't tried this yet since I just came across it but it looks interesting. Drop me a note if you have any experience with this.

*There are so many citation/reference management programs available now it is difficult to choose one. Here is a Wiki with info on dozens of options. Personally, I am using Papers and Endnote. I love papers and started with endnote nearly a decade ago and haven't had the time/energy to fully migrate to another program. If I were to start now I'd probably try Mendeley for all of my reference management needs because it's free, open source, available for multiple operating systems, and can sync online so as to avoid being tied to a single computer. There is also good sharing options for lab meetings and collaborative projects. They group at Mendeley also seem to update frequently and really make an effort to improve the product. The only downside is that I've heard it can be slow and the server is unavailable at times.

Please join the conversation and tell us about your experiences with these or other programs. Let's not all suffer in silence.

Monday, January 31, 2011

Introductory R Books

Here's a link to another blog compiling information and recommendations are introductory books on R (not statistics books that use R).  I thought this might be useful for people.

http://csgillespie.wordpress.com/2011/01/28/r-programming-books-updated/

Tuesday, January 18, 2011

R GUIs, IDEs, and text editors

When deciding what program to write your R code in it is important to consider a number of factors so you can keep organized and work efficiently.  I found this article useful for getting started.  Text editors include syntax highlighting and other useful features often including text auto-complete functions.  Integrated Development Environments (IDEs) include text editors but are much more powerful.  They include file organization systems, integration among various systems, debugging, and more.  The power does come with the cost of more complexity.  When considering your options, it's important to also consider what other programming languages you will be using and if you'll be using different operating systems.  I use a Mac and write primarily in R but sometimes use a PC and am learning more HTML and PHP.  I would also consider using LaTeX in the future.  Therefore, I need a flexible program that works on a variety of platforms.

Below is some information on programs that I have tried or am trying now.  These all support the R programing language.

  • JGR - not sure if I like it any better than the native R GUI for Mac
  • Komodo Edit - I've had some trouble with this on campus (socket server problem that I don't understand) but it is a fantastic program when it works
  • Eclipse and r-script plugin - this program didn't seem very user friendly but I am starting to try it and many people swear by it for a variety of programming purposes.
  • Tinn-R (Windows only) 
  • GNU Emacs - GNU Emacs is an extensible, customizable text editor—and more.  Resources can be found here     
  • Aquamacs Emacs - text editor based on GNU Emacs that works well on in Mac OS X
  • Vim - text editor that works on most OS and for many languages including R

I hope that helps some people and I'd love to hear about your experiences, preferences, and recommendations.