I have a dataset from my dissertation
research were I repeatedly counted salamanders from some plots and removed
salamanders from other plots. I was interested in plotting the captures over
time (sensu Hairston 1987). As with all statistics and graphs in R, there are a
variety of ways to create the same or similar output. One challenge I always
have is dealing with dates. Another challenge is plotting error bars on plots.
Now I had the challenge of plotting the average captures per night or month ±
SE for two groups (reference and depletion plots – 5 of each on each night) vs.
time. This is the type of thing that could be done in 5 minutes on graph paper
by hand but took me a while in R and I’m still tweaking various plots. Below I
explore a variety of ways to handle and plot this data. I hope it’s helpful for
others. Chime in with comments if you have any suggestions or similar
experiences.
> ####################################################################
> # Summary of Count and Removal
Data
> # Dissertation Project 2011
> # October 25, 2011
> # Daniel J. Hocking
>
####################################################################
>
> Data <- read.table('/Users/Dan/…/AllCountsR.txt',
header = TRUE, na.strings = "NA")
>
> str(Data)
'data.frame': 910
obs. of 32 variables:
$ dates
: Factor w/ 91 levels "04/06/09","04/13/11",..: 12
14 15 16 17 18 19 21 22 23 ...
You’ll notice that when importing a
text file created in excel with the default date format, R treats the date
variable as a Factor within the data frame. We need to convert it to a date
form that R can recognize. Two such built-in functions are as.Date and
as.POSIXct. The latter is a more common format and the one I choose to use
(both are very similar but not fully interchangeable). To get the data in the
POSIXct format in this case I use the strptime function as seen below. I also
create a couple new columns of day, month, and year in case they become useful
for aggregating or summarizing the data later.
>
> dates
<-strptime(as.character(Data$dates), "%m/%d/%y") # Change date from excel storage to
internal R format
>
> dim(Data)
[1] 910 32
> Data = Data[,2:32]
# remove
> Data = data.frame(date = dates,
Data)
#this is now the date in useful fashion
>
> Data$mo <- strftime(Data$date,
"%m")
> Data$mon <-
strftime(Data$date, "%b")
> Data$yr <- strftime(Data$date,
"%Y")
>
> monyr <- function(x)
+ {
+ x <- as.POSIXlt(x)
+ x$mday <- 1
+ as.POSIXct(x)
+ }
>
> Data$mo.yr <- monyr(Data$date)
>
> str(Data)
'data.frame': 910
obs. of 36 variables:
$ date
: POSIXct, format: "2008-05-17" "2008-05-18" ...
$ mo
: chr "05"
"05" "05" "05" ...
$ mon
: chr "May"
"May" "May" "May" ...
$ yr
: chr "2008"
"2008" "2008" "2008" ...
$ mo.yr
: POSIXct, format: "2008-05-01 00:00:00" "2008-05-01
00:00:00" ...
>
As you can see the date is now in the internal
R date form of POSIXct (YYYY-MM-DD).
Now I use a custom function to summarize
each night of counts and removals. I forgot offhand how to call to a custom
function stored elsewhere to I lazily pasted it in my script. I found this nice
little function online but I apologize to the author because I don’t remember
were I found it.
> library(ggplot2)
> library(doBy)
>
> ## Summarizes data.
> ## Gives count, mean, standard
deviation, standard error of the mean, and confidence interval (default 95%).
> ## If there are within-subject
variables, calculate adjusted values using method from Morey (2008).
> ## measurevar: the name of a column that contains the
variable to be summariezed
> ## groupvars: a vector containing names of columns that
contain grouping variables
> ## na.rm: a boolean that indicates whether to ignore NA's
> ## conf.interval: the percent range of the confidence
interval (default is 95%)
> summarySE <- function(data=NULL,
measurevar, groupvars=NULL, na.rm=FALSE, conf.interval=.95) {
+ require(doBy)
+
+ # New version of length which can handle
NA's: if na.rm==T, don't count them
+ length2 <- function (x, na.rm=FALSE) {
+ if (na.rm)
sum(!is.na(x))
+ else length(x)
+ }
+
+ # Collapse the data
+ formula <- as.formula(paste(measurevar,
paste(groupvars, collapse=" + "), sep=" ~ "))
+ datac <- summaryBy(formula, data=data,
FUN=c(length2,mean,sd), na.rm=na.rm)
+
+ # Rename columns
+ names(datac)[ names(datac) ==
paste(measurevar, ".mean", sep="") ] <- measurevar
+ names(datac)[ names(datac) ==
paste(measurevar, ".sd", sep="") ] <-
"sd"
+ names(datac)[ names(datac) ==
paste(measurevar, ".length2", sep="") ] <- "N"
+
+ datac$se <- datac$sd /
sqrt(datac$N) # Calculate standard
error of the mean
+
+ # Confidence interval multiplier for
standard error
+ # Calculate t-statistic for confidence
interval:
+ # e.g., if conf.interval is .95, use .975
(above/below), and use df=N-1
+ ciMult <- qt(conf.interval/2 + .5,
datac$N-1)
+ datac$ci <- datac$se * ciMult
+
+ return(datac)
+ }
>
> # summarySE provides the standard
deviation, standard error of the mean, and a (default 95%) confidence interval
> gCount <- summarySE(Count,
measurevar="count", groupvars=c("date","trt"))
>
Now I’m ready to plot. I’ll start with a
line graph of the two treatment plot types.
> ### Line Plot ###
> # Ref: Quick-R
>
> # convert factor to numeric for
convenience
> gCount$trtcode <-
as.numeric(gCount$trt)
> ntrt <- max(gCount$trtcode)
>
> # ranges for x and y axes
> xrange <- range(gCount$date)
> yrange <- range(gCount$count)
>
> # Set up blank plot
> plot(gCount$date, gCount$count, type
= "n",
+ xlab = "Date",
+ ylab = "Mean number of salamanders
per night")
>
> # add lines
> color <- seq(1:ntrt)
> line <- seq(1:ntrt)
> for (i in 1:ntrt){
+ Treatment <- subset(gCount, gCount$trtcode==i)
+ lines(Treatment$date, Treatment$count, col =
color[i])#, lty = line[i])
+ }
As you can see this is not attractive but
it does show that I generally caught more salamander in the reference (red
line) plots. This makes it apparent that a line graph is probably a bit messy
for this data and might even be a bit misleading because data was not collected
continuously or at even intervals (weather and season dependent collection).
So, I tried a plot of just the points using the scatterplot function from the
“car” package.
> ### package car scatterplot by
groups ###
> library(car)
>
> # Plot
> scatterplot(count ~ date + trt, data
= gCount,
+
smooth = FALSE, grid = FALSE, reg.line = FALSE,
+ xlab="Date", ylab="Mean number of
salamanders per night")
>
This was nice because it was very simple
to code. It includes points from every night but I would still like to
summarize it more. Before I get to that, I would like to try having breaks
between the years. The lattice package should be useful for this.
> library(lattice)
>
> # Add year, month, and day to
dataframe
> chardate <- as.character(gCount$date)
> splitdate <- strsplit(chardate,
split = "-")
> gCount$year <-
as.numeric(unlist(lapply(splitdate, "[", 1)))
> gCount$month <-
as.numeric(unlist(lapply(splitdate, "[", 2)))
> gCount$day <-
as.numeric(unlist(lapply(splitdate, "[", 3)))
>
>
> # Plot
> xyplot(count ~ trt + date | year,
+ data = gCount,
+ ylab="Daily salamander
captures", xlab="date",
+ pch = seq(1:ntrt),
+ scales=list(x=list(alternating=c(1, 1,
1))),
+ between=list(y=1),
+ par.strip.text=list(cex=0.7),
+ par.settings=list(axis.text=list(cex=0.7)))
>
Obviously there is a problem with this. I
am not getting proper overlaying of the two treatments. I tried adjusting the
equation (e.g. count ~ month | year*trt), but nothing was that enticing and I
decided to go back to other plotting functions. The lattice package is great
for trellis plots and visualizing mixed effects models.
I now decided to summarize the data by month rather than by
day and add standard error bars. This goes back to using the base plot function.
> ### Line Plot ###
> # Ref:
http://personality-project.org/r/r.plottingdates.html
>
> # summarySE provides the standard
deviation, standard error of the mean, and a (default 95%) confidence interval
> mCount <- summarySE(Count,
measurevar="count", groupvars=c("mo.yr","trt"))
> refmCount <- subset(mCount,
mCount$trt == "Reference")
> depmCount <- subset(mCount,
mCount$trt == "Depletion")
>
>
daterange=c(as.POSIXct(min(mCount$mo.yr)),as.POSIXct(max(mCount$mo.yr)))
>
> # determine the lowest and highest
months
> ylims <- c(0, max(mCount$count +
mCount$se))
> r <-
as.POSIXct(range(refmCount$mo.yr), "month")
>
> plot(refmCount$mo.yr,
refmCount$count, type = "n", xaxt = "n",
+ xlab = "Date",
+ ylab = "Mean number of
salamanders per night",
+ xlim = c(r[1], r[2]),
+ ylim = ylims)
> axis.POSIXct(1, at = seq(r[1], r[2],
by = "month"), format = "%b")
> points(refmCount$mo.yr,
refmCount$count, type = "p", pch = 19)
> points(depmCount$mo.yr,
depmCount$count, type = "p", pch = 24)
> arrows(refmCount$mo.yr,
refmCount$count+refmCount$se, refmCount$mo.yr, refmCount$count-refmCount$se,
angle=90, code=3, length=0)
> arrows(depmCount$mo.yr,
depmCount$count+depmCount$se, depmCount$mo.yr, depmCount$count-depmCount$se,
angle=90, code=3, length=0)
>
Now that’s a much better visualization of
the data and that’s the whole goal of a figure for publication. The only thing
I might change would be I might plot by year with the labels of Month-Year
(format = %b $Y). I might add a legend but with only two treatments I might
just include the info in the figure description.
Although that is probably going to be my
final product for my current purposes, I wanted to explore a few other graphing
options for visualizing this data. The first is to use box plots. I use the add
= TRUE option to add a second group after subsetting the data.
> ### Boxplot ###
> # Ref:
http://personality-project.org/r/r.plottingdates.html
>
> # as.POSIXlt(date)$mon #gives the months in numeric order mod 12
with January = 0 and December = 11
>
> refboxplot <- boxplot(count ~
date, data = Count, subset = trt == "Reference",
+
ylab = "Mean number of salamanders per night",
+
xlab = "Date")
#show the graph and save the data
> depboxplot <- boxplot(count ~
date, data = Count, subset = trt == "Depletion", col = 2, add = TRUE)
>
Clearly this is a mess and not useful.
But you can imagine that with some work and summarizing by month or season it
could be a useful and efficient way to present the data. Next I tried the
popular package ggplot2.
> ### ggplot ###
> # Refs:
http://learnr.wordpress.com/2010/02/25/ggplot2-plotting-dates-hours-and-minutes/
> # http://had.co.nz/ggplot2/
> #
http://wiki.stdout.org/rcookbook/Graphs/Plotting%20means%20and%20error%20bars%20%28ggplot2%29
>
> library(ggplot2)
>
> ggplot(data = gCount, aes(x = date,
y = count, group = trt)) +
+ #geom_point(aes(shape = factor(trt))) +
+ geom_point(aes(colour = factor(trt), shape
= factor(trt)), size = 3) +
+ #geom_line() +
+ geom_errorbar(aes(ymin=count-se,
ymax=count+se), width=.1) +
+ #geom_line() +
+ # scale_shape_manual(values=c(24,21)) + #
explicitly have sham=fillable triangle, ACCX=fillable circle
+ #scale_fill_manual(values=c("white","black"))
+ # explicitly have sham=white, ACCX=black
+ xlab("Date") +
+ ylab("Mean number of salamander
captures per night") +
+
scale_colour_hue(name="Treatment", # Legend label, use darker
colors
+
l=40) + # Use darker
colors, lightness=40
+ theme_bw() + # make the theme
black-and-white rather than grey (do this before font changes, or it overrides
them)
+
+ opts(legend.position=c(.2, .9), # Position
legend inside This must go after theme_bw
+
panel.grid.major = theme_blank(), # switch off major gridlines
+
panel.grid.minor = theme_blank(), # switch off minor gridlines
+ legend.title
= theme_blank(), # switch off the legend title
+ legend.key =
theme_blank()) # switch off the rectangle around symbols in the legend
>
This plot could work with some fine
tuning, especially with the legend(s) but you get the idea. It wasn’t as easy
for me as the plot function but ggplot is quite versatile and probably a good
package to have in your back pocket for complicated graphing.
Next up was the gplots package for the
plotCI function.
> library(gplots)
>
> plotCI(
+ x = refmCount$mo.yr,
+ y = refmCount$count,
+ uiw = refmCount$se, # error bar length (default is to
put this much above and below point)
+ pch = 24, # symbol (plotting character) type: see
help(pch); 24 = filled triangle pointing up
+ pt.bg = "white", # fill colour for symbol
+ cex = 1.0, # symbol size multiplier
+ lty = 1, # error bar line type: see help(par) - not
sure how to change plot lines
+ type = "p", # p=points, l=lines, b=both,
o=overplotted points/lines, etc.; see help(plot.default)
+ gap = 0, # distance from symbol to error bar
+ sfrac = 0.005, # width of error bar as proportion of x
plotting region (default 0.01)
+ xlab = "Year", # x axis label
+ ylab = "Mean number of salamanders per
night",
+ las = 1, # axis labels horizontal (default is 0 for
always parallel to axis)
+ font.lab = 1, # 1 plain, 2 bold, 3 italic, 4 bold
italic, 5 symbol
+ xaxt = "n") # Don't print x-axis
> )
> axis.POSIXct(1, at = seq(r[1], r[2], by =
"year"), format = "%b %Y") # label the x axis by
month-years
>
> plotCI(
+ x=depmCount$mo.yr,
+ y = depmCount$count,
+ uiw=depmCount$se, # error bar length (default is to
put this much above and below point)
+ pch=21, # symbol (plotting character) type: see
help(pch); 21 = circle
+ pt.bg="grey", # fill colour for symbol
+ cex=1.0, # symbol size multiplier
+ lty=1, # line type: see help(par)
+ type="p", # p=points, l=lines, b=both,
o=overplotted points/lines, etc.; see help(plot.default)
+ gap=0, # distance from symbol to error bar
+ sfrac=0.005, # width of error bar as proportion of x
plotting region (default 0.01)
+ xaxt = "n",
+ add=TRUE # ADD this plot to the previous one
+ )
>
Now this is a nice figure. I must say
that I like this. It is very similar to the standard plot code and graph but it
was a little easier to add the error bars.
So that’s it, what a fun weekend I had!
Let me know what you think or if you have any suggestions. I’m new to this and
love to learn new ways of coding things in R.







Thanks for posting this. I like your final graph. I have similar kinds of data which I'd like to plot like this, so will find this useful.
ReplyDeleteIn terms of the post, I could offer a couple of suggestions. Firstly it's nicer to have an example that people can try themselves - either provide the dataset to download, or use one already freely available online/ through an R package. Secondly I think it would be better if you just showed how you got the final graph to make the post shorter - although it is quite nice to see how you arrived at the final graph too.
Thanks for the interest and advice Tom. Good luck with your research.
ReplyDeleteGood post. I have been making a lot of graphs like this for the project I am currently working on. I also love summarySE and use it a lot. One option you didn't mention is xYplot in the package Hmisc. It's nice because it has all the functionality of lattice (grouping variables, panel conditioning etc) but lets you add error bars. It's not necessarily better than the solutions you showed though. I have an example here: http://anthony.darrouzet-nardi.net/scienceblog/?p=406
ReplyDeleteThanks Anthony. That's the great thing about R, there are many solutions, some better, some worse, some easier, some harder, and some just different. That's why I like being part of the R-Bloggers community. I am new to all this an it really helps increase my R toolbox. I have heard some praise for the Hmisc package but haven't ever really tried it. I like the sound of xYplot, I'll have to give it a try. I have lots of longitudinal data so it could be really handy.
ReplyDeleteHere's a message that I got via email that I thought was good advice and worth sharing:
ReplyDeleteR is excellent for such an exporative analysis. Good that you had fun doing it. :) Some suggestions:
- if you plot error bars (first) and means (second) independently (using functions segments() and points()) you could avoid overplotting of mean values by error bars of the other group (e.g. June 2011)
- you could also shift depletion data points slightly to left and reference to right, such that error bars do not overlap
But most important:
- think about which conclusion you would like the observer to draw from your data/plot?
- do you need to leave huge gaps without data points? -- imo only if you want to show a (linear) trend over time that would be skewed if you don't
- are you interested in absolute numbers or in the difference towards a reference value? -- if the latter than plot the differences per time point instead of both original values
- are you interested in differences between years? -- hopefully not, as you sampled not within the same time period each year
Basically: Think about what is your hypothesis and which value derived from your data illustrates the answer to this hypothesis best. This is what you should put in the focus of your plot.