Ignore if you don't need this bit of support.
This is one in a series of tutorials in which we explore basic data import, exploration and much more using data from the Gapminder project. Now is the time to make sure you are working in the appropriate directory on your computer, perhaps through the use of an RStudio project. To ensure a clean slate, you may wish to clean out your workspace and restart R (both available from the RStudio Session menu, among other methods). Confirm that the new R process has the desired working directory, for example, with the getwd()
command or by glancing at the top of RStudio's Console pane.
Open a new R script (in RStudio, File > New > R Script). Develop and run your code from there (recommended) or periodicially copy "good" commands from the history. In due course, save this script with a name ending in .r or .R, containing no spaces or other funny stuff, and evoking "data aggregation".
Assuming the data can be found in the current working directory, this works:
gDat <- read.delim("gapminderDataFiveYear.txt")
Plan B (I use here, because of where the source of this tutorial lives):
## data import from URL
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file = gdURL)
Basic sanity check that the import has gone well:
str(gDat)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ..
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 ..
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
If you feel the urge to store a little snippet of a data.frame:
(snippet <- subset(gDat, country == "Canada"))
## country year pop continent lifeExp gdpPercap
## 241 Canada 1952 14785584 Americas 68.75 11367
## 242 Canada 1957 17010154 Americas 69.96 12490
## 243 Canada 1962 18985849 Americas 71.30 13462
## 244 Canada 1967 20819767 Americas 72.13 16077
## 245 Canada 1972 22284500 Americas 72.88 18971
## 246 Canada 1977 23796400 Americas 74.21 22091
## 247 Canada 1982 25201900 Americas 75.76 22899
## 248 Canada 1987 26549700 Americas 76.86 26627
## 249 Canada 1992 28523502 Americas 77.95 26343
## 250 Canada 1997 30305843 Americas 78.61 28955
## 251 Canada 2002 31902268 Americas 79.77 33329
## 252 Canada 2007 33390141 Americas 80.65 36319
Stop and ask yourself ...
Do I want to create sub-data.frames for each level of some factor (or unique combination of several factors) ... in order to compute or graph something?
If NO, then maybe you really do need to store a copy of a subset of the data.frame. But seriously consider whether you can achieve your goals by simply using the subset =
argument -- or perhaps with()
coupled with subset()
-- to enact a computation on a specific set of rows. If this still does not suit your needs, then maybe you really should use subset()
as shown above and carry on.
If YES, use data aggregation techniques or conditioning in lattice
or facetting in ggplot2
plots -- don’t subset the data.frame. Or, to be totally clear, only subset the data.frame as a temporary measure as you develop your elegant code for computing on or visualizing these sub-data.frames.
There are two main options for data aggregation:
apply
family of functionsplyr
add-on packageI used the built-in functions for many years but am transitioning to plyr
. I recommend simply starting with plyr
if you are new to R. You can see older material about data aggregation with built-in functions here. In this tutorial we will only use plyr
.
You'll notice I did not even mention another option that may occur to some: hand-coding for
loops, perhaps, even (shudder) nested for
loops! Don't do it. By the end of this tutorial you'll see things that are much faster and more fun. Yes, of course, tedious loops are required for data aggregation but when you can, let other developers write them for you, in super-efficient low level code. This is more about saving programmer time than compute time, BTW.
plyr
If you have not already done so, you'll need to install plyr
. Here's one way to do that:
install.packages("plyr", dependencies = TRUE)
You will also need to load the package before you can use the functions in an R session:
library(plyr)
You can make that apply to all your R sessions via your .Rprofile
. Note this is a controversial practice, because it means your code will not necessarily run "as is" on someone else's computer. For that reason, I auto-load packages very sparingly. plyr
is not (yet) on my list.
plyr
Big IdeasThe plyr
functions will not make much sense viewed individually, e.g. simply reading the help for ddply()
is not the fast track to competence. There is a very important over-arching logic for the package and it is well worth reading the article The split-apply-combine strategy for data analysis, Hadley Wickham, Journal of Statistical Software, vol. 40, no. 1, pp. 1–29, 2011. Though it is no substitute for reading the above, here is the most critical information:
plyr
.plyr
is a set a functions with names like this: XYply
where X
specifies what sort of input you're giving and Y
specifies the sort of output you want.
a
= array, where matrices and vectors are important special casesd
= data.framel
= list_
= no output; only valid for Y
, obviously; useful when you're operating on a list purely for the side effects, e.g., making a plot or sending output to screen/file.data
is the first argument = the inputToday we will emphasize ddply()
which accepts a data.frame, splits it into pieces based on one or more factors, computes on the pieces, then returns the results as a data.frame. For the record, the built-in functions most relevant to ddply()
are tapply()
and friends.
ddply()
Let's say we want to get the maximum life expectancy for each continent.
(maxLeByCont <- ddply(gDat, ~ continent, summarize, maxLifeExp = max(lifeExp)))
## continent maxLifeExp
## 1 Africa 76.44
## 2 Americas 80.65
## 3 Asia 82.60
## 4 Europe 81.76
## 5 Oceania 81.23
Let's study the return value.
str(maxLeByCont)
## 'data.frame': 5 obs. of 2 variables:
## $ continent : Factor w/ 5 levels "Africa","Americas",..: 1 2 3 4 5
## $ maxLifeExp: num 76.4 80.7 82.6 81.8 81.2
levels(maxLeByCont$continent)
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
So we got a data.frame back, with one observation per continent, and two variables: the maximum life expectancies and the continent, as a factor, with the same levels in the same order, as for the input data.frame gDat
. If you have sweated to do such things with built-in functions, this minor miracle might make you cry tears of joy (or anguish over all the hours you have wasted.)
summarize()
or its synonym summarise()
is a function provided by plyr
that creates a new data.frame from an old one. It is related to the built-in function transform()
that transforms variables in a data.frame or adds new ones. Feel free to play with it a bit in some top-level commands; you will use it alot inside plyr
calls.
The two variables in maxLeByCont
come from two sources. The continent
factor is provided by ddply()
and represents the labelling of the life expectancies with their associated continent. This is the book-keeping associated with dividing the input into little bits, computing on them, and gluing the results together again in an orderly, labelled fashion. We can take more credit for the other variable maxLifeExp
, which has a name we chose ("maxLifeExp") and arises from applying a function we specified (max()
) to a variable of our choice (lifeExp
).
You try: compute the minimum GDP per capita by continent. Here's what I get:
## continent minGdpPercap
## 1 Africa 241.2
## 2 Americas 1201.6
## 3 Asia 331.0
## 4 Europe 973.5
## 5 Oceania 10039.6
You might have chosen a different name for the minimum GDP/capita's, but your numerical results should match.
The function you want to apply to the continent-specific data.frames can be built-in, like max()
above, or a custom function you've written. This custom function can be written in advance or specified 'on the fly'. Here's how I would count the number of countries in this dataset for each continent.
ddply(gDat, ~continent, summarize, nUniqCountries = length(unique(country)))
## continent nUniqCountries
## 1 Africa 52
## 2 Americas 25
## 3 Asia 33
## 4 Europe 30
## 5 Oceania 2
Here is another way to do the same thing that doesn't use summarize()
at all:
ddply(gDat, ~ continent,
function(x) return(c(nUniqCountries = length(unique(x$country)))))
## continent nUniqCountries
## 1 Africa 52
## 2 Americas 25
## 3 Asia 33
## 4 Europe 30
## 5 Oceania 2
In pseudo pseudo-code, here is what's happening in both of the above commands:
returnValue <- an empty receptacle with one "slot" per country
for each possible country i {
x <- subset(gDat, subset = country == i)
returnValue[i] <- length(unique(x$country))
name or label for returnValue[i] is set to country i
}
ddply packages returnValue and associate names/labels as a nice data.frame
You don't have to compute just one thing for each sub-data.frame, nor are you limited to computing on just one variable. Check it out.
ddply(gDat, ~ continent, summarize,
minLifeExp = min(lifeExp), maxLifeExp = max(lifeExp),
medGdpPercap = median(gdpPercap))
## continent minLifeExp maxLifeExp medGdpPercap
## 1 Africa 23.60 76.44 1192
## 2 Americas 37.58 80.65 5466
## 3 Asia 28.80 82.60 2647
## 4 Europe 43.59 81.76 12082
## 5 Oceania 69.12 81.23 17983
ddply()
and polishing the resultsNow I want to do something more complicated. I want to fit a linear regression for each country, modelling life expectancy as a function of the year and then retain the estimated intercepts and slopes. I will walk before I run. Therefore, I will create a tiny sub-data.frame to prototype this, before I fold it into a ddply()
call. If you're a newbie, watch how complicated tasks are slowly constructed.
jCountry <- "France" # pick, but do not hard wire, an example
(jDat <- subset(gDat, country == jCountry)) # temporary measure!
## country year pop continent lifeExp gdpPercap
## 529 France 1952 42459667 Europe 67.41 7030
## 530 France 1957 44310863 Europe 68.93 8663
## 531 France 1962 47124000 Europe 70.51 10560
## 532 France 1967 49569000 Europe 71.55 13000
## 533 France 1972 51732000 Europe 72.38 16107
## 534 France 1977 53165019 Europe 73.83 18293
## 535 France 1982 54433565 Europe 74.89 20294
## 536 France 1987 55630100 Europe 76.34 22066
## 537 France 1992 57374179 Europe 77.46 24704
## 538 France 1997 58623428 Europe 78.64 25890
## 539 France 2002 59925035 Europe 79.59 28926
## 540 France 2007 61083916 Europe 80.66 30470
xyplot(lifeExp ~ year, jDat, type = c("p", "r")) # always plot the data
jFit <- lm(lifeExp ~ year, jDat)
summary(jFit)
##
## Call:
## lm(formula = lifeExp ~ year, data = jDat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3801 -0.1389 0.0124 0.1429 0.3349
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.98e+02 7.29e+00 -54.6 1.0e-13 ***
## year 2.38e-01 3.68e-03 64.8 1.9e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.22 on 10 degrees of freedom
## Multiple R-squared: 0.998, Adjusted R-squared: 0.997
## F-statistic: 4.2e+03 on 1 and 10 DF, p-value: 1.86e-14
Wow, check out that crazy intercept! Apparently the life expectancy in France around year 0 A.D. was minus 400 years! This a great opportunity for some sanity checking of a model fit and thinking about how to reparametrize the model to make the parameters have natural interpretation. I think it makes more sense for the intercept to correspond to life expectancy in 1952, the earliest date in our dataset. Let's try that again.
(yearMin <- min(gDat$year))
## [1] 1952
jFit <- lm(lifeExp ~ I(year - yearMin), jDat)
summary(jFit)
##
## Call:
## lm(formula = lifeExp ~ I(year - yearMin), data = jDat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3801 -0.1389 0.0124 0.1429 0.3349
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 67.79013 0.11949 567.3 < 2e-16 ***
## I(year - yearMin) 0.23850 0.00368 64.8 1.9e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.22 on 10 degrees of freedom
## Multiple R-squared: 0.998, Adjusted R-squared: 0.997
## F-statistic: 4.2e+03 on 1 and 10 DF, p-value: 1.86e-14
An intercept around 68 years makes much more common sense and is also supported by our plot. What is this jFit
object and how can I get stuff out of it?
class(jFit)
## [1] "lm"
mode(jFit)
## [1] "list"
It turns out jFit
is of class "lm" and its mode is list. So that means I could use indexing to isolate specific components. But what's in there?
## str(jFit) # too ugly to print here but you should look
names(jFit)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
jFit$coefficients
## (Intercept) I(year - yearMin)
## 67.7901 0.2385
Using str()
and names()
reveals a great deal about this "lm" object and reading the help file for lm()
would explain a great deal more. In the See Also section we learn there's a generic function coef()
which looks promising.
coef(jFit)
## (Intercept) I(year - yearMin)
## 67.7901 0.2385
As a rule, I use extractor functions like this when they are available.
We have achieved our goal for this specific country -- we've gotten its intercept and slope. Now we need to package that as a function (we will talk about functions properly later, but this should be fairly self-explanatory).
jFun <- function(x) coef(lm(lifeExp ~ I(year - yearMin), x))
jFun(jDat) # trying out our new function ... yes still get same numbers
## (Intercept) I(year - yearMin)
## 67.7901 0.2385
I hate the names of these return values. Good names pay off downstream, so I will enhance my function.
jFun <- function(x) {
estCoefs <- coef(lm(lifeExp ~ I(year - yearMin), x))
names(estCoefs) <- c("intercept", "slope")
return(estCoefs)
}
jFun(jDat) # trying out our improved function ... yes still get same numbers
## intercept slope
## 67.7901 0.2385
It's always a good idea to try out a function on a few small examples.
jFun(subset(gDat, country == "Canada"))
## intercept slope
## 68.8838 0.2189
jFun(subset(gDat, country == "Uruguay"))
## intercept slope
## 65.7416 0.1833
jFun(subset(gDat, country == "India"))
## intercept slope
## 39.2698 0.5053
It seems like we are ready to scale up by placing this function inside a ddply()
call.
jCoefs <- ddply(gDat, ~country, jFun)
str(jCoefs)
## 'data.frame': 142 obs. of 3 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10..
## $ intercept: num 29.9 59.2 43.4 32.1 62.7 ...
## $ slope : num 0.275 0.335 0.569 0.209 0.232 ...
tail(jCoefs)
## country intercept slope
## 137 Venezuela 57.51 0.32972
## 138 Vietnam 39.01 0.67162
## 139 West Bank and Gaza 43.80 0.60110
## 140 Yemen, Rep. 30.13 0.60546
## 141 Zambia 47.66 -0.06043
## 142 Zimbabwe 55.22 -0.09302
We did it! By the time we've packaged the computation in a function, the call itself is deceptively simple. To review, here's the script I would save from our work in this section:
## realistically, you would read the data from a local file
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file = gdURL)
## str(gDat) here when working interactively
yearMin <- min(gDat$year)
jFun <- function(x) {
estCoefs <- coef(lm(lifeExp ~ I(year - yearMin), x))
names(estCoefs) <- c("intercept", "slope")
return(estCoefs)
}
## jFun(subset(gDat, country == 'India')) to see what it does
jCoefs <- ddply(gDat, ~country, jFun)
Over the course of its development, the number of lines in the script would start small, get bigger as I fiddle around, make mistakes, write and use lots of sanity checking code and then .... contract down to the above, as I strip it down to bare necessities. That is how the pros actually work. They don't write beautiful elegant scripts the first time and they aren't satisfied with hideous hack-y "transcripts" of the very first attempt that (sort of) worked.
Finally, let's present this information attractively in a table. I consider my advice about how to do this less definitive than the above. I'll be happy to see people explore other table-making tools. Fixed width plain text printing of data.frames is OK for internal use and during development. But at some point you will want a nicer looking table. Markdown doesn't have a proper table syntax, because it is a ruthlessly simple language. Some Markdown dialects / processing engines support table extensions, BTW. But here I will teach you how to make an HTML table. This HTML code will survive unharmed as your R Markdown document is converted from R Markdown to Markdown and finally to HTML.
I've experimented with the xtable
package, which you will need to install
install.packages("xtable", dependencies = TRUE)
and load.
library(xtable)
Let's pick some countries at random and display their estimated coefficients. FYI: the following R chunk has an option results='asis'
and that is important to the correct display of the table.
set.seed(916)
foo <- jCoefs[sample(nrow(jCoefs), size = 15), ]
foo <- xtable(foo)
print(foo, type = "html", include.rownames = FALSE)
country | intercept | slope |
---|---|---|
Lebanon | 58.69 | 0.26 |
Senegal | 36.75 | 0.50 |
Dominican Republic | 48.60 | 0.47 |
Oman | 37.21 | 0.77 |
Germany | 67.57 | 0.21 |
Korea, Dem. Rep. | 54.91 | 0.32 |
Mauritius | 55.37 | 0.35 |
Slovak Republic | 67.01 | 0.13 |
Comoros | 40.00 | 0.45 |
Argentina | 62.69 | 0.23 |
Central African Republic | 38.81 | 0.18 |
Ecuador | 49.07 | 0.50 |
West Bank and Gaza | 43.80 | 0.60 |
Egypt | 40.97 | 0.56 |
Myanmar | 41.41 | 0.43 |
Two easy improvments to make this table more useful are
The easiest way to get the continent information is to enhance our ddply()
call. Here is what we used:
jCoefs <- ddply(gDat, ~country, jFun)
This divides gDat
into country-specific pieces. But we can supply two factors in the second argument: country
and continent
. In theory, sub-data.frames will be made for all possible combinations of the levels of country
and continent
. Many of those will have zero rows because there is, for example, no Belgium in Asia. By default, plyr
functions drop these empty combinations. But the labelling work done by ddply()
will still help us, as we will get both country
and continent
as factors in our result. This is easier to see than explain!
jCoefs <- ddply(gDat, ~country + continent, jFun)
str(jCoefs)
## 'data.frame': 142 obs. of 4 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10..
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 4 1 1 2 5 4 3 ..
## $ intercept: num 29.9 59.2 43.4 32.1 62.7 ...
## $ slope : num 0.275 0.335 0.569 0.209 0.232 ...
tail(jCoefs)
## country continent intercept slope
## 137 Venezuela Americas 57.51 0.32972
## 138 Vietnam Asia 39.01 0.67162
## 139 West Bank and Gaza Asia 43.80 0.60110
## 140 Yemen, Rep. Asia 30.13 0.60546
## 141 Zambia Africa 47.66 -0.06043
## 142 Zimbabwe Africa 55.22 -0.09302
Now, prior to making the HTML table, we will sort the data.frame, so it starts with the country with the shortest life expectancy in 1952, and goes to the largest.
set.seed(916)
foo <- jCoefs[sample(nrow(jCoefs), size = 15), ]
foo <- arrange(foo, intercept)
## foo <- foo[order(foo$intercept), ] # an uglier non-plyr way
foo <- xtable(foo)
print(foo, type = "html", include.rownames = FALSE)
country | continent | intercept | slope |
---|---|---|---|
Senegal | Africa | 36.75 | 0.50 |
Oman | Asia | 37.21 | 0.77 |
Central African Republic | Africa | 38.81 | 0.18 |
Comoros | Africa | 40.00 | 0.45 |
Egypt | Africa | 40.97 | 0.56 |
Myanmar | Asia | 41.41 | 0.43 |
West Bank and Gaza | Asia | 43.80 | 0.60 |
Dominican Republic | Americas | 48.60 | 0.47 |
Ecuador | Americas | 49.07 | 0.50 |
Korea, Dem. Rep. | Asia | 54.91 | 0.32 |
Mauritius | Africa | 55.37 | 0.35 |
Lebanon | Asia | 58.69 | 0.26 |
Argentina | Americas | 62.69 | 0.23 |
Slovak Republic | Europe | 67.01 | 0.13 |
Germany | Europe | 67.57 | 0.21 |
Soon we will start making the companion plots ... but for now our work is done.
plyr
is a powerful package for data aggregation.
ddply()
is the most important function: data.frames are great, therefore ddply()
is great because it takes data.frame as input and returns data.frame as output.
Simple functions, built-in or user-defined, can be provided directly in a call to ddply()
.
It's better to handle more complicated data aggregation differently. Build it up slowly and revisit earlier steps for improvement. Here is a gentle workflow:
ddply()
call using your functionddply()
againResults you present to the world generally need polishing to make the best impression and have the best impact. This too will be an iterative process. Here are some good things to think about:
What it you wanted to write the table of slopes and intercepts to file? Go to the tutorial on getting data out of R, but if this is an emergency, the main functions to consider are write.table()
and saveRDS()
.
plyr
paper: The split-apply-combine strategy for data analysis, Hadley Wickham, Journal of Statistical Software, vol. 40, no. 1, pp. 1–29, 2011. Go here for supplements, such as example code from the paper.
Data Manipulation with R by Phil Spector, Springer (2008) | author webpage | GoogleBooks search
The main link above to SpringerLink will give full access to the book if you are on a UBC network (or any other network that confers accesss).
See Chapter 8 (“Data Aggregation”)
Student: How do you pass more than one argument for a function into ddply()
. The main example that we used in class was this:
(yearMin <- min(gDat$year))
## [1] 1952
jFun <- function(x) {
estCoefs <- coef(lm(lifeExp ~ I(year - yearMin), x))
names(estCoefs) <- c("intercept", "slope")
return(estCoefs)
}
jCoefs <- ddply(gDat, ~country, jFun)
head(jCoefs)
## country intercept slope
## 1 Afghanistan 29.91 0.2753
## 2 Albania 59.23 0.3347
## 3 Algeria 43.37 0.5693
## 4 Angola 32.13 0.2093
## 5 Argentina 62.69 0.2317
## 6 Australia 68.40 0.2277
and jFun
only requires one argument, x
. What if it had more than one argument?
Answer: Let's imagine that the shift for the year covariate is an argument instead of a previously-assigned variable yearMin
. Here's how it would work.
jFunTwoArgs <- function(x, cvShift = 0) {
estCoefs <- coef(lm(lifeExp ~ I(year - cvShift), x))
names(estCoefs) <- c("intercept", "slope")
return(estCoefs)
}
Since I've assigned cvShift =
a default value of zero, we can get coefficients where the intercept corresponds to the year A.D. 0 with this simple call:
jCoefsSilly <- ddply(gDat, ~country, jFunTwoArgs)
head(jCoefsSilly)
## country intercept slope
## 1 Afghanistan -507.5 0.2753
## 2 Albania -594.1 0.3347
## 3 Algeria -1067.9 0.5693
## 4 Angola -376.5 0.2093
## 5 Argentina -389.6 0.2317
## 6 Australia -376.1 0.2277
We are getting the same estimated slopes but the silly year 0 intercepts we've seen before. Let's use the cvShift =
argument to resolve this.
jCoefsSane <- ddply(gDat, ~country, jFunTwoArgs, cvShift = 1952)
head(jCoefsSane)
## country intercept slope
## 1 Afghanistan 29.91 0.2753
## 2 Albania 59.23 0.3347
## 3 Algeria 43.37 0.5693
## 4 Angola 32.13 0.2093
## 5 Argentina 62.69 0.2317
## 6 Australia 68.40 0.2277
We're back to our usual estimated intercepts, which reflect life expectancy in 1952. Of course hard-wiring 1952 is not a great idea, so here's probably our best code yet:
jCoefsBest <- ddply(gDat, ~country, jFunTwoArgs, cvShift = min(gDat$year))
head(jCoefsBest)
## country intercept slope
## 1 Afghanistan 29.91 0.2753
## 2 Albania 59.23 0.3347
## 3 Algeria 43.37 0.5693
## 4 Angola 32.13 0.2093
## 5 Argentina 62.69 0.2317
## 6 Australia 68.40 0.2277