Go back to STAT545A home
lattice
Follow the existing homework submission instructions. By 9:30am Monday September 30, be prepared to share your links via a Google doc.
gapminderDataFiveYear.txt
.lattice
package to create "companion" graphics, at least one per data aggregation task. So make at least two good figures.
echo = FALSE
very sparingly.Tips
xlab =
etc. Table headings and axis labels don't need to be "publication ready", just transparent.lattice
functions we haven't explicitly covered yet. In particular, xyplot()
might be handy and basic use is pretty easy.This homework and my solutions are about the figures. I include companion tables of numbers from performing data aggregation to model real world behavior. I mostly expose my figure-making code here and mostly hide the data aggregation code. Why? Because I did lots of fiddling with reshaping and changing factor level order and it's probably overkill for people new to R. I am probably not making the best use of, e.g., the reshape2
package. If you want gory details, look at the source.
You are welcome to improvise. Wherever I say "life expectancy", you could substitute "GDP per capita". When I say "max", feel free to look at the 0.9 quantile. You get the idea.
## data import from URL
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file = gdURL)
library(lattice)
library(plyr)
library(xtable)
library(reshape2)
str(gDat)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ..
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 ..
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
## drop Oceania
jDat <- droplevels(subset(gDat, continent != "Oceania"))
str(jDat)
## 'data.frame': 1680 obs. of 6 variables:
## $ country : Factor w/ 140 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ..
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 ..
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
##This function is handy for the repetitive task of making HTML tables.
htmlPrint <- function(x, ...,
digits = 0, include.rownames = FALSE) {
print(xtable(x, digits = digits, ...), type = 'html',
include.rownames = include.rownames)
}
Report what your measure of typical life expectancy is. If you use something like a trimmed mean, report the trim level properly somewhere close by or in figure.
Life expectancy, trimmed mean, trim = 0.2:yr1952 | yr1957 | yr1962 | yr1967 | yr1972 | yr1977 | yr1982 | yr1987 | yr1992 | yr1997 | yr2002 | yr2007 |
---|---|---|---|---|---|---|---|---|---|---|---|
47 | 50 | 53 | 55 | 58 | 60 | 62 | 64 | 66 | 67 | 68 | 69 |
stripplot(lifeExp ~ factor(year), jDat, jitter.data = TRUE,
type = c("p", "a"), fun = myTrimMean, alpha = 0.6, grid = "h",
main = paste("Life expectancy, trimmed mean, trim = ", jTrim))
stripplot(lifeExp ~ factor(year) | reorder(continent, lifeExp), jDat,
jitter.data = TRUE,
type = c("p", "a"), fun = myTrimMean, alpha = 0.6, grid = "h",
main = paste("Life expectancy, trimmed mean, trim = ", jTrim),
scales = list(x = list(rot = c(45, 0))))
stripplot(lifeExp ~ factor(year), jDat, jitter.data = TRUE,
group = reorder(continent, lifeExp),
type = c("p", "a"), fun = myTrimMean, alpha = 0.6, grid = "h",
main = paste("Life expectancy, trimmed mean, trim = ", jTrim),
scales = list(x = list(rot = c(45, 0))),
auto.key = list(reverse.rows = TRUE,
x = 0.07, y = 0.95, corner = c(0, 1)))
Life expectancy, trimmed mean, trim = 0.2:
## Error: could not find function "myTrimMean"
year | lifeExp | |
---|---|---|
NA | ||
NA.1 | ||
NA.2 | ||
NA.3 |
It is awkward to ignore the year here, since there are strong temporal trends in GDP per capita. You are encouraged to address that in some way. Easy: consider only one year. Harder but better: work the year variable into your visual display.
## get continent*year-specific min and max in tall data.frame
extGdppercapByContinent <- ddply(jDat, ~ continent + year, function(x) {
jLevels <- c("min", "max")
data.frame(gdpPercap = range(x$gdpPercap),
stat = factor(jLevels, levels = jLevels))
})
rawPlot <- xyplot(gdpPercap ~ year | continent, extGdppercapByContinent,
group = stat, type = "b", grid = "h", as.table = TRUE,
auto.key = list(columns = 2))
logPlot <- xyplot(gdpPercap ~ year | continent, extGdppercapByContinent,
group = stat, type = "b", grid = "h", as.table = TRUE,
auto.key = list(columns = 2),
scales = list(y = list(log = TRUE, equispaced.log = FALSE)))
#yscale.components = yscale.components.logpower)
print(rawPlot)
print(logPlot)
continent | stat | 1952 | 1957 | 1962 | 1967 | 1972 | 1977 | 1982 | 1987 | 1992 | 1997 | 2002 | 2007 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Africa | max | 4725 | 5487 | 6757 | 18773 | 21011 | 21951 | 17364 | 11864 | 13522 | 14723 | 12522 | 13206 |
Africa | min | 299 | 336 | 355 | 413 | 464 | 502 | 462 | 390 | 411 | 312 | 241 | 278 |
Americas | max | 13990 | 14847 | 16173 | 19530 | 21806 | 24073 | 25010 | 29884 | 32004 | 35767 | 39097 | 42952 |
Americas | min | 1398 | 1544 | 1662 | 1452 | 1654 | 1874 | 2011 | 1823 | 1456 | 1342 | 1270 | 1202 |
Asia | max | 108382 | 113523 | 95458 | 80895 | 109348 | 59265 | 33693 | 28118 | 34933 | 40301 | 36023 | 47307 |
Asia | min | 331 | 350 | 388 | 349 | 357 | 371 | 424 | 385 | 347 | 415 | 611 | 944 |
Europe | max | 14734 | 17909 | 20431 | 22966 | 27195 | 26982 | 28398 | 31541 | 33966 | 41283 | 44684 | 49357 |
Europe | min | 974 | 1354 | 1710 | 2172 | 2860 | 3528 | 3631 | 3739 | 2497 | 3193 | 4604 | 5937 |
Wow, what's up with extremely high GDP/capita in Asia prior to 1980? O.I.L.
Too bad it's not easy to display this in a transposed fashion without actually transposing the object.
xtable
does not support that. Wonder if another package does? The hard part is that data.frame variables -- so table columns -- can have different flavors but data.frame or matrix rows -- table rows -- cannot. Therein lies the fiddliness.
yr1952 | yr1957 | yr1962 | yr1967 | yr1972 | yr1977 | yr1982 | yr1987 | yr1992 | yr1997 | yr2002 | yr2007 |
---|---|---|---|---|---|---|---|---|---|---|---|
Kuwait | Kuwait | Kuwait | Kuwait | Kuwait | Kuwait | Saudi Arabia | Kuwait | Kuwait | Kuwait | Singapore | Kuwait |
108382 | 113523 | 95458 | 80895 | 109348 | 59265 | 33693 | 28118 | 34933 | 40301 | 36023 | 47307 |
Ditto above re: using one year only or incorporating the year properly.
First I ignore year, which is a little sketchy....
foo <- ddply(jDat, ~ continent, summarize,
sdGdpPercap = sd(gdpPercap), iqrGdpPercap = IQR(gdpPercap),
madGdpPercap = mad(gdpPercap))
xyplot(sdGdpPercap + iqrGdpPercap + madGdpPercap ~ continent, foo,
type = "b", ylab = "measure of spread",
auto.key = list(x = 0.15, y = 0.85, corner = c(0, 1)))
Africa | Americas | Asia | Europe | |
---|---|---|---|---|
sdGdpPercap | 2828 | 6397 | 14045 | 9355 |
iqrGdpPercap | 1616 | 4402 | 7492 | 13248 |
madGdpPercap | 775 | 3269 | 2821 | 8846 |
Then I account for year. This shows why robust statistics, like the median, MAD, and IQR, are more handy in real life than your intro stats course may have indicated. Watch the sd chase the oil-rich Arab GDP/per capita outliers, which do not really characterize the entire set of countries in Asia.
foo <- ddply(jDat, ~ continent + year, summarize,
sdGdpPercap = sd(gdpPercap), iqrGdpPercap = IQR(gdpPercap),
madGdpPercap = mad(gdpPercap))
## cheap trick using lattice's extended formula interface
## avoiding reshaping
xyplot(sdGdpPercap + iqrGdpPercap + madGdpPercap ~ year, foo,
group = reorder(continent, sdGdpPercap), layout = c(3, 1),
type = "b", ylab = "measure of spread",
auto.key = list(x = 0.35, y = 0.85, corner = c(0, 1),
reverse.rows = TRUE))
Wish I knew how to make table shrink to fit the width available. Probably not possible with the technology I am using.
continent | measSpread | 1952 | 1957 | 1962 | 1967 | 1972 | 1977 | 1982 | 1987 | 1992 | 1997 | 2002 | 2007 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Asia | sd | 18635 | 19507 | 16416 | 14063 | 19088 | 11816 | 8701 | 8090 | 9727 | 11094 | 11151 | 14155 |
Asia | iqr | 2286 | 2497 | 3362 | 5071 | 7548 | 10034 | 11511 | 9939 | 13430 | 17800 | 17141 | 19864 |
Asia | mad | 921 | 1292 | 1428 | 1970 | 2775 | 3639 | 4922 | 4825 | 4282 | 4316 | 4498 | 4566 |
Europe | sd | 3114 | 3678 | 4199 | 4725 | 5510 | 5874 | 6453 | 7483 | 9110 | 10065 | 11197 | 11800 |
Europe | iqr | 3996 | 5202 | 5558 | 6619 | 7465 | 8692 | 9452 | 11047 | 16367 | 17243 | 18652 | 19006 |
Europe | mad | 2984 | 3692 | 4226 | 5135 | 6001 | 6864 | 7647 | 9015 | 12066 | 12348 | 13027 | 12506 |
Americas | sd | 3002 | 3312 | 3422 | 4161 | 4754 | 5356 | 5530 | 6665 | 7047 | 7874 | 8896 | 9713 |
Americas | iqr | 1512 | 2269 | 2430 | 2546 | 2778 | 2918 | 4739 | 3667 | 3698 | 5083 | 3939 | 6249 |
Americas | mad | 1265 | 1917 | 1720 | 2077 | 2230 | 2260 | 3464 | 3292 | 3231 | 3934 | 3167 | 4774 |
Africa | sd | 983 | 1135 | 1462 | 2848 | 3287 | 4142 | 3243 | 2567 | 2644 | 2821 | 2973 | 3618 |
Africa | iqr | 920 | 966 | 1029 | 1110 | 1476 | 2036 | 1959 | 2022 | 1964 | 2064 | 2534 | 3131 |
Africa | mad | 697 | 712 | 786 | 743 | 952 | 1015 | 899 | 814 | 728 | 771 | 819 | 1032 |
Make sure to give your definition of "low life expectancy".
#(bMark <- quantile(jDat$lifeExp, 0.2))
bMark <- 43 # JB wrote this on her 43rd b'day!
stripplot(lifeExp ~ factor(year) | reorder(continent, lifeExp), jDat,
jitter.data = TRUE, alpha = 0.6, grid = "h",
group = lifeExp <= bMark,
main = paste("Life expectancy <= JB's current age =", bMark),
scales = list(x = list(rot = c(45, 0))))
This table will be truly huge if I transpose, as I've done elsewhere. Leaving as is.
Africa | Asia | Americas | Europe | |
---|---|---|---|---|
1952 | 0.83 (43/52) | 0.36 (12/33) | 0.20 (5/25) | 0.00 (0/30) |
1957 | 0.65 (34/52) | 0.33 (11/33) | 0.08 (2/25) | 0.00 (0/30) |
1962 | 0.54 (28/52) | 0.15 (5/33) | 0.00 (0/25) | 0.00 (0/30) |
1967 | 0.38 (20/52) | 0.09 (3/33) | 0.00 (0/25) | 0.00 (0/30) |
1972 | 0.25 (13/52) | 0.09 (3/33) | 0.00 (0/25) | 0.00 (0/30) |
1977 | 0.19 (10/52) | 0.06 (2/33) | 0.00 (0/25) | 0.00 (0/30) |
1982 | 0.13 (7/52) | 0.03 (1/33) | 0.00 (0/25) | 0.00 (0/30) |
1987 | 0.08 (4/52) | 0.03 (1/33) | 0.00 (0/25) | 0.00 (0/30) |
1992 | 0.10 (5/52) | 0.03 (1/33) | 0.00 (0/25) | 0.00 (0/30) |
1997 | 0.12 (6/52) | 0.03 (1/33) | 0.00 (0/25) | 0.00 (0/30) |
2002 | 0.08 (4/52) | 0.03 (1/33) | 0.00 (0/25) | 0.00 (0/30) |
2007 | 0.12 (6/52) | 0.00 (0/33) | 0.00 (0/25) | 0.00 (0/30) |
Find them and then plot their data.
Tip. The hard part here is finding the interesting countries in an elegant manner. If you don't / can't do that, you can hard-wire the interesting countries, just so you get to make the plots! (Nice demo of why it's EVIL to have commas in the country names.) Here you go:
Find them and then plot their data.
I think we are looking at the effect of HIV/AIDS in Africa and the effect of despots, genocide, political upheaval, and foreign invasions elsewhere. With respect to Africa, the 2007 data shows the beginning of the impact of antiretroviral medicines pumped into Africa through massive infusions of foreign aid. Here is an interesting NYT article on how the coffin-making businesses of Africa are struggling now that AIDS deaths are declining.
residInteresting <-
subset(lmGoodies, scaledResid > 4, select = country, drop = TRUE)
xyplot(lifeExp ~ year | country,
subset(jDat, country %in% residInteresting),
group = continent,
grid = TRUE, type = c("p", "r"))