Go back to STAT545A home

Homework #4 Visualize a quantitative variable with lattice

Follow the existing homework submission instructions. By 9:30am Monday September 30, be prepared to share your links via a Google doc.

Tips

Disclaimer

This homework and my solutions are about the figures. I include companion tables of numbers from performing data aggregation to model real world behavior. I mostly expose my figure-making code here and mostly hide the data aggregation code. Why? Because I did lots of fiddling with reshaping and changing factor level order and it's probably overkill for people new to R. I am probably not making the best use of, e.g., the reshape2 package. If you want gory details, look at the source.

Choose your own adventure.

You are welcome to improvise. Wherever I say "life expectancy", you could substitute "GDP per capita". When I say "max", feel free to look at the 0.9 quantile. You get the idea.

Load the Gapminder data, necessary packages and drop Oceania

## data import from URL
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file = gdURL)
library(lattice)
library(plyr)
library(xtable)
library(reshape2)
str(gDat)
## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ..
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 ..
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
## drop Oceania
jDat <- droplevels(subset(gDat, continent != "Oceania"))
str(jDat)
## 'data.frame':    1680 obs. of  6 variables:
##  $ country  : Factor w/ 140 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ..
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 ..
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
##This function is handy for the repetitive task of making HTML tables.
htmlPrint <- function(x, ...,
                      digits = 0, include.rownames = FALSE) {
  print(xtable(x, digits = digits, ...), type = 'html',
        include.rownames = include.rownames)
  }

Examine "typical" life expectancy for different years.

Report what your measure of typical life expectancy is. If you use something like a trimmed mean, report the trim level properly somewhere close by or in figure.

Life expectancy, trimmed mean, trim = 0.2:
yr1952 yr1957 yr1962 yr1967 yr1972 yr1977 yr1982 yr1987 yr1992 yr1997 yr2002 yr2007
47 50 53 55 58 60 62 64 66 67 68 69
stripplot(lifeExp ~ factor(year), jDat, jitter.data = TRUE,
          type = c("p", "a"), fun = myTrimMean, alpha = 0.6, grid = "h",
          main = paste("Life expectancy, trimmed mean, trim = ", jTrim))

How is life expectancy changing over time on different continents?

stripplot(lifeExp ~ factor(year) | reorder(continent, lifeExp), jDat,
          jitter.data = TRUE,
          type = c("p", "a"), fun = myTrimMean, alpha = 0.6, grid = "h",
          main = paste("Life expectancy, trimmed mean, trim = ", jTrim),
          scales = list(x = list(rot = c(45, 0))))
stripplot(lifeExp ~ factor(year), jDat, jitter.data = TRUE,
          group = reorder(continent, lifeExp),
          type = c("p", "a"), fun = myTrimMean, alpha = 0.6, grid = "h",
          main = paste("Life expectancy, trimmed mean, trim = ", jTrim),
          scales = list(x = list(rot = c(45, 0))),
          auto.key = list(reverse.rows = TRUE,
                          x = 0.07, y = 0.95, corner = c(0, 1)))

Life expectancy, trimmed mean, trim = 0.2:

## Error: could not find function "myTrimMean"
year lifeExp
NA
NA.1
NA.2
NA.3

Depict the maximum and minimum of GDP per capita for all continents.

It is awkward to ignore the year here, since there are strong temporal trends in GDP per capita. You are encouraged to address that in some way. Easy: consider only one year. Harder but better: work the year variable into your visual display.

## get continent*year-specific min and max in tall data.frame
extGdppercapByContinent <- ddply(jDat, ~ continent + year, function(x) {
  jLevels <- c("min", "max")
  data.frame(gdpPercap = range(x$gdpPercap),
             stat = factor(jLevels, levels = jLevels))
  })
rawPlot <- xyplot(gdpPercap ~ year | continent, extGdppercapByContinent,
                  group = stat, type = "b", grid = "h", as.table = TRUE,
                  auto.key = list(columns = 2))
logPlot <- xyplot(gdpPercap ~ year | continent, extGdppercapByContinent,
                  group = stat, type = "b", grid = "h", as.table = TRUE,
                  auto.key = list(columns = 2),
                  scales = list(y = list(log = TRUE, equispaced.log = FALSE)))
                  #yscale.components = yscale.components.logpower)
print(rawPlot)
print(logPlot)

continent stat 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
Africa max 4725 5487 6757 18773 21011 21951 17364 11864 13522 14723 12522 13206
Africa min 299 336 355 413 464 502 462 390 411 312 241 278
Americas max 13990 14847 16173 19530 21806 24073 25010 29884 32004 35767 39097 42952
Americas min 1398 1544 1662 1452 1654 1874 2011 1823 1456 1342 1270 1202
Asia max 108382 113523 95458 80895 109348 59265 33693 28118 34933 40301 36023 47307
Asia min 331 350 388 349 357 371 424 385 347 415 611 944
Europe max 14734 17909 20431 22966 27195 26982 28398 31541 33966 41283 44684 49357
Europe min 974 1354 1710 2172 2860 3528 3631 3739 2497 3193 4604 5937

Wow, what's up with extremely high GDP/capita in Asia prior to 1980? O.I.L.

Too bad it's not easy to display this in a transposed fashion without actually transposing the object. xtable does not support that. Wonder if another package does? The hard part is that data.frame variables -- so table columns -- can have different flavors but data.frame or matrix rows -- table rows -- cannot. Therein lies the fiddliness.

yr1952 yr1957 yr1962 yr1967 yr1972 yr1977 yr1982 yr1987 yr1992 yr1997 yr2002 yr2007
Kuwait Kuwait Kuwait Kuwait Kuwait Kuwait Saudi Arabia Kuwait Kuwait Kuwait Singapore Kuwait
108382 113523 95458 80895 109348 59265 33693 28118 34933 40301 36023 47307

Look at the spread of GDP per capita within the continents.

Ditto above re: using one year only or incorporating the year properly.

First I ignore year, which is a little sketchy....

foo <- ddply(jDat, ~ continent, summarize,
             sdGdpPercap = sd(gdpPercap), iqrGdpPercap = IQR(gdpPercap),
             madGdpPercap = mad(gdpPercap))
xyplot(sdGdpPercap + iqrGdpPercap + madGdpPercap ~ continent, foo,
       type = "b", ylab = "measure of spread",
       auto.key = list(x = 0.15, y = 0.85, corner = c(0, 1)))
Africa Americas Asia Europe
sdGdpPercap 2828 6397 14045 9355
iqrGdpPercap 1616 4402 7492 13248
madGdpPercap 775 3269 2821 8846

Then I account for year. This shows why robust statistics, like the median, MAD, and IQR, are more handy in real life than your intro stats course may have indicated. Watch the sd chase the oil-rich Arab GDP/per capita outliers, which do not really characterize the entire set of countries in Asia.

foo <- ddply(jDat, ~ continent + year, summarize,
             sdGdpPercap = sd(gdpPercap), iqrGdpPercap = IQR(gdpPercap),
             madGdpPercap = mad(gdpPercap))
## cheap trick using lattice's extended formula interface
## avoiding reshaping
xyplot(sdGdpPercap + iqrGdpPercap + madGdpPercap ~ year, foo,
       group = reorder(continent, sdGdpPercap), layout = c(3, 1),
       type = "b", ylab = "measure of spread",
       auto.key = list(x = 0.35, y = 0.85, corner = c(0, 1),
                       reverse.rows = TRUE))

Wish I knew how to make table shrink to fit the width available. Probably not possible with the technology I am using.

continent measSpread 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
Asia sd 18635 19507 16416 14063 19088 11816 8701 8090 9727 11094 11151 14155
Asia iqr 2286 2497 3362 5071 7548 10034 11511 9939 13430 17800 17141 19864
Asia mad 921 1292 1428 1970 2775 3639 4922 4825 4282 4316 4498 4566
Europe sd 3114 3678 4199 4725 5510 5874 6453 7483 9110 10065 11197 11800
Europe iqr 3996 5202 5558 6619 7465 8692 9452 11047 16367 17243 18652 19006
Europe mad 2984 3692 4226 5135 6001 6864 7647 9015 12066 12348 13027 12506
Americas sd 3002 3312 3422 4161 4754 5356 5530 6665 7047 7874 8896 9713
Americas iqr 1512 2269 2430 2546 2778 2918 4739 3667 3698 5083 3939 6249
Americas mad 1265 1917 1720 2077 2230 2260 3464 3292 3231 3934 3167 4774
Africa sd 983 1135 1462 2848 3287 4142 3243 2567 2644 2821 2973 3618
Africa iqr 920 966 1029 1110 1476 2036 1959 2022 1964 2064 2534 3131
Africa mad 697 712 786 743 952 1015 899 814 728 771 819 1032

Depict the number and/or proportion of countries with low life expectancy over time by continent.

Make sure to give your definition of "low life expectancy".

#(bMark <- quantile(jDat$lifeExp, 0.2))
bMark <- 43 # JB wrote this on her 43rd b'day!
stripplot(lifeExp ~ factor(year) | reorder(continent, lifeExp), jDat,
          jitter.data = TRUE, alpha = 0.6, grid = "h",
          group = lifeExp <= bMark,
          main = paste("Life expectancy <= JB's current age =", bMark),
          scales = list(x = list(rot = c(45, 0))))

This table will be truly huge if I transpose, as I've done elsewhere. Leaving as is.

Africa Asia Americas Europe
1952 0.83 (43/52) 0.36 (12/33) 0.20 (5/25) 0.00 (0/30)
1957 0.65 (34/52) 0.33 (11/33) 0.08 (2/25) 0.00 (0/30)
1962 0.54 (28/52) 0.15 (5/33) 0.00 (0/25) 0.00 (0/30)
1967 0.38 (20/52) 0.09 (3/33) 0.00 (0/25) 0.00 (0/30)
1972 0.25 (13/52) 0.09 (3/33) 0.00 (0/25) 0.00 (0/30)
1977 0.19 (10/52) 0.06 (2/33) 0.00 (0/25) 0.00 (0/30)
1982 0.13 (7/52) 0.03 (1/33) 0.00 (0/25) 0.00 (0/30)
1987 0.08 (4/52) 0.03 (1/33) 0.00 (0/25) 0.00 (0/30)
1992 0.10 (5/52) 0.03 (1/33) 0.00 (0/25) 0.00 (0/30)
1997 0.12 (6/52) 0.03 (1/33) 0.00 (0/25) 0.00 (0/30)
2002 0.08 (4/52) 0.03 (1/33) 0.00 (0/25) 0.00 (0/30)
2007 0.12 (6/52) 0.00 (0/33) 0.00 (0/25) 0.00 (0/30)

Find countries with extremely low or high life expectancy in 1952 or that exhibit extremely rapid or slow life expectancy gains.

Find them and then plot their data.

Tip. The hard part here is finding the interesting countries in an elegant manner. If you don't / can't do that, you can hard-wire the interesting countries, just so you get to make the plots! (Nice demo of why it's EVIL to have commas in the country names.) Here you go:

Find countries with sudden, substantial departures from the temporal trend in one of the quantitative measures.

Find them and then plot their data.

I think we are looking at the effect of HIV/AIDS in Africa and the effect of despots, genocide, political upheaval, and foreign invasions elsewhere. With respect to Africa, the 2007 data shows the beginning of the impact of antiretroviral medicines pumped into Africa through massive infusions of foreign aid. Here is an interesting NYT article on how the coffin-making businesses of Africa are struggling now that AIDS deaths are declining.

residInteresting <-
  subset(lmGoodies, scaledResid > 4, select = country, drop = TRUE)
xyplot(lifeExp ~ year | country, 
       subset(jDat, country %in% residInteresting),
       group = continent,
       grid = TRUE, type = c("p", "r"))

Student work on HW 4.