Homework #4 Visualize a quantitative variable with lattice
- Menu of data aggregation tasks to explore with a figure.
- Student work on HW 4.

Homework #4 Visualize a quantitative variable with `lattice`

Follow the existing homework submission instructions. By 9:30am Monday September 30, be prepared to share your links via a Google doc.

Start with the Gapminder data as provided in gapminderDataFiveYear.txt.
Pick at least two data aggregation tasks inspired by homework 3, recapped below. You decide what to work on but spend ~2 hours on this. Of course you are welcome to do more if you wish.
Try to pick tasks that you did not do and modify (or at least read) data aggregation code from a fellow student (best!) or JB.
- If applicable, state and link to the student's code and report.
- Comment on their HW 3 report and/or code if you wish. Be kind but constructive.
- Notice what makes code hard vs. easy to understand and re-use. This foretells what revisiting your own code will feel like 6 months from now.
Use the lattice package to create "companion" graphics, at least one per data aggregation task. So make at least two good figures.
- Your figure does not have to depict every last number from the data aggregation result. Use your judgement.
- The figure can illustrate things that are hard/impossible to see in the data aggregation result.
- Goal: the table and the figure complement each other!
- Notice which data formats make better inputs for plotting functions vs. for human-friendly tables.
Write up with R Markdown.
- Include a narrative, written in English prose. Don't just show code and results.
- Expose your code, i.e. use echo = FALSE very sparingly.
DUE: Before class begins 9:30am Monday September 30.
- Compile into an HTML report. Follow the naming convention. Publish the HTML report on the web somewhere, such as on RPubs. Make the slug follow the naming convention.
- Publish the R Markdown file as a Gist. Follow the naming convention.
- Share your links by editing the Google doc mentioned above.

Tips

Perform a superficial check that data import went OK.
Give informative, short variable names, so you don't need to fiddle with xlab = etc. Table headings and axis labels don't need to be "publication ready", just transparent.
Attend to the order of your factor levels.
Feel free to use lattice functions we haven't explicitly covered yet. In particular, xyplot() might be handy and basic use is pretty easy.
You are encouraged to drop Oceania right from the start.

Disclaimer

This homework and my solutions are about the figures. I include companion tables of numbers from performing data aggregation to model real world behavior. I mostly expose my figure-making code here and mostly hide the data aggregation code. Why? Because I did lots of fiddling with reshaping and changing factor level order and it's probably overkill for people new to R. I am probably not making the best use of, e.g., the reshape2 package. If you want gory details, look at the source.

Choose your own adventure.

You are welcome to improvise. Wherever I say "life expectancy", you could substitute "GDP per capita". When I say "max", feel free to look at the 0.9 quantile. You get the idea.

Load the Gapminder data, necessary packages and drop Oceania

## data import from URL
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file = gdURL)
library(lattice)
library(plyr)
library(xtable)
library(reshape2)
str(gDat)

## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ..
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 ..
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

## drop Oceania
jDat <- droplevels(subset(gDat, continent != "Oceania"))
str(jDat)

## 'data.frame':    1680 obs. of  6 variables:
##  $ country  : Factor w/ 140 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ..
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 ..
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

##This function is handy for the repetitive task of making HTML tables.
htmlPrint <- function(x, ...,
                      digits = 0, include.rownames = FALSE) {
  print(xtable(x, digits = digits, ...), type = 'html',
        include.rownames = include.rownames)
  }

Examine "typical" life expectancy for different years.

Report what your measure of typical life expectancy is. If you use something like a trimmed mean, report the trim level properly somewhere close by or in figure.

Life expectancy, trimmed mean, trim = 0.2:

yr1952	yr1957	yr1962	yr1967	yr1972	yr1977	yr1982	yr1987	yr1992	yr1997	yr2002	yr2007
47	50	53	55	58	60	62	64	66	67	68	69

stripplot(lifeExp ~ factor(year), jDat, jitter.data = TRUE,
          type = c("p", "a"), fun = myTrimMean, alpha = 0.6, grid = "h",
          main = paste("Life expectancy, trimmed mean, trim = ", jTrim))

How is life expectancy changing over time on different continents?

stripplot(lifeExp ~ factor(year) | reorder(continent, lifeExp), jDat,
          jitter.data = TRUE,
          type = c("p", "a"), fun = myTrimMean, alpha = 0.6, grid = "h",
          main = paste("Life expectancy, trimmed mean, trim = ", jTrim),
          scales = list(x = list(rot = c(45, 0))))
stripplot(lifeExp ~ factor(year), jDat, jitter.data = TRUE,
          group = reorder(continent, lifeExp),
          type = c("p", "a"), fun = myTrimMean, alpha = 0.6, grid = "h",
          main = paste("Life expectancy, trimmed mean, trim = ", jTrim),
          scales = list(x = list(rot = c(45, 0))),
          auto.key = list(reverse.rows = TRUE,
                          x = 0.07, y = 0.95, corner = c(0, 1)))

Life expectancy, trimmed mean, trim = 0.2:

## Error: could not find function "myTrimMean"

	year	lifeExp
NA
NA.1
NA.2
NA.3

Depict the maximum and minimum of GDP per capita for all continents.

It is awkward to ignore the year here, since there are strong temporal trends in GDP per capita. You are encouraged to address that in some way. Easy: consider only one year. Harder but better: work the year variable into your visual display.

## get continent*year-specific min and max in tall data.frame
extGdppercapByContinent <- ddply(jDat, ~ continent + year, function(x) {
  jLevels <- c("min", "max")
  data.frame(gdpPercap = range(x$gdpPercap),
             stat = factor(jLevels, levels = jLevels))
  })
rawPlot <- xyplot(gdpPercap ~ year | continent, extGdppercapByContinent,
                  group = stat, type = "b", grid = "h", as.table = TRUE,
                  auto.key = list(columns = 2))
logPlot <- xyplot(gdpPercap ~ year | continent, extGdppercapByContinent,
                  group = stat, type = "b", grid = "h", as.table = TRUE,
                  auto.key = list(columns = 2),
                  scales = list(y = list(log = TRUE, equispaced.log = FALSE)))
                  #yscale.components = yscale.components.logpower)
print(rawPlot)
print(logPlot)

continent	stat	1952	1957	1962	1967	1972	1977	1982	1987	1992	1997	2002	2007
Africa	max	4725	5487	6757	18773	21011	21951	17364	11864	13522	14723	12522	13206
Africa	min	299	336	355	413	464	502	462	390	411	312	241	278
Americas	max	13990	14847	16173	19530	21806	24073	25010	29884	32004	35767	39097	42952
Americas	min	1398	1544	1662	1452	1654	1874	2011	1823	1456	1342	1270	1202
Asia	max	108382	113523	95458	80895	109348	59265	33693	28118	34933	40301	36023	47307
Asia	min	331	350	388	349	357	371	424	385	347	415	611	944
Europe	max	14734	17909	20431	22966	27195	26982	28398	31541	33966	41283	44684	49357
Europe	min	974	1354	1710	2172	2860	3528	3631	3739	2497	3193	4604	5937

Wow, what's up with extremely high GDP/capita in Asia prior to 1980? O.I.L.

Too bad it's not easy to display this in a transposed fashion without actually transposing the object. xtable does not support that. Wonder if another package does? The hard part is that data.frame variables -- so table columns -- can have different flavors but data.frame or matrix rows -- table rows -- cannot. Therein lies the fiddliness.

yr1952	yr1957	yr1962	yr1967	yr1972	yr1977	yr1982	yr1987	yr1992	yr1997	yr2002	yr2007
Kuwait	Kuwait	Kuwait	Kuwait	Kuwait	Kuwait	Saudi Arabia	Kuwait	Kuwait	Kuwait	Singapore	Kuwait
108382	113523	95458	80895	109348	59265	33693	28118	34933	40301	36023	47307

Look at the spread of GDP per capita within the continents.

Ditto above re: using one year only or incorporating the year properly.

First I ignore year, which is a little sketchy....

foo <- ddply(jDat, ~ continent, summarize,
             sdGdpPercap = sd(gdpPercap), iqrGdpPercap = IQR(gdpPercap),
             madGdpPercap = mad(gdpPercap))
xyplot(sdGdpPercap + iqrGdpPercap + madGdpPercap ~ continent, foo,
       type = "b", ylab = "measure of spread",
       auto.key = list(x = 0.15, y = 0.85, corner = c(0, 1)))

	Africa	Americas	Asia	Europe
sdGdpPercap	2828	6397	14045	9355
iqrGdpPercap	1616	4402	7492	13248
madGdpPercap	775	3269	2821	8846

Then I account for year. This shows why robust statistics, like the median, MAD, and IQR, are more handy in real life than your intro stats course may have indicated. Watch the sd chase the oil-rich Arab GDP/per capita outliers, which do not really characterize the entire set of countries in Asia.

foo <- ddply(jDat, ~ continent + year, summarize,
             sdGdpPercap = sd(gdpPercap), iqrGdpPercap = IQR(gdpPercap),
             madGdpPercap = mad(gdpPercap))
## cheap trick using lattice's extended formula interface
## avoiding reshaping
xyplot(sdGdpPercap + iqrGdpPercap + madGdpPercap ~ year, foo,
       group = reorder(continent, sdGdpPercap), layout = c(3, 1),
       type = "b", ylab = "measure of spread",
       auto.key = list(x = 0.35, y = 0.85, corner = c(0, 1),
                       reverse.rows = TRUE))

Wish I knew how to make table shrink to fit the width available. Probably not possible with the technology I am using.

continent	measSpread	1952	1957	1962	1967	1972	1977	1982	1987	1992	1997	2002	2007
Asia	sd	18635	19507	16416	14063	19088	11816	8701	8090	9727	11094	11151	14155
Asia	iqr	2286	2497	3362	5071	7548	10034	11511	9939	13430	17800	17141	19864
Asia	mad	921	1292	1428	1970	2775	3639	4922	4825	4282	4316	4498	4566
Europe	sd	3114	3678	4199	4725	5510	5874	6453	7483	9110	10065	11197	11800
Europe	iqr	3996	5202	5558	6619	7465	8692	9452	11047	16367	17243	18652	19006
Europe	mad	2984	3692	4226	5135	6001	6864	7647	9015	12066	12348	13027	12506
Americas	sd	3002	3312	3422	4161	4754	5356	5530	6665	7047	7874	8896	9713
Americas	iqr	1512	2269	2430	2546	2778	2918	4739	3667	3698	5083	3939	6249
Americas	mad	1265	1917	1720	2077	2230	2260	3464	3292	3231	3934	3167	4774
Africa	sd	983	1135	1462	2848	3287	4142	3243	2567	2644	2821	2973	3618
Africa	iqr	920	966	1029	1110	1476	2036	1959	2022	1964	2064	2534	3131
Africa	mad	697	712	786	743	952	1015	899	814	728	771	819	1032

Depict the number and/or proportion of countries with low life expectancy over time by continent.

Make sure to give your definition of "low life expectancy".

#(bMark <- quantile(jDat$lifeExp, 0.2))
bMark <- 43 # JB wrote this on her 43rd b'day!
stripplot(lifeExp ~ factor(year) | reorder(continent, lifeExp), jDat,
          jitter.data = TRUE, alpha = 0.6, grid = "h",
          group = lifeExp <= bMark,
          main = paste("Life expectancy <= JB's current age =", bMark),
          scales = list(x = list(rot = c(45, 0))))

This table will be truly huge if I transpose, as I've done elsewhere. Leaving as is.

	Africa	Asia	Americas
1952	0.83 (43/52)	0.36 (12/33)	0.20 (5/25)
1957	0.65 (34/52)	0.33 (11/33)	0.08 (2/25)
1962	0.54 (28/52)	0.15 (5/33)	0.00 (0/25)
1967	0.38 (20/52)	0.09 (3/33)	0.00 (0/25)
1972	0.25 (13/52)	0.09 (3/33)	0.00 (0/25)
1977	0.19 (10/52)	0.06 (2/33)	0.00 (0/25)
1982	0.13 (7/52)	0.03 (1/33)	0.00 (0/25)
1987	0.08 (4/52)	0.03 (1/33)	0.00 (0/25)
1992	0.10 (5/52)	0.03 (1/33)	0.00 (0/25)
1997	0.12 (6/52)	0.03 (1/33)	0.00 (0/25)
2002	0.08 (4/52)	0.03 (1/33)	0.00 (0/25)
2007	0.12 (6/52)	0.00 (0/33)	0.00 (0/25)

Find countries with extremely low or high life expectancy in 1952 or that exhibit extremely rapid or slow life expectancy gains.

Find them and then plot their data.

Tip. The hard part here is finding the interesting countries in an elegant manner. If you don't / can't do that, you can hard-wire the interesting countries, just so you get to make the plots! (Nice demo of why it's EVIL to have commas in the country names.) Here you go:

Africa, intercepts
- low: Gambia | Sierra Leone | Guinea
- high: Reunion | Zimbabwe | Mauritius
Asia, intercepts
- low: Afghanistan | Yemen, Rep. | Nepal
- high: Hong Kong, China | Japan | Israel
Americas, intercepts
- low: Bolivia | Haiti | Guatemala
- high: Puerto Rico | United States | Canada
Europe, intercepts
- low: Turkey | Bosnia and Herzegovina | Albania
- high: Netherlands | Iceland | Norway
Africa, slopes
- low: Zimbabwe | Zambia | Rwanda
- high: Gambia | Tunisia | Libya
Asia, slopes
- low: Iraq | Sri Lanka | Lebanon
- high: Saudi Arabia | Vietnam | Oman
Americas, slopes
- low: Paraguay | Trinidad and Tobago | Uruguay
- high: Guatemala | Honduras | Nicaragua
Europe, slopes
- low: Denmark | Hungary | Norway
- high: Portugal | Bosnia and Herzegovina | Turkey

Find countries with sudden, substantial departures from the temporal trend in one of the quantitative measures.

Find them and then plot their data.

I think we are looking at the effect of HIV/AIDS in Africa and the effect of despots, genocide, political upheaval, and foreign invasions elsewhere. With respect to Africa, the 2007 data shows the beginning of the impact of antiretroviral medicines pumped into Africa through massive infusions of foreign aid. Here is an interesting NYT article on how the coffin-making businesses of Africa are struggling now that AIDS deaths are declining.

residInteresting <-
  subset(lmGoodies, scaledResid > 4, select = country, drop = TRUE)
xyplot(lifeExp ~ year | country, 
       subset(jDat, country %in% residInteresting),
       group = continent,
       grid = TRUE, type = c("p", "r"))

Student work on HW 4.

attali-dea source | report
baik-jon source | report
bolandnazar-moh source | report
brueckman-chr EDIT HERE
chu-jus source | report
Zachary Daly: source | report
dinsdale-dan source | report
gao-wen source | report
Matthew Gingerich: source | report
hu-yum source | report
jewell-sea source report
johnston-reb source | report
mahdiar khosravi source | report
Wooyong Lee: source | report
liao-wei: source | report
ma-hui source | report
meng-viv source | report
mohd abul basher-abd source | report
ni-jac source | report
Christian Okkels source | report
Greg Owens: source | report
Mina Park: source | report
spencer-nei source | report
wang-ton source | report
Leah Weber: source | report
woollard-geo source | report
xue-xinxin source | report
yuen-ama source | report
zhang-yim source| report
zhang-jon source | report
zhang-yif source | report
zhang-jin source | report
inskip-jes source | report
liu-yan source | report
haraty-mon EDIT HERE
yuen-mac source | report