Go back to STAT545A home
Follow the existing homework submission instructions UP TO THE LINK SUBMISSION PART. Submit your links by editing this Google doc -- WARNING: I turned off sharing ~3pm Mon Sept 23 because almost everyone had submitted their links. Contact JB if you need to submit or edit a link.
gapminderDataFiveYear.txt
.plyr
package to tackle some, not all, tasks from the list below. You decide what to work on but spend ~2 hours on this. Of course you are welcome to do more if you wish.echo = FALSE
very sparinglyTips
plyr
and xtable
packages## data import from URL
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file = gdURL)
library(plyr)
library(xtable)
str(gDat)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ..
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 ..
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
## define a function for converting and printing to HTML table
htmlPrint <- function(x, ...,
digits = 0, include.rownames = FALSE) {
print(xtable(x, digits = digits, ...), type = 'html',
include.rownames = include.rownames, ...)
}
I wrote a function
htmlPrint()
to create an HTML table and print it.
You are welcome to create your own exercises by riffing on what you see below. In fact, that's great -- you won't be copying from your office mate and it emulates real life where you first need to think up interesting questions, then solve them. The easiest sort of variation is to do what I ask below but with a different variable, different summary statistic, different tuning parameter, etc.
Produce a data.frame with 3 variables: a continent factor, the minimum GDP per capita, the maximum GDP per capita. There will be one row per continent. Sort on either the minimum or the maximum and assess whether the continents rank similarly w/r/t the other extreme. Discuss.
foo <- ddply(gDat, ~ continent, summarize,
minGdpPercap = min(gdpPercap), maxGdpPercap = max(gdpPercap))
htmlPrint(arrange(foo, minGdpPercap))
continent | minGdpPercap | maxGdpPercap |
---|---|---|
Africa | 241 | 21951 |
Asia | 331 | 113523 |
Europe | 974 | 49357 |
Americas | 1202 | 42952 |
Oceania | 10040 | 34435 |
Produce a data.frame with 3 variables: a continent factor, a GDP per capita summary statistic, a factor that describes the statistic (i.e. "min" vs "max"). There will be one row per continent * statistic combination. Don't use reshaping -- get this directly with data aggregation.
foo <- ddply(gDat, ~ continent, function(x) {
gdpPercap <- range(x$gdpPercap)
return(data.frame(gdpPercap, stat = c("min", "max")))
})
htmlPrint(foo)
continent | gdpPercap | stat |
---|---|---|
Africa | 241 | min |
Africa | 21951 | max |
Americas | 1202 | min |
Americas | 42952 | max |
Asia | 331 | min |
Asia | 113523 | max |
Europe | 974 | min |
Europe | 49357 | max |
Oceania | 10040 | min |
Oceania | 34435 | max |
Measures of spread (we will discuss in class soon, just try them out!):
sd()
and var()
mad()
IQR()
Produce a data.frame with one row per continent and variables containing the different measures of spread for GDP/capita. Sort on one of those and assess whether the continents rank similarly w/r/t heterogeneity of GDP/capita by the other measures. Discuss.
foo <- ddply(gDat, ~ continent, summarize,
sdGdpPercap = sd(gdpPercap), madGdpPercap = mad(gdpPercap),
iqrGdpPercap = IQR(gdpPercap))
htmlPrint(arrange(foo, sdGdpPercap))
continent | sdGdpPercap | madGdpPercap | iqrGdpPercap |
---|---|---|---|
Africa | 2828 | 775 | 1616 |
Oceania | 6359 | 6459 | 8072 |
Americas | 6397 | 3269 | 4402 |
Europe | 9355 | 8846 | 13248 |
Asia | 14045 | 2821 | 7492 |
## spot checks!
IQR(gDat$gdpPercap[gDat$continent == "Europe"])
[1] 13248
mad(gDat$gdpPercap[gDat$continent == "Asia"])
[1] 2821
If you don't get an error but you are still worried you aren't doing the right computation, pick a few cases in the middle at random and check "by hand".
Compute the average life expectancy by year, as a warm-up exercise. Then consider an alternative: a trimmed mean. There is a trim
argument in the built-in function mean()
. Pick a modest trim fraction that is greater than 0. Report your sorted results. Make sure your trim level is well-documented.
jTrim <- 0.2
foo <- ddply(gDat, ~ year, summarize, tMean = mean(lifeExp, trim = jTrim))
## Error: object 'jTrim' not found
htmlPrint(arrange(foo, tMean))
## Error: object 'tMean' not found
Hear ye, hear ye: the trim level for the trimmed means above is 0.2. Let it be known that I did not type that number in; rather it was printed from the value assigned in the previous R chunk.
If you haven't used it yet, you should experiment with the backtick method of including inline R code. Look at the R Markdown file to see how I do this. If I change
jTrim
to a different value in the R code, my prose will automatically update.
Get a data.frame with 3 variables: continent, year, and mean or median life expectancy. What are the trends over time? Are they the same or different across the continents? Is this easy to assess from this data.frame?
htmlPrint(ddply(gDat, ~ continent + year,
summarize, medLifeExp = median(lifeExp)))
continent | year | medLifeExp |
---|---|---|
Africa | 1952 | 39 |
Africa | 1957 | 41 |
Africa | 1962 | 43 |
Africa | 1967 | 45 |
Africa | 1972 | 47 |
Africa | 1977 | 49 |
Africa | 1982 | 51 |
Africa | 1987 | 52 |
Africa | 1992 | 52 |
Africa | 1997 | 53 |
Africa | 2002 | 51 |
Africa | 2007 | 53 |
Americas | 1952 | 55 |
Americas | 1957 | 56 |
Americas | 1962 | 58 |
Americas | 1967 | 61 |
Americas | 1972 | 63 |
Americas | 1977 | 66 |
Americas | 1982 | 67 |
Americas | 1987 | 69 |
Americas | 1992 | 70 |
Americas | 1997 | 72 |
Americas | 2002 | 72 |
Americas | 2007 | 73 |
Asia | 1952 | 45 |
Asia | 1957 | 48 |
Asia | 1962 | 49 |
Asia | 1967 | 54 |
Asia | 1972 | 57 |
Asia | 1977 | 61 |
Asia | 1982 | 64 |
Asia | 1987 | 66 |
Asia | 1992 | 69 |
Asia | 1997 | 70 |
Asia | 2002 | 71 |
Asia | 2007 | 72 |
Europe | 1952 | 66 |
Europe | 1957 | 68 |
Europe | 1962 | 70 |
Europe | 1967 | 71 |
Europe | 1972 | 71 |
Europe | 1977 | 72 |
Europe | 1982 | 73 |
Europe | 1987 | 75 |
Europe | 1992 | 75 |
Europe | 1997 | 76 |
Europe | 2002 | 78 |
Europe | 2007 | 79 |
Oceania | 1952 | 69 |
Oceania | 1957 | 70 |
Oceania | 1962 | 71 |
Oceania | 1967 | 71 |
Oceania | 1972 | 72 |
Oceania | 1977 | 73 |
Oceania | 1982 | 74 |
Oceania | 1987 | 75 |
Oceania | 1992 | 77 |
Oceania | 1997 | 78 |
Oceania | 2002 | 80 |
Oceania | 2007 | 81 |
Get a data.frame with 1 + 5 = 6 variables: year and then one variable per continent, giving mean or median life expectancy. Is there a natural way to do this in plyr
without hard-wiring the continent-specific variables and without reshaping an intermediate result? Since I am new-ish to plyr
, I don't know yet if this is possible and/or a totally stupid way to go about this. So this one is for people who are happy to play around a bit. Hint: I've made some progress here and advise you look into daply()
.
foo <- daply(gDat, ~ year + continent,
summarize, medLifeExp = median(lifeExp))
#str(foo)
foo <- as.data.frame(foo)
#str(foo) ## there's still some goofiness, ie the variables are lists .... hmmmm
htmlPrint(foo, include.rownames = TRUE)
Africa | Americas | Asia | Europe | Oceania | |
---|---|---|---|---|---|
1952 | 39 | 55 | 45 | 66 | 69 |
1957 | 41 | 56 | 48 | 68 | 70 |
1962 | 43 | 58 | 49 | 70 | 71 |
1967 | 45 | 61 | 54 | 71 | 71 |
1972 | 47 | 63 | 57 | 71 | 72 |
1977 | 49 | 66 | 61 | 72 | 73 |
1982 | 51 | 67 | 64 | 73 | 74 |
1987 | 52 | 69 | 66 | 75 | 75 |
1992 | 52 | 70 | 69 | 75 | 77 |
1997 | 53 | 72 | 70 | 76 | 78 |
2002 | 51 | 72 | 71 | 78 | 80 |
2007 | 53 | 73 | 72 | 79 | 81 |
Arguably, this table is better transposed. So just switch the order of
continent
andyear
in theplyr
call.
foo <- daply(gDat, ~ continent + year,
summarize, medLifeExp = median(lifeExp))
htmlPrint(as.data.frame(foo), include.rownames = TRUE)
1952 | 1957 | 1962 | 1967 | 1972 | 1977 | 1982 | 1987 | 1992 | 1997 | 2002 | 2007 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Africa | 39 | 41 | 43 | 45 | 47 | 49 | 51 | 52 | 52 | 53 | 51 | 53 |
Americas | 55 | 56 | 58 | 61 | 63 | 66 | 67 | 69 | 70 | 72 | 72 | 73 |
Asia | 45 | 48 | 49 | 54 | 57 | 61 | 64 | 66 | 69 | 70 | 71 | 72 |
Europe | 66 | 68 | 70 | 71 | 71 | 72 | 73 | 75 | 75 | 76 | 78 | 79 |
Oceania | 69 | 70 | 71 | 71 | 72 | 73 | 74 | 75 | 77 | 78 | 80 | 81 |
Compute some measure of worldwide life expectancy -- you decide -- a mean or median or some other quantile or perhaps your current age. The determine how many countries on each continent have a life expectancy less than this benchmark, for each year. Produce a data.frame with 3 variables: continent, year, and a country count. What's the trend over time? Is that trend the same for all continents? How much fun is it to assess this from this tall table?
#(bMark <- quantile(gDat$lifeExp, 0.2))
bMark <- 43 # JB wrote this on her 43rd b'day!
htmlPrint(ddply(gDat, ~ continent + year,
function(x) c(lowLifeExp = sum(x$lifeExp <= bMark))))
continent | year | lowLifeExp |
---|---|---|
Africa | 1952 | 43 |
Africa | 1957 | 34 |
Africa | 1962 | 28 |
Africa | 1967 | 20 |
Africa | 1972 | 13 |
Africa | 1977 | 10 |
Africa | 1982 | 7 |
Africa | 1987 | 4 |
Africa | 1992 | 5 |
Africa | 1997 | 6 |
Africa | 2002 | 4 |
Africa | 2007 | 6 |
Americas | 1952 | 5 |
Americas | 1957 | 2 |
Americas | 1962 | 0 |
Americas | 1967 | 0 |
Americas | 1972 | 0 |
Americas | 1977 | 0 |
Americas | 1982 | 0 |
Americas | 1987 | 0 |
Americas | 1992 | 0 |
Americas | 1997 | 0 |
Americas | 2002 | 0 |
Americas | 2007 | 0 |
Asia | 1952 | 12 |
Asia | 1957 | 11 |
Asia | 1962 | 5 |
Asia | 1967 | 3 |
Asia | 1972 | 3 |
Asia | 1977 | 2 |
Asia | 1982 | 1 |
Asia | 1987 | 1 |
Asia | 1992 | 1 |
Asia | 1997 | 1 |
Asia | 2002 | 1 |
Asia | 2007 | 0 |
Europe | 1952 | 0 |
Europe | 1957 | 0 |
Europe | 1962 | 0 |
Europe | 1967 | 0 |
Europe | 1972 | 0 |
Europe | 1977 | 0 |
Europe | 1982 | 0 |
Europe | 1987 | 0 |
Europe | 1992 | 0 |
Europe | 1997 | 0 |
Europe | 2002 | 0 |
Europe | 2007 | 0 |
Oceania | 1952 | 0 |
Oceania | 1957 | 0 |
Oceania | 1962 | 0 |
Oceania | 1967 | 0 |
Oceania | 1972 | 0 |
Oceania | 1977 | 0 |
Oceania | 1982 | 0 |
Oceania | 1987 | 0 |
Oceania | 1992 | 0 |
Oceania | 1997 | 0 |
Oceania | 2002 | 0 |
Oceania | 2007 | 0 |
The continents have very different numbers of countries. Maybe the proportion of "low life expectancy" countries is more relevant than the raw count. Or maybe the final table should report both? Work on that.
## note: bMark set in previous chunk
htmlPrint(ddply(gDat, ~ continent + year, function(x) {
jCount = sum(x$lifeExp <= bMark)
c(count = jCount, prop = jCount / nrow(x))
}), digits = c(0, 0, 0, 0, 2))
continent | year | count | prop |
---|---|---|---|
Africa | 1952 | 43 | 0.83 |
Africa | 1957 | 34 | 0.65 |
Africa | 1962 | 28 | 0.54 |
Africa | 1967 | 20 | 0.38 |
Africa | 1972 | 13 | 0.25 |
Africa | 1977 | 10 | 0.19 |
Africa | 1982 | 7 | 0.13 |
Africa | 1987 | 4 | 0.08 |
Africa | 1992 | 5 | 0.10 |
Africa | 1997 | 6 | 0.12 |
Africa | 2002 | 4 | 0.08 |
Africa | 2007 | 6 | 0.12 |
Americas | 1952 | 5 | 0.20 |
Americas | 1957 | 2 | 0.08 |
Americas | 1962 | 0 | 0.00 |
Americas | 1967 | 0 | 0.00 |
Americas | 1972 | 0 | 0.00 |
Americas | 1977 | 0 | 0.00 |
Americas | 1982 | 0 | 0.00 |
Americas | 1987 | 0 | 0.00 |
Americas | 1992 | 0 | 0.00 |
Americas | 1997 | 0 | 0.00 |
Americas | 2002 | 0 | 0.00 |
Americas | 2007 | 0 | 0.00 |
Asia | 1952 | 12 | 0.36 |
Asia | 1957 | 11 | 0.33 |
Asia | 1962 | 5 | 0.15 |
Asia | 1967 | 3 | 0.09 |
Asia | 1972 | 3 | 0.09 |
Asia | 1977 | 2 | 0.06 |
Asia | 1982 | 1 | 0.03 |
Asia | 1987 | 1 | 0.03 |
Asia | 1992 | 1 | 0.03 |
Asia | 1997 | 1 | 0.03 |
Asia | 2002 | 1 | 0.03 |
Asia | 2007 | 0 | 0.00 |
Europe | 1952 | 0 | 0.00 |
Europe | 1957 | 0 | 0.00 |
Europe | 1962 | 0 | 0.00 |
Europe | 1967 | 0 | 0.00 |
Europe | 1972 | 0 | 0.00 |
Europe | 1977 | 0 | 0.00 |
Europe | 1982 | 0 | 0.00 |
Europe | 1987 | 0 | 0.00 |
Europe | 1992 | 0 | 0.00 |
Europe | 1997 | 0 | 0.00 |
Europe | 2002 | 0 | 0.00 |
Europe | 2007 | 0 | 0.00 |
Oceania | 1952 | 0 | 0.00 |
Oceania | 1957 | 0 | 0.00 |
Oceania | 1962 | 0 | 0.00 |
Oceania | 1967 | 0 | 0.00 |
Oceania | 1972 | 0 | 0.00 |
Oceania | 1977 | 0 | 0.00 |
Oceania | 1982 | 0 | 0.00 |
Oceania | 1987 | 0 | 0.00 |
Oceania | 1992 | 0 | 0.00 |
Oceania | 1997 | 0 | 0.00 |
Oceania | 2002 | 0 | 0.00 |
Oceania | 2007 | 0 | 0.00 |
Create a table with one row per year and one column -- at least conceptually -- per continent. In that "column" report the absolute and/or relative abundance of "low life expectancy" countries. Is there a natural way to do this in plyr
without hard-wiring the continent-specific variables and without reshaping an intermediate result? Since I am new-ish to plyr
, I don't know yet if this is possible and/or a totally stupid way to go about this. So this one is for people who are happy to play around a bit. Hint: I've made some progress here and advise you look into daply()
. I ended up skipping straight to character output versus an object appropriate for further computation or graphing. Sometimes it's better to approach these ill-defined tasks from a different angle.
## note: bMark set in a previous chunk
## my first success ... but was still a 3-dim array :(
## that would be useful for plotting, at least after some reshaping, but is not
## so easy on the eyes
# daply(gDat, ~ year + continent, function(x) {
# jCount <- sum(x$lifeExp < bMark)
# c(count = jCount, prop = jCount / nrow(x))
# })
foo <- daply(gDat, ~ year + continent, function(x) {
jCount <- sum(x$lifeExp <= bMark)
jTotal <- nrow(x)
jProp <- jCount / jTotal
return(sprintf("%1.2f (%d/%d)", jProp, jCount, jTotal))
})
htmlPrint(foo, include.rownames = TRUE)
Africa | Americas | Asia | Europe | Oceania | |
---|---|---|---|---|---|
1952 | 0.83 (43/52) | 0.20 (5/25) | 0.36 (12/33) | 0.00 (0/30) | 0.00 (0/2) |
1957 | 0.65 (34/52) | 0.08 (2/25) | 0.33 (11/33) | 0.00 (0/30) | 0.00 (0/2) |
1962 | 0.54 (28/52) | 0.00 (0/25) | 0.15 (5/33) | 0.00 (0/30) | 0.00 (0/2) |
1967 | 0.38 (20/52) | 0.00 (0/25) | 0.09 (3/33) | 0.00 (0/30) | 0.00 (0/2) |
1972 | 0.25 (13/52) | 0.00 (0/25) | 0.09 (3/33) | 0.00 (0/30) | 0.00 (0/2) |
1977 | 0.19 (10/52) | 0.00 (0/25) | 0.06 (2/33) | 0.00 (0/30) | 0.00 (0/2) |
1982 | 0.13 (7/52) | 0.00 (0/25) | 0.03 (1/33) | 0.00 (0/30) | 0.00 (0/2) |
1987 | 0.08 (4/52) | 0.00 (0/25) | 0.03 (1/33) | 0.00 (0/30) | 0.00 (0/2) |
1992 | 0.10 (5/52) | 0.00 (0/25) | 0.03 (1/33) | 0.00 (0/30) | 0.00 (0/2) |
1997 | 0.12 (6/52) | 0.00 (0/25) | 0.03 (1/33) | 0.00 (0/30) | 0.00 (0/2) |
2002 | 0.08 (4/52) | 0.00 (0/25) | 0.03 (1/33) | 0.00 (0/30) | 0.00 (0/2) |
2007 | 0.12 (6/52) | 0.00 (0/25) | 0.00 (0/33) | 0.00 (0/30) | 0.00 (0/2) |
It was easy (see above) to get the minimum life expectancy seen on each continent by year. Now try to report the year, the continent, the minimum (or max or whatever) life expectancy AND the country it pertains to. You could do something similar with GDP per capita or population (though result likely to be boring with population). Hint: read up on which.min()
and which.max()
.
## this works but produces a frustrating long/tall result; not good for printing
# ddply(gDat, ~ year + continent, function(x) {
# theMin <- which.min(x$lifeExp)
# x[theMin, c("country", "year", "continent", "lifeExp")]
# })
foo <- daply(gDat, ~ year + continent, function(x) {
theMin <- which.min(x$lifeExp)
return(sprintf("%1.0f %s", x$lifeExp[theMin], x$country[theMin]))
})
htmlPrint(foo, include.rownames = TRUE)
Africa | Americas | Asia | Europe | Oceania | |
---|---|---|---|---|---|
1952 | 30 Gambia | 38 Haiti | 29 Afghanistan | 44 Turkey | 69 Australia |
1957 | 32 Sierra Leone | 41 Haiti | 30 Afghanistan | 48 Turkey | 70 New Zealand |
1962 | 33 Sierra Leone | 43 Bolivia | 32 Afghanistan | 52 Turkey | 71 Australia |
1967 | 34 Sierra Leone | 45 Bolivia | 34 Afghanistan | 54 Turkey | 71 Australia |
1972 | 35 Sierra Leone | 47 Bolivia | 36 Afghanistan | 57 Turkey | 72 New Zealand |
1977 | 37 Sierra Leone | 50 Haiti | 31 Cambodia | 60 Turkey | 72 New Zealand |
1982 | 38 Sierra Leone | 51 Haiti | 40 Afghanistan | 61 Turkey | 74 New Zealand |
1987 | 40 Angola | 54 Haiti | 41 Afghanistan | 63 Turkey | 74 New Zealand |
1992 | 24 Rwanda | 55 Haiti | 42 Afghanistan | 66 Turkey | 76 New Zealand |
1997 | 36 Rwanda | 57 Haiti | 42 Afghanistan | 69 Turkey | 78 New Zealand |
2002 | 39 Zambia | 58 Haiti | 42 Afghanistan | 71 Turkey | 79 New Zealand |
2007 | 40 Swaziland | 61 Haiti | 44 Afghanistan | 72 Turkey | 80 New Zealand |
Consider the linear regression we fit in tutorial of life expectancy vs. time. Find the min (and/or max) of the slope (and/or the intercept) within each continent. Report these interesting countries and some info about them in a data.frame. This might work for other variables too.
yearMin <- min(gDat$year)
jFun <- function(x) {
estCoefs <- coef(lm(lifeExp ~ I(year - yearMin), x))
names(estCoefs) <- c("intercept", "slope")
return(estCoefs)
}
jCoefs <- ddply(gDat, ~ country + continent, jFun)
## giving out the Most Improved Award!
## going after maximum slope within continent
foo <- ddply(jCoefs, ~ continent, function(x) {
theMax <- which.max(x$slope)
x[theMax, c("continent", "country", "intercept", "slope")]
})
htmlPrint(foo, digits = c(0, 0, 0, 0, 2))
continent | country | intercept | slope |
---|---|---|---|
Africa | Libya | 42 | 0.63 |
Americas | Nicaragua | 43 | 0.56 |
Asia | Oman | 37 | 0.77 |
Europe | Turkey | 46 | 0.50 |
Oceania | Australia | 68 | 0.23 |
Sudden, substantial departures from the temporal trend is interesting. This goes for life expectancy, GDP per capita, or population. How might one operationalize this notion of "interesting"?
Make a data.frame of interesting countries, reporting their continent affiliation, and summary statistics on life expectancy (or GDP/capita or whatever you were looking at). Won't it be interesting to start to graphically explore them ... next week.
yearMin <- min(gDat$year)
jFun <- function(x) {
jFit <- lm(lifeExp ~ I(year - yearMin), x)
jCoef <- coef(jFit)
names(jCoef) <- NULL
return(c(intercept = jCoef[1],
slope = jCoef[2],
maxResid = max(abs(resid(jFit)))/summary(jFit)$sigma))
}
maxResids <- ddply(gDat, ~ country + continent, jFun)
foo <- ddply(maxResids, ~ continent, function(x) {
theMax <- which.max(x$maxResid)
x[theMax, ]
})
htmlPrint(foo, digits = c(0, 0, 0, 2, 2, 2))
country | continent | intercept | slope | maxResid |
---|---|---|---|---|
Rwanda | Africa | 42.74 | -0.05 | 2.64 |
Ecuador | Americas | 49.07 | 0.50 | 2.25 |
Cambodia | Asia | 37.02 | 0.40 | 2.79 |
Finland | Europe | 66.45 | 0.24 | 2.72 |
Australia | Oceania | 68.40 | 0.23 | 1.65 |
## I can't resist, I must make a plot
xyplot(lifeExp ~ year | country, gDat,
subset = country %in% foo$country, type = c("p", "r"))
Discuss: Is Ecuador really interesting? Do you think it's the most interesting country in the Americas? I doubt it. What could we do better? Making plots is ABSOLUTELY ESSENTIAL to sanity checking your claims and beliefs based on numerical computation.
Student: consider the exercises where we return info on the country that exhibits, e.g., the minimum life expectancy for a certain continent and/or year. What if there are ties and that minimum is achieved by more than 1 country?
I consider this a special case or extension of something we did above: identifying "low life expectancy" countries. So I will mix that code with our min/max finding code. BTW I cannot find an instance where any countries tie for lowest or longest life expectancy, but my two stage solution reflects my attempt to look into that. I created a copy of gDat
and altered two life expectancies to create a tie! No, people are not usually living past 110 in Norway and Sweden!
## I couldn't find any examples of two or more countries sharing the min or max
## life expectancy in a given year; so I tweaked the data!
testDat <- gDat
toReplace <- with(testDat, country %in% c("Norway", "Sweden") & year == 2007)
testDat[toReplace, "lifeExp"] <- 110
## for each life expectancy, compute its rank within all life expectancies
## recorded that year
ggDat <- ddply(testDat, ~ year, transform,
leRank = rank(lifeExp, ties.method = "average"))
## look at the new leRank variable, if you wish
#table(ggDat$leRank)
## pull out the countries with extreme ranks ... with min and max being the most
## extreme of the extremes!
foo <- ddply(ggDat, ~ year, function(x) {
minRank <- min(x$leRank) # you could change this to find the lowest, e.g. 2
maxRank <- max(x$leRank) # and the highest 2
whoMin <- which(x$leRank == minRank) # I allow this to have length > 1
whoMax <- which(x$leRank == maxRank)
data.frame(year = x$year[whoMin[1]],
minLifeExp = x$lifeExp[whoMin[1]],
minCountries = paste(x$country[whoMin], collapse = " "),
minRank = minRank,
maxLifeExp = x$lifeExp[whoMax[1]],
maxCountries = paste(x$country[whoMax], collapse = " "),
maxRank = maxRank)
})
htmlPrint(foo, digits = c(0, 0, 0, 0, 1, 0, 0, 1))
year | minLifeExp | minCountries | minRank | maxLifeExp | maxCountries | maxRank |
---|---|---|---|---|---|---|
1952 | 29 | Afghanistan | 1.0 | 73 | Norway | 142.0 |
1957 | 30 | Afghanistan | 1.0 | 73 | Iceland | 142.0 |
1962 | 32 | Afghanistan | 1.0 | 74 | Iceland | 142.0 |
1967 | 34 | Afghanistan | 1.0 | 74 | Sweden | 142.0 |
1972 | 35 | Sierra Leone | 1.0 | 75 | Sweden | 142.0 |
1977 | 31 | Cambodia | 1.0 | 76 | Iceland | 142.0 |
1982 | 38 | Sierra Leone | 1.0 | 77 | Japan | 142.0 |
1987 | 40 | Angola | 1.0 | 79 | Japan | 142.0 |
1992 | 24 | Rwanda | 1.0 | 79 | Japan | 142.0 |
1997 | 36 | Rwanda | 1.0 | 81 | Japan | 142.0 |
2002 | 39 | Zambia | 1.0 | 82 | Japan | 142.0 |
2007 | 40 | Swaziland | 1.0 | 110 | Norway Sweden | 141.5 |
Remember, the 2007 data for Norway and Sweden is fake; useful for testing this code.