Ignore if you don't need this bit of support.
This is one in a series of tutorials in which we explore basic data import, exploration and much more using data from the Gapminder project. Now is the time to make sure you are working in the appropriate directory on your computer, perhaps through the use of an RStudio project. To ensure a clean slate, you may wish to clean out your workspace and restart R (both available from the RStudio Session menu, among other methods). Confirm that the new R process has the desired working directory, for example, with the getwd()
command or by glancing at the top of RStudio's Console pane.
Open a new R script (in RStudio, File > New > R Script). Develop and run your code from there (recommended) or periodically copy "good" commands from the history. In due course, save this script with a name ending in .r or .R, containing no spaces or other funny stuff, and evoking "colors" and "base graphics".
Assuming the data can be found in the current working directory, this works:
gDat <- read.delim("gapminderDataFiveYear.txt")
Plan B (I use here, because of where the source of this tutorial lives):
## data import from URL
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file = gdURL)
Basic sanity check that the import has gone well:
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ..
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 ..
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
I need a small well-behaved excerpt from the Gapminder data for demonstration purposes. I randomly draw 8 countries, keep their data from 2007, and sort the rows based on GDP per capita. Meet jDat
## country year pop continent lifeExp gdpPercap
## 504 Eritrea 2007 4906585 Africa 58.04 641.4
## 1080 Nepal 2007 28901790 Asia 63.78 1091.4
## 276 Chad 2007 10238807 Africa 50.65 1704.1
## 792 Jamaica 2007 2780132 Americas 72.57 7320.9
## 396 Cuba 2007 11416987 Americas 78.27 8948.1
## 360 Costa Rica 2007 4133884 Americas 78.78 9645.1
## 576 Germany 2007 82400996 Europe 79.41 32170.4
## 1152 Norway 2007 4627926 Europe 80.20 49357.2
We will use palettes from the RColorBrewer
package so load it now:
Remind yourself how and why we do this in the first block on colors.
## how to change the plot symbol in a simple, non-knitr setting
opar <- par(pch = 19)
The plots we made in the first block on colors were unrealistic (who really wants each point to have its own color?) and elementary (it’s not that hard to get that far by yourself).
In the real world, you'll want to encode a factor via color. This is, of course, one of the most compelling reasons to switch to ggplot2
or lattice
, but it's informative to do this "by hand" a few times in your life.
First, remake the basic scatterplot without color.
jXlim <- c(460, 60000)
jYlim <- c(47, 82)
plot(lifeExp ~ gdpPercap, jDat, log = 'x', xlim = jXlim, ylim = jYlim)
Let's color the points according to the continent
factor. Using base graphics, there is no escape from hand-crafting an appropriate vector of colors. The only question is: how will you do it?
Before we get bogged down in details, we do some set-up, modelling some general best practices. I create a small data.frame to hold my color scheme. This facilitates all the solutions below and leaves me in a good position for using the scheme in other base graphics plots, for making a color key, for changing the scheme, etc. I also resist the temptation to pick my own colors and, instead, use a qualitative palette from RColorBrewer
(jColors <-
data.frame(continent = levels(continent),
color = I(brewer.pal(nlevels(continent), name = 'Dark2')))))
## continent color
## 1 Africa #1B9E77
## 2 Americas #D95F02
## 3 Asia #7570B3
## 4 Europe #E7298A
With intention, this data.frame has a factor continent
with the same name as the continent factor in jDat
and with the same levels, in the same order. I am also using I()
to protect the color-specifying hex strings from being converted to factor.
Now we must create a vector specifying colors from our scheme in the proper order, i.e. reflecting the continent of each country.
If you are a recovering Excel user, think of match()
as one of your table look-up functions. It looks up the values in the first argument x
in the second argument, table
, and returns positive integers that reflect the index of where an individual x
value is first found in table
We're going to look up the continents listed in jDat
in the corresponding continent factor in the color scheme data.frame jColors
## continent color
## 1 Africa #1B9E77
## 2 Americas #D95F02
## 3 Asia #7570B3
## 4 Europe #E7298A
data.frame(subset(jDat, select = c(country, continent)),
matchRetVal = match(jDat$continent, jColors$continent))
## country continent matchRetVal
## 504 Eritrea Africa 1
## 1080 Nepal Asia 3
## 276 Chad Africa 1
## 792 Jamaica Americas 2
## 396 Cuba Americas 2
## 360 Costa Rica Americas 2
## 576 Germany Europe 4
## 1152 Norway Europe 4
Stare hard at the match()
results in the last column and check a couple of values "by hand" to cement your understanding.
In this case, I could duplicate the
result withunclass(jDat$continent)
but it's not very safe and extensible.
Now I can use the match()
results to index into the color variable of my color scheme. That creates the vector of colors I need for the col =
argument of plot()
plot(lifeExp ~ gdpPercap, jDat, log = 'x', xlim = jXlim, ylim = jYlim,
col = jColors$color[match(jDat$continent, jColors$continent)],
main = 'custom color scheme based on Dark2', cex = 2)
legend(x = 'bottomright',
legend = as.character(jColors$continent),
col = jColors$color, pch = par("pch"), bty = 'n', xjust = 1)
I added a legend "by hand" too. That sort of tedium is greatly reduced in ggplot2
and lattice
, both of which can do that fairly automagically.
If you are a recovering Excel user, add merge()
to your list of functions related to table look-up. If you have some experience with databases, merge()
implements join operations.
is much more powerful than match()
and, accordingly, a bit harder to master. It takes two data.frames and combines them in a systematic way to make a new data.frame. I will not describe merge()
in its full generality but will focus on our current use case, which is fairly typical and down-to-earth.
First, merge()
will look for variable names that are shared between the two inputs. In our case, there is exactly one: continent
. This is our matching variable.
Second, merge()
will find rows in each of the input data.frames that match on continent
and join their data. Since each continent occurs exactly once in the color scheme data.frame jColors
, life is very good. We don't have to worry about what happens when these matches involve multiple rows in the two sources. In our case, it's easy to accept that the merged result will have one row per row in jDat
, the larger of the two data.frames in terms of rows. The only novel information jColors
offers is the color information, so the merged result will also have exactly one new variable: color
(jDatColor <- merge(jDat, jColors))
## continent country year pop lifeExp gdpPercap color
## 1 Africa Eritrea 2007 4906585 58.04 641.4 #1B9E77
## 2 Africa Chad 2007 10238807 50.65 1704.1 #1B9E77
## 3 Americas Jamaica 2007 2780132 72.57 7320.9 #D95F02
## 4 Americas Cuba 2007 11416987 78.27 8948.1 #D95F02
## 5 Americas Costa Rica 2007 4133884 78.78 9645.1 #D95F02
## 6 Asia Nepal 2007 28901790 63.78 1091.4 #7570B3
## 7 Europe Germany 2007 82400996 79.41 32170.4 #E7298A
## 8 Europe Norway 2007 4627926 80.20 49357.2 #E7298A
And now we can provide this new color variable as the value of the col =
plot(lifeExp ~ gdpPercap, jDatColor, log = 'x', xlim = jXlim, ylim = jYlim,
col = color,
main = 'custom color scheme based on Dark2', cex = 2)
legend(x = 'bottomright',
legend = as.character(jColors$continent),
col = jColors$color, pch = par("pch"), bty = 'n', xjust = 1)
and merge()
approachesI am drawn more to the merge()
approach. The code is easier to read and write.
The merge()
approach "contaminates" your data with color information, which feels slightly inelegant. But I'm OK with that.
The merge()
approach leaves behind a nice self-contained object that bundles the data with a color scheme. If the creation and application of the color scheme is painful, it can be nice to have this clean result for sharing and downstream reuse.
The merge()
function is extremely useful and I urge you to use it in more complicated settings. For the record, it is not necessary for the "matching variables" to have exactly the same names, as our did here, and you have control over what happens when there are multiple matches or no matches.
## execute this if you followed my code for
## changing the default plot symbol in a simple, non-knitr setting
## reversing the effects of this: opar <- par(pch = 19)