6 Data Visualization with the Grammar of Graphics (ggplots2)
6.1 Introduction
It is often necessary to create graphs to effectively communicate key patterns within a dataset. While many software packages allow the user to make basic plots, it can be challenging to create plots that are customized to address a specific idea. While there are numerous ways to create graphs, this tutorial will focus on the R package ggplot2
, created by Hadley Wickham.
There are two key functions that are used in ggplot2
:
qplot()
or quick plot is similar to base plotting functions in R and is primarily used to produce quick and easy graphics.ggplot()
the grammar of graphics plot is different from other graphics functions because it uses a particular grammar inspired by Leland Wilkinson’s landmark book, The Grammar of Graphics, that focused on thinking about, reasoning with and communicating with graphics. It enables layering of independent components to create custom graphics.
# This tutorial will use the following three packages from the tidyverse
library(ggplot2)
library(readr)
library(dplyr)
6.2 Overview of Data
In this tutorial, we will use the AmesHousing data, which provides information on the sales of individual residential properties in Ames, Iowa from 2006 to 2010. The data set contains 2930 observations, and a large number of explanatory variables involved in assessing home values. A full description of this dataset can be found here.
# The csv file should be imported into rstudio:
<- read_csv("data/AmesHousing.csv",
AmesHousing name_repair = make.names)
6.3 The qplot
function
In this section, we will briefly provide examples of how the qplot
function can be used to create basic graphs. Run the code below and answer Questions 1)-5).
(Note: Personally, I rarely use the qplot function. While it might be convenient for simple plots, I prefer the structure of the ggplot command as you get more consistent control over everything. This also means that you can feel free to just quick glance through this part.)
# Create a histogram of housing prices
qplot(data=AmesHousing, x=SalePrice, main ="Histogram of Housing Prices in Ames, Iowa")
# Create a scatterplot of above ground living area by sales price
qplot(data=AmesHousing,x=GrLivArea, y=SalePrice)
# Create a scatterplot with log transformed variables, coloring by a third variable
qplot(data=AmesHousing,x=log(GrLivArea),y=log(SalePrice),color=KitchenQual)
# Create distinct scatterplots for each type of kitchen quality and number of fireplaces
qplot(data=AmesHousing,x=GrLivArea,y=SalePrice,facets=KitchenQual~Fireplaces)
# Create a dotplot of sale prices by kitchen quality
qplot(data=AmesHousing,x=KitchenQual,y=SalePrice)
# Create a boxplot of sale prices by kitchen quality
qplot(data=AmesHousing,x=KitchenQual,y=log(SalePrice),geom="boxplot")
6.3.1 Exercises
In this dataset, how many houses were sold with four fireplaces?
What is the
facet
argument used for?Based upon the data documentation, what are the five different levels for kitchen quality?
Do these graphs indicate that the quality of a kitchen could be related to the sale price?
In the RStudio console, type
?qplot
. Modify the above code to create a barchart (geom='bar'
) to count the number of sales for each level of kitchen quality. What is the difference between the color and fill command?
6.3.2 Using qplot
on vectors
Perhaps one of the best uses for the qplot
function is that you can provide the x
and y
variables without having to have them stored in a dataset. (i.e., they can just be atomic vectors) This is different that the main ggplot
function which is intended to map the variables within a dataset (data.frame or tibble) to different aspects of the plot.
For example,
<- rnorm(n = 1000, mean = 0, sd = 1) # generates 1000 random values from N(0,1) distribution
x1 qplot(x = x1)
# or with two variables
<- ( x1 + runif(1000, min = -2.5, max = -0.1) )^2
x2 qplot(x = x1, y = x2)
6.4 The basic structure of the ggplot
function
All ggplot
functions must have at least three components:
- data frame: In this activity we will be using the
AmesHousing
data. - geom: to determine the type of geometric shape used to display the data, such as line, bar, point, or area.
- aes: to determine how variables in the data are mapped to visual properties (aesthetics) of geoms. This can include x position, y position, color, shape, fill, and size.
Thus the simplest code for a graphic made with ggplot()
would have one of the the following forms:
ggplot(data, aes(x, y)) + geom_line()
orggplot(data) + geom_line(aes(x, y))
.
Note the two lines of code above would provide identical results. In the first case, the aes
is set as the default for all geoms
. In essence, the same x
and y
variables are used throughout the entire graphic. However, as graphics get more complex, it is often best to creating local aes
mappings for each geom
as shown in the second line of code.
# Create a histogram of housing prices
ggplot(data=AmesHousing) + geom_histogram(mapping = aes(SalePrice))
In the above code, the terms data=
and mapping=
are not required, but are used for clarification. For example, the following code will produce identical results:
ggplot(AmesHousing) + geom_histogram(aes(SalePrice))
.
# Create a scatterplot of above ground living area by sales price
ggplot(data=AmesHousing) + geom_point(mapping= aes(x=GrLivArea, y=SalePrice))
6.5 Customizing graphics using the ggplot
function
In the following code, we layer additional components onto the two graphs shown above.
ggplot(data=AmesHousing) +
geom_histogram(mapping = aes(SalePrice/100000),
breaks=seq(0, 7, by = 1), col="red", fill="lightblue") +
geom_density(mapping = aes(x=SalePrice/100000, y = (..count..))) +
labs(title="Figure 9: Housing Prices in Ames, Iowa (in $100,000)",
x="Sale Price of Individual Homes")
Remarks:
- The histogram geom transforms the SalePrice, modifies the bin size and changes the color.
geom_density
overlays a density curve on top of the histogram.- Typically density curves and histrograms have very different scales, here we use
y = (..count..)
to modify the density. Alternatively, we could specifyaes(x = SalePrice/100000, y = (..density..))
in the histogram geom. - The labs() command adds a title and an x-axis label. A y-axis label can also be added by using `y = ” “.
In the code below we create three scatterplots of the log of the above ground living area by the log of sales price
ggplot(data=AmesHousing, aes(x=log(GrLivArea), y=log(SalePrice)) ) +
geom_point(shape = 3, color = "darkgreen") +
geom_smooth(method=lm, color="green") +
labs(title="Figure 10: Housing Prices in Ames, Iowa")
ggplot(data=AmesHousing) +
geom_point(aes(x=log(GrLivArea), y=log(SalePrice), color=KitchenQual),shape=2, size=2) +
geom_smooth(aes(x=log(GrLivArea), y=log(SalePrice), color=KitchenQual),
method=loess, size=1) +
labs(title="Figure 11: Housing Prices in Ames, Iowa")
ggplot(data=AmesHousing) +
geom_point(mapping = aes(x=log(GrLivArea), y=log(SalePrice), color=KitchenQual)) +
geom_smooth(mapping = aes(x=log(GrLivArea), y=log(SalePrice), color=KitchenQual),
method=lm, se=FALSE, fullrange=TRUE) +
facet_grid(. ~ Fireplaces) +
labs(title="Figure 12: Housing Prices in Ames, Iowa")
Remarks:
geom_point
is used to create a scatterplot. As shown in Figure 10, multiple shapes can be used as points. The Data Visualization Cheat Sheet lists several shape options`geom_smooth
adds a fitted line through the data.method=lm
specifies a linear regression line.method=loess
(the default) creates a smooth fit curve.se=FALSE
removes the shaded confidence regions around each line.fullrange=TRUE
extends all regression lines to the same length
facet_grid
andfacet_wrap
commands are used to create multiple plots. Above, we have created separate scatterplots based upon the number of fireplaces.- When assigning fixed characteristics, (such as
color
,shape
orsize
), the commands occur outside theaes
, as incolor="green"
. When characteristics are dependent on the data, the command should occur within theaes
, such as incolor=Kitchen.Qual
.
In the above examples, only a few geoms
are listed. The ggplot2 website lists each geom
and gives detailed examples of how they are used.
6.5.1 Exercises:
- Create a histogram of the above ground living area,
GrLivArea
. - Create a scatterplot using
Year.Built
as the explanatory variable andSalePrice
as the response variable. Include a regression line, a title, and labels for the x and y axes. - Modify the scatterplot in Question 9) so that there is still only one regression line, but the points are colored by the overall condition of the home,
Overall.Cond
.
6.6 Additional Considerations with R graphics
Influence of data types on graphics: If you use the str
command after reading data into R, you will notice that each variable is assigned one of the following types
: Character, Numeric (real numbers), Integer, Complex, or Logical (TRUE/FALSE). In particular, the variable Fireplaces in considered an integer. In the code below we try to color
and fill
a density graph by an integer value. Notice that the color and fill commands appear to be ignored in the graph.
# str(AmesHousing)
ggplot(data=AmesHousing) +
geom_density(aes(SalePrice, color = Fireplaces, fill = Fireplaces))
In the following code, we use the dplyr
package to modify the AmesHousing data; we first restrict the dataset to only houses with less than three fireplaces and then create a new variable, called Fireplace2. The as.factor
command creates a factor, wich is a variable that contains a set of numeric codes with character-valued levels. Notice that the color
and fill
command now work properly.
# Create a new data frame with only houses with less than 3 fireplaces
<- filter(AmesHousing, Fireplaces < 3)
AmesHousing2 # Create a new variable called Fireplace2
<-mutate(AmesHousing2,Fireplace2=as.factor(Fireplaces))
AmesHousing2 #str(AmesHousing2)
ggplot(data=AmesHousing2) +
geom_density(aes(SalePrice, color = Fireplace2, fill = Fireplace2), alpha = 0.2)
Customizing graphs: In addition to using a data frame, geoms, and aes, several additional components can be added to customize each graph, such as: stats, scales, themes, positions, coordinate systems, labels, and legends. We will not discuss all of these components here, but will likely come across examples and applications where they are used. In the code below we provide a few examples on how to customize graphs.
ggplot(AmesHousing2, aes(x = Fireplace2, y = SalePrice, color = PavedDrive)) +
geom_boxplot(position = position_dodge(width = 1)) +
coord_flip()+
labs(title="Housing Prices in Ames, Iowa") +
theme(plot.title = element_text( color = "blue", face="bold", size=12, hjust=0))
Remarks:
position
is used to address geoms that would take the same space on a graph. In the above boxplot,position_dodge(width = 1)
adds a space between each box. For scatterplots,position = position_jitter()
puts spaces between overlapping points.theme
is used to change the style of a graph, but does not change the data or geoms. The above code is used to modify only the title in a boxplot. A better approach for beginners is to choose among themes that were created to customize the overall graph. Common examples aretheme_bw()
,theme_classic()
,theme_grey()
, andtheme_minimal()
. You can also use/install theggthemes
package for many more options.
6.6.1 Exercises
In the density plot above, explain what the
color
,fill
, andalpha
commands are used for. Hint: try running the code with and without these commands or use the Data Visualization Cheat Sheet.In the boxplot, what is done by the code
coord_flip()
?Create a new boxplot, similar to the one above, but use
theme_bw()
instead of the given theme command. Explain how the graph changes.Use the tab completion feature in RStudio (type theme and hit the
Tab
key to see various options) to determine what theme is the default for most graphs in ggplot.
6.7 Exercises
- In order to complete this activity, you will need to use the
dplyr
package to manipulate the dataset before making any graphics.
- Restrict the
AmesHousing
data to only sales under normal conditions. In other words,Condition.1 == Norm
- Create a new variable called
TotalSqFt = GRLivArea + TotalBsmtSF
and remove any homes with more than 3000 total square feet. - Create a new variable, where
No
indicates no fireplaces in the home andYes
indicates at least one fireplace in the home. - With this modified data file, create a graphic involving no more than three explanatory variables that best portrays how to predict sales price. For example, Figure 12 uses a linear model of kitchen quality, above ground square footage, and number of fireplaces to predict sale price.
6.8 Additional resources
ggplot2 reference page: A well-documented list of ggplot2 components with descriptions
Stackoverflow, an online community to share information.
R Graphics Cookbook, a text by Winston Chang