Data Visualization
“A Picture is worth a Thousand Words”
Primarily five guidelines for data visualization
What is GGplot?
Let’s load in the required packages before we move into the practicum.
tidyverse package (R library) is required for ggplot.
We are using the Penguins data set from the previous EDA section.
palmerpenguins package is required for ggplot.
First, we will visually explore our data set to see our data distribution before we start telling the story to the audience.
We first check what columns are in our data set, how many rows and the class types of our data.
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Then let’s create the first ggplot!
Think of ggplot as a canvas.
First you set up a blank canvas before you start painting the picture.
Aesthetics describe every aspect of a graphical element.
Think of aesthetics as attributes to the drawings you are painting on your canvas.
library(janitor) # More data wrangling tools
library(gghighlight) # For highlighting graphs
library(ggrepel) # For plot labeling
library(knitr) # For basic tables
library(plotly) # For interactive graphs and 3D
library(viridis) # Colorblind friendly palette
library(gapminder) # For the gapminder data set
Now that we have pretty good insight on what our data set looks like, let’s move onto figuring out what information or stories we can tell our audience.
Some questions we can explore together:
Our next step is to reduce the clutter in our plots in order to start shaping a story to tell to our audience.
# We can make changes to our data frame temporarily and input into ggplot using pipes (%>%)
penguins %>%
ggplot() +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
labs(x = "Bill Length (mm)",
y = "Bill depth (mm)",
title = "Correlation Between Bill Length and Bill Depth",
color = "Species")
To make our plots more presentable we can add more features such as
penguins %>%
ggplot() +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
labs(x = "Bill Length (mm)",
y = "Bill depth (mm)",
title = "Correlation Between Bill Length and Bill Depth",
color = "Species") +
# A pre-made theme in ggplot
theme_bw() +
# theme() can be used to change multiple aspects of the graph
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
scale_color_brewer(palette = 13)
You can have distinct theme and color palette for your organization’s internal use or brand recognition. Let’s create a new theme and color palette.
#creating new theme
new_theme <- theme_linedraw() + theme(
text = element_text(family = "Times New Roman"), # specify font
plot.background = element_rect(fill = "#F4ECDA"), # put in background color
legend.box.background = element_rect(colour = "black", size = 1), # configure legend appearance
legend.title = element_text(face = "bold"), # configure legend appearance
legend.text = element_text(size = 8),
axis.title.x = element_text(face = "bold"), # configure x axis label appearance
axis.title.y = element_text(face = "bold"), # configure y axis labe apparance
plot.title = element_text(size = 15, face = "bold"),
plot.subtitle = element_text(face = "italic", size = 13, color = "#433F3F"),
plot.caption = element_text( # configure for caption of plot if there is one added
hjust = 0,
size = 8,
color = "#433F3F"
)
)
# creating color palette
discrete_palette <-
c("#E69F00",
"#56B4E9",
"#009E73",
"#F0E442",
"#0072B2",
"#D55E00",
"#CC79A7")
You can use Bar Graphs and Histograms to show distribution of data using a new dataset called mpg.
library(gghighlight)
# This dataset contains a subset of the fuel economy data.
# class: "type" of car
# n: count the number of cars in each class type
glimpse(mpg)
Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
$ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
$ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
$ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
$ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
$ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
$ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
$ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
$ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
$ class <chr> "compact", "compact", "compact", "compact", "compact", "c…
Here we use geom_histogram() to plot the bar graph
# This dataset contains a subset of the fuel economy data.
# class: "type" of car
# n: count the number of cars in each class type
mpg %>%
count(class) %>% # count the number of class type
arrange(n) %>% # arrange by count number in ascending
mutate(class = factor(class, levels = class)) %>% # change "class" to factor for grouping in plot
ggplot(aes(x = class, y = n)) +
geom_histogram(aes(fill = class), stat = "identity") +
scale_fill_viridis_d() +
gghighlight(n > 60, unhighlighted_params = aes(fill = "lightblue2")) + # highlight maximum
coord_flip() + # change x and y coordinate
labs(x = "Class", y = "Count", title = "Data Distribution by Car Class Type") +
new_theme
Pie Charts describe parts as a whole, but are best avoided because
You can show even more detail for data distribution by using boxplots, violin plots and data points. Here we do that using geom_boxplot, geom_violin and geom_jitter.
mpg %>%
mutate(drv = as.factor(drv)) %>% # change to factor in order to plot as categorical data.
ggplot(aes(x = drv, y = hwy)) +
geom_violin() + # plot violin plot
geom_boxplot(aes(fill = drv)) + # plot box plot and fill by drive train type.
geom_jitter(size = 0.8, alpha = 0.7) + # show data point and change transparency and size
labs(
x = "Drive train",
y = "Highway mpg",
title = "Fuel Economy based on Car Drive Train",
subtitle = "4 = 4 wheel drive, f = front wheel drive, r = rear wheel drive"
) +
new_theme
Key Takeaways from this presentation:
GGPlot
: grammar of graphicsData Visualization
: graphical representation of data to convey informationDataset
: a collection of data used for a visualizationVariable
: any measure in a datasetA more complete glossary of Data Visualization can be found here