Data Science for Social Impact

Data Visualization

Leean Jo, Hein H. Aung

What is Data Visualization?

  • Graphical Representation of data to convey information.
  • An Important skill to master in
    • Data Science
    • Any discipline

Why is Good Data Visualization important?

“A Picture is worth a Thousand Words”

  • Easier and Quicker to absorb information.
  • Especially in the age of Big Data, Data Visualization is essential
    • to analyze large amounts of data
    • make data-driven decisions

What Type of Data do you have?

  • Think about the Data Type you want to visualize
    • Quantitative (Continuous/Discrete)
    • Qualitative (Ordered/Unordered)
    • Time series
    • Text
  • What type of graphs would be portray the best visualizations for those data?

Think about the Audience

  • You’re telling a story with your data
    • WHY the story matters to your audience is important.
  • Who are you presenting to?
    • Do they have a strong technical background or not?

Colors

  • Colors make it easy to categorize data
    • but not everyone see color the same way.
    • Use color-blind friendly palettes for accessibility.
  • Color choice has an impact on data.
    • Unconscious human bias on association with color (red = hot, blue = cold)
  • R has some colorblind friendly packages you can use.
  • Consider the media the graph will be presented in.
    • Some print books generate in gray scales.

Tips for Improving Data Visualization

Primarily five guidelines for data visualization

  1. Show the data
  2. Reduce the clutter
  3. Integrate the graphics and text
  4. Avoid the spaghetti chart + Graph having too much information
  5. Start with gray

Introduction to GGplot

What is GGplot?

  • Grammar of Graphics
    • Let you plot graphs by combining independent components.
    • Your plots build up layer by layer.
  • All plots have two major components:
    • data you want to plot and
    • mapping of your data variables to attributes.

Creating your first ggplot

Let’s load in the required packages before we move into the practicum.

tidyverse package (R library) is required for ggplot.

library(tidyverse) # For ggplot and data wrangling tools

In this demo

We are using the Penguins data set from the previous EDA section.

palmerpenguins package is required for ggplot.

First, we will visually explore our data set to see our data distribution before we start telling the story to the audience.

We first check what columns are in our data set, how many rows and the class types of our data.

library(palmerpenguins)

#glimpse() shows a glimpse of your data set
penguins %>% glimpse()
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Then let’s create the first ggplot!

Think of ggplot as a canvas.

First you set up a blank canvas before you start painting the picture.

# Load in the penguins data set.
library(palmerpenguins) # For penguin data set

penguins <- palmerpenguins::penguins

ggplot(data = penguins)

Aesthetics

Aesthetics describe every aspect of a graphical element.

Think of aesthetics as attributes to the drawings you are painting on your canvas.

Let’s first start from “gray” and show the data that we have

library(janitor) # More data wrangling tools
library(gghighlight) # For highlighting graphs
library(ggrepel) # For plot labeling
library(knitr) # For basic tables
library(plotly) # For interactive graphs and 3D
library(viridis) # Colorblind friendly palette
library(gapminder) # For the gapminder data set

Let’s explore the “island” column

  • ggplot(): blank canvas
  • geom_bar(): geometry or a type of plot you want to generate ** Here, it’s a bar graph
  • aes(): aesthetics ** You can specify x and y axis
ggplot(data = penguins) +
  geom_bar(mapping = aes(x = island),
           stat = "count")

# select() select column(s)
# dinstinct() returns distinct values
# nrow() checks the no. of rows
penguins %>%
  select(island) %>%
  distinct() %>%
  nrow()
[1] 3

Telling the Story

Now that we have pretty good insight on what our data set looks like, let’s move onto figuring out what information or stories we can tell our audience.

Some questions we can explore together:

  1. What is the correlation between the bill length and depth?
  2. What is the difference between male and female bill length and depth?
  3. How does the island affect the bill length and depth of penguins?

Our next step is to reduce the clutter in our plots in order to start shaping a story to tell to our audience.

Visualizing the bill length and depth of penguins:

  • ggplot(): blank canvas
  • geom_point(): geometry or a type of plot you want to generate ** Here, it’s a scatter graph
  • aes(): aesthetics ** You can specify x and y axis and color to categorize by species
  • labs(): change labels
# We can make changes to our data frame temporarily and input into ggplot using pipes (%>%)
penguins %>%
  ggplot() +
  geom_point(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
  labs(x = "Bill Length (mm)",
       y = "Bill depth (mm)",
       title = "Correlation Between Bill Length and Bill Depth",
       color = "Species")

Visualizing the difference between male and female bill length and depth:

penguins %>% na.omit() %>%
  ggplot() +
  geom_point(aes(x = bill_length_mm, y = bill_depth_mm, color = species, shape = sex)) +
  labs(x = "Bill Length (mm)",
       y = "Bill depth (mm)",
       title = "Correlation Between Bill Length and Bill Depth",
       color = "Species")

Visualizing the penguins’ bill length and depth in different islands:

penguins %>% na.omit() %>%
  ggplot() +
  geom_point(aes(x = bill_length_mm, y = bill_depth_mm, color = species, shape = sex)) +
  labs(x = "Bill Length (mm)",
       y = "Bill depth (mm)",
       title = "Correlation Between Bill Length and Bill Depth",
       color = "Species") +
  facet_grid(~ island)

Making Graphs more Presentable

To make our plots more presentable we can add more features such as

Making Graphs more Presentable

  • theme_bw(): a pre-made theme in ggplot
  • theme(): customize theme elements
  • scale_color_brewer: choose a color palette
penguins %>%
  ggplot() +
  geom_point(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
  labs(x = "Bill Length (mm)",
       y = "Bill depth (mm)",
       title = "Correlation Between Bill Length and Bill Depth",
       color = "Species") +
  # A pre-made theme in ggplot
  theme_bw() +
  # theme() can be used to change multiple aspects of the graph
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  scale_color_brewer(palette = 13)

Another cool thing about ggplot is

  • You can also create your own theme and color palette!

You can have distinct theme and color palette for your organization’s internal use or brand recognition. Let’s create a new theme and color palette.

#creating new theme
new_theme <- theme_linedraw() + theme(
  text = element_text(family = "Times New Roman"), # specify font
  plot.background = element_rect(fill = "#F4ECDA"), # put in background color
  legend.box.background = element_rect(colour = "black", size = 1), # configure legend appearance
  legend.title = element_text(face = "bold"), # configure legend appearance
  legend.text = element_text(size = 8), 
  axis.title.x = element_text(face = "bold"), # configure x axis label appearance
  axis.title.y = element_text(face = "bold"), # configure y axis labe apparance
  plot.title = element_text(size = 15, face = "bold"),
  plot.subtitle = element_text(face = "italic", size = 13, color = "#433F3F"),
  plot.caption = element_text( # configure for caption of plot if there is one added
    hjust = 0,
    size = 8,
    color = "#433F3F"
  )
)
# creating color palette
discrete_palette <-
  c("#E69F00",
    "#56B4E9",
    "#009E73",
    "#F0E442",
    "#0072B2",
    "#D55E00",
    "#CC79A7")

More Cool Things with ggplot and R

Showing Distribution

Bar Graphs and Histograms

You can use Bar Graphs and Histograms to show distribution of data using a new dataset called mpg.

  • mpg: this dataset contains a subset of the fuel economy data.
  • class: “type” of car
library(gghighlight)

# This dataset contains a subset of the fuel economy data.
# class: "type" of car
# n: count the number of cars in each class type

glimpse(mpg)
Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
$ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
$ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
$ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
$ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
$ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

Here we use geom_histogram() to plot the bar graph

  • ggplot(data = mpg) + geom_histogram(aes(x = class, y = n, fill = class), stat = “identity”)
  • n: the number of cars in each class type
  • stat = “identity”: specify ggplot to plot the number of count for class.
# This dataset contains a subset of the fuel economy data.
# class: "type" of car
# n: count the number of cars in each class type

mpg %>%
  count(class) %>% # count the number of class type
  arrange(n) %>% # arrange by count number in ascending
  mutate(class = factor(class, levels = class)) %>% # change "class" to factor for grouping in plot
  ggplot(aes(x = class, y = n)) +
  geom_histogram(aes(fill = class), stat = "identity") +
  scale_fill_viridis_d() +
  gghighlight(n > 60, unhighlighted_params = aes(fill = "lightblue2")) + # highlight maximum
  coord_flip() + # change x and y coordinate
  labs(x = "Class", y = "Count", title = "Data Distribution by Car Class Type") +
  new_theme

More Cool Things with ggplot and R

Showing Distribution

Pie Charts

Pie Charts describe parts as a whole, but are best avoided because

  • it doesn’t tell much information about each part (since it’s purpose is to show “wholeness”)
  • human’s visual perception on angles are weak
    • cannot tell apart 15% vs. 14%, 18% vs. 20%
    • can only really identify in quarters: 25%, 50%, 75%, 100%.

More Cool Things with ggplot and R

Showing Distribution

BoxPlot, Violin Plot and showing data

You can show even more detail for data distribution by using boxplots, violin plots and data points. Here we do that using geom_boxplot, geom_violin and geom_jitter.

mpg %>%
  mutate(drv = as.factor(drv)) %>% # change to factor in order to plot as categorical data.
  ggplot(aes(x = drv, y = hwy)) +
  geom_violin() + # plot violin plot
  geom_boxplot(aes(fill = drv)) + # plot box plot and fill by drive train type.
  geom_jitter(size = 0.8, alpha = 0.7) + # show data point and change transparency and size
  labs(
    x = "Drive train",
    y = "Highway mpg",
    title = "Fuel Economy based on Car Drive Train",
    subtitle = "4 = 4 wheel drive, f = front wheel drive, r = rear wheel drive"
  ) +
  new_theme

Conclusion

Key Takeaways from this presentation:

  1. Think about the Audience
    • What’s the story you’re trying to tell?
  2. Think about your Data Type/ Type of Visualization
    • What kind of graph suits your Audience?
    • What kind of graph fits your Data best?
  3. Follow the 5 guidelines for data visualization.
  4. Be mindful of visual perception and color choices.

Glossary

  • GGPlot: grammar of graphics
  • Data Visualization: graphical representation of data to convey information
  • Dataset: a collection of data used for a visualization
  • Variable: any measure in a dataset

A more complete glossary of Data Visualization can be found here

Further Resources