Data Science for Social Impact

Data Cleaning

Deepa Bhuvanagiri

Introduction

  • Data cleaning is an crucial step in the data science pipeline
  • Sometimes deemed as the most time consuming step
  • Ensures data is correct and ready for proper analysis
  • A common saying used is “garbage in equals garbage out”
    • Idea that models built on poor data yields poor/inaccurate results

Purpose of Demo

In this Demo

  • We will look into the basics of data cleaning and common conventions used in R

A common package used for cleaning and manipulating data in R is tidyverse

  • Which comes with a set of functions that we will look closer into

Functions and Packages

  • A function is a way to group a set of code to run when called
  • An analogy to help think about how a function works
    • Could be to think about a “donut factory”
    • Where you input ingredients and the output are the donuts

Figure 1
  • We can take this analogy of the donut factory and apply it to coding
  • Instead let’s say we want to calculate the mean
    • We can define a function \(mean()\), where its inputs are the numbers we want the average of
    • And the output would be the average

Figure 2
  • We will be able to see functions in action later in this demo.

  • A package is a group of functionsor packages that we can import to use in our code

  • For example, we could have functions \(mean()\), \(median()\), \(mode()\) in a package called “stat_summaries”

  • We can then import stat_summaries to start using these functions

Packages Used in Demo

Tidyverse

  • R has a collection of packages/tools that make data wrangling and cleaning easier
  • A popular package is Tidyverse

Functions to aid in data manipulation:

  • filter() to select cases based on their values
  • arrange() to reorder the cases
  • select() and rename() to select variables based on their names
  • mutate() and transmute() to add new variables that are functions of existing variables
  • summarise() to condense multiple values to a single value
  • groupby() allows you to group within columns based on common attribute

Pipe Operator

  • Allows you to run one function after another
  • df %>% filter() %>% arrange()
  • Put on your socks, then your shoes, then tie your laces

Data Structures: DataFrames

  • Dataframes are a way for us to organize tabular data
  • Very commonly used in Data Science
  • Stored in a file type called a csv: i.e. filename.csv
  • Has columns and rows Figure 2

Demo

Import Packages

#use this code to install packages onto your computer
#if you have not already done so
# install.packages("tidyverse")
# install.packages("dplyr")
# install.packages("lubridate")

library(tidyverse)
library(dplyr)
library(lubridate)
library(kableExtra)

Import Dataframe

Contains information on UFO sightings around the world and obtained via Kaggle

Note: this data set had been additionally cleaned before being used in this demo

ufo_df<-read.csv("ufos.csv")
UFOs
city state country shape duration_seconds date.posted latitude longitude
san marcos tx us cylinder 2700 4/27/2004 29.88306 -97.941111
lackland afb tx light 7200 12/16/2005 29.38421 -98.581082
chester (uk/england) gb circle 20 1/21/2008 53.20000 -2.916667
edna tx us circle 20 1/17/2004 28.97833 -96.645833
kaneohe hi us light 900 1/22/2004 21.41806 -157.803611
bristol tn us sphere 300 4/27/2007 36.59500 -82.188889
penarth (uk/wales) gb circle 180 2/14/2006 51.43472 -3.180000
norwalk ct us disk 1200 10/2/1999 41.11750 -73.408333
pell city al us disk 180 3/19/2009 33.58611 -86.286111
live oak fl us disk 120 5/11/2005 30.29472 -82.984167

Summary of Dataframe

We can use the functions below to get an idea of the structure of our data

  • head() let’s us see the first 5 rows of a dataframe
head(ufo_df, n = 5)
UFOs
city state country shape duration_seconds date.posted latitude longitude
san marcos tx us cylinder 2700 4/27/2004 29.88306 -97.941111
lackland afb tx light 7200 12/16/2005 29.38421 -98.581082
chester (uk/england) gb circle 20 1/21/2008 53.20000 -2.916667
edna tx us circle 20 1/17/2004 28.97833 -96.645833
kaneohe hi us light 900 1/22/2004 21.41806 -157.803611
  • tail() let’s us see the last 5 rows of a dataframe
tail(ufo_df, n = 5)
UFOs
city state country shape duration_seconds date.posted latitude longitude
88866 napa ca us other 1200 9/30/2013 38.29722 -122.28444
88867 vienna va us circle 5 9/30/2013 38.90111 -77.26556
88868 edmond ok us cigar 1020 9/30/2013 35.65278 -97.47778
88869 starr sc us diamond 0 9/30/2013 34.37694 -82.69583
88870 ft. lauderdale fl us oval 0 9/30/2013 26.12194 -80.14361
  • dim() shows us how many rows and columns in a dataframe
dim(ufo_df) # dimensions of data (rows, columns)
[1] 88870     8
  • summary() prints summary statistics of the data
summary(ufo_df) # summary stats of each column
UFOs
city state country shape duration_seconds date.posted latitude longitude
Length:88870 Length:88870 Length:88870 Length:88870 Min. : 0 Length:88870 Min. :-82.86 Min. :-176.66
Class :character Class :character Class :character Class :character 1st Qu.: 15 Class :character 1st Qu.: 34.03 1st Qu.:-112.05
Mode :character Mode :character Mode :character Mode :character Median : 120 Mode :character Median : 39.23 Median : -87.65
NA NA NA NA Mean : 8373 NA Mean : 37.45 Mean : -84.83
NA NA NA NA 3rd Qu.: 600 NA 3rd Qu.: 42.72 3rd Qu.: -77.62
NA NA NA NA Max. :97836000 NA Max. : 72.70 Max. : 178.44
NA NA NA NA NA NA NA's :197 NA

Data Cleaning + Intro Piping

  • Example of how to rename a column in a dataframe using rename()
    • Renamed date.posted to date_posted
colnames(ufo_df) #colnames() lets us print the names of the columns in a dataframe
[1] "city"             "state"            "country"          "shape"           
[5] "duration_seconds" "date.posted"      "latitude"         "longitude"       
#renames column to match convention
ufo_df_renamed<-rename(ufo_df,date_posted = date.posted)

colnames(ufo_df_renamed)
[1] "city"             "state"            "country"          "shape"           
[5] "duration_seconds" "date_posted"      "latitude"         "longitude"       
  • Now let’s try using piping instead
  • Notice how with piping
    • The previous variable (ufo_df) is assumed to be the first input in the next function (rename())
ufo_df%>%
  rename(date_posted = date.posted) ->temp1
colnames(temp1)
[1] "city"             "state"            "country"          "shape"           
[5] "duration_seconds" "date_posted"      "latitude"         "longitude"       
  • This use of piping works for lots of different functions
    • Here the mutate function allows us to tell the computer we are dealing with a date variable
    • remember how we can have different types of variables (e.g. int, character, etc.)
str(temp1) #this function lets us see the different variable types present
'data.frame':   88870 obs. of  8 variables:
 $ city            : chr  "san marcos" "lackland afb" "chester (uk/england)" "edna" ...
 $ state           : chr  "tx" "tx" "" "tx" ...
 $ country         : chr  "us" "" "gb" "us" ...
 $ shape           : chr  "cylinder" "light" "circle" "circle" ...
 $ duration_seconds: int  2700 7200 20 20 900 300 180 1200 180 120 ...
 $ date_posted     : chr  "4/27/2004" "12/16/2005" "1/21/2008" "1/17/2004" ...
 $ latitude        : num  29.9 29.4 53.2 29 21.4 ...
 $ longitude       : num  -97.94 -98.58 -2.92 -96.65 -157.8 ...
temp1%>%
  mutate(date_posted = as.POSIXct(date_posted,format = "%m/%d/%Y")) ->temp2
str(temp2)
'data.frame':   88870 obs. of  8 variables:
 $ city            : chr  "san marcos" "lackland afb" "chester (uk/england)" "edna" ...
 $ state           : chr  "tx" "tx" "" "tx" ...
 $ country         : chr  "us" "" "gb" "us" ...
 $ shape           : chr  "cylinder" "light" "circle" "circle" ...
 $ duration_seconds: int  2700 7200 20 20 900 300 180 1200 180 120 ...
 $ date_posted     : POSIXct, format: "2004-04-27" "2005-12-16" ...
 $ latitude        : num  29.9 29.4 53.2 29 21.4 ...
 $ longitude       : num  -97.94 -98.58 -2.92 -96.65 -157.8 ...

Here is the cool part!

  • we can now use piping to do multiple cleaning steps at once
  • Below we can rename a column and change a variable type
  • Then output it to a new variable
#combines cleaning into one chunk
#put on your socks, then your shoes, then tie your laces
# put on your socks %>% then your shoes %>% then tie your laces

ufo_df%>%
  rename(date_posted = date.posted)%>%
  mutate(date_posted = as.POSIXct(date_posted,format = "%m/%d/%Y")) ->ufo_df_cleaned

str(ufo_df_cleaned)
'data.frame':   88870 obs. of  8 variables:
 $ city            : chr  "san marcos" "lackland afb" "chester (uk/england)" "edna" ...
 $ state           : chr  "tx" "tx" "" "tx" ...
 $ country         : chr  "us" "" "gb" "us" ...
 $ shape           : chr  "cylinder" "light" "circle" "circle" ...
 $ duration_seconds: int  2700 7200 20 20 900 300 180 1200 180 120 ...
 $ date_posted     : POSIXct, format: "2004-04-27" "2005-12-16" ...
 $ latitude        : num  29.9 29.4 53.2 29 21.4 ...
 $ longitude       : num  -97.94 -98.58 -2.92 -96.65 -157.8 ...

Filter

The filter function let’s us subset the data based on specific criteria

  • Below we can select all the UFOs that are shaped like cylinders
  • Notice our use of piping here
ufo_df_cleaned%>%
  filter(shape == "cylinder")
UFOs
city state country shape duration_seconds date_posted latitude longitude
san marcos tx us cylinder 2700 2004-04-27 29.88306 -97.94111
monroe county oh us cylinder 60 2004-07-08 39.44028 -84.36222
elgin il us cylinder 1800 1998-11-01 42.03722 -88.28111
victoria (canada) bc ca cylinder 30 2000-12-02 46.21667 -63.48333
sterling il us cylinder 600 2010-11-21 41.78861 -89.69611
suffern ny us cylinder 20 2012-10-30 41.11472 -74.15000
  • Let’s see how we can filter numerical values
    • Here we are selecting all the UFOs that were in the air for more than 1,000 seconds
#finds ufos that were seen for more than 1000 seconds
ufo_df_cleaned%>%
  filter(duration_seconds>1000)
UFOs
city state country shape duration_seconds date_posted latitude longitude
san marcos tx us cylinder 2700 2004-04-27 29.88306 -97.94111
lackland afb tx light 7200 2005-12-16 29.38421 -98.58108
norwalk ct us disk 1200 1999-10-02 41.11750 -73.40833
bellmore ny us disk 1800 2000-05-11 40.66861 -73.52750
harlan county ky us circle 1200 2005-09-15 36.84306 -83.32194
niantic ct us disk 1800 2003-09-24 41.32528 -72.19361
  • We can also filter on an interval
    • Here we can select all UFOs that were in the air for more than 1,000 seconds but less than 2,000 seconds
#finds ufos that were seen for more than 1000 seconds but less than 2000 seconds
ufo_df_cleaned%>%
  filter(duration_seconds>1000 & duration_seconds<2000)
UFOs
city state country shape duration_seconds date_posted latitude longitude
norwalk ct us disk 1200 1999-10-02 41.11750 -73.40833
bellmore ny us disk 1800 2000-05-11 40.66861 -73.52750
harlan county ky us circle 1200 2005-09-15 36.84306 -83.32194
niantic ct us disk 1800 2003-09-24 41.32528 -72.19361
cardiff (uk/wales) gb disk 1200 2007-02-01 51.50000 -3.20000
hudson ks us light 1200 2004-07-25 38.10556 -98.65972
  • We can also filter by date
    • This is where it comes in handy to have this column in POSIXct format (i.e. date variable)
  • Here we can select all the UFO that were sighted in 2004
  • The year() allows us to filter by year
  • If we wanted to filter by month or day we would use month() and day()
#find all ufos that were spotted in 2004
ufo_df_cleaned%>%
  filter(year(date_posted) == 2004) #used with lubridate package
UFOs
city state country shape duration_seconds date_posted latitude longitude
san marcos tx us cylinder 2700 2004-04-27 29.88306 -97.94111
edna tx us circle 20 2004-01-17 28.97833 -96.64583
kaneohe hi us light 900 2004-01-22 21.41806 -157.80361
hudson ks us light 1200 2004-07-25 38.10556 -98.65972
louisville ky us light 30 2004-03-17 38.25417 -85.75944
seattle (ballard area) wa us unknown 2 2004-04-27 47.60639 -122.33083

Summarise

sumarise() let’s us take the data and group/summarises the data in a specific way and outputs a dataframe

  • Let’s look at an example:
    • Here are able to calculate on average how long a UFO was spotted in the sky
    • We create a new column called mean_time and use the mean() to calculate the mean
    • The output is a data frame
ufo_df_cleaned%>%
  summarise(mean_time = mean(duration_seconds))
UFOs
mean_time
8373.412
  • We can also create multiple columns using summarise()
  • More specifically, we can create a new dataframe with columns summarising the data
  • Below we create a dataframe that contains the average, minimum and maximum times a UFO was sighted in the sky
#finds the average time a ufo was seen in the sky
ufo_df_cleaned%>%
  summarise(mean_time = mean(duration_seconds),
            min_time = min(duration_seconds),
            max_time = max(duration_seconds))
UFOs
mean_time min_time max_time
8373.412 0 97836000
  • We can also combine filter() and summarise() using pipes
  • Below we are able to see the average time a UFO with a cylinder shape was seen in the sky
#combine with filtering
#Finds the average (mean) speed of a ufo with a cylinder shape
ufo_df_cleaned%>%
  filter(shape == "cylinder")%>%
  summarise(mean_time = mean(duration_seconds))
UFOs
mean_time
3184.08

Add columns

  • We can also add columns to our dataframe using mutate()
  • Currently, we have information on how long a UFO was seen in the sky in seconds
    • This isn’t too useful when some of the entries are 1,000 seconds
  • To better interpret this, we can create a new column where the duration is in minutes
#adds a column that converts the duration from seconds to minutes
ufo_df_cleaned%>%
  mutate(durations_mins = round(duration_seconds/60, 2))
UFOs
city state country shape duration_seconds date_posted latitude longitude durations_mins
san marcos tx us cylinder 2700 2004-04-27 29.88306 -97.941111 45.00
lackland afb tx light 7200 2005-12-16 29.38421 -98.581082 120.00
chester (uk/england) gb circle 20 2008-01-21 53.20000 -2.916667 0.33
edna tx us circle 20 2004-01-17 28.97833 -96.645833 0.33
kaneohe hi us light 900 2004-01-22 21.41806 -157.803611 15.00
bristol tn us sphere 300 2007-04-27 36.59500 -82.188889 5.00

There is also a built in function for this that was outlined here:(convert from seconds to hours and minutes)

Select

select() allows us to choose specific columns that are of interest to us

  • It is a way to reduce number of columns
  • Let’s say we only care about the state and duration in the sky of a UFO
ufo_df_cleaned%>%
  mutate(durations_mins = round(duration_seconds/60, 2))%>%
  select(state, duration_seconds, durations_mins)
UFOs
state duration_seconds durations_mins
tx 2700 45.00
tx 7200 120.00
20 0.33
tx 20 0.33
hi 900 15.00
tn 300 5.00

What if we want all but the “city” column

  • Placing a “-” before the name of the column will indicate to select everything but that column
ufo_df_cleaned%>%
  select(-city)
UFOs
state country shape duration_seconds date_posted latitude longitude
tx us cylinder 2700 2004-04-27 29.88306 -97.941111
tx light 7200 2005-12-16 29.38421 -98.581082
gb circle 20 2008-01-21 53.20000 -2.916667
tx us circle 20 2004-01-17 28.97833 -96.645833
hi us light 900 2004-01-22 21.41806 -157.803611
tn us sphere 300 2007-04-27 36.59500 -82.188889

Groupby

group_by() can be used with summarise() to get some really cool results

  • Let’s say we want to know if the shape of a UFO affects how long it was seen in the sky
  • We might want to calculate the average time each type of UFO (e.g. a cylinder UFO vs. circle UFO vs…) was in the sky and compare
  • To do this, we can group by UFO shape and calculate the mean for each
#Calculates which shape was witnessed for the longest time in sky
ufo_df_cleaned%>%
  group_by(shape)%>%
  summarise(mean_time = mean(duration_seconds))
UFOs
shape mean_time
31270.1983
changed 3600.0000
changing 1933.5794
chevron 488.6256
cigar 1805.3039
circle 4307.6143
  • To help view the data more clearly
  • We can also arrange the duration_seconds column from least to greatest using arrange()
  • To get an idea of which UFO shapes were in the sky for the shortest time
#orders shapes from least amount of time in sky seen to most
ufo_df_cleaned%>%
  group_by(shape)%>%
  summarise(mean_time = mean(duration_seconds))%>%
  arrange(mean_time) 
UFOs
shape mean_time
dome 2.0000
flare 30.0000
pyramid 120.0000
hexagon 240.0000
round 452.5000
chevron 488.6256
  • We can also see how to arrange the duration_seconds column from greatest to least
  • desc() stand for descending, so when used with arrange()
    • Will order entries in descending order
#orders shapes from seen in the sky the longest to shortest
ufo_df_cleaned%>%
  group_by(shape)%>%
  summarise(mean_time = mean(duration_seconds))%>%
  arrange(desc(mean_time))
UFOs
shape mean_time
cone 71342.83
31270.20
sphere 20410.62
crescent 18905.00
other 18669.29
light 12256.44

Summarise + n()

n() allows us to calculate the total number of entries in a group

  • If we wanted to see how many UFOs there were for each shape we can use:
    • group_by() to group by each UFO shape
    • summarise() to create a new dataframe with the summary stat (in this case n())
    • n() to count the number of entries per group (i.e. UFO shape)
#how many of each ufo shape were spotted
ufo_df_cleaned%>%
  group_by(shape)%>%
  summarise(total = n())%>%
  arrange(desc(total))
UFOs
shape total
light 17872
triangle 8489
circle 8452
fireball 6562
unknown 6319
other 6247
  • We can also add head() to the code to print the top 10 most UFO shapes spotted
#top 5 shapes that were spotted
ufo_df_cleaned%>%
  group_by(shape)%>%
  summarise(total = n())%>%
  arrange(desc(total))%>%
  head(10)
UFOs
shape total
light 17872
triangle 8489
circle 8452
fireball 6562
unknown 6319
other 6247
disk 6005
sphere 5755
oval 4119
3116

Bonus Demo

Plotting with ggplot

#can plot the top 5 shapes by inputting dataset straight to ggplot
ufo_df_cleaned%>%
  group_by(shape)%>%
  summarise(total = n())%>%
  arrange(desc(total))%>%
  head()%>% # up to here is same code as in previous slide
  ggplot(aes(shape,total)) + 
  geom_col()

Map of UFO sightings

#make the dots smaller
#maybe leaflet and mapview
#shows that theres multiple ways to code something
library(mapview)
#filters out entries where there is not latitude or longitdue entries
# Note this may take a second to run

ufo_df_cleaned %>%
  head(100)%>%
  filter(!is.na(longitude) & !is.na(latitude)) %>%
  mapview(
    xcol = "longitude",
    ycol = "latitude",
    crs = 4269,
    grid = FALSE
  )

Glossary

  • Function: A way to group a set of code to run when called
  • Pipe Operator: Allows you to run one function after another
  • Package: A group of functions or packages that we can import to use in our code

References

  • A Beginner’s Guide to Tidyverse – The Most Powerful Collection of R Packages for Data Science. (2019, May 13). Analytics Vidhya. Retrieved July 27, 2023, from https://www.analyticsvidhya.com/ blog/2019/05/beginner-guide-tidyverse-most-powerful-collection-r-packages-data-science/