Data Science for Social Impact

Data Cleaning

Deepa Bhuvanagiri

Introduction

Data cleaning is an crucial step in the data science pipeline
Sometimes deemed as the most time consuming step
Ensures data is correct and ready for proper analysis
A common saying used is “garbage in equals garbage out”
- Idea that models built on poor data yields poor/inaccurate results

Purpose of Demo

In this Demo

We will look into the basics of data cleaning and common conventions used in R

A common package used for cleaning and manipulating data in R is tidyverse

Which comes with a set of functions that we will look closer into

Functions and Packages

A function is a way to group a set of code to run when called
An analogy to help think about how a function works
- Could be to think about a “donut factory”
- Where you input ingredients and the output are the donuts

We can take this analogy of the donut factory and apply it to coding
Instead let’s say we want to calculate the mean
- We can define a function \(mean()\), where its inputs are the numbers we want the average of
- And the output would be the average

We will be able to see functions in action later in this demo.
A package is a group of functionsor packages that we can import to use in our code
For example, we could have functions \(mean()\), \(median()\), \(mode()\) in a package called “stat_summaries”
We can then import stat_summaries to start using these functions

Packages Used in Demo

Tidyverse

R has a collection of packages/tools that make data wrangling and cleaning easier
A popular package is Tidyverse

Popular Functions

Functions to aid in data manipulation:

filter() to select cases based on their values
arrange() to reorder the cases
select() and rename() to select variables based on their names
mutate() and transmute() to add new variables that are functions of existing variables
summarise() to condense multiple values to a single value
groupby() allows you to group within columns based on common attribute

Pipe Operator

Allows you to run one function after another
df %>% filter() %>% arrange()
Put on your socks, then your shoes, then tie your laces

Data Structures: DataFrames

Dataframes are a way for us to organize tabular data
Very commonly used in Data Science
Stored in a file type called a csv: i.e. filename.csv
Has columns and rows

Demo

Import Packages

#use this code to install packages onto your computer
#if you have not already done so
# install.packages("tidyverse")
# install.packages("dplyr")
# install.packages("lubridate")

library(tidyverse)
library(dplyr)
library(lubridate)
library(kableExtra)

Import Dataframe

Contains information on UFO sightings around the world and obtained via Kaggle

Note: this data set had been additionally cleaned before being used in this demo

ufo_df<-read.csv("ufos.csv")

UFOs
city	state	country	shape	duration_seconds	date.posted	latitude	longitude
san marcos	tx	us	cylinder	2700	4/27/2004	29.88306	-97.941111
lackland afb	tx		light	7200	12/16/2005	29.38421	-98.581082
chester (uk/england)		gb	circle	20	1/21/2008	53.20000	-2.916667
edna	tx	us	circle	20	1/17/2004	28.97833	-96.645833
kaneohe	hi	us	light	900	1/22/2004	21.41806	-157.803611
bristol	tn	us	sphere	300	4/27/2007	36.59500	-82.188889
penarth (uk/wales)		gb	circle	180	2/14/2006	51.43472	-3.180000
norwalk	ct	us	disk	1200	10/2/1999	41.11750	-73.408333
pell city	al	us	disk	180	3/19/2009	33.58611	-86.286111
live oak	fl	us	disk	120	5/11/2005	30.29472	-82.984167

Summary of Dataframe

We can use the functions below to get an idea of the structure of our data

head() let’s us see the first 5 rows of a dataframe

head(ufo_df, n = 5)

UFOs
city	state	country	shape	duration_seconds	date.posted	latitude	longitude
san marcos	tx	us	cylinder	2700	4/27/2004	29.88306	-97.941111
lackland afb	tx		light	7200	12/16/2005	29.38421	-98.581082
chester (uk/england)		gb	circle	20	1/21/2008	53.20000	-2.916667
edna	tx	us	circle	20	1/17/2004	28.97833	-96.645833
kaneohe	hi	us	light	900	1/22/2004	21.41806	-157.803611

tail() let’s us see the last 5 rows of a dataframe

tail(ufo_df, n = 5)

UFOs
	city	state	country	shape	duration_seconds	date.posted	latitude	longitude
88866	napa	ca	us	other	1200	9/30/2013	38.29722	-122.28444
88867	vienna	va	us	circle	5	9/30/2013	38.90111	-77.26556
88868	edmond	ok	us	cigar	1020	9/30/2013	35.65278	-97.47778
88869	starr	sc	us	diamond	0	9/30/2013	34.37694	-82.69583
88870	ft. lauderdale	fl	us	oval	0	9/30/2013	26.12194	-80.14361

dim() shows us how many rows and columns in a dataframe

dim(ufo_df) # dimensions of data (rows, columns)

[1] 88870     8

summary() prints summary statistics of the data

summary(ufo_df) # summary stats of each column

UFOs
city	state	country	shape	duration_seconds	date.posted	latitude	longitude
Length:88870	Length:88870	Length:88870	Length:88870	Min. : 0	Length:88870	Min. :-82.86	Min. :-176.66
Class :character	Class :character	Class :character	Class :character	1st Qu.: 15	Class :character	1st Qu.: 34.03	1st Qu.:-112.05
Mode :character	Mode :character	Mode :character	Mode :character	Median : 120	Mode :character	Median : 39.23	Median : -87.65
NA	NA	NA	NA	Mean : 8373	NA	Mean : 37.45	Mean : -84.83
NA	NA	NA	NA	3rd Qu.: 600	NA	3rd Qu.: 42.72	3rd Qu.: -77.62
NA	NA	NA	NA	Max. :97836000	NA	Max. : 72.70	Max. : 178.44
NA	NA	NA	NA	NA	NA	NA's :197	NA

Data Cleaning + Intro Piping

Example of how to rename a column in a dataframe using rename()
- Renamed date.posted to date_posted

colnames(ufo_df) #colnames() lets us print the names of the columns in a dataframe

[1] "city"             "state"            "country"          "shape"           
[5] "duration_seconds" "date.posted"      "latitude"         "longitude"

#renames column to match convention
ufo_df_renamed<-rename(ufo_df,date_posted = date.posted)

colnames(ufo_df_renamed)

[1] "city"             "state"            "country"          "shape"           
[5] "duration_seconds" "date_posted"      "latitude"         "longitude"

Now let’s try using piping instead
Notice how with piping
- The previous variable (ufo_df) is assumed to be the first input in the next function (rename())

ufo_df%>%
  rename(date_posted = date.posted) ->temp1
colnames(temp1)

[1] "city"             "state"            "country"          "shape"           
[5] "duration_seconds" "date_posted"      "latitude"         "longitude"

This use of piping works for lots of different functions
- Here the mutate function allows us to tell the computer we are dealing with a date variable
- remember how we can have different types of variables (e.g. int, character, etc.)

str(temp1) #this function lets us see the different variable types present

'data.frame':   88870 obs. of  8 variables:
 $ city            : chr  "san marcos" "lackland afb" "chester (uk/england)" "edna" ...
 $ state           : chr  "tx" "tx" "" "tx" ...
 $ country         : chr  "us" "" "gb" "us" ...
 $ shape           : chr  "cylinder" "light" "circle" "circle" ...
 $ duration_seconds: int  2700 7200 20 20 900 300 180 1200 180 120 ...
 $ date_posted     : chr  "4/27/2004" "12/16/2005" "1/21/2008" "1/17/2004" ...
 $ latitude        : num  29.9 29.4 53.2 29 21.4 ...
 $ longitude       : num  -97.94 -98.58 -2.92 -96.65 -157.8 ...

temp1%>%
  mutate(date_posted = as.POSIXct(date_posted,format = "%m/%d/%Y")) ->temp2
str(temp2)

'data.frame':   88870 obs. of  8 variables:
 $ city            : chr  "san marcos" "lackland afb" "chester (uk/england)" "edna" ...
 $ state           : chr  "tx" "tx" "" "tx" ...
 $ country         : chr  "us" "" "gb" "us" ...
 $ shape           : chr  "cylinder" "light" "circle" "circle" ...
 $ duration_seconds: int  2700 7200 20 20 900 300 180 1200 180 120 ...
 $ date_posted     : POSIXct, format: "2004-04-27" "2005-12-16" ...
 $ latitude        : num  29.9 29.4 53.2 29 21.4 ...
 $ longitude       : num  -97.94 -98.58 -2.92 -96.65 -157.8 ...

Here is the cool part!

we can now use piping to do multiple cleaning steps at once
Below we can rename a column and change a variable type
Then output it to a new variable

#combines cleaning into one chunk
#put on your socks, then your shoes, then tie your laces
# put on your socks %>% then your shoes %>% then tie your laces

ufo_df%>%
  rename(date_posted = date.posted)%>%
  mutate(date_posted = as.POSIXct(date_posted,format = "%m/%d/%Y")) ->ufo_df_cleaned

str(ufo_df_cleaned)

'data.frame':   88870 obs. of  8 variables:
 $ city            : chr  "san marcos" "lackland afb" "chester (uk/england)" "edna" ...
 $ state           : chr  "tx" "tx" "" "tx" ...
 $ country         : chr  "us" "" "gb" "us" ...
 $ shape           : chr  "cylinder" "light" "circle" "circle" ...
 $ duration_seconds: int  2700 7200 20 20 900 300 180 1200 180 120 ...
 $ date_posted     : POSIXct, format: "2004-04-27" "2005-12-16" ...
 $ latitude        : num  29.9 29.4 53.2 29 21.4 ...
 $ longitude       : num  -97.94 -98.58 -2.92 -96.65 -157.8 ...

Filter

The filter function let’s us subset the data based on specific criteria

Below we can select all the UFOs that are shaped like cylinders
Notice our use of piping here

ufo_df_cleaned%>%
  filter(shape == "cylinder")

UFOs
city	state	country	shape	duration_seconds	date_posted	latitude	longitude
san marcos	tx	us	cylinder	2700	2004-04-27	29.88306	-97.94111
monroe county	oh	us	cylinder	60	2004-07-08	39.44028	-84.36222
elgin	il	us	cylinder	1800	1998-11-01	42.03722	-88.28111
victoria (canada)	bc	ca	cylinder	30	2000-12-02	46.21667	-63.48333
sterling	il	us	cylinder	600	2010-11-21	41.78861	-89.69611
suffern	ny	us	cylinder	20	2012-10-30	41.11472	-74.15000

Let’s see how we can filter numerical values
- Here we are selecting all the UFOs that were in the air for more than 1,000 seconds

#finds ufos that were seen for more than 1000 seconds
ufo_df_cleaned%>%
  filter(duration_seconds>1000)

UFOs
city	state	country	shape	duration_seconds	date_posted	latitude	longitude
san marcos	tx	us	cylinder	2700	2004-04-27	29.88306	-97.94111
lackland afb	tx		light	7200	2005-12-16	29.38421	-98.58108
norwalk	ct	us	disk	1200	1999-10-02	41.11750	-73.40833
bellmore	ny	us	disk	1800	2000-05-11	40.66861	-73.52750
harlan county	ky	us	circle	1200	2005-09-15	36.84306	-83.32194
niantic	ct	us	disk	1800	2003-09-24	41.32528	-72.19361

We can also filter on an interval
- Here we can select all UFOs that were in the air for more than 1,000 seconds but less than 2,000 seconds

#finds ufos that were seen for more than 1000 seconds but less than 2000 seconds
ufo_df_cleaned%>%
  filter(duration_seconds>1000 & duration_seconds<2000)

UFOs
city	state	country	shape	duration_seconds	date_posted	latitude	longitude
norwalk	ct	us	disk	1200	1999-10-02	41.11750	-73.40833
bellmore	ny	us	disk	1800	2000-05-11	40.66861	-73.52750
harlan county	ky	us	circle	1200	2005-09-15	36.84306	-83.32194
niantic	ct	us	disk	1800	2003-09-24	41.32528	-72.19361
cardiff (uk/wales)		gb	disk	1200	2007-02-01	51.50000	-3.20000
hudson	ks	us	light	1200	2004-07-25	38.10556	-98.65972

We can also filter by date
- This is where it comes in handy to have this column in POSIXct format (i.e. date variable)
Here we can select all the UFO that were sighted in 2004
The year() allows us to filter by year
If we wanted to filter by month or day we would use month() and day()

#find all ufos that were spotted in 2004
ufo_df_cleaned%>%
  filter(year(date_posted) == 2004) #used with lubridate package

UFOs
city	state	country	shape	duration_seconds	date_posted	latitude	longitude
san marcos	tx	us	cylinder	2700	2004-04-27	29.88306	-97.94111
edna	tx	us	circle	20	2004-01-17	28.97833	-96.64583
kaneohe	hi	us	light	900	2004-01-22	21.41806	-157.80361
hudson	ks	us	light	1200	2004-07-25	38.10556	-98.65972
louisville	ky	us	light	30	2004-03-17	38.25417	-85.75944
seattle (ballard area)	wa	us	unknown	2	2004-04-27	47.60639	-122.33083

Summarise

sumarise() let’s us take the data and group/summarises the data in a specific way and outputs a dataframe

Let’s look at an example:
- Here are able to calculate on average how long a UFO was spotted in the sky
- We create a new column called mean_time and use the mean() to calculate the mean
- The output is a data frame

ufo_df_cleaned%>%
  summarise(mean_time = mean(duration_seconds))

UFOs
mean_time
8373.412

We can also create multiple columns using summarise()
More specifically, we can create a new dataframe with columns summarising the data
Below we create a dataframe that contains the average, minimum and maximum times a UFO was sighted in the sky

#finds the average time a ufo was seen in the sky
ufo_df_cleaned%>%
  summarise(mean_time = mean(duration_seconds),
            min_time = min(duration_seconds),
            max_time = max(duration_seconds))

UFOs
mean_time	min_time	max_time
8373.412	0	97836000

We can also combine filter() and summarise() using pipes
Below we are able to see the average time a UFO with a cylinder shape was seen in the sky

#combine with filtering
#Finds the average (mean) speed of a ufo with a cylinder shape
ufo_df_cleaned%>%
  filter(shape == "cylinder")%>%
  summarise(mean_time = mean(duration_seconds))

UFOs
mean_time
3184.08

Add columns

We can also add columns to our dataframe using mutate()
Currently, we have information on how long a UFO was seen in the sky in seconds
- This isn’t too useful when some of the entries are 1,000 seconds
To better interpret this, we can create a new column where the duration is in minutes

#adds a column that converts the duration from seconds to minutes
ufo_df_cleaned%>%
  mutate(durations_mins = round(duration_seconds/60, 2))

UFOs
city	state	country	shape	duration_seconds	date_posted	latitude	longitude	durations_mins
san marcos	tx	us	cylinder	2700	2004-04-27	29.88306	-97.941111	45.00
lackland afb	tx		light	7200	2005-12-16	29.38421	-98.581082	120.00
chester (uk/england)		gb	circle	20	2008-01-21	53.20000	-2.916667	0.33
edna	tx	us	circle	20	2004-01-17	28.97833	-96.645833	0.33
kaneohe	hi	us	light	900	2004-01-22	21.41806	-157.803611	15.00
bristol	tn	us	sphere	300	2007-04-27	36.59500	-82.188889	5.00

There is also a built in function for this that was outlined here:(convert from seconds to hours and minutes)

Select

select() allows us to choose specific columns that are of interest to us

It is a way to reduce number of columns
Let’s say we only care about the state and duration in the sky of a UFO

ufo_df_cleaned%>%
  mutate(durations_mins = round(duration_seconds/60, 2))%>%
  select(state, duration_seconds, durations_mins)

UFOs
state	duration_seconds	durations_mins
tx	2700	45.00
tx	7200	120.00
	20	0.33
tx	20	0.33
hi	900	15.00
tn	300	5.00

What if we want all but the “city” column

Placing a “-” before the name of the column will indicate to select everything but that column

ufo_df_cleaned%>%
  select(-city)

UFOs
state	country	shape	duration_seconds	date_posted	latitude	longitude
tx	us	cylinder	2700	2004-04-27	29.88306	-97.941111
tx		light	7200	2005-12-16	29.38421	-98.581082
	gb	circle	20	2008-01-21	53.20000	-2.916667
tx	us	circle	20	2004-01-17	28.97833	-96.645833
hi	us	light	900	2004-01-22	21.41806	-157.803611
tn	us	sphere	300	2007-04-27	36.59500	-82.188889

Groupby

group_by() can be used with summarise() to get some really cool results

Let’s say we want to know if the shape of a UFO affects how long it was seen in the sky
We might want to calculate the average time each type of UFO (e.g. a cylinder UFO vs. circle UFO vs…) was in the sky and compare
To do this, we can group by UFO shape and calculate the mean for each

#Calculates which shape was witnessed for the longest time in sky
ufo_df_cleaned%>%
  group_by(shape)%>%
  summarise(mean_time = mean(duration_seconds))

UFOs
shape	mean_time
	31270.1983
changed	3600.0000
changing	1933.5794
chevron	488.6256
cigar	1805.3039
circle	4307.6143

To help view the data more clearly
We can also arrange the duration_seconds column from least to greatest using arrange()
To get an idea of which UFO shapes were in the sky for the shortest time

#orders shapes from least amount of time in sky seen to most
ufo_df_cleaned%>%
  group_by(shape)%>%
  summarise(mean_time = mean(duration_seconds))%>%
  arrange(mean_time)

UFOs
shape	mean_time
dome	2.0000
flare	30.0000
pyramid	120.0000
hexagon	240.0000
round	452.5000
chevron	488.6256

We can also see how to arrange the duration_seconds column from greatest to least
desc() stand for descending, so when used with arrange()
- Will order entries in descending order

#orders shapes from seen in the sky the longest to shortest
ufo_df_cleaned%>%
  group_by(shape)%>%
  summarise(mean_time = mean(duration_seconds))%>%
  arrange(desc(mean_time))

UFOs
shape	mean_time
cone	71342.83
	31270.20
sphere	20410.62
crescent	18905.00
other	18669.29
light	12256.44

Summarise + n()

n() allows us to calculate the total number of entries in a group

If we wanted to see how many UFOs there were for each shape we can use:
- group_by() to group by each UFO shape
- summarise() to create a new dataframe with the summary stat (in this case n())
- n() to count the number of entries per group (i.e. UFO shape)

#how many of each ufo shape were spotted
ufo_df_cleaned%>%
  group_by(shape)%>%
  summarise(total = n())%>%
  arrange(desc(total))

UFOs
shape	total
light	17872
triangle	8489
circle	8452
fireball	6562
unknown	6319
other	6247

We can also add head() to the code to print the top 10 most UFO shapes spotted

#top 5 shapes that were spotted
ufo_df_cleaned%>%
  group_by(shape)%>%
  summarise(total = n())%>%
  arrange(desc(total))%>%
  head(10)

UFOs
shape	total
light	17872
triangle	8489
circle	8452
fireball	6562
unknown	6319
other	6247
disk	6005
sphere	5755
oval	4119
	3116

Bonus Demo

Plotting with ggplot

#can plot the top 5 shapes by inputting dataset straight to ggplot
ufo_df_cleaned%>%
  group_by(shape)%>%
  summarise(total = n())%>%
  arrange(desc(total))%>%
  head()%>% # up to here is same code as in previous slide
  ggplot(aes(shape,total)) + 
  geom_col()

Map of UFO sightings

#make the dots smaller
#maybe leaflet and mapview
#shows that theres multiple ways to code something
library(mapview)
#filters out entries where there is not latitude or longitdue entries
# Note this may take a second to run

ufo_df_cleaned %>%
  head(100)%>%
  filter(!is.na(longitude) & !is.na(latitude)) %>%
  mapview(
    xcol = "longitude",
    ycol = "latitude",
    crs = 4269,
    grid = FALSE
  )

Glossary

Function: A way to group a set of code to run when called
Pipe Operator: Allows you to run one function after another
Package: A group of functions or packages that we can import to use in our code

References

A Beginner’s Guide to Tidyverse – The Most Powerful Collection of R Packages for Data Science. (2019, May 13). Analytics Vidhya. Retrieved July 27, 2023, from https://www.analyticsvidhya.com/ blog/2019/05/beginner-guide-tidyverse-most-powerful-collection-r-packages-data-science/