Data Cleaning
In this Demo
A common package used for cleaning and manipulating data in R is tidyverse
We will be able to see functions in action later in this demo.
A package is a group of functionsor packages that we can import to use in our code
For example, we could have functions \(mean()\), \(median()\), \(mode()\) in a package called “stat_summaries”
We can then import stat_summaries to start using these functions
Functions to aid in data manipulation:
filter()
to select cases based on their valuesarrange()
to reorder the casesselect()
and rename() to select variables based on their namesmutate()
and transmute() to add new variables that are functions of existing variablessummarise()
to condense multiple values to a single valuegroupby()
allows you to group within columns based on common attributedf %>% filter() %>% arrange()
Contains information on UFO sightings around the world and obtained via Kaggle
Note: this data set had been additionally cleaned before being used in this demo
city | state | country | shape | duration_seconds | date.posted | latitude | longitude |
---|---|---|---|---|---|---|---|
san marcos | tx | us | cylinder | 2700 | 4/27/2004 | 29.88306 | -97.941111 |
lackland afb | tx | light | 7200 | 12/16/2005 | 29.38421 | -98.581082 | |
chester (uk/england) | gb | circle | 20 | 1/21/2008 | 53.20000 | -2.916667 | |
edna | tx | us | circle | 20 | 1/17/2004 | 28.97833 | -96.645833 |
kaneohe | hi | us | light | 900 | 1/22/2004 | 21.41806 | -157.803611 |
bristol | tn | us | sphere | 300 | 4/27/2007 | 36.59500 | -82.188889 |
penarth (uk/wales) | gb | circle | 180 | 2/14/2006 | 51.43472 | -3.180000 | |
norwalk | ct | us | disk | 1200 | 10/2/1999 | 41.11750 | -73.408333 |
pell city | al | us | disk | 180 | 3/19/2009 | 33.58611 | -86.286111 |
live oak | fl | us | disk | 120 | 5/11/2005 | 30.29472 | -82.984167 |
We can use the functions below to get an idea of the structure of our data
city | state | country | shape | duration_seconds | date.posted | latitude | longitude |
---|---|---|---|---|---|---|---|
san marcos | tx | us | cylinder | 2700 | 4/27/2004 | 29.88306 | -97.941111 |
lackland afb | tx | light | 7200 | 12/16/2005 | 29.38421 | -98.581082 | |
chester (uk/england) | gb | circle | 20 | 1/21/2008 | 53.20000 | -2.916667 | |
edna | tx | us | circle | 20 | 1/17/2004 | 28.97833 | -96.645833 |
kaneohe | hi | us | light | 900 | 1/22/2004 | 21.41806 | -157.803611 |
city | state | country | shape | duration_seconds | date.posted | latitude | longitude | |
---|---|---|---|---|---|---|---|---|
88866 | napa | ca | us | other | 1200 | 9/30/2013 | 38.29722 | -122.28444 |
88867 | vienna | va | us | circle | 5 | 9/30/2013 | 38.90111 | -77.26556 |
88868 | edmond | ok | us | cigar | 1020 | 9/30/2013 | 35.65278 | -97.47778 |
88869 | starr | sc | us | diamond | 0 | 9/30/2013 | 34.37694 | -82.69583 |
88870 | ft. lauderdale | fl | us | oval | 0 | 9/30/2013 | 26.12194 | -80.14361 |
city | state | country | shape | duration_seconds | date.posted | latitude | longitude | |
---|---|---|---|---|---|---|---|---|
Length:88870 | Length:88870 | Length:88870 | Length:88870 | Min. : 0 | Length:88870 | Min. :-82.86 | Min. :-176.66 | |
Class :character | Class :character | Class :character | Class :character | 1st Qu.: 15 | Class :character | 1st Qu.: 34.03 | 1st Qu.:-112.05 | |
Mode :character | Mode :character | Mode :character | Mode :character | Median : 120 | Mode :character | Median : 39.23 | Median : -87.65 | |
NA | NA | NA | NA | Mean : 8373 | NA | Mean : 37.45 | Mean : -84.83 | |
NA | NA | NA | NA | 3rd Qu.: 600 | NA | 3rd Qu.: 42.72 | 3rd Qu.: -77.62 | |
NA | NA | NA | NA | Max. :97836000 | NA | Max. : 72.70 | Max. : 178.44 | |
NA | NA | NA | NA | NA | NA | NA's :197 | NA |
[1] "city" "state" "country" "shape"
[5] "duration_seconds" "date.posted" "latitude" "longitude"
#renames column to match convention
ufo_df_renamed<-rename(ufo_df,date_posted = date.posted)
colnames(ufo_df_renamed)
[1] "city" "state" "country" "shape"
[5] "duration_seconds" "date_posted" "latitude" "longitude"
ufo_df
) is assumed to be the first input in the next function (rename()
)[1] "city" "state" "country" "shape"
[5] "duration_seconds" "date_posted" "latitude" "longitude"
'data.frame': 88870 obs. of 8 variables:
$ city : chr "san marcos" "lackland afb" "chester (uk/england)" "edna" ...
$ state : chr "tx" "tx" "" "tx" ...
$ country : chr "us" "" "gb" "us" ...
$ shape : chr "cylinder" "light" "circle" "circle" ...
$ duration_seconds: int 2700 7200 20 20 900 300 180 1200 180 120 ...
$ date_posted : chr "4/27/2004" "12/16/2005" "1/21/2008" "1/17/2004" ...
$ latitude : num 29.9 29.4 53.2 29 21.4 ...
$ longitude : num -97.94 -98.58 -2.92 -96.65 -157.8 ...
'data.frame': 88870 obs. of 8 variables:
$ city : chr "san marcos" "lackland afb" "chester (uk/england)" "edna" ...
$ state : chr "tx" "tx" "" "tx" ...
$ country : chr "us" "" "gb" "us" ...
$ shape : chr "cylinder" "light" "circle" "circle" ...
$ duration_seconds: int 2700 7200 20 20 900 300 180 1200 180 120 ...
$ date_posted : POSIXct, format: "2004-04-27" "2005-12-16" ...
$ latitude : num 29.9 29.4 53.2 29 21.4 ...
$ longitude : num -97.94 -98.58 -2.92 -96.65 -157.8 ...
Here is the cool part!
#combines cleaning into one chunk
#put on your socks, then your shoes, then tie your laces
# put on your socks %>% then your shoes %>% then tie your laces
ufo_df%>%
rename(date_posted = date.posted)%>%
mutate(date_posted = as.POSIXct(date_posted,format = "%m/%d/%Y")) ->ufo_df_cleaned
str(ufo_df_cleaned)
'data.frame': 88870 obs. of 8 variables:
$ city : chr "san marcos" "lackland afb" "chester (uk/england)" "edna" ...
$ state : chr "tx" "tx" "" "tx" ...
$ country : chr "us" "" "gb" "us" ...
$ shape : chr "cylinder" "light" "circle" "circle" ...
$ duration_seconds: int 2700 7200 20 20 900 300 180 1200 180 120 ...
$ date_posted : POSIXct, format: "2004-04-27" "2005-12-16" ...
$ latitude : num 29.9 29.4 53.2 29 21.4 ...
$ longitude : num -97.94 -98.58 -2.92 -96.65 -157.8 ...
The filter function let’s us subset the data based on specific criteria
city | state | country | shape | duration_seconds | date_posted | latitude | longitude |
---|---|---|---|---|---|---|---|
san marcos | tx | us | cylinder | 2700 | 2004-04-27 | 29.88306 | -97.94111 |
monroe county | oh | us | cylinder | 60 | 2004-07-08 | 39.44028 | -84.36222 |
elgin | il | us | cylinder | 1800 | 1998-11-01 | 42.03722 | -88.28111 |
victoria (canada) | bc | ca | cylinder | 30 | 2000-12-02 | 46.21667 | -63.48333 |
sterling | il | us | cylinder | 600 | 2010-11-21 | 41.78861 | -89.69611 |
suffern | ny | us | cylinder | 20 | 2012-10-30 | 41.11472 | -74.15000 |
city | state | country | shape | duration_seconds | date_posted | latitude | longitude |
---|---|---|---|---|---|---|---|
san marcos | tx | us | cylinder | 2700 | 2004-04-27 | 29.88306 | -97.94111 |
lackland afb | tx | light | 7200 | 2005-12-16 | 29.38421 | -98.58108 | |
norwalk | ct | us | disk | 1200 | 1999-10-02 | 41.11750 | -73.40833 |
bellmore | ny | us | disk | 1800 | 2000-05-11 | 40.66861 | -73.52750 |
harlan county | ky | us | circle | 1200 | 2005-09-15 | 36.84306 | -83.32194 |
niantic | ct | us | disk | 1800 | 2003-09-24 | 41.32528 | -72.19361 |
city | state | country | shape | duration_seconds | date_posted | latitude | longitude |
---|---|---|---|---|---|---|---|
norwalk | ct | us | disk | 1200 | 1999-10-02 | 41.11750 | -73.40833 |
bellmore | ny | us | disk | 1800 | 2000-05-11 | 40.66861 | -73.52750 |
harlan county | ky | us | circle | 1200 | 2005-09-15 | 36.84306 | -83.32194 |
niantic | ct | us | disk | 1800 | 2003-09-24 | 41.32528 | -72.19361 |
cardiff (uk/wales) | gb | disk | 1200 | 2007-02-01 | 51.50000 | -3.20000 | |
hudson | ks | us | light | 1200 | 2004-07-25 | 38.10556 | -98.65972 |
year()
allows us to filter by yearmonth()
and day()
city | state | country | shape | duration_seconds | date_posted | latitude | longitude |
---|---|---|---|---|---|---|---|
san marcos | tx | us | cylinder | 2700 | 2004-04-27 | 29.88306 | -97.94111 |
edna | tx | us | circle | 20 | 2004-01-17 | 28.97833 | -96.64583 |
kaneohe | hi | us | light | 900 | 2004-01-22 | 21.41806 | -157.80361 |
hudson | ks | us | light | 1200 | 2004-07-25 | 38.10556 | -98.65972 |
louisville | ky | us | light | 30 | 2004-03-17 | 38.25417 | -85.75944 |
seattle (ballard area) | wa | us | unknown | 2 | 2004-04-27 | 47.60639 | -122.33083 |
sumarise()
let’s us take the data and group/summarises the data in a specific way and outputs a dataframe
mean()
to calculate the meanmean_time |
---|
8373.412 |
summarise()
mean_time | min_time | max_time |
---|---|---|
8373.412 | 0 | 97836000 |
filter()
and summarise()
using pipesmean_time |
---|
3184.08 |
mutate()
city | state | country | shape | duration_seconds | date_posted | latitude | longitude | durations_mins |
---|---|---|---|---|---|---|---|---|
san marcos | tx | us | cylinder | 2700 | 2004-04-27 | 29.88306 | -97.941111 | 45.00 |
lackland afb | tx | light | 7200 | 2005-12-16 | 29.38421 | -98.581082 | 120.00 | |
chester (uk/england) | gb | circle | 20 | 2008-01-21 | 53.20000 | -2.916667 | 0.33 | |
edna | tx | us | circle | 20 | 2004-01-17 | 28.97833 | -96.645833 | 0.33 |
kaneohe | hi | us | light | 900 | 2004-01-22 | 21.41806 | -157.803611 | 15.00 |
bristol | tn | us | sphere | 300 | 2007-04-27 | 36.59500 | -82.188889 | 5.00 |
There is also a built in function for this that was outlined here:(convert from seconds to hours and minutes)
select()
allows us to choose specific columns that are of interest to us
state | duration_seconds | durations_mins |
---|---|---|
tx | 2700 | 45.00 |
tx | 7200 | 120.00 |
20 | 0.33 | |
tx | 20 | 0.33 |
hi | 900 | 15.00 |
tn | 300 | 5.00 |
What if we want all but the “city” column
state | country | shape | duration_seconds | date_posted | latitude | longitude |
---|---|---|---|---|---|---|
tx | us | cylinder | 2700 | 2004-04-27 | 29.88306 | -97.941111 |
tx | light | 7200 | 2005-12-16 | 29.38421 | -98.581082 | |
gb | circle | 20 | 2008-01-21 | 53.20000 | -2.916667 | |
tx | us | circle | 20 | 2004-01-17 | 28.97833 | -96.645833 |
hi | us | light | 900 | 2004-01-22 | 21.41806 | -157.803611 |
tn | us | sphere | 300 | 2007-04-27 | 36.59500 | -82.188889 |
group_by()
can be used with summarise()
to get some really cool results
shape | mean_time |
---|---|
31270.1983 | |
changed | 3600.0000 |
changing | 1933.5794 |
chevron | 488.6256 |
cigar | 1805.3039 |
circle | 4307.6143 |
arrange()
shape | mean_time |
---|---|
dome | 2.0000 |
flare | 30.0000 |
pyramid | 120.0000 |
hexagon | 240.0000 |
round | 452.5000 |
chevron | 488.6256 |
desc()
stand for descending, so when used with arrange()
shape | mean_time |
---|---|
cone | 71342.83 |
31270.20 | |
sphere | 20410.62 |
crescent | 18905.00 |
other | 18669.29 |
light | 12256.44 |
n()
allows us to calculate the total number of entries in a group
group_by()
to group by each UFO shapesummarise()
to create a new dataframe with the summary stat (in this case n()
)n()
to count the number of entries per group (i.e. UFO shape)shape | total |
---|---|
light | 17872 |
triangle | 8489 |
circle | 8452 |
fireball | 6562 |
unknown | 6319 |
other | 6247 |
head()
to the code to print the top 10 most UFO shapes spottedshape | total |
---|---|
light | 17872 |
triangle | 8489 |
circle | 8452 |
fireball | 6562 |
unknown | 6319 |
other | 6247 |
disk | 6005 |
sphere | 5755 |
oval | 4119 |
3116 |
#make the dots smaller
#maybe leaflet and mapview
#shows that theres multiple ways to code something
library(mapview)
#filters out entries where there is not latitude or longitdue entries
# Note this may take a second to run
ufo_df_cleaned %>%
head(100)%>%
filter(!is.na(longitude) & !is.na(latitude)) %>%
mapview(
xcol = "longitude",
ycol = "latitude",
crs = 4269,
grid = FALSE
)