Data Science for Social Impact

Introduction to Statistics and Exploratory Data Analysis(EDA)

Ayorinde Emmanuel Olatunde

What is Statistics?

  • Statistics is COSPA

  • It is the science of Collection, Organizing, Summarizing, Presenting, and Analyzing data.

  • It is aimed at moving from data through information and knowledge to Applied knowledge (Decision)

Why Statistics?

  • The essence of statistics is

    • To be able to scientifically learn/gain insights about the whole from its part with a measure of certainty

    • Thereby saving us time, cost, personnel etc.

  • Statistics has gained wide usage to the point that it now cut across all field of research,

    • with some heavily relying on it while others sparing rely on it.
  • Researchers and analysts can use statistical approaches to:

    • derive meaningful findings, identify patterns, and make informed data-based decisions.

Basic Terms in Statistics

  • Data: Raw information gathered through observations, experiments, surveys, or other sources.
    • It can be numerical (quantitative i.e. amounts/measurements) or qualitative (categorical i.e. features/characteristics).
  • Population/Sample: Population is the complete group of interest, while sample is a subset of that population.
    • Since studying the entire population is not always practicable, analyzing a sample to reveal insights into the desired characteristics of the population becomes the whole essence of Statistics.
    • Estimates of Population are called Parameters, while estimates of Sample are called Statistics.

Basic Terms in Statistics

  • Descriptive Statistics: are used to explore and describe important characteristics of a dataset, i.e.

Basic Terms in Statistics

  • Statistical Graphs and Charts: such as: Bar charts, line graphs, histograms, pie charts, and scatter plots etc are visual representation of data which helps in comprehending patterns and trends in datasets.

  • Inferential statistics: involve generating prediction/conclusions about a population based on the analysis of a sample drawn from the population.

  • Probability: is the likelihood that an event will occur. Probabilistic statements are often needed for quantification in inferential statistics for predictions and hypotheses testing.

  • Hypothesis testing: is a strategy for determining whether there is a significant difference between two or more groups or whether an observed effect is significant or not.

    • It entails creating null and alternative hypotheses and running statistical tests to see if the data favors one hypothesis over the other.
  • Correlation & Regression: Correlation assesses the strength and direction of a linear relationship between two continuous variables

    • Regression develops a model to predict the value of a dependent variable based on the values of one or more independent variables
    • Under certain conditions, Regression can help in identifying causality effects among variables.

Introduction to Exploratory Data Analysis (EDA)

What is EDA?

  • EDA is the first set of analysis you perform on your data with the aim of:

    • Discovering patterns

    • Observing possible rare events

    • Testing hypothesis to check if your assumptions are valid through summary statistics and graphs

Why EDA?

  • In one word, it helps the analyst to know the direction in which he will perform the analysis, i.e. to better understand the variables and the relationships between them.

  • EDA is akin to the set of questions you ask yourself before you take a decision where a series of Yes may imply taking the decision while a series of No may mean not taking it, in doing this, the analyst is able to clean up the data.

With EDA, you are able to

  • Reduce the features to the vital few and remove redundant features,
  • Identify outliers/influential points/missing values/human error,
  • Understand the presence/absence of relationship(s) between variables, and
  • Thereby use the insights gained to minimize potential error in the main analysis.

Summarily, EDA sets the foundation for effectively drawing relevant information from data analysis to make statistically sound decision(s).

Demonstration of EDA: Palmerpenguins

Palmerpenguins dataset, what about it?

It is a model dataset that contains both categorical and numerical features of the body measurements of 344 penguins comprising of three species from three islands.

Loading Libraries

library(tidyverse)
library(palmerpenguins)
library(corrplot)

# Install required packages
#utils::install.packages("tidyverse", repos = "http://cran.us.r-project.org")
#utils::install.packages("palmerpenguins", repos = "http://cran.us.r-project.org")
#utils::install.packages("corrplot", repos = "http://cran.us.r-project.org")
library(tidyverse)
library(palmerpenguins)
library(corrplot)
library(kableExtra)
library(skimr)
library(dplyr)

data("penguins")


kable(head(penguins, n = 5),
    caption = "Forest Fire dataframe") %>%
  kable_styling(
    font_size = 15, 
    full_width = TRUE, 
    latex_options = "scale_down") %>%
  kable_classic(c("striped", "hover", "condensed"))
Forest Fire dataframe
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Adelie Torgersen 40.3 18.0 195 3250 female 2007
Adelie Torgersen NA NA NA NA NA 2007
Adelie Torgersen 36.7 19.3 193 3450 female 2007

Count the number of rows and columns in the dataset

dim(penguins)
[1] 344   8
nrow(penguins)
[1] 344
ncol(penguins)
[1] 8

View the last rows of the dataset

kable(tail(penguins, n = 5),
    caption = "Forest Fire dataframe") %>%
  kable_styling(
    font_size = 15, 
    full_width = TRUE, 
    latex_options = "scale_down") %>%
  kable_classic(c("striped", "hover", "condensed"))
Forest Fire dataframe
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Chinstrap Dream 55.8 19.8 207 4000 male 2009
Chinstrap Dream 43.5 18.1 202 3400 female 2009
Chinstrap Dream 49.6 18.2 193 3775 male 2009
Chinstrap Dream 50.8 19.0 210 4100 male 2009
Chinstrap Dream 50.2 18.7 198 3775 female 2009

Check the structure of columns (variables/feautures) in the dataset

Factor vs Characters vs Numeric

  • Character: Word or string of words that can be transformed into factored variables
  • Factor: used for categorical data, stored as integers, can be ordered or unordered.
  • Numeric: used for continuous/ratio/interval level data
str(penguins)
tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
 $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

Factor levels, number of levels etc for Factor variable

levels(penguins$species)
[1] "Adelie"    "Chinstrap" "Gentoo"   
nlevels(penguins$species)
[1] 3

Filter the missing values based on bill length and sex

penguins = penguins %>% 
  filter(!is.na(bill_length_mm), !is.na(sex))

Identify if there is missing values in specific column

mean(is.na(penguins$body_mass_g))
[1] 0

Count the number of Penguins based on Species/Island

kable(count(penguins, species, island)) %>%
  kable_styling(
    font_size = 15, 
    full_width = TRUE, 
    latex_options = "scale_down") %>%
  kable_classic(c("striped", "hover", "condensed"))
species island n
Adelie Biscoe 44
Adelie Dream 55
Adelie Torgersen 47
Chinstrap Dream 68
Gentoo Biscoe 119

Summary statistics of the dataset

kable(summary(penguins)) %>%
  kable_styling(
    font_size = 15, 
    full_width = TRUE, 
    latex_options = "scale_down") %>%
  kable_classic(c("striped", "hover", "condensed"))
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie :146 Biscoe :163 Min. :32.10 Min. :13.10 Min. :172 Min. :2700 female:165 Min. :2007
Chinstrap: 68 Dream :123 1st Qu.:39.50 1st Qu.:15.60 1st Qu.:190 1st Qu.:3550 male :168 1st Qu.:2007
Gentoo :119 Torgersen: 47 Median :44.50 Median :17.30 Median :197 Median :4050 NA Median :2008
NA NA Mean :43.99 Mean :17.16 Mean :201 Mean :4207 NA Mean :2008
NA NA 3rd Qu.:48.60 3rd Qu.:18.70 3rd Qu.:213 3rd Qu.:4775 NA 3rd Qu.:2009
NA NA Max. :59.60 Max. :21.50 Max. :231 Max. :6300 NA Max. :2009

Filter and select specific columns

penguins_selected <- penguins %>%
  select(species, island, bill_length_mm, bill_depth_mm)
head(penguins_selected)
# A tibble: 6 × 4
  species island    bill_length_mm bill_depth_mm
  <fct>   <fct>              <dbl>         <dbl>
1 Adelie  Torgersen           39.1          18.7
2 Adelie  Torgersen           39.5          17.4
3 Adelie  Torgersen           40.3          18  
4 Adelie  Torgersen           36.7          19.3
5 Adelie  Torgersen           39.3          20.6
6 Adelie  Torgersen           38.9          17.8

Group data by a variable and calculate summary statistics

penguins_summary <- penguins %>%
  group_by(species) %>%
  summarize(mean_bill_length = mean(bill_length_mm, na.rm = TRUE),
            mean_bill_depth = mean(bill_depth_mm, na.rm = TRUE))
print(penguins_summary)
# A tibble: 3 × 3
  species   mean_bill_length mean_bill_depth
  <fct>                <dbl>           <dbl>
1 Adelie                38.8            18.3
2 Chinstrap             48.8            18.4
3 Gentoo                47.6            15.0

Another beautiful way to summarize by using the skimr R package

skim(penguins)
Data summary
Name penguins
Number of rows 333
Number of columns 8
_______________________
Column type frequency:
factor 3
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1 FALSE 3 Ade: 146, Gen: 119, Chi: 68
island 0 1 FALSE 3 Bis: 163, Dre: 123, Tor: 47
sex 0 1 FALSE 2 mal: 168, fem: 165

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 0 1 43.99 5.47 32.1 39.5 44.5 48.6 59.6 ▃▇▇▆▁
bill_depth_mm 0 1 17.16 1.97 13.1 15.6 17.3 18.7 21.5 ▅▆▇▇▂
flipper_length_mm 0 1 200.97 14.02 172.0 190.0 197.0 213.0 231.0 ▂▇▃▅▃
body_mass_g 0 1 4207.06 805.22 2700.0 3550.0 4050.0 4775.0 6300.0 ▃▇▅▃▂
year 0 1 2008.04 0.81 2007.0 2007.0 2008.0 2009.0 2009.0 ▇▁▇▁▇

skim for specified outputs

skim(penguins) %>%
  dplyr::select(skim_type, skim_variable, n_missing)
Data summary
Name penguins
Number of rows 333
Number of columns 8
_______________________
Column type frequency:
factor 3
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing
species 0
island 0
sex 0

Variable type: numeric

skim_variable n_missing
bill_length_mm 0
bill_depth_mm 0
flipper_length_mm 0
body_mass_g 0
year 0

Turn the dataframe to tibble

skim(penguins) %>%
  tibble::as_tibble()
# A tibble: 8 × 15
  skim_type skim_variable n_missing complete_rate factor.ordered factor.n_unique
  <chr>     <chr>             <int>         <dbl> <lgl>                    <int>
1 factor    species               0             1 FALSE                        3
2 factor    island                0             1 FALSE                        3
3 factor    sex                   0             1 FALSE                        2
4 numeric   bill_length_…         0             1 NA                          NA
5 numeric   bill_depth_mm         0             1 NA                          NA
6 numeric   flipper_leng…         0             1 NA                          NA
7 numeric   body_mass_g           0             1 NA                          NA
8 numeric   year                  0             1 NA                          NA
# ℹ 9 more variables: factor.top_counts <chr>, numeric.mean <dbl>,
#   numeric.sd <dbl>, numeric.p0 <dbl>, numeric.p25 <dbl>, numeric.p50 <dbl>,
#   numeric.p75 <dbl>, numeric.p100 <dbl>, numeric.hist <chr>

Get summary statistics for a specific variable.

skim(penguins) %>%
  dplyr::filter(skim_variable == "bill_depth_mm")
Data summary
Name penguins
Number of rows 333
Number of columns 8
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_depth_mm 0 1 17.16 1.97 13.1 15.6 17.3 18.7 21.5 ▅▆▇▇▂

Get summary statistics by group/category [species].

penguins %>%
  dplyr::group_by(species) %>%
  skim()
Data summary
Name Piped data
Number of rows 333
Number of columns 8
_______________________
Column type frequency:
factor 2
numeric 5
________________________
Group variables species

Variable type: factor

skim_variable species n_missing complete_rate ordered n_unique top_counts
island Adelie 0 1 FALSE 3 Dre: 55, Tor: 47, Bis: 44
island Chinstrap 0 1 FALSE 1 Dre: 68, Bis: 0, Tor: 0
island Gentoo 0 1 FALSE 1 Bis: 119, Dre: 0, Tor: 0
sex Adelie 0 1 FALSE 2 fem: 73, mal: 73
sex Chinstrap 0 1 FALSE 2 fem: 34, mal: 34
sex Gentoo 0 1 FALSE 2 mal: 61, fem: 58

Variable type: numeric

skim_variable species n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm Adelie 0 1 38.82 2.66 32.1 36.73 38.85 40.77 46.0 ▁▆▇▆▁
bill_length_mm Chinstrap 0 1 48.83 3.34 40.9 46.35 49.55 51.08 58.0 ▂▇▇▅▁
bill_length_mm Gentoo 0 1 47.57 3.11 40.9 45.35 47.40 49.60 59.6 ▃▇▇▁▁
bill_depth_mm Adelie 0 1 18.35 1.22 15.5 17.50 18.40 19.00 21.5 ▂▆▇▃▂
bill_depth_mm Chinstrap 0 1 18.42 1.14 16.4 17.50 18.45 19.40 20.8 ▅▇▇▆▂
bill_depth_mm Gentoo 0 1 15.00 0.99 13.1 14.20 15.00 15.75 17.3 ▅▇▇▆▂
flipper_length_mm Adelie 0 1 190.10 6.52 172.0 186.00 190.00 195.00 210.0 ▁▆▇▅▁
flipper_length_mm Chinstrap 0 1 195.82 7.13 178.0 191.00 196.00 201.00 212.0 ▁▅▇▅▂
flipper_length_mm Gentoo 0 1 217.24 6.59 203.0 212.00 216.00 221.50 231.0 ▂▇▇▆▃
body_mass_g Adelie 0 1 3706.16 458.62 2850.0 3362.50 3700.00 4000.00 4775.0 ▅▇▇▃▂
body_mass_g Chinstrap 0 1 3733.09 384.34 2700.0 3487.50 3700.00 3950.00 4800.0 ▁▅▇▃▁
body_mass_g Gentoo 0 1 5092.44 501.48 3950.0 4700.00 5050.00 5500.00 6300.0 ▃▇▇▇▂
year Adelie 0 1 2008.05 0.81 2007.0 2007.00 2008.00 2009.00 2009.0 ▇▁▇▁▇
year Chinstrap 0 1 2007.97 0.86 2007.0 2007.00 2008.00 2009.00 2009.0 ▇▁▆▁▇
year Gentoo 0 1 2008.07 0.79 2007.0 2007.00 2008.00 2009.00 2009.0 ▆▁▇▁▇

Get summary statistics by group/category [island].

penguins %>%
  dplyr::group_by(island) %>%
  skim()
Data summary
Name Piped data
Number of rows 333
Number of columns 8
_______________________
Column type frequency:
factor 2
numeric 5
________________________
Group variables island

Variable type: factor

skim_variable island n_missing complete_rate ordered n_unique top_counts
species Biscoe 0 1 FALSE 2 Gen: 119, Ade: 44, Chi: 0
species Dream 0 1 FALSE 2 Chi: 68, Ade: 55, Gen: 0
species Torgersen 0 1 FALSE 1 Ade: 47, Chi: 0, Gen: 0
sex Biscoe 0 1 FALSE 2 mal: 83, fem: 80
sex Dream 0 1 FALSE 2 mal: 62, fem: 61
sex Torgersen 0 1 FALSE 2 fem: 24, mal: 23

Variable type: numeric

skim_variable island n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm Biscoe 0 1 45.25 4.83 34.5 41.85 45.8 48.75 59.6 ▃▅▇▃▁
bill_length_mm Dream 0 1 44.22 5.95 32.1 39.20 45.2 49.90 58.0 ▅▇▆▇▁
bill_length_mm Torgersen 0 1 39.04 3.03 33.5 36.65 39.0 41.10 46.0 ▅▅▇▅▂
bill_depth_mm Biscoe 0 1 15.91 1.83 13.1 14.50 15.6 17.00 21.1 ▇▇▃▃▁
bill_depth_mm Dream 0 1 18.34 1.14 15.5 17.50 18.4 19.00 21.2 ▁▃▇▅▁
bill_depth_mm Torgersen 0 1 18.45 1.35 15.9 17.45 18.4 19.25 21.5 ▃▇▇▃▃
flipper_length_mm Biscoe 0 1 209.56 14.28 172.0 198.50 213.0 220.00 231.0 ▁▃▁▇▅
flipper_length_mm Dream 0 1 193.19 7.43 178.0 188.00 193.0 198.00 212.0 ▂▇▇▅▂
flipper_length_mm Torgersen 0 1 191.53 6.22 176.0 187.50 191.0 195.50 210.0 ▂▅▇▃▁
body_mass_g Biscoe 0 1 4719.17 790.86 2850.0 4200.00 4800.0 5350.00 6300.0 ▂▅▇▇▃
body_mass_g Dream 0 1 3718.90 412.94 2700.0 3412.50 3700.0 3962.50 4800.0 ▁▆▇▃▂
body_mass_g Torgersen 0 1 3708.51 451.85 2900.0 3337.50 3700.0 4000.00 4700.0 ▅▇▇▅▃
year Biscoe 0 1 2008.09 0.78 2007.0 2007.00 2008.0 2009.00 2009.0 ▆▁▇▁▇
year Dream 0 1 2007.99 0.85 2007.0 2007.00 2008.0 2009.00 2009.0 ▇▁▆▁▇
year Torgersen 0 1 2008.02 0.82 2007.0 2007.00 2008.0 2009.00 2009.0 ▇▁▇▁▇

Get summary statistics by group/category [sex].

penguins %>%
  dplyr::group_by(sex) %>%
  skim()
Data summary
Name Piped data
Number of rows 333
Number of columns 8
_______________________
Column type frequency:
factor 2
numeric 5
________________________
Group variables sex

Variable type: factor

skim_variable sex n_missing complete_rate ordered n_unique top_counts
species female 0 1 FALSE 3 Ade: 73, Gen: 58, Chi: 34
species male 0 1 FALSE 3 Ade: 73, Gen: 61, Chi: 34
island female 0 1 FALSE 3 Bis: 80, Dre: 61, Tor: 24
island male 0 1 FALSE 3 Bis: 83, Dre: 62, Tor: 23

Variable type: numeric

skim_variable sex n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm female 0 1 42.10 4.90 32.1 37.60 42.80 46.20 58.0 ▅▅▇▂▁
bill_length_mm male 0 1 45.85 5.37 34.6 40.98 46.80 50.32 59.6 ▅▇▆▇▁
bill_depth_mm female 0 1 16.43 1.80 13.1 14.50 17.00 17.80 20.7 ▇▃▇▇▁
bill_depth_mm male 0 1 17.89 1.86 14.1 16.08 18.45 19.25 21.5 ▃▅▅▇▂
flipper_length_mm female 0 1 197.36 12.50 172.0 187.00 193.00 210.00 222.0 ▂▇▃▃▃
flipper_length_mm male 0 1 204.51 14.55 178.0 193.00 200.50 219.00 231.0 ▂▇▃▃▅
body_mass_g female 0 1 3862.27 666.17 2700.0 3350.00 3650.00 4550.00 5200.0 ▃▇▂▃▃
body_mass_g male 0 1 4545.68 787.63 3250.0 3900.00 4300.00 5312.50 6300.0 ▅▇▂▅▂
year female 0 1 2008.04 0.81 2007.0 2007.00 2008.00 2009.00 2009.0 ▇▁▇▁▇
year male 0 1 2008.04 0.81 2007.0 2007.00 2008.00 2009.00 2009.0 ▇▁▇▁▇

Creating visualizations

Pie chart

  • Pie charts: are used in illustrating the proportion or distribution of different categories within the dataset.

  • It can be used to visualize the distribution of categorical data, identify dominant categories, and observe any major disparities in the dataset.

Pie chart for the distribution of Islands

ggplot(penguins, aes(x = "", fill = island)) +
  geom_bar(width = 1) +
  coord_polar("y", start = 0) +
  labs(title = "Distribution of Islands", fill = "Island")

Bar plot

  • Bar Plot: helps in visualizing the distribution of categorical data. From it, categories that are more/less frequent can easily be identified and insightful comparison of the frequencies of different categories to draw conclusions from the data can be made.

Bar plot for Species distribution

ggplot(penguins, aes(x = species, fill = species)) +
  geom_bar() +
  labs(title = "Species Distribution", x = "Species", y = "Count") +
  theme(legend.position = "none")

Bar Plot of Body Mass factored by Specie and Sex

penguins_by_sex = penguins %>% 
  filter(!is.na(sex)) %>% 
  group_by(species, sex) %>% 
  summarise(bill_depth_mm = mean(bill_depth_mm),
            bill_length_mm = mean(bill_length_mm),
            flipper_length_mm = mean(flipper_length_mm),
            body_mass_g = mean(body_mass_g))

ggplot(penguins_by_sex, aes(x = species, y = body_mass_g, fill=sex)) +
  geom_col(position=position_dodge())

Histogram

  • Histogram: just like bar chart/plot helps in giving the graphical representation of the distribution of a continuous or discrete variable.
  • Histograms helps in uncovering patterns such as whether the data is symmetric, righ/left skewed, or uni/bi/trimodal, identify measures of central tendencies, understand the spread of the data, detect outliers.
  • A careful look at it also will help in making informed decision on pre-processing methods/model selection/data transformation etc.
  • While Bar chart is more appropriate for categorical data or data with a small number of unique values, kernel density plots or box plots are more appropriate for more complex distributions.

Histogram of Body Mass

ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram(binwidth = 50, color = "black", fill = "lightblue") +
  labs(title = "Histogram of Body Mass (g)", x = "Body Mass (g)", y = "Frequency")

Histogram of Body Mass (g) by species

ggplot(penguins, aes(x = body_mass_g, fill = species)) +
  geom_histogram(position='identity', alpha=0.5) +
  labs(title = "Histogram of Body Mass (g) by species", x = "Body Mass (g) by species", y = "Frequency")

Histogram of Bill lenght (mm) by species

ggplot(penguins, aes(x = bill_length_mm, fill = species)) +
  geom_histogram(position='identity', alpha=0.5) +
  labs(title = "Histogram of Bill lenght (mm) by species", x = "Bill lenght (mm) by species", y = "Frequency")

Histogram of Bill depth (mm) by species

ggplot(penguins, aes(x = bill_depth_mm, fill = species)) +
  geom_histogram(position='identity', alpha=0.5) +
  labs(title = "Histogram of Bill depth (mm) by species", x = "Bill depth (mm) by species", y = "Frequency")

Histogram of flipper length (mm) by species

ggplot(penguins, aes(x = flipper_length_mm, fill = species)) +
  geom_histogram(position='identity', alpha=0.5) +
  labs(title = "Histogram of flipper length (mm) by species", x = "Flipper length (mm) by species", y = "Frequency")

Box plot

  • Box plot: summarizes the distribution of a dataset and identify potential outliers/extreme values in it.
  • It is very useful in viewing numerical data distributions across different categories or groups, and gives the 1st, 2nd (median) and 3rd quartiles of the distribution at a glance.
  • It also clearly displays the outliers at the tail end of the box’s whiskers (the lines that extend above and below the box).
  • The presence/absence of the skewness can also be easily read through the orientation of the box.

Box plot of Body Mass by Species

ggplot(penguins, aes(x = species, y = body_mass_g, fill = species)) +
  geom_boxplot() +
  labs(title = "Box Plot of Body Mass by Species", x = "Species", y = "Body Mass (g)")

Violin plot

  • Violin plot: is a combination of a box plot and a kernel density plot (a smoothed version of the histogram that gives the representation of the data’s underlying probability density function), offering a more comprehensive view of the data distribution.
  • Each of the violin plot’s width is proportional to the density of data at different values and also symmetrical about the median, so that a lopsided violin will imply asymmetric dataset. Multiple peaked violin plot suggests that the data might have multiple modes or subgroups, while dots beyond the full body of the violin suggests presence of outliers.

Violin plot for Bill Length by Species

ggplot(penguins, aes(x = species, y = bill_length_mm, fill = species)) +
  geom_violin() +
  labs(title = "Violin Plot of Bill Length by Species", x = "Species", y = "Bill Length (mm)")

Density plot

  • Density plot: helps in visualizing/understanding the distribution of a continuous variable through a smooth, continuous estimate of the probability density function (PDF) of the data.
  • It also helps in identifying patterns, or compare distributions between different groups/categories.

Density plot for Body Mass by Species

ggplot(penguins, aes(x = body_mass_g, fill = species)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density Plot of Body Mass by Species", x = "Body Mass (g)")

Scatter Plot

  • Scatter Plot: display the relationship between two continuous variables helping us to understand
    • pattern (i.e. clusters, trends, or outliers),
    • direction (i.e. positive/negative slope/correlation),
    • strength (strong/weak) of the relationship between the variables.

Scatter Plot: Bill Length vs. Bill Depth

ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
  geom_point() +
  labs(title = "Scatter Plot: Bill Length vs. Bill Depth",
       x = "Bill Length (mm)",
       y = "Bill Depth (mm)")

Scatter plot of Flipper Length vs. Body Mass

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point() +
  labs(title = "Flipper Length vs. Body Mass", x = "Flipper Length (mm)", y = "Body Mass (g)")

Pairwise scatter plots with color by Species

ggplot(penguins, aes(x = species, y = flipper_length_mm, color = species)) +
  geom_jitter(alpha = 0.7, width = 0.3) +
  facet_grid(. ~ sex) +
  labs(title = "Pairwise Scatter Plots by Species and Sex",
       x = "Species",
       y = "Flipper Length (mm)")

Summary

Depending on the type/kind of data you want to analyze, there are many more plots which serve different/similar purposes which can uncover the distribution, relationships, patterns, abscence/presence of outliers/extreme values within the dataset you want to explore.

The following is a list of some of such plots: Correlation plot, Heatmap, Line Plot, Pair Plot (Scatter Plot Matrix), Biplot, Time Series Plot, Area Plot, Stacked Area Chart, Hexbin Plot, Q-Q Plot (Quantile-Quantile Plot), Andrews Plot, Parallel Coordinates Plot, Network Plot (Graph Visualization), Choropleth Map, etc.

Glossary

  • Data: Raw information gathered through observations, experiments, surveys, or other sources.

    • It can be numerical (quantitative i.e. amounts/measurements) or qualitative (categorical i.e. features/characteristics).
  • Population/Sample: Population is the complete group of interest, while sample is a subset of that population.

    • Since studying the entire population is not always practicable, analyzing a sample to reveal insights into the desired characteristics of the population becomes the whole essence of Statistics.
    • Estimates of Population are called Parameters, while estimates of Sample are called Statistics.
  • Descriptive Statistics: are used to explore and describe important characteristics of a dataset, i.e.

  • Statistical Graphs and Charts: such as: Bar charts, line graphs, histograms, pie charts, and scatter plots etc are visual representation of data which helps in comprehending patterns and trends in datasets.

  • Inferential statistics: involve generating prediction/conclusions about a population based on the analysis of a sample drawn from the population.

  • Probability: is the likelihood that an event will occur. Probabilistic statements are often needed for quantification in inferential statistics for predictions and hypotheses testing.

  • Hypothesis testing: is a strategy for determining whether there is a significant difference between two or more groups or whether an observed effect is significant or not.

    • It entails creating null and alternative hypotheses and running statistical tests to see if the data favors one hypothesis over the other.
  • Correlation & Regression: Correlation assesses the strength and direction of a linear relationship between two continuous variables,

    • Regression develops a model to predict the value of a dependent variable based on the values of one or more independent variables
    • Under certain conditions, Regression can help in identifying causality effects among variables.

Further Resouces