Data Science for Social Impact

Linear Modeling

Mohommad Redad Mehdi, Erika I. Barcelos

Predictive Modeling

  • In a data science pipeline:
  • after the data is collected, explored and cleaned
    • we usually proceed with modeling

Predictive modeling is the process of using statistical and machine learning models and tools to:

  • understand a system or process,

  • identify patterns and trends in the data, and

  • use the gained insights to make predictions are forecast future outcomes or behaviors

An important requirement is that:

  • Data related or derived to the project is available

    • the learning process takes place when the model learns from the data

Statistical, Machine and Deep Learning

Statistical, Machine and Deep Learning have different goals, and requirements and the choice depend on several factors, such as:

  • Amount of data available,

  • how important the model intrepretability is, and

  • the extend of manual efforts in feature selection

In statistical learning the main goal is to:

  • make inferences,

  • understanding and quantifying the relationships between variables in the data,

  • any size of data sets, including small & medium, and

  • high interpretability (low complexity problems)

In machine learning the main focus relies on:

  • finding generalizable models,

  • low interpretability, and

  • moderate complexity

In deep learning:

  • highly accurate predictions

  • very low interpretability

  • can handle very complex problem

Regression

  • Set of techniques used to predict a response(dependent) based on one more more predictors(independent variables)
  • It can be used to identify and quantify the magnitude in which the predictors relate to the response
  • Depending on the type of data response and predictor different regression models can be applied
  • Generalized linear Models (GLM) represent a family of models containing the most common regression types such as linear regression, logistic regression and multiple linear regression

Regression

Linear & Multiple Linear Regression

Linear regression is a commonly used approach to make inferences and model relationships between predictors and response variables in a dataset

  • Linear Regression: one predictor and the response
  • Multiple linear regression:multiple predictors and the response

Regression

Linear & Multiple Linear Regression

In linear and multiple linear regression the coefficients are determined by fitting the linear equation to the data

Linear Regression :\[ y=ax + b\]

Multiple Linear Regression : \[y = ax + bx_1 + cx_2...nx_n\]

Regression

Linear & Multiple Linear Regression

Assumptions

There are 4 different assumptions a linear regression model must adhere to:

  • There is a linear relationship between the mean response \(y\) and the explanatory variable \(x\)
  • The errors are independent - there is no connection between how far any two points lie from the regression line
  • The responses are normally distributed at each level of \(x\)
  • The variance of the responses is equal for all levels of \(x\)

Demonstration of EDA: Palmerpenguins

Aim of this demo:

  • The goal of this demo is to try and find which variables are the best to predict the weight of penguins
  • Quantify the relationship between different variables and the weight of penguins

Loading libraries & Data

Loading libraries

library(MASS)
library(ISLR2)
library(palmerpenguins)
library(tidyverse)
library(ggplot2)
library(GGally)
library(gridExtra)
library(kableExtra)
library(magrittr)

The palmerpenguins library contains the penguins data set.

Quick overview of the data

We will load in the penguins dataset and print out the first 5 instances.

attach(penguins)
penguins <- penguins
kable(head(penguins, n = 5),
    caption = "Forest Fire dataframe") %>%
  kable_styling(
    font_size = 15, 
    full_width = TRUE, 
    latex_options = "scale_down") %>%
  kable_classic(c("striped", "hover", "condensed"))
Forest Fire dataframe
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Adelie Torgersen 40.3 18.0 195 3250 female 2007
Adelie Torgersen NA NA NA NA NA 2007
Adelie Torgersen 36.7 19.3 193 3450 female 2007

Quick overview of the data

To find out more about the data set, we can type str(penguins).

It tells us about the

  • various columns,

  • their data types,

  • additional information

str(penguins)
tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
 $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

This tells us that there are 8 total variables in the dataset

Checking % of missing data

kable(colMeans(is.na(penguins))*100, booktabs = TRUE,
      col.names = "missing values") %>%
  kable_styling(
    font_size = 18, 
    full_width = FALSE, 
    latex_options = "scale_down") %>%
  kable_classic(c("striped", "hover"))
missing values
species 0.0000000
island 0.0000000
bill_length_mm 0.5813953
bill_depth_mm 0.5813953
flipper_length_mm 0.5813953
body_mass_g 0.5813953
sex 3.1976744
year 0.0000000
  • All columns have less than 3% of missing data
  • There are several strategies to handle missing data
    • Since the amount of data missing is below 4% let’s just remove them

Removing NA

penguins <- na.omit(penguins)

Let’s look at our data!

  • Let us perform some basic EDA before modeling
  • So that we have a better understanding about the relationship between various variables

Let’s look at our data!

With the below chunk of code we can examine the numerical and graphical summaries of the relationships between model covariates and responses

ggpairs(data = penguins,
        columns = c("species", "bill_length_mm", "bill_depth_mm", "flipper_length_mm", "sex", "body_mass_g"))

  • The relationship between two continuous variables is depicted with scatterplots below the diagonal and correlation coefficients above the diagonal.

  • From this graph, we can see that there is a strong correlation between body mass and bill length, body mass and bill depth, and body mass and flipper length.

Modeling

First we will check the relationship between each of the bill and flipper variables individually.

Body Mass with Bill Length

model_1 <- lm(body_mass_g ~ bill_length_mm, data = penguins)
summary(model_1)

Call:
lm(formula = body_mass_g ~ bill_length_mm, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-1759.38  -468.82    27.79   464.20  1641.00 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     388.845    289.817   1.342    0.181    
bill_length_mm   86.792      6.538  13.276   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 651.4 on 331 degrees of freedom
Multiple R-squared:  0.3475,    Adjusted R-squared:  0.3455 
F-statistic: 176.2 on 1 and 331 DF,  p-value: < 2.2e-16

A few observations from the summary of our fit:

  • The Adjusted R-squared is 0.3455, which means about 34.55% of the variability in the response variable (body mass) is explained by the independent variable (bill_length) included in the model.

  • The Residual standard error is 651.4, which means that, on average, the predicted body mass values from the model deviate from the actual values by approximately 651.4 units.

  • The p-value is almost 0, i.e the model is statistically significant( see more about statistically significance and p-values in this book

Modeling

Body Mass with Bill Length

We will now plot bill_length_mm and body_mass_g along with the least squares regression line using the plot() and abline() functions.

plot(bill_length_mm, body_mass_g)
abline(model_1)

  • We can observe that the fit of our model is relatively good
  • There is also a hint of non-linearity

Modeling

Body Mass with Bill Depth

model_2 <- lm(body_mass_g ~ bill_depth_mm, data = penguins)
summary(model_2)

Call:
lm(formula = body_mass_g ~ bill_depth_mm, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-1616.08  -510.38   -68.67   486.51  1811.12 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    7519.98     342.33  21.967   <2e-16 ***
bill_depth_mm  -193.01      19.81  -9.741   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 710.9 on 331 degrees of freedom
Multiple R-squared:  0.2228,    Adjusted R-squared:  0.2205 
F-statistic: 94.89 on 1 and 331 DF,  p-value: < 2.2e-16
  • This model only accounts for ~22% variability.
  • The standard error is greater than model 1.

Modeling

Body Mass with Bill Depth

plot(bill_depth_mm, body_mass_g)
abline(model_2)

  • The fit is not very good
  • The predicted values are far off the real ones

Modeling

Body Mass with Flip Length

model_3 <- lm(body_mass_g ~ flipper_length_mm, data = penguins)
summary(model_3)

Call:
lm(formula = body_mass_g ~ flipper_length_mm, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-1057.33  -259.79   -12.24   242.97  1293.89 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -5872.09     310.29  -18.93   <2e-16 ***
flipper_length_mm    50.15       1.54   32.56   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 393.3 on 331 degrees of freedom
Multiple R-squared:  0.7621,    Adjusted R-squared:  0.7614 
F-statistic:  1060 on 1 and 331 DF,  p-value: < 2.2e-16
  • This model explains ~76% variability of our response variable
  • The residual error is the minimum out of the three models

Modeling

Body Mass with Flip Length

plot(flipper_length_mm, body_mass_g)
abline(model_3)

  • Shows a good fit
  • Best out of the three models

Multiple Linear Regression

  • As you can see, all 3 models are each statistically significant (p-value < 0.05)

  • Next let’s try adding multiple independent variables to the formula In order to fit a multiple linear regression model using least squares, we again use the lm() function

The syntax \(lm(y \sim x1 + x2 + x3)\) is used to fit a model with three predictors, x1, x2, and x3

  • The summary() function now outputs the regression coefficients for all the predictors

Multiple Linear Regression

model_4 <- lm(body_mass_g ~ bill_length_mm + bill_depth_mm, data = penguins)
summary(model_4)

Call:
lm(formula = body_mass_g ~ bill_length_mm + bill_depth_mm, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-1806.74  -456.82    15.33   471.07  1541.34 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    3413.452    437.911   7.795 8.50e-14 ***
bill_length_mm   74.813      6.076  12.313  < 2e-16 ***
bill_depth_mm  -145.507     16.873  -8.624 2.76e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 589.4 on 330 degrees of freedom
Multiple R-squared:  0.4675,    Adjusted R-squared:  0.4642 
F-statistic: 144.8 on 2 and 330 DF,  p-value: < 2.2e-16

Adding all varialbes

If we consider that all the variables in the dataset are relevant and want to include them as predictors, we can use the simple shorthand \(lm(y \sim .)\).

  • Instead of typing them out individually as \(lm(y \sim x1 + x2 +...)\).

  • So, in case of the Penguins dataset, if we want to include all the 8 variables in predicting the body mass, we can simply type:

model_5 <- lm(body_mass_g ~ . , data = penguins)
summary(model_5)

Call:
lm(formula = body_mass_g ~ ., data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-809.70 -180.87   -6.25  176.76  864.22 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       84087.945  41912.019   2.006  0.04566 *  
speciesChinstrap   -282.539     88.790  -3.182  0.00160 ** 
speciesGentoo       890.958    144.563   6.163 2.12e-09 ***
islandDream         -21.180     58.390  -0.363  0.71704    
islandTorgersen     -58.777     60.852  -0.966  0.33482    
bill_length_mm       18.964      7.112   2.667  0.00805 ** 
bill_depth_mm        60.798     20.002   3.040  0.00256 ** 
flipper_length_mm    18.504      3.128   5.915 8.46e-09 ***
sexmale             378.977     48.074   7.883 4.95e-14 ***
year                -42.785     20.949  -2.042  0.04194 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 286.5 on 323 degrees of freedom
Multiple R-squared:  0.8768,    Adjusted R-squared:  0.8734 
F-statistic: 255.4 on 9 and 323 DF,  p-value: < 2.2e-16

Interaction Terms

  • It is easy to include interaction terms in a linear model using the lm() function.

  • The syntax bill_depth_mm + bill_length_mm tells R to include an interaction term between bill_depth_mm and bill_length_mm.

  • The syntax bill_depth_mm * bill_length_mm simultaneously includes bill_depth_mm, bill_length_mm, and the interaction term bill_depth_mm\(\times\)bill_length_mm as predictors; it is a shorthand for bill_depgth_mm + bill_length_mm + bill_depgth_mm:bill_length_mm.

We can also pass in transformed versions of the predictors.

Interaction Terms

In practice

model_6 <- lm(body_mass_g ~ bill_depth_mm * bill_length_mm, data = penguins)
summary(model_6)

Call:
lm(formula = body_mass_g ~ bill_depth_mm * bill_length_mm, data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-1814.0  -353.6     2.9   352.4  1603.6 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  -25718.181   2696.211  -9.539   <2e-16 ***
bill_depth_mm                  1493.496    150.910   9.897   <2e-16 ***
bill_length_mm                  718.853     59.256  12.131   <2e-16 ***
bill_depth_mm:bill_length_mm    -36.317      3.328 -10.911   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 505.8 on 329 degrees of freedom
Multiple R-squared:  0.609, Adjusted R-squared:  0.6054 
F-statistic: 170.8 on 3 and 329 DF,  p-value: < 2.2e-16

Interaction Terms

In practice

As we saw in our correlation plot, the body mass was heavily correlated with three variables. Let’s use all three for our model now:

model_7 <- lm(body_mass_g ~ bill_depth_mm * bill_length_mm * flipper_length_mm, data = penguins)
summary(model_7)

Call:
lm(formula = body_mass_g ~ bill_depth_mm * bill_length_mm * flipper_length_mm, 
    data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-993.49 -247.75  -37.51  226.17  976.89 

Coefficients:
                                                 Estimate Std. Error t value
(Intercept)                                    -6.317e+04  3.843e+04  -1.644
bill_depth_mm                                   3.872e+03  2.258e+03   1.715
bill_length_mm                                  8.467e+02  8.879e+02   0.954
flipper_length_mm                               2.843e+02  1.903e+02   1.494
bill_depth_mm:bill_length_mm                   -5.928e+01  5.207e+01  -1.138
bill_depth_mm:flipper_length_mm                -1.667e+01  1.122e+01  -1.485
bill_length_mm:flipper_length_mm               -3.118e+00  4.365e+00  -0.714
bill_depth_mm:bill_length_mm:flipper_length_mm  2.385e-01  2.570e-01   0.928
                                               Pr(>|t|)  
(Intercept)                                      0.1012  
bill_depth_mm                                    0.0873 .
bill_length_mm                                   0.3410  
flipper_length_mm                                0.1362  
bill_depth_mm:bill_length_mm                     0.2558  
bill_depth_mm:flipper_length_mm                  0.1385  
bill_length_mm:flipper_length_mm                 0.4756  
bill_depth_mm:bill_length_mm:flipper_length_mm   0.3542  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 354.4 on 325 degrees of freedom
Multiple R-squared:  0.8104,    Adjusted R-squared:  0.8063 
F-statistic: 198.4 on 7 and 325 DF,  p-value: < 2.2e-16

Interaction Terms

In practice

If there is evidence of a non-linear relationship, we can also pass transformed versions of predictors.

We use the anova() function to further quantify the quality of the fit.

anova(model_1, model_2, model_3, model_4, model_5, model_6, model_7)
Analysis of Variance Table

Model 1: body_mass_g ~ bill_length_mm
Model 2: body_mass_g ~ bill_depth_mm
Model 3: body_mass_g ~ flipper_length_mm
Model 4: body_mass_g ~ bill_length_mm + bill_depth_mm
Model 5: body_mass_g ~ species + island + bill_length_mm + bill_depth_mm + 
    flipper_length_mm + sex + year
Model 6: body_mass_g ~ bill_depth_mm * bill_length_mm
Model 7: body_mass_g ~ bill_depth_mm * bill_length_mm * flipper_length_mm
  Res.Df       RSS Df Sum of Sq      F    Pr(>F)    
1    331 140467133                                  
2    331 167300074  0 -26832941                     
3    331  51211963  0 116088111                     
4    330 114633409  1 -63421446                     
5    323  26517018  7  88116390 153.33 < 2.2e-16 ***
6    329  84173822 -6 -57656804 117.05 < 2.2e-16 ***
7    325  40819662  4  43354160 132.02 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • The above variance analysis shows the results for all our models
  • The p-value for models 5, 6, and 7 are practically zero
  • Model 5 contains all the variables, model 6 contains only the interaction term of bill_length_mm and bill_depth_mm, and model 7 contains the interaction terms of flipper_length_mm, bill_length_mm, and bill_depth_mm

Let’s compare only these three:

anova(model_5, model_6, model_7)
Analysis of Variance Table

Model 1: body_mass_g ~ species + island + bill_length_mm + bill_depth_mm + 
    flipper_length_mm + sex + year
Model 2: body_mass_g ~ bill_depth_mm * bill_length_mm
Model 3: body_mass_g ~ bill_depth_mm * bill_length_mm * flipper_length_mm
  Res.Df      RSS Df Sum of Sq      F    Pr(>F)    
1    323 26517018                                  
2    329 84173822 -6 -57656804 117.05 < 2.2e-16 ***
3    325 40819662  4  43354160 132.02 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Here the \(F\)-statistic is \(132\) and the associated \(p\)-value is virtually zero
  • From the summaries of our models above, it would appear that model 5 is better in terms of every aspect
  • It has a lower Residual standard error, and a higher Adjuster R-squared
  • But the anova analysis of the two models shows that model 7 is a significantly better choice than model 5 for predicting body mass
  • Also, statistical significance does not always mean practical significance

Intro to Generalised Linear Modeling

Recall that in a linear regression model,

  • the object is to model the expected value of a continuous variable, \(Y\),
  • as a linear function of the predictor, \(\epsilon = X\beta\).

The model structure is thus: \(E(Y) = X\beta + \epsilon\),

  • where \(\epsilon\) refers to the residual error term.

Intro to Generalised Linear Modeling

The linear regression model assumes that \(Y\)

  • is continuous and comes from a normal distribution,
  • that \(epsilon\) is normally distributed and
  • that the relationship between the linear predictor \(eta\) and
  • the expected outcome \(E(Y)\) is strictly linear.

However, these assumptions are easily violated in many real world data examples,

  • such as those with binary or proportional outcome variables and
  • those with non-linear relationships between
    • the predictors and the outcome variable.

In these scenarios

  • where linear regression models are clearly inappropriate,
  • Generalised linear models (GLM) are needed.
  • The GLM is the generalised version of linear regression
  • that allows for deviations from the assumptions underlying linear regression.

The GLM generalises linear regression

  • by assuming the dependent variable \(Y\)
    • to be generated from any particular distribution in an exponential family
    • (a large class of probability distributions that includes
    • the normal, binomial, Poisson and gamma distributions, among others)

In this way, the distribution of \(Y\)

  • does not necessarily have to be normal.

In addition, the GLM allows the linear predictor \(\eta\)

  • to be connected to the expected value of the outcome variable, \(E(Y)\),
  • via a link function \(g(.)\).

The outcome variable, \(Y\), therefore, depends on \(\eta\)

  • through \(E(Y) = g^{-1}(\eta) = g^{-1}(X\beta)\).

In this way, the model does not assume

  • a linear relationship between \(E(Y)\) and \(\eta\);
  • instead, the model assumes a linear relationship
    • between \(E(Y)\) and the transformed \(g^{-1}(\eta)\).

GLM Modeling

  • GLM when the link function is Gaussian or identity is equal to Linear Regression
  • If we model the data using GLM we would obtain the same results
  • Other examples of GLM are Logistic Regression (logic as link function) and Poisson Regression( log as link function )

Glossary

  • Linear Regression: Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. It assumes a linear relationship and aims to find the best-fitting line (linear equation) that minimizes the difference between the observed and predicted values.

  • Multiple Linear Regression: Multiple linear regression extends the concept of linear regression to include multiple independent variables. It models the relationship between a dependent variable and multiple independent variables while assuming a linear relationship.

  • Generalized Linear Models (GLMs): Generalized linear models are a broader class of regression models that accommodate a wide range of response variable distributions beyond the normal distribution. They handle situations where the relationship between variables is not strictly linear, such as when the response variable is binary or count data.

  • Statistical Significance: Statistical significance refers to the likelihood that an observed effect in data is not due to random chance. It is assessed using p-values, and a result is considered statistically significant if the p-value is below a pre-defined threshold (often denoted by an alpha level, such as 0.05).

  • Residuals: Residuals are the differences between observed and predicted values in a regression model. They represent the unexplained variation in the dependent variable and are used to assess the model’s goodness of fit.

  • Standard Errors: Standard errors quantify the variability of estimated coefficients in a regression model. They provide a measure of the uncertainty associated with the estimated relationships between variables.

  • p-Values: p-values indicate the probability of observing a test statistic as extreme as the one calculated, assuming that the null hypothesis is true. A small p-value (typically less than 0.05) suggests that the observed effect is unlikely to occur by chance alone, leading to the rejection of the null hypothesis.

  • F-Statistic: The F-statistic is used in analysis of variance (ANOVA) and regression analysis to compare the fit of different models. It assesses whether there is a significant difference in the variation explained by the model compared to the null hypothesis.

  • Adjusted R-squared: Adjusted R-squared is a modified version of the R-squared statistic that accounts for the number of predictors in a model. It penalizes the addition of unnecessary predictors, aiming to provide a more balanced measure of model fit that considers both explanatory power and model complexity.

  • Multiple R-squared: Multiple R-squared (R²) is a statistic that measures the proportion of variability in the dependent variable explained by a regression model’s independent variables. It increases as the model’s ability to explain variability improves.