Data Science for Social Impact

Linear Modeling

Mohommad Redad Mehdi, Erika I. Barcelos

Predictive Modeling

In a data science pipeline:
after the data is collected, explored and cleaned
- we usually proceed with modeling

Predictive modeling is the process of using statistical and machine learning models and tools to:

understand a system or process,
identify patterns and trends in the data, and
use the gained insights to make predictions are forecast future outcomes or behaviors

An important requirement is that:

Data related or derived to the project is available
- the learning process takes place when the model learns from the data

Statistical, Machine and Deep Learning

Statistical, Machine and Deep Learning have different goals, and requirements and the choice depend on several factors, such as:

Amount of data available,
how important the model intrepretability is, and
the extend of manual efforts in feature selection

In statistical learning the main goal is to:

make inferences,
understanding and quantifying the relationships between variables in the data,
any size of data sets, including small & medium, and
high interpretability (low complexity problems)

In machine learning the main focus relies on:

finding generalizable models,
low interpretability, and
moderate complexity

In deep learning:

highly accurate predictions
very low interpretability
can handle very complex problem

Regression

Set of techniques used to predict a response(dependent) based on one more more predictors(independent variables)
It can be used to identify and quantify the magnitude in which the predictors relate to the response
Depending on the type of data response and predictor different regression models can be applied
Generalized linear Models (GLM) represent a family of models containing the most common regression types such as linear regression, logistic regression and multiple linear regression

Regression

Linear & Multiple Linear Regression

Linear regression is a commonly used approach to make inferences and model relationships between predictors and response variables in a dataset

Linear Regression: one predictor and the response
Multiple linear regression:multiple predictors and the response

Regression

Linear & Multiple Linear Regression

In linear and multiple linear regression the coefficients are determined by fitting the linear equation to the data

Linear Regression :\[ y=ax + b\]

Multiple Linear Regression : \[y = ax + bx_1 + cx_2...nx_n\]

Regression

Linear & Multiple Linear Regression

Assumptions

There are 4 different assumptions a linear regression model must adhere to:

There is a linear relationship between the mean response \(y\) and the explanatory variable \(x\)
The errors are independent - there is no connection between how far any two points lie from the regression line
The responses are normally distributed at each level of \(x\)
The variance of the responses is equal for all levels of \(x\)

Demonstration of EDA: Palmerpenguins

Aim of this demo:

The goal of this demo is to try and find which variables are the best to predict the weight of penguins
Quantify the relationship between different variables and the weight of penguins

Loading libraries & Data

Loading libraries

library(MASS)
library(ISLR2)
library(palmerpenguins)
library(tidyverse)
library(ggplot2)
library(GGally)
library(gridExtra)
library(kableExtra)
library(magrittr)

The palmerpenguins library contains the penguins data set.

Quick overview of the data

We will load in the penguins dataset and print out the first 5 instances.

attach(penguins)
penguins <- penguins
kable(head(penguins, n = 5),
    caption = "Forest Fire dataframe") %>%
  kable_styling(
    font_size = 15, 
    full_width = TRUE, 
    latex_options = "scale_down") %>%
  kable_classic(c("striped", "hover", "condensed"))

Forest Fire dataframe
species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Adelie	Torgersen	39.1	18.7	181	3750	male	2007
Adelie	Torgersen	39.5	17.4	186	3800	female	2007
Adelie	Torgersen	40.3	18.0	195	3250	female	2007
Adelie	Torgersen	NA	NA	NA	NA	NA	2007
Adelie	Torgersen	36.7	19.3	193	3450	female	2007

Quick overview of the data

To find out more about the data set, we can type str(penguins).

It tells us about the

various columns,
their data types,
additional information

str(penguins)

tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
 $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

This tells us that there are 8 total variables in the dataset

Checking % of missing data

kable(colMeans(is.na(penguins))*100, booktabs = TRUE,
      col.names = "missing values") %>%
  kable_styling(
    font_size = 18, 
    full_width = FALSE, 
    latex_options = "scale_down") %>%
  kable_classic(c("striped", "hover"))

	missing values
species	0.0000000
island	0.0000000
bill_length_mm	0.5813953
bill_depth_mm	0.5813953
flipper_length_mm	0.5813953
body_mass_g	0.5813953
sex	3.1976744
year	0.0000000

All columns have less than 3% of missing data
There are several strategies to handle missing data
- Since the amount of data missing is below 4% let’s just remove them

Removing NA

penguins <- na.omit(penguins)

Let’s look at our data!

Let us perform some basic EDA before modeling
So that we have a better understanding about the relationship between various variables

Let’s look at our data!

With the below chunk of code we can examine the numerical and graphical summaries of the relationships between model covariates and responses

ggpairs(data = penguins,
        columns = c("species", "bill_length_mm", "bill_depth_mm", "flipper_length_mm", "sex", "body_mass_g"))

The relationship between two continuous variables is depicted with scatterplots below the diagonal and correlation coefficients above the diagonal.
From this graph, we can see that there is a strong correlation between body mass and bill length, body mass and bill depth, and body mass and flipper length.

Modeling

First we will check the relationship between each of the bill and flipper variables individually.

Body Mass with Bill Length

model_1 <- lm(body_mass_g ~ bill_length_mm, data = penguins)
summary(model_1)


Call:
lm(formula = body_mass_g ~ bill_length_mm, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-1759.38  -468.82    27.79   464.20  1641.00 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     388.845    289.817   1.342    0.181    
bill_length_mm   86.792      6.538  13.276   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 651.4 on 331 degrees of freedom
Multiple R-squared:  0.3475,    Adjusted R-squared:  0.3455 
F-statistic: 176.2 on 1 and 331 DF,  p-value: < 2.2e-16

A few observations from the summary of our fit:

The Adjusted R-squared is 0.3455, which means about 34.55% of the variability in the response variable (body mass) is explained by the independent variable (bill_length) included in the model.
The Residual standard error is 651.4, which means that, on average, the predicted body mass values from the model deviate from the actual values by approximately 651.4 units.
The p-value is almost 0, i.e the model is statistically significant( see more about statistically significance and p-values in this book

Modeling

Body Mass with Bill Length

We will now plot bill_length_mm and body_mass_g along with the least squares regression line using the plot() and abline() functions.

plot(bill_length_mm, body_mass_g)
abline(model_1)

We can observe that the fit of our model is relatively good
There is also a hint of non-linearity

Modeling

Body Mass with Bill Depth

model_2 <- lm(body_mass_g ~ bill_depth_mm, data = penguins)
summary(model_2)


Call:
lm(formula = body_mass_g ~ bill_depth_mm, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-1616.08  -510.38   -68.67   486.51  1811.12 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    7519.98     342.33  21.967   <2e-16 ***
bill_depth_mm  -193.01      19.81  -9.741   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 710.9 on 331 degrees of freedom
Multiple R-squared:  0.2228,    Adjusted R-squared:  0.2205 
F-statistic: 94.89 on 1 and 331 DF,  p-value: < 2.2e-16

This model only accounts for ~22% variability.
The standard error is greater than model 1.

Modeling

Body Mass with Bill Depth

plot(bill_depth_mm, body_mass_g)
abline(model_2)

The fit is not very good
The predicted values are far off the real ones

Modeling

Body Mass with Flip Length

model_3 <- lm(body_mass_g ~ flipper_length_mm, data = penguins)
summary(model_3)


Call:
lm(formula = body_mass_g ~ flipper_length_mm, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-1057.33  -259.79   -12.24   242.97  1293.89 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -5872.09     310.29  -18.93   <2e-16 ***
flipper_length_mm    50.15       1.54   32.56   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 393.3 on 331 degrees of freedom
Multiple R-squared:  0.7621,    Adjusted R-squared:  0.7614 
F-statistic:  1060 on 1 and 331 DF,  p-value: < 2.2e-16

This model explains ~76% variability of our response variable
The residual error is the minimum out of the three models

Modeling

Body Mass with Flip Length

plot(flipper_length_mm, body_mass_g)
abline(model_3)

Shows a good fit
Best out of the three models

Multiple Linear Regression

As you can see, all 3 models are each statistically significant (p-value < 0.05)
Next let’s try adding multiple independent variables to the formula In order to fit a multiple linear regression model using least squares, we again use the lm() function

The syntax \(lm(y \sim x1 + x2 + x3)\) is used to fit a model with three predictors, x1, x2, and x3

The summary() function now outputs the regression coefficients for all the predictors

Multiple Linear Regression

model_4 <- lm(body_mass_g ~ bill_length_mm + bill_depth_mm, data = penguins)
summary(model_4)


Call:
lm(formula = body_mass_g ~ bill_length_mm + bill_depth_mm, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-1806.74  -456.82    15.33   471.07  1541.34 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    3413.452    437.911   7.795 8.50e-14 ***
bill_length_mm   74.813      6.076  12.313  < 2e-16 ***
bill_depth_mm  -145.507     16.873  -8.624 2.76e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 589.4 on 330 degrees of freedom
Multiple R-squared:  0.4675,    Adjusted R-squared:  0.4642 
F-statistic: 144.8 on 2 and 330 DF,  p-value: < 2.2e-16

Adding all varialbes

After performing Colinearity analysis and Featureselection
- we can create a dataset with only relevant variables

If we consider that all the variables in the dataset are relevant and want to include them as predictors, we can use the simple shorthand \(lm(y \sim .)\).

Instead of typing them out individually as \(lm(y \sim x1 + x2 +...)\).
So, in case of the Penguins dataset, if we want to include all the 8 variables in predicting the body mass, we can simply type:

model_5 <- lm(body_mass_g ~ . , data = penguins)
summary(model_5)


Call:
lm(formula = body_mass_g ~ ., data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-809.70 -180.87   -6.25  176.76  864.22 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       84087.945  41912.019   2.006  0.04566 *  
speciesChinstrap   -282.539     88.790  -3.182  0.00160 ** 
speciesGentoo       890.958    144.563   6.163 2.12e-09 ***
islandDream         -21.180     58.390  -0.363  0.71704    
islandTorgersen     -58.777     60.852  -0.966  0.33482    
bill_length_mm       18.964      7.112   2.667  0.00805 ** 
bill_depth_mm        60.798     20.002   3.040  0.00256 ** 
flipper_length_mm    18.504      3.128   5.915 8.46e-09 ***
sexmale             378.977     48.074   7.883 4.95e-14 ***
year                -42.785     20.949  -2.042  0.04194 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 286.5 on 323 degrees of freedom
Multiple R-squared:  0.8768,    Adjusted R-squared:  0.8734 
F-statistic: 255.4 on 9 and 323 DF,  p-value: < 2.2e-16

Interaction Terms

It is easy to include interaction terms in a linear model using the lm() function.
The syntax bill_depth_mm + bill_length_mm tells R to include an interaction term between bill_depth_mm and bill_length_mm.
The syntax bill_depth_mm * bill_length_mm simultaneously includes bill_depth_mm, bill_length_mm, and the interaction term bill_depth_mm\(\times\)bill_length_mm as predictors; it is a shorthand for bill_depgth_mm + bill_length_mm + bill_depgth_mm:bill_length_mm.

We can also pass in transformed versions of the predictors.

Interaction Terms

In practice

model_6 <- lm(body_mass_g ~ bill_depth_mm * bill_length_mm, data = penguins)
summary(model_6)


Call:
lm(formula = body_mass_g ~ bill_depth_mm * bill_length_mm, data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-1814.0  -353.6     2.9   352.4  1603.6 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  -25718.181   2696.211  -9.539   <2e-16 ***
bill_depth_mm                  1493.496    150.910   9.897   <2e-16 ***
bill_length_mm                  718.853     59.256  12.131   <2e-16 ***
bill_depth_mm:bill_length_mm    -36.317      3.328 -10.911   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 505.8 on 329 degrees of freedom
Multiple R-squared:  0.609, Adjusted R-squared:  0.6054 
F-statistic: 170.8 on 3 and 329 DF,  p-value: < 2.2e-16

Interaction Terms

In practice

As we saw in our correlation plot, the body mass was heavily correlated with three variables. Let’s use all three for our model now:

model_7 <- lm(body_mass_g ~ bill_depth_mm * bill_length_mm * flipper_length_mm, data = penguins)
summary(model_7)


Call:
lm(formula = body_mass_g ~ bill_depth_mm * bill_length_mm * flipper_length_mm, 
    data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-993.49 -247.75  -37.51  226.17  976.89 

Coefficients:
                                                 Estimate Std. Error t value
(Intercept)                                    -6.317e+04  3.843e+04  -1.644
bill_depth_mm                                   3.872e+03  2.258e+03   1.715
bill_length_mm                                  8.467e+02  8.879e+02   0.954
flipper_length_mm                               2.843e+02  1.903e+02   1.494
bill_depth_mm:bill_length_mm                   -5.928e+01  5.207e+01  -1.138
bill_depth_mm:flipper_length_mm                -1.667e+01  1.122e+01  -1.485
bill_length_mm:flipper_length_mm               -3.118e+00  4.365e+00  -0.714
bill_depth_mm:bill_length_mm:flipper_length_mm  2.385e-01  2.570e-01   0.928
                                               Pr(>|t|)  
(Intercept)                                      0.1012  
bill_depth_mm                                    0.0873 .
bill_length_mm                                   0.3410  
flipper_length_mm                                0.1362  
bill_depth_mm:bill_length_mm                     0.2558  
bill_depth_mm:flipper_length_mm                  0.1385  
bill_length_mm:flipper_length_mm                 0.4756  
bill_depth_mm:bill_length_mm:flipper_length_mm   0.3542  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 354.4 on 325 degrees of freedom
Multiple R-squared:  0.8104,    Adjusted R-squared:  0.8063 
F-statistic: 198.4 on 7 and 325 DF,  p-value: < 2.2e-16

Interaction Terms

In practice

If there is evidence of a non-linear relationship, we can also pass transformed versions of predictors.

We use the anova() function to further quantify the quality of the fit.

anova(model_1, model_2, model_3, model_4, model_5, model_6, model_7)

Analysis of Variance Table

Model 1: body_mass_g ~ bill_length_mm
Model 2: body_mass_g ~ bill_depth_mm
Model 3: body_mass_g ~ flipper_length_mm
Model 4: body_mass_g ~ bill_length_mm + bill_depth_mm
Model 5: body_mass_g ~ species + island + bill_length_mm + bill_depth_mm + 
    flipper_length_mm + sex + year
Model 6: body_mass_g ~ bill_depth_mm * bill_length_mm
Model 7: body_mass_g ~ bill_depth_mm * bill_length_mm * flipper_length_mm
  Res.Df       RSS Df Sum of Sq      F    Pr(>F)    
1    331 140467133                                  
2    331 167300074  0 -26832941                     
3    331  51211963  0 116088111                     
4    330 114633409  1 -63421446                     
5    323  26517018  7  88116390 153.33 < 2.2e-16 ***
6    329  84173822 -6 -57656804 117.05 < 2.2e-16 ***
7    325  40819662  4  43354160 132.02 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The above variance analysis shows the results for all our models
The p-value for models 5, 6, and 7 are practically zero
Model 5 contains all the variables, model 6 contains only the interaction term of bill_length_mm and bill_depth_mm, and model 7 contains the interaction terms of flipper_length_mm, bill_length_mm, and bill_depth_mm

Let’s compare only these three:

anova(model_5, model_6, model_7)

Analysis of Variance Table

Model 1: body_mass_g ~ species + island + bill_length_mm + bill_depth_mm + 
    flipper_length_mm + sex + year
Model 2: body_mass_g ~ bill_depth_mm * bill_length_mm
Model 3: body_mass_g ~ bill_depth_mm * bill_length_mm * flipper_length_mm
  Res.Df      RSS Df Sum of Sq      F    Pr(>F)    
1    323 26517018                                  
2    329 84173822 -6 -57656804 117.05 < 2.2e-16 ***
3    325 40819662  4  43354160 132.02 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here the \(F\)-statistic is \(132\) and the associated \(p\)-value is virtually zero
From the summaries of our models above, it would appear that model 5 is better in terms of every aspect
It has a lower Residual standard error, and a higher Adjuster R-squared
But the anova analysis of the two models shows that model 7 is a significantly better choice than model 5 for predicting body mass
Also, statistical significance does not always mean practical significance

Intro to Generalised Linear Modeling

Recall that in a linear regression model,

the object is to model the expected value of a continuous variable, \(Y\),
as a linear function of the predictor, \(\epsilon = X\beta\).

The model structure is thus: \(E(Y) = X\beta + \epsilon\),

where \(\epsilon\) refers to the residual error term.

Intro to Generalised Linear Modeling

The linear regression model assumes that \(Y\)

is continuous and comes from a normal distribution,
that \(epsilon\) is normally distributed and
that the relationship between the linear predictor \(eta\) and
the expected outcome \(E(Y)\) is strictly linear.

However, these assumptions are easily violated in many real world data examples,

such as those with binary or proportional outcome variables and
those with non-linear relationships between
- the predictors and the outcome variable.

In these scenarios

where linear regression models are clearly inappropriate,
Generalised linear models (GLM) are needed.
The GLM is the generalised version of linear regression
that allows for deviations from the assumptions underlying linear regression.

The GLM generalises linear regression

by assuming the dependent variable \(Y\)
- to be generated from any particular distribution in an exponential family
- (a large class of probability distributions that includes
- the normal, binomial, Poisson and gamma distributions, among others)

In this way, the distribution of \(Y\)

does not necessarily have to be normal.

In addition, the GLM allows the linear predictor \(\eta\)

to be connected to the expected value of the outcome variable, \(E(Y)\),
via a link function \(g(.)\).

The outcome variable, \(Y\), therefore, depends on \(\eta\)

through \(E(Y) = g^{-1}(\eta) = g^{-1}(X\beta)\).

In this way, the model does not assume

a linear relationship between \(E(Y)\) and \(\eta\);
instead, the model assumes a linear relationship
- between \(E(Y)\) and the transformed \(g^{-1}(\eta)\).

GLM Modeling

GLM when the link function is Gaussian or identity is equal to Linear Regression
If we model the data using GLM we would obtain the same results
Other examples of GLM are Logistic Regression (logic as link function) and Poisson Regression( log as link function )

Glossary

Linear Regression: Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. It assumes a linear relationship and aims to find the best-fitting line (linear equation) that minimizes the difference between the observed and predicted values.
Multiple Linear Regression: Multiple linear regression extends the concept of linear regression to include multiple independent variables. It models the relationship between a dependent variable and multiple independent variables while assuming a linear relationship.
Generalized Linear Models (GLMs): Generalized linear models are a broader class of regression models that accommodate a wide range of response variable distributions beyond the normal distribution. They handle situations where the relationship between variables is not strictly linear, such as when the response variable is binary or count data.
Statistical Significance: Statistical significance refers to the likelihood that an observed effect in data is not due to random chance. It is assessed using p-values, and a result is considered statistically significant if the p-value is below a pre-defined threshold (often denoted by an alpha level, such as 0.05).
Residuals: Residuals are the differences between observed and predicted values in a regression model. They represent the unexplained variation in the dependent variable and are used to assess the model’s goodness of fit.
Standard Errors: Standard errors quantify the variability of estimated coefficients in a regression model. They provide a measure of the uncertainty associated with the estimated relationships between variables.
p-Values: p-values indicate the probability of observing a test statistic as extreme as the one calculated, assuming that the null hypothesis is true. A small p-value (typically less than 0.05) suggests that the observed effect is unlikely to occur by chance alone, leading to the rejection of the null hypothesis.
F-Statistic: The F-statistic is used in analysis of variance (ANOVA) and regression analysis to compare the fit of different models. It assesses whether there is a significant difference in the variation explained by the model compared to the null hypothesis.
Adjusted R-squared: Adjusted R-squared is a modified version of the R-squared statistic that accounts for the number of predictors in a model. It penalizes the addition of unnecessary predictors, aiming to provide a more balanced measure of model fit that considers both explanatory power and model complexity.
Multiple R-squared: Multiple R-squared (R²) is a statistic that measures the proportion of variability in the dependent variable explained by a regression model’s independent variables. It increases as the model’s ability to explain variability improves.