Linear Modeling
Predictive modeling is the process of using statistical and machine learning models and tools to:
understand a system or process,
identify patterns and trends in the data, and
use the gained insights to make predictions are forecast future outcomes or behaviors
Data related or derived to the project is available
Statistical, Machine and Deep Learning have different goals, and requirements and the choice depend on several factors, such as:
Amount of data available,
how important the model intrepretability is, and
the extend of manual efforts in feature selection
make inferences,
understanding and quantifying the relationships between variables in the data,
any size of data sets, including small & medium, and
high interpretability (low complexity problems)
finding generalizable models,
low interpretability, and
moderate complexity
highly accurate predictions
very low interpretability
can handle very complex problem
Linear regression is a commonly used approach to make inferences and model relationships between predictors and response variables in a dataset
In linear and multiple linear regression the coefficients are determined by fitting the linear equation to the data
Linear Regression :\[ y=ax + b\]
Multiple Linear Regression : \[y = ax + bx_1 + cx_2...nx_n\]
There are 4 different assumptions a linear regression model must adhere to:
Loading libraries
The palmerpenguins
library contains the penguins
data set.
We will load in the penguins dataset and print out the first 5 instances.
attach(penguins)
penguins <- penguins
kable(head(penguins, n = 5),
caption = "Forest Fire dataframe") %>%
kable_styling(
font_size = 15,
full_width = TRUE,
latex_options = "scale_down") %>%
kable_classic(c("striped", "hover", "condensed"))
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|
Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
Adelie | Torgersen | NA | NA | NA | NA | NA | 2007 |
Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
To find out more about the data set, we can type str(penguins)
.
It tells us about the
various columns,
their data types,
additional information
tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
$ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
$ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
This tells us that there are 8 total variables in the dataset
kable(colMeans(is.na(penguins))*100, booktabs = TRUE,
col.names = "missing values") %>%
kable_styling(
font_size = 18,
full_width = FALSE,
latex_options = "scale_down") %>%
kable_classic(c("striped", "hover"))
missing values | |
---|---|
species | 0.0000000 |
island | 0.0000000 |
bill_length_mm | 0.5813953 |
bill_depth_mm | 0.5813953 |
flipper_length_mm | 0.5813953 |
body_mass_g | 0.5813953 |
sex | 3.1976744 |
year | 0.0000000 |
With the below chunk of code we can examine the numerical and graphical summaries of the relationships between model covariates and responses
ggpairs(data = penguins,
columns = c("species", "bill_length_mm", "bill_depth_mm", "flipper_length_mm", "sex", "body_mass_g"))
The relationship between two continuous variables is depicted with scatterplots below the diagonal and correlation coefficients above the diagonal.
From this graph, we can see that there is a strong correlation between body mass and bill length, body mass and bill depth, and body mass and flipper length.
First we will check the relationship between each of the bill and flipper variables individually.
Call:
lm(formula = body_mass_g ~ bill_length_mm, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-1759.38 -468.82 27.79 464.20 1641.00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 388.845 289.817 1.342 0.181
bill_length_mm 86.792 6.538 13.276 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 651.4 on 331 degrees of freedom
Multiple R-squared: 0.3475, Adjusted R-squared: 0.3455
F-statistic: 176.2 on 1 and 331 DF, p-value: < 2.2e-16
A few observations from the summary of our fit:
The Adjusted R-squared is 0.3455, which means about 34.55% of the variability in the response variable (body mass) is explained by the independent variable (bill_length) included in the model.
The Residual standard error is 651.4, which means that, on average, the predicted body mass values from the model deviate from the actual values by approximately 651.4 units.
The p-value is almost 0, i.e the model is statistically significant( see more about statistically significance and p-values in this book
Call:
lm(formula = body_mass_g ~ bill_depth_mm, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-1616.08 -510.38 -68.67 486.51 1811.12
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7519.98 342.33 21.967 <2e-16 ***
bill_depth_mm -193.01 19.81 -9.741 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 710.9 on 331 degrees of freedom
Multiple R-squared: 0.2228, Adjusted R-squared: 0.2205
F-statistic: 94.89 on 1 and 331 DF, p-value: < 2.2e-16
Call:
lm(formula = body_mass_g ~ flipper_length_mm, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-1057.33 -259.79 -12.24 242.97 1293.89
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5872.09 310.29 -18.93 <2e-16 ***
flipper_length_mm 50.15 1.54 32.56 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 393.3 on 331 degrees of freedom
Multiple R-squared: 0.7621, Adjusted R-squared: 0.7614
F-statistic: 1060 on 1 and 331 DF, p-value: < 2.2e-16
As you can see, all 3 models are each statistically significant (p-value < 0.05)
Next let’s try adding multiple independent variables to the formula In order to fit a multiple linear regression model using least squares, we again use the lm()
function
The syntax \(lm(y \sim x1 + x2 + x3)\) is used to fit a model with three predictors, x1
, x2
, and x3
summary()
function now outputs the regression coefficients for all the predictors
Call:
lm(formula = body_mass_g ~ bill_length_mm + bill_depth_mm, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-1806.74 -456.82 15.33 471.07 1541.34
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3413.452 437.911 7.795 8.50e-14 ***
bill_length_mm 74.813 6.076 12.313 < 2e-16 ***
bill_depth_mm -145.507 16.873 -8.624 2.76e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 589.4 on 330 degrees of freedom
Multiple R-squared: 0.4675, Adjusted R-squared: 0.4642
F-statistic: 144.8 on 2 and 330 DF, p-value: < 2.2e-16
If we consider that all the variables in the dataset are relevant and want to include them as predictors, we can use the simple shorthand \(lm(y \sim .)\).
Instead of typing them out individually as \(lm(y \sim x1 + x2 +...)\).
So, in case of the Penguins
dataset, if we want to include all the 8 variables in predicting the body mass, we can simply type:
Call:
lm(formula = body_mass_g ~ ., data = penguins)
Residuals:
Min 1Q Median 3Q Max
-809.70 -180.87 -6.25 176.76 864.22
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 84087.945 41912.019 2.006 0.04566 *
speciesChinstrap -282.539 88.790 -3.182 0.00160 **
speciesGentoo 890.958 144.563 6.163 2.12e-09 ***
islandDream -21.180 58.390 -0.363 0.71704
islandTorgersen -58.777 60.852 -0.966 0.33482
bill_length_mm 18.964 7.112 2.667 0.00805 **
bill_depth_mm 60.798 20.002 3.040 0.00256 **
flipper_length_mm 18.504 3.128 5.915 8.46e-09 ***
sexmale 378.977 48.074 7.883 4.95e-14 ***
year -42.785 20.949 -2.042 0.04194 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 286.5 on 323 degrees of freedom
Multiple R-squared: 0.8768, Adjusted R-squared: 0.8734
F-statistic: 255.4 on 9 and 323 DF, p-value: < 2.2e-16
It is easy to include interaction terms in a linear model using the lm()
function.
The syntax bill_depth_mm + bill_length_mm
tells R
to include an interaction term between bill_depth_mm
and bill_length_mm
.
The syntax bill_depth_mm * bill_length_mm
simultaneously includes bill_depth_mm
, bill_length_mm
, and the interaction term bill_depth_mm
\(\times\)bill_length_mm
as predictors; it is a shorthand for bill_depgth_mm + bill_length_mm + bill_depgth_mm:bill_length_mm
.
We can also pass in transformed versions of the predictors.
Call:
lm(formula = body_mass_g ~ bill_depth_mm * bill_length_mm, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-1814.0 -353.6 2.9 352.4 1603.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -25718.181 2696.211 -9.539 <2e-16 ***
bill_depth_mm 1493.496 150.910 9.897 <2e-16 ***
bill_length_mm 718.853 59.256 12.131 <2e-16 ***
bill_depth_mm:bill_length_mm -36.317 3.328 -10.911 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 505.8 on 329 degrees of freedom
Multiple R-squared: 0.609, Adjusted R-squared: 0.6054
F-statistic: 170.8 on 3 and 329 DF, p-value: < 2.2e-16
As we saw in our correlation plot, the body mass was heavily correlated with three variables. Let’s use all three for our model now:
model_7 <- lm(body_mass_g ~ bill_depth_mm * bill_length_mm * flipper_length_mm, data = penguins)
summary(model_7)
Call:
lm(formula = body_mass_g ~ bill_depth_mm * bill_length_mm * flipper_length_mm,
data = penguins)
Residuals:
Min 1Q Median 3Q Max
-993.49 -247.75 -37.51 226.17 976.89
Coefficients:
Estimate Std. Error t value
(Intercept) -6.317e+04 3.843e+04 -1.644
bill_depth_mm 3.872e+03 2.258e+03 1.715
bill_length_mm 8.467e+02 8.879e+02 0.954
flipper_length_mm 2.843e+02 1.903e+02 1.494
bill_depth_mm:bill_length_mm -5.928e+01 5.207e+01 -1.138
bill_depth_mm:flipper_length_mm -1.667e+01 1.122e+01 -1.485
bill_length_mm:flipper_length_mm -3.118e+00 4.365e+00 -0.714
bill_depth_mm:bill_length_mm:flipper_length_mm 2.385e-01 2.570e-01 0.928
Pr(>|t|)
(Intercept) 0.1012
bill_depth_mm 0.0873 .
bill_length_mm 0.3410
flipper_length_mm 0.1362
bill_depth_mm:bill_length_mm 0.2558
bill_depth_mm:flipper_length_mm 0.1385
bill_length_mm:flipper_length_mm 0.4756
bill_depth_mm:bill_length_mm:flipper_length_mm 0.3542
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 354.4 on 325 degrees of freedom
Multiple R-squared: 0.8104, Adjusted R-squared: 0.8063
F-statistic: 198.4 on 7 and 325 DF, p-value: < 2.2e-16
If there is evidence of a non-linear relationship, we can also pass transformed versions of predictors.
We use the anova()
function to further quantify the quality of the fit.
Analysis of Variance Table
Model 1: body_mass_g ~ bill_length_mm
Model 2: body_mass_g ~ bill_depth_mm
Model 3: body_mass_g ~ flipper_length_mm
Model 4: body_mass_g ~ bill_length_mm + bill_depth_mm
Model 5: body_mass_g ~ species + island + bill_length_mm + bill_depth_mm +
flipper_length_mm + sex + year
Model 6: body_mass_g ~ bill_depth_mm * bill_length_mm
Model 7: body_mass_g ~ bill_depth_mm * bill_length_mm * flipper_length_mm
Res.Df RSS Df Sum of Sq F Pr(>F)
1 331 140467133
2 331 167300074 0 -26832941
3 331 51211963 0 116088111
4 330 114633409 1 -63421446
5 323 26517018 7 88116390 153.33 < 2.2e-16 ***
6 329 84173822 -6 -57656804 117.05 < 2.2e-16 ***
7 325 40819662 4 43354160 132.02 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Variance Table
Model 1: body_mass_g ~ species + island + bill_length_mm + bill_depth_mm +
flipper_length_mm + sex + year
Model 2: body_mass_g ~ bill_depth_mm * bill_length_mm
Model 3: body_mass_g ~ bill_depth_mm * bill_length_mm * flipper_length_mm
Res.Df RSS Df Sum of Sq F Pr(>F)
1 323 26517018
2 329 84173822 -6 -57656804 117.05 < 2.2e-16 ***
3 325 40819662 4 43354160 132.02 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model structure is thus: \(E(Y) = X\beta + \epsilon\),
The linear regression model assumes that \(Y\)
However, these assumptions are easily violated in many real world data examples,
In addition, the GLM allows the linear predictor \(\eta\)
The outcome variable, \(Y\), therefore, depends on \(\eta\)
In this way, the model does not assume
Linear Regression
: Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. It assumes a linear relationship and aims to find the best-fitting line (linear equation) that minimizes the difference between the observed and predicted values.
Multiple Linear Regression
: Multiple linear regression extends the concept of linear regression to include multiple independent variables. It models the relationship between a dependent variable and multiple independent variables while assuming a linear relationship.
Generalized Linear Models (GLMs)
: Generalized linear models are a broader class of regression models that accommodate a wide range of response variable distributions beyond the normal distribution. They handle situations where the relationship between variables is not strictly linear, such as when the response variable is binary or count data.
Statistical Significance
: Statistical significance refers to the likelihood that an observed effect in data is not due to random chance. It is assessed using p-values, and a result is considered statistically significant if the p-value is below a pre-defined threshold (often denoted by an alpha level, such as 0.05).
Residuals
: Residuals are the differences between observed and predicted values in a regression model. They represent the unexplained variation in the dependent variable and are used to assess the model’s goodness of fit.
Standard Errors
: Standard errors quantify the variability of estimated coefficients in a regression model. They provide a measure of the uncertainty associated with the estimated relationships between variables.
p-Values
: p-values indicate the probability of observing a test statistic as extreme as the one calculated, assuming that the null hypothesis is true. A small p-value (typically less than 0.05) suggests that the observed effect is unlikely to occur by chance alone, leading to the rejection of the null hypothesis.
F-Statistic
: The F-statistic is used in analysis of variance (ANOVA) and regression analysis to compare the fit of different models. It assesses whether there is a significant difference in the variation explained by the model compared to the null hypothesis.
Adjusted R-squared
: Adjusted R-squared is a modified version of the R-squared statistic that accounts for the number of predictors in a model. It penalizes the addition of unnecessary predictors, aiming to provide a more balanced measure of model fit that considers both explanatory power and model complexity.
Multiple R-squared
: Multiple R-squared (R²) is a statistic that measures the proportion of variability in the dependent variable explained by a regression model’s independent variables. It increases as the model’s ability to explain variability improves.