Multiple Linear Regression
Multiple Linear Regression
What Is Multiple Linear Regression?
Multiple linear regression extends simple linear regression by modelling the relationship between two or more predictor variables () and a continuous outcome variable (). The goal is to find the combination of predictors that best explains variability in the outcome.
The regression equation takes the general form:
Where:
- is the predicted value of the outcome
- is the intercept — the predicted value of when all predictors equal zero
- are partial regression coefficients — each represents the predicted change in for a one-unit increase in that predictor, holding all other predictors constant
The "holding constant" part is critical. Unlike running separate simple regressions, multiple regression isolates the unique contribution of each predictor after accounting for the others.
When to Use It
Use multiple linear regression when:
- You have one continuous outcome and two or more predictors (continuous or dummy-coded categorical).
- You want to know which predictors uniquely contribute to the outcome after controlling for the others.
- You want to improve prediction beyond what a single predictor provides.
- You want to statistically control for confounding variables (e.g., predicting job performance from training hours while controlling for years of experience).
If your outcome is binary (e.g., pass/fail), use logistic regression. If you have a single predictor, simple linear regression is sufficient.
Assumptions
-
Linearity. Each predictor has a linear relationship with the outcome (holding other predictors constant). Check partial regression plots (added-variable plots) for each predictor.
-
Independence of residuals. Observations are independent. Violated in time-series or hierarchical data. Test with the Durbin-Watson statistic (values near 2 indicate independence).
-
Normality of residuals. The residuals should be approximately normally distributed. Inspect a Q-Q plot of the residuals. Regression is robust to this with large samples.
-
Homoscedasticity. The variance of residuals is constant across all predicted values. A "funnel" shape in the residuals-vs.-predicted plot signals heteroscedasticity.
-
No multicollinearity. Predictors should not be too highly correlated with each other. Multicollinearity inflates standard errors and makes individual coefficients unstable. Check the Variance Inflation Factor (VIF):
Where is the from regressing predictor on all other predictors. A VIF above 10 (or above 5, by stricter standards) indicates problematic multicollinearity.
Formula
Model Equation
Coefficient of Determination ()
tells you the proportion of variance in explained by the set of predictors combined.
Adjusted
Because always increases when you add predictors (even useless ones), adjusted penalizes for the number of predictors:
Where is the sample size and is the number of predictors. Use adjusted when comparing models with different numbers of predictors.
F-Test for Overall Model
The F-test evaluates whether the set of predictors collectively explains a significant amount of variance:
With and .
Standardized Coefficients ()
To compare the relative importance of predictors measured on different scales, use standardized coefficients:
A larger absolute indicates a stronger unique contribution to the prediction of .
Worked Example
Scenario: A university admissions researcher wants to predict student GPA () from weekly study hours () and nightly sleep hours () for students.
| Student | Study Hours () | Sleep Hours () | GPA () |
|---|---|---|---|
| 1 | 10 | 7 | 2.8 |
| 2 | 15 | 6 | 3.0 |
| 3 | 20 | 8 | 3.5 |
| 4 | 25 | 7 | 3.7 |
| 5 | 12 | 5 | 2.5 |
| 6 | 30 | 8 | 3.9 |
| 7 | 18 | 6 | 3.1 |
| 8 | 22 | 7 | 3.4 |
Step 1: Compute the means.
Step 2: Fit the regression model.
Using the least-squares method (typically computed with software), suppose the solution yields:
Step 3: Interpret the coefficients.
- : A student with zero study hours and zero sleep hours would have a predicted GPA of 0.345 (not meaningful in practice — purely a mathematical anchor).
- : Each additional weekly study hour is associated with a 0.108-point increase in GPA, holding sleep hours constant.
- : Each additional nightly sleep hour is associated with a 0.175-point increase in GPA, holding study hours constant.
Step 4: Evaluate model fit.
The model explains 96% of the variance in GPA. After adjusting for the number of predictors, 94% of variance is explained.
Step 5: Test the overall model.
The set of predictors significantly predicts GPA.
Step 6: Check multicollinearity.
Both VIF values are well below 10, so multicollinearity is not a concern.
Interpretation
The regression equation tells us that both study hours and sleep hours independently contribute to predicting GPA. Study hours is the stronger predictor in absolute terms ( per hour, accumulated across many weekly hours) while sleep hours also makes a meaningful unique contribution ( per hour).
The high suggests excellent model fit, though the small sample () means these estimates should be interpreted cautiously and cross-validated with a larger sample.
vs. Adjusted
Always report adjusted in multiple regression. Regular will increase whenever you add a predictor, even if it is noise. Adjusted can decrease if a new predictor does not improve the model enough to justify the lost degree of freedom.
Common Mistakes
-
Including too many predictors for the sample size. A common guideline is at least 10-20 observations per predictor. With and 10 predictors, the model is likely overfit.
-
Ignoring multicollinearity. When predictors are highly correlated, individual coefficients become unstable and may flip sign. Always check VIF values.
-
Interpreting coefficients as causal effects. Regression coefficients reflect associations, not causation. Without experimental control, confounds may explain the relationships.
-
Using stepwise selection uncritically. Automated stepwise procedures capitalize on chance and produce models that may not replicate. Use theory-driven predictor selection when possible.
-
Reporting instead of adjusted . In multiple regression, is inflated. Always report adjusted as the primary measure of model fit.
-
Forgetting to check residual plots. A high does not guarantee that assumptions are met. Always inspect residual vs. predicted plots and Q-Q plots.
-
Confusing unstandardized and standardized coefficients. Report (with units) for interpretation and for comparing relative importance. Do not mix them.
How to Run It
Confidence intervals for coefficients
confint(model)
Variance Inflation Factor
library(car) vif(model)
Diagnostic plots
par(mfrow = c(2, 2)) plot(model)
```python
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Fit the model
X = df[['study_hours', 'sleep_hours']]
X = sm.add_constant(X)
model = sm.OLS(df['gpa'], X).fit()
print(model.summary())
# Variance Inflation Factor
for i, col in enumerate(X.columns[1:], 1):
print(f"VIF {col}: {variance_inflation_factor(X.values, i):.2f}")
```
Go to Analyze > Regression > Linear
Move your dependent variable (e.g., GPA) into the Dependent box
Move all predictor variables (e.g., Study Hours, Sleep Hours) into the Independent(s) box
Click Statistics and check Estimates, Model fit, Descriptives, and Collinearity diagnostics
Click Plots: set *ZRESID as Y and *ZPRED as X to check homoscedasticity; also request a normal probability plot
Click OK
SPSS outputs a Model Summary (R, R², Adjusted R²), ANOVA table (F-test for the overall model), Coefficients table (b, SE, Beta, t, p for each predictor), and Collinearity Statistics (Tolerance and VIF).
Go to Data > Data Analysis > Regression (requires the Analysis ToolPak)
Set Input Y Range to your dependent variable column (GPA)
Set Input X Range to the columns containing all predictor variables (Study Hours and Sleep Hours together)
Check Labels if your first row contains headers
Check Residual Plots and Normal Probability Plots
Click OK
Excel outputs R², Adjusted R², the ANOVA table, and coefficients with standard errors, t-statistics, and p-values. Excel does not compute VIF directly — compute it manually by regressing each predictor on the others and using VIF = 1 / (1 − R²).
Ready to calculate?
Now that you understand the concept, use the free Subthesis Research Tools on Subthesis to run your own analysis.
Related Concepts
Simple Linear Regression
Master simple linear regression: learn how to predict a continuous outcome from one predictor variable, interpret slope, intercept, and R-squared values.
Pearson Correlation
Learn how to calculate and interpret the Pearson correlation coefficient (r) to measure the strength and direction of linear relationships between two variables.
Effect Size
Learn what effect size is, why it matters more than p-values alone, and how to calculate and interpret Cohen's d, Hedges' g, and eta-squared for your research.