Multiple Linear Regression

Purpose

Predicts a continuous dependent variable from two or more independent variables and quantifies the unique contribution of each predictor.

When to Use

When you want to predict or explain variation in a continuous outcome using multiple predictor variables simultaneously.

Data Type

Two or more continuous (or dummy-coded categorical) predictors and one continuous outcome

Key Assumptions

Linearity, independence of residuals, normality of residuals, homoscedasticity, no multicollinearity (VIF < 10).

Tools

Subthesis Research Tools on Subthesis →

What Is Multiple Linear Regression?

Multiple linear regression extends simple linear regression by modelling the relationship between two or more predictor variables ( $X_1, X_2, \dots, X_p$ ) and a continuous outcome variable ( $Y$ ). The goal is to find the combination of predictors that best explains variability in the outcome.

The regression equation takes the general form:

\hat{Y} = b_0 + b_1 X_1 + b_2 X_2 + \cdots + b_p X_p

Where:

$\hat{Y}$ is the predicted value of the outcome
$b_0$ is the intercept — the predicted value of $Y$ when all predictors equal zero
$b_1, b_2, \dots, b_p$ are partial regression coefficients — each represents the predicted change in $Y$ for a one-unit increase in that predictor, holding all other predictors constant

The "holding constant" part is critical. Unlike running separate simple regressions, multiple regression isolates the unique contribution of each predictor after accounting for the others.

When to Use It

Use multiple linear regression when:

You have one continuous outcome and two or more predictors (continuous or dummy-coded categorical).
You want to know which predictors uniquely contribute to the outcome after controlling for the others.
You want to improve prediction beyond what a single predictor provides.
You want to statistically control for confounding variables (e.g., predicting job performance from training hours while controlling for years of experience).

If your outcome is binary (e.g., pass/fail), use logistic regression. If you have a single predictor, simple linear regression is sufficient.

Assumptions

Linearity. Each predictor has a linear relationship with the outcome (holding other predictors constant). Check partial regression plots (added-variable plots) for each predictor.
Independence of residuals. Observations are independent. Violated in time-series or hierarchical data. Test with the Durbin-Watson statistic (values near 2 indicate independence).
Normality of residuals. The residuals should be approximately normally distributed. Inspect a Q-Q plot of the residuals. Regression is robust to this with large samples.
Homoscedasticity. The variance of residuals is constant across all predicted values. A "funnel" shape in the residuals-vs.-predicted plot signals heteroscedasticity.
No multicollinearity. Predictors should not be too highly correlated with each other. Multicollinearity inflates standard errors and makes individual coefficients unstable. Check the Variance Inflation Factor (VIF):

VIF_j = \frac{1}{1 - R_j^2}

Where $R_j^2$ is the $R^2$ from regressing predictor $j$ on all other predictors. A VIF above 10 (or above 5, by stricter standards) indicates problematic multicollinearity.

Formula

Model Equation

\hat{Y} = b_0 + b_1 X_1 + b_2 X_2 + \cdots + b_p X_p

Coefficient of Determination ( $R^2$ )

R^2 = 1 - \frac{SS_{residual}}{SS_{total}}

$R^2$ tells you the proportion of variance in $Y$ explained by the set of predictors combined.

Adjusted $R^2$

Because $R^2$ always increases when you add predictors (even useless ones), adjusted $R^2$ penalizes for the number of predictors:

R^2_{adj} = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}

Where $n$ is the sample size and $p$ is the number of predictors. Use adjusted $R^2$ when comparing models with different numbers of predictors.

F-Test for Overall Model

The F-test evaluates whether the set of predictors collectively explains a significant amount of variance:

F = \frac{MS_{regression}}{MS_{residual}} = \frac{SS_{regression} / p}{SS_{residual} / (n - p - 1)}

With $df_1 = p$ and $df_2 = n - p - 1$ .

Standardized Coefficients ( $\beta$ )

To compare the relative importance of predictors measured on different scales, use standardized coefficients:

\beta_j = b_j \cdot \frac{s_{X_j}}{s_Y}

A larger absolute $\beta$ indicates a stronger unique contribution to the prediction of $Y$ .

Worked Example

Scenario: A university admissions researcher wants to predict student GPA ( $Y$ ) from weekly study hours ( $X_1$ ) and nightly sleep hours ( $X_2$ ) for $n = 8$ students.

Student	Study Hours ( $X_1$ )	Sleep Hours ( $X_2$ )	GPA ( $Y$ )
1	10	7	2.8
2	15	6	3.0
3	20	8	3.5
4	25	7	3.7
5	12	5	2.5
6	30	8	3.9
7	18	6	3.1
8	22	7	3.4

Step 1: Compute the means.

\bar{X}_1 = \frac{10+15+20+25+12+30+18+22}{8} = 19.0

\bar{X}_2 = \frac{7+6+8+7+5+8+6+7}{8} = 6.75

\bar{Y} = \frac{2.8+3.0+3.5+3.7+2.5+3.9+3.1+3.4}{8} = 3.2375

Step 2: Fit the regression model.

Using the least-squares method (typically computed with software), suppose the solution yields:

\hat{Y} = 0.345 + 0.108 X_1 + 0.175 X_2

Step 3: Interpret the coefficients.

$b_0 = 0.345$ : A student with zero study hours and zero sleep hours would have a predicted GPA of 0.345 (not meaningful in practice — purely a mathematical anchor).
$b_1 = 0.108$ : Each additional weekly study hour is associated with a 0.108-point increase in GPA, holding sleep hours constant.
$b_2 = 0.175$ : Each additional nightly sleep hour is associated with a 0.175-point increase in GPA, holding study hours constant.

Step 4: Evaluate model fit.

R^2 = .96, \quad R^2_{adj} = .94

The model explains 96% of the variance in GPA. After adjusting for the number of predictors, 94% of variance is explained.

Step 5: Test the overall model.

F(2, 5) = 58.80, \quad p < .001

The set of predictors significantly predicts GPA.

Step 6: Check multicollinearity.

VIF_{X_1} = 1.12, \quad VIF_{X_2} = 1.12

Both VIF values are well below 10, so multicollinearity is not a concern.

Interpretation

The regression equation $\hat{Y} = 0.345 + 0.108X_1 + 0.175X_2$ tells us that both study hours and sleep hours independently contribute to predicting GPA. Study hours is the stronger predictor in absolute terms ( $b_1 = 0.108$ per hour, accumulated across many weekly hours) while sleep hours also makes a meaningful unique contribution ( $b_2 = 0.175$ per hour).

The high $R^2_{adj} = .94$ suggests excellent model fit, though the small sample ( $n = 8$ ) means these estimates should be interpreted cautiously and cross-validated with a larger sample.

$R^2$ vs. Adjusted $R^2$

Always report adjusted $R^2$ in multiple regression. Regular $R^2$ will increase whenever you add a predictor, even if it is noise. Adjusted $R^2$ can decrease if a new predictor does not improve the model enough to justify the lost degree of freedom.

Common Mistakes

Including too many predictors for the sample size. A common guideline is at least 10-20 observations per predictor. With $n = 30$ and 10 predictors, the model is likely overfit.
Ignoring multicollinearity. When predictors are highly correlated, individual coefficients become unstable and may flip sign. Always check VIF values.
Interpreting coefficients as causal effects. Regression coefficients reflect associations, not causation. Without experimental control, confounds may explain the relationships.
Using stepwise selection uncritically. Automated stepwise procedures capitalize on chance and produce models that may not replicate. Use theory-driven predictor selection when possible.
Reporting $R^2$ instead of adjusted $R^2$ . In multiple regression, $R^2$ is inflated. Always report adjusted $R^2$ as the primary measure of model fit.
Forgetting to check residual plots. A high $R^2$ does not guarantee that assumptions are met. Always inspect residual vs. predicted plots and Q-Q plots.
Confusing unstandardized and standardized coefficients. Report $b$ (with units) for interpretation and $\beta$ for comparing relative importance. Do not mix them.

How to Run It

```r # Multiple linear regression in R model <- lm(gpa ~ study_hours + sleep_hours, data = mydata) summary(model)

Confidence intervals for coefficients

confint(model)

Variance Inflation Factor

library(car) vif(model)

Diagnostic plots

par(mfrow = c(2, 2)) plot(model)

```python
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Fit the model
X = df[['study_hours', 'sleep_hours']]
X = sm.add_constant(X)
model = sm.OLS(df['gpa'], X).fit()
print(model.summary())

# Variance Inflation Factor
for i, col in enumerate(X.columns[1:], 1):
    print(f"VIF {col}: {variance_inflation_factor(X.values, i):.2f}")
```


Go to Analyze > Regression > Linear
Move your dependent variable (e.g., GPA) into the Dependent box
Move all predictor variables (e.g., Study Hours, Sleep Hours) into the Independent(s) box
Click Statistics and check Estimates, Model fit, Descriptives, and Collinearity diagnostics
Click Plots: set *ZRESID as Y and *ZPRED as X to check homoscedasticity; also request a normal probability plot
Click OK

SPSS outputs a Model Summary (R, R², Adjusted R²), ANOVA table (F-test for the overall model), Coefficients table (b, SE, Beta, t, p for each predictor), and Collinearity Statistics (Tolerance and VIF).



Go to Data > Data Analysis > Regression (requires the Analysis ToolPak)
Set Input Y Range to your dependent variable column (GPA)
Set Input X Range to the columns containing all predictor variables (Study Hours and Sleep Hours together)
Check Labels if your first row contains headers
Check Residual Plots and Normal Probability Plots
Click OK

Excel outputs R², Adjusted R², the ANOVA table, and coefficients with standard errors, t-statistics, and p-values. Excel does not compute VIF directly — compute it manually by regressing each predictor on the others and using VIF = 1 / (1 − R²).



## How to Report in APA Format

> A multiple linear regression was conducted to predict GPA from weekly study hours and nightly sleep hours. The overall model was statistically significant, $F(2, 5) = 58.80$, $p < .001$, $R^2 = .96$, $R^2_{adj} = .94$. Study hours significantly predicted GPA ($b = 0.108$, $SE = 0.014$, $\beta = .87$, $p < .001$), as did sleep hours ($b = 0.175$, $SE = 0.072$, $\beta = .27$, $p = .048$). Multicollinearity was not a concern (all VIFs < 2). The regression equation was: predicted GPA $= 0.345 + 0.108 \times$ study hours $+ 0.175 \times$ sleep hours.

Key elements to include:

- The $F$-test for the overall model with degrees of freedom
- $R^2$ and adjusted $R^2$
- Unstandardized coefficient ($b$), standard error, standardized coefficient ($\beta$), and p-value for each predictor
- VIF or a statement about multicollinearity
- The regression equation

Ready to calculate?

Now that you understand the concept, use the free Subthesis Research Tools on Subthesis to run your own analysis.

Explore Research Tools on Subthesis

Multiple Linear Regression

Multiple Linear Regression

What Is Multiple Linear Regression?

When to Use It

Assumptions

Formula

Model Equation

Coefficient of Determination ( $R^2$ )

Adjusted $R^2$

F-Test for Overall Model

Standardized Coefficients ( $\beta$ )

Worked Example

Interpretation

$R^2$ vs. Adjusted $R^2$

Common Mistakes

How to Run It

Confidence intervals for coefficients

Variance Inflation Factor

Diagnostic plots

Related Concepts

Simple Linear Regression

Pearson Correlation

Effect Size

Multiple Linear Regression

What Is Multiple Linear Regression?

When to Use It

Assumptions

Formula

Model Equation

Coefficient of Determination (R2R^2R2)

Adjusted R2R^2R2

F-Test for Overall Model

Standardized Coefficients (β\betaβ)

Worked Example

Interpretation

R2R^2R2 vs. Adjusted R2R^2R2

Common Mistakes

How to Run It

Confidence intervals for coefficients

Variance Inflation Factor

Diagnostic plots

Related Concepts

Simple Linear Regression

Pearson Correlation

Effect Size

Coefficient of Determination ( $R^2$ )

Adjusted $R^2$

Standardized Coefficients ( $\beta$ )

$R^2$ vs. Adjusted $R^2$