Skip to main content
Stats for Scholars
Concepts Decision Tree Reporting Calculators Blog Software Cheat Sheets
Concepts Decision Tree Reporting Calculators Blog Software Cheat Sheets
Home Concepts Multiple Linear Regression

Descriptive Statistics

  • Descriptive Statistics

Inferential Statistics

  • Chi-Square Test of Independence
  • Independent Samples t-Test
  • Kruskal-Wallis H Test
  • Logistic Regression
  • Mann-Whitney U Test
  • Multiple Linear Regression
  • One-Way ANOVA
  • Paired Samples t-Test
  • Pearson Correlation
  • Repeated Measures ANOVA
  • Simple Linear Regression
  • Two-Way (Factorial) ANOVA
  • Wilcoxon Signed-Rank Test

Effect Size & Power

  • Effect Size
  • Sample Size Determination
  • Statistical Power & Power Analysis

Reliability & Validity

  • Cronbach's Alpha
  • Inter-Rater Reliability

Multiple Linear Regression

advanced Inferential Statistics

Multiple Linear Regression

Purpose
Predicts a continuous dependent variable from two or more independent variables and quantifies the unique contribution of each predictor.
When to Use
When you want to predict or explain variation in a continuous outcome using multiple predictor variables simultaneously.
Data Type
Two or more continuous (or dummy-coded categorical) predictors and one continuous outcome
Key Assumptions
Linearity, independence of residuals, normality of residuals, homoscedasticity, no multicollinearity (VIF < 10).
Tools
Subthesis Research Tools on Subthesis →

What Is Multiple Linear Regression?

Multiple linear regression extends simple linear regression by modelling the relationship between two or more predictor variables (X1,X2,…,XpX_1, X_2, \dots, X_pX1​,X2​,…,Xp​) and a continuous outcome variable (YYY). The goal is to find the combination of predictors that best explains variability in the outcome.

The regression equation takes the general form:

Y^=b0+b1X1+b2X2+⋯+bpXp\hat{Y} = b_0 + b_1 X_1 + b_2 X_2 + \cdots + b_p X_p Y^=b0​+b1​X1​+b2​X2​+⋯+bp​Xp​

Where:

  • Y^\hat{Y}Y^ is the predicted value of the outcome
  • b0b_0b0​ is the intercept — the predicted value of YYY when all predictors equal zero
  • b1,b2,…,bpb_1, b_2, \dots, b_pb1​,b2​,…,bp​ are partial regression coefficients — each represents the predicted change in YYY for a one-unit increase in that predictor, holding all other predictors constant

The "holding constant" part is critical. Unlike running separate simple regressions, multiple regression isolates the unique contribution of each predictor after accounting for the others.

When to Use It

Use multiple linear regression when:

  • You have one continuous outcome and two or more predictors (continuous or dummy-coded categorical).
  • You want to know which predictors uniquely contribute to the outcome after controlling for the others.
  • You want to improve prediction beyond what a single predictor provides.
  • You want to statistically control for confounding variables (e.g., predicting job performance from training hours while controlling for years of experience).

If your outcome is binary (e.g., pass/fail), use logistic regression. If you have a single predictor, simple linear regression is sufficient.

Assumptions

  1. Linearity. Each predictor has a linear relationship with the outcome (holding other predictors constant). Check partial regression plots (added-variable plots) for each predictor.

  2. Independence of residuals. Observations are independent. Violated in time-series or hierarchical data. Test with the Durbin-Watson statistic (values near 2 indicate independence).

  3. Normality of residuals. The residuals should be approximately normally distributed. Inspect a Q-Q plot of the residuals. Regression is robust to this with large samples.

  4. Homoscedasticity. The variance of residuals is constant across all predicted values. A "funnel" shape in the residuals-vs.-predicted plot signals heteroscedasticity.

  5. No multicollinearity. Predictors should not be too highly correlated with each other. Multicollinearity inflates standard errors and makes individual coefficients unstable. Check the Variance Inflation Factor (VIF):

VIFj=11−Rj2VIF_j = \frac{1}{1 - R_j^2} VIFj​=1−Rj2​1​

Where Rj2R_j^2Rj2​ is the R2R^2R2 from regressing predictor jjj on all other predictors. A VIF above 10 (or above 5, by stricter standards) indicates problematic multicollinearity.

Formula

Model Equation

Y^=b0+b1X1+b2X2+⋯+bpXp\hat{Y} = b_0 + b_1 X_1 + b_2 X_2 + \cdots + b_p X_p Y^=b0​+b1​X1​+b2​X2​+⋯+bp​Xp​

Coefficient of Determination (R2R^2R2)

R2=1−SSresidualSStotalR^2 = 1 - \frac{SS_{residual}}{SS_{total}} R2=1−SStotal​SSresidual​​

R2R^2R2 tells you the proportion of variance in YYY explained by the set of predictors combined.

Adjusted R2R^2R2

Because R2R^2R2 always increases when you add predictors (even useless ones), adjusted R2R^2R2 penalizes for the number of predictors:

Radj2=1−(1−R2)(n−1)n−p−1R^2_{adj} = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1} Radj2​=1−n−p−1(1−R2)(n−1)​

Where nnn is the sample size and ppp is the number of predictors. Use adjusted R2R^2R2 when comparing models with different numbers of predictors.

F-Test for Overall Model

The F-test evaluates whether the set of predictors collectively explains a significant amount of variance:

F=MSregressionMSresidual=SSregression/pSSresidual/(n−p−1)F = \frac{MS_{regression}}{MS_{residual}} = \frac{SS_{regression} / p}{SS_{residual} / (n - p - 1)} F=MSresidual​MSregression​​=SSresidual​/(n−p−1)SSregression​/p​

With df1=pdf_1 = pdf1​=p and df2=n−p−1df_2 = n - p - 1df2​=n−p−1.

Standardized Coefficients (β\betaβ)

To compare the relative importance of predictors measured on different scales, use standardized coefficients:

βj=bj⋅sXjsY\beta_j = b_j \cdot \frac{s_{X_j}}{s_Y} βj​=bj​⋅sY​sXj​​​

A larger absolute β\betaβ indicates a stronger unique contribution to the prediction of YYY.

Worked Example

Scenario: A university admissions researcher wants to predict student GPA (YYY) from weekly study hours (X1X_1X1​) and nightly sleep hours (X2X_2X2​) for n=8n = 8n=8 students.

Student Study Hours (X1X_1X1​) Sleep Hours (X2X_2X2​) GPA (YYY)
1 10 7 2.8
2 15 6 3.0
3 20 8 3.5
4 25 7 3.7
5 12 5 2.5
6 30 8 3.9
7 18 6 3.1
8 22 7 3.4

Step 1: Compute the means.

Xˉ1=10+15+20+25+12+30+18+228=19.0\bar{X}_1 = \frac{10+15+20+25+12+30+18+22}{8} = 19.0 Xˉ1​=810+15+20+25+12+30+18+22​=19.0

Xˉ2=7+6+8+7+5+8+6+78=6.75\bar{X}_2 = \frac{7+6+8+7+5+8+6+7}{8} = 6.75 Xˉ2​=87+6+8+7+5+8+6+7​=6.75

Yˉ=2.8+3.0+3.5+3.7+2.5+3.9+3.1+3.48=3.2375\bar{Y} = \frac{2.8+3.0+3.5+3.7+2.5+3.9+3.1+3.4}{8} = 3.2375 Yˉ=82.8+3.0+3.5+3.7+2.5+3.9+3.1+3.4​=3.2375

Step 2: Fit the regression model.

Using the least-squares method (typically computed with software), suppose the solution yields:

Y^=0.345+0.108X1+0.175X2\hat{Y} = 0.345 + 0.108 X_1 + 0.175 X_2 Y^=0.345+0.108X1​+0.175X2​

Step 3: Interpret the coefficients.

  • b0=0.345b_0 = 0.345b0​=0.345: A student with zero study hours and zero sleep hours would have a predicted GPA of 0.345 (not meaningful in practice — purely a mathematical anchor).
  • b1=0.108b_1 = 0.108b1​=0.108: Each additional weekly study hour is associated with a 0.108-point increase in GPA, holding sleep hours constant.
  • b2=0.175b_2 = 0.175b2​=0.175: Each additional nightly sleep hour is associated with a 0.175-point increase in GPA, holding study hours constant.

Step 4: Evaluate model fit.

R2=.96,Radj2=.94R^2 = .96, \quad R^2_{adj} = .94 R2=.96,Radj2​=.94

The model explains 96% of the variance in GPA. After adjusting for the number of predictors, 94% of variance is explained.

Step 5: Test the overall model.

F(2,5)=58.80,p<.001F(2, 5) = 58.80, \quad p < .001 F(2,5)=58.80,p<.001

The set of predictors significantly predicts GPA.

Step 6: Check multicollinearity.

VIFX1=1.12,VIFX2=1.12VIF_{X_1} = 1.12, \quad VIF_{X_2} = 1.12 VIFX1​​=1.12,VIFX2​​=1.12

Both VIF values are well below 10, so multicollinearity is not a concern.

Interpretation

The regression equation Y^=0.345+0.108X1+0.175X2\hat{Y} = 0.345 + 0.108X_1 + 0.175X_2Y^=0.345+0.108X1​+0.175X2​ tells us that both study hours and sleep hours independently contribute to predicting GPA. Study hours is the stronger predictor in absolute terms (b1=0.108b_1 = 0.108b1​=0.108 per hour, accumulated across many weekly hours) while sleep hours also makes a meaningful unique contribution (b2=0.175b_2 = 0.175b2​=0.175 per hour).

The high Radj2=.94R^2_{adj} = .94Radj2​=.94 suggests excellent model fit, though the small sample (n=8n = 8n=8) means these estimates should be interpreted cautiously and cross-validated with a larger sample.

R2R^2R2 vs. Adjusted R2R^2R2

Always report adjusted R2R^2R2 in multiple regression. Regular R2R^2R2 will increase whenever you add a predictor, even if it is noise. Adjusted R2R^2R2 can decrease if a new predictor does not improve the model enough to justify the lost degree of freedom.

Common Mistakes

  1. Including too many predictors for the sample size. A common guideline is at least 10-20 observations per predictor. With n=30n = 30n=30 and 10 predictors, the model is likely overfit.

  2. Ignoring multicollinearity. When predictors are highly correlated, individual coefficients become unstable and may flip sign. Always check VIF values.

  3. Interpreting coefficients as causal effects. Regression coefficients reflect associations, not causation. Without experimental control, confounds may explain the relationships.

  4. Using stepwise selection uncritically. Automated stepwise procedures capitalize on chance and produce models that may not replicate. Use theory-driven predictor selection when possible.

  5. Reporting R2R^2R2 instead of adjusted R2R^2R2. In multiple regression, R2R^2R2 is inflated. Always report adjusted R2R^2R2 as the primary measure of model fit.

  6. Forgetting to check residual plots. A high R2R^2R2 does not guarantee that assumptions are met. Always inspect residual vs. predicted plots and Q-Q plots.

  7. Confusing unstandardized and standardized coefficients. Report bbb (with units) for interpretation and β\betaβ for comparing relative importance. Do not mix them.

How to Run It

```r # Multiple linear regression in R model <- lm(gpa ~ study_hours + sleep_hours, data = mydata) summary(model)

Confidence intervals for coefficients

confint(model)

Variance Inflation Factor

library(car) vif(model)

Diagnostic plots

par(mfrow = c(2, 2)) plot(model)

```python import statsmodels.api as sm from statsmodels.stats.outliers_influence import variance_inflation_factor # Fit the model X = df[['study_hours', 'sleep_hours']] X = sm.add_constant(X) model = sm.OLS(df['gpa'], X).fit() print(model.summary()) # Variance Inflation Factor for i, col in enumerate(X.columns[1:], 1): print(f"VIF {col}: {variance_inflation_factor(X.values, i):.2f}") ```
  1. Go to Analyze > Regression > Linear
  2. Move your dependent variable (e.g., GPA) into the Dependent box
  3. Move all predictor variables (e.g., Study Hours, Sleep Hours) into the Independent(s) box
  4. Click Statistics and check Estimates, Model fit, Descriptives, and Collinearity diagnostics
  5. Click Plots: set *ZRESID as Y and *ZPRED as X to check homoscedasticity; also request a normal probability plot
  6. Click OK

SPSS outputs a Model Summary (R, R², Adjusted R²), ANOVA table (F-test for the overall model), Coefficients table (b, SE, Beta, t, p for each predictor), and Collinearity Statistics (Tolerance and VIF).

  1. Go to Data > Data Analysis > Regression (requires the Analysis ToolPak)
  2. Set Input Y Range to your dependent variable column (GPA)
  3. Set Input X Range to the columns containing all predictor variables (Study Hours and Sleep Hours together)
  4. Check Labels if your first row contains headers
  5. Check Residual Plots and Normal Probability Plots
  6. Click OK

Excel outputs R², Adjusted R², the ANOVA table, and coefficients with standard errors, t-statistics, and p-values. Excel does not compute VIF directly — compute it manually by regressing each predictor on the others and using VIF = 1 / (1 − R²).

## How to Report in APA Format > A multiple linear regression was conducted to predict GPA from weekly study hours and nightly sleep hours. The overall model was statistically significant, $F(2, 5) = 58.80$, $p < .001$, $R^2 = .96$, $R^2_{adj} = .94$. Study hours significantly predicted GPA ($b = 0.108$, $SE = 0.014$, $\beta = .87$, $p < .001$), as did sleep hours ($b = 0.175$, $SE = 0.072$, $\beta = .27$, $p = .048$). Multicollinearity was not a concern (all VIFs < 2). The regression equation was: predicted GPA $= 0.345 + 0.108 \times$ study hours $+ 0.175 \times$ sleep hours. Key elements to include: - The $F$-test for the overall model with degrees of freedom - $R^2$ and adjusted $R^2$ - Unstandardized coefficient ($b$), standard error, standardized coefficient ($\beta$), and p-value for each predictor - VIF or a statement about multicollinearity - The regression equation

Ready to calculate?

Now that you understand the concept, use the free Subthesis Research Tools on Subthesis to run your own analysis.

Explore Research Tools on Subthesis

Related Concepts

Simple Linear Regression

Master simple linear regression: learn how to predict a continuous outcome from one predictor variable, interpret slope, intercept, and R-squared values.

Pearson Correlation

Learn how to calculate and interpret the Pearson correlation coefficient (r) to measure the strength and direction of linear relationships between two variables.

Effect Size

Learn what effect size is, why it matters more than p-values alone, and how to calculate and interpret Cohen's d, Hedges' g, and eta-squared for your research.

Stats for Scholars

Statistics for Researchers, Not Statisticians

A Subthesis Resource

Learn

  • Statistical Concepts
  • Choose a Test
  • APA Reporting
  • Blog

Resources

  • Calculators
  • Cheat Sheets
  • About
  • FAQ
  • Accessibility
  • Privacy
  • Terms

© 2026 Angel Reyes / Subthesis. All rights reserved.

Privacy Policy Terms of Use