Skip to main content
Stats for Scholars
Concepts Decision Tree Reporting Calculators Blog Software Cheat Sheets
Concepts Decision Tree Reporting Calculators Blog Software Cheat Sheets
Home Concepts Simple Linear Regression

Descriptive Statistics

  • Descriptive Statistics

Inferential Statistics

  • Chi-Square Test of Independence
  • Independent Samples t-Test
  • Kruskal-Wallis H Test
  • Logistic Regression
  • Mann-Whitney U Test
  • Multiple Linear Regression
  • One-Way ANOVA
  • Paired Samples t-Test
  • Pearson Correlation
  • Repeated Measures ANOVA
  • Simple Linear Regression
  • Two-Way (Factorial) ANOVA
  • Wilcoxon Signed-Rank Test

Effect Size & Power

  • Effect Size
  • Sample Size Determination
  • Statistical Power & Power Analysis

Reliability & Validity

  • Cronbach's Alpha
  • Inter-Rater Reliability

Simple Linear Regression

intermediate Inferential Statistics

Simple Linear Regression

Purpose
Predicts a continuous dependent variable from a single continuous independent variable and quantifies the strength of that prediction.
When to Use
When you want to predict or explain variation in a continuous outcome using one predictor variable.
Data Type
One continuous predictor (X) and one continuous outcome (Y)
Key Assumptions
Linearity, independence of residuals, homoscedasticity (constant variance of residuals), normality of residuals, no significant outliers.
Tools
Subthesis Research Tools on Subthesis →

What Is Simple Linear Regression?

Simple linear regression is a method for modelling the relationship between a single predictor variable (XXX) and a continuous outcome variable (YYY) by fitting a straight line through the data. The goal is to find the line that minimizes the total squared distance between the observed data points and the predicted values on the line.

The regression equation takes the form:

Y^=b0+b1X\hat{Y} = b_0 + b_1 X Y^=b0​+b1​X

Where:

  • Y^\hat{Y}Y^ (Y-hat) is the predicted value of the outcome
  • b0b_0b0​ is the y-intercept — the predicted value of YYY when X=0X = 0X=0
  • b1b_1b1​ is the slope — the predicted change in YYY for each one-unit increase in XXX

While Pearson correlation tells you the strength and direction of a linear association, regression goes further by giving you a prediction equation. Correlation asks "Are these related?" Regression asks "By how much does YYY change when XXX changes, and can I predict YYY from XXX?"

When to Use It

Use simple linear regression when:

  • You have one continuous predictor and one continuous outcome.
  • You want to predict the value of the outcome from the predictor (e.g., predicting exam score from study hours).
  • You want to quantify the rate of change — how much the outcome changes per unit change in the predictor.
  • You have a theoretical reason to treat one variable as the predictor and the other as the outcome.

If you have multiple predictors, you need multiple linear regression. If the outcome is categorical (e.g., pass/fail), you need logistic regression.

Assumptions

Simple linear regression requires the following assumptions. Violations can lead to biased coefficients, incorrect p-values, or poor predictions.

  1. Linearity. The relationship between XXX and YYY is linear. Check by inspecting a scatter plot of XXX vs. YYY and a residual plot (residuals vs. predicted values). If you see a curve, consider transforming variables or using polynomial regression.

  2. Independence of residuals. Each observation is independent of the others. This is violated in time-series or clustered data (e.g., students nested in classrooms).

  3. Homoscedasticity. The variance of the residuals is constant across all levels of XXX. In the residual plot, the spread of points should be roughly the same width throughout. A "funnel" shape indicates heteroscedasticity.

  4. Normality of residuals. The residuals (not the raw variables) should be approximately normally distributed. Check with a Q-Q plot or a Shapiro-Wilk test on the residuals. With large samples (n>30n > 30n>30), regression is fairly robust to this.

  5. No significant outliers or influential points. Extreme values can drag the regression line. Use Cook's distance (Di>1D_i > 1Di​>1 is concerning) and leverage values to identify influential cases.

Formula

Slope

The slope b1b_1b1​ is calculated using the same quantities as the Pearson correlation:

b1=∑i=1n(Xi−Xˉ)(Yi−Yˉ)∑i=1n(Xi−Xˉ)2=r⋅sYsXb_1 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2} = r \cdot \frac{s_Y}{s_X} b1​=∑i=1n​(Xi​−Xˉ)2∑i=1n​(Xi​−Xˉ)(Yi​−Yˉ)​=r⋅sX​sY​​

Where rrr is the Pearson correlation, and sXs_XsX​ and sYs_YsY​ are the standard deviations of XXX and YYY.

Intercept

b0=Yˉ−b1Xˉb_0 = \bar{Y} - b_1 \bar{X} b0​=Yˉ−b1​Xˉ

The intercept ensures the regression line passes through the point (Xˉ,Yˉ)(\bar{X}, \bar{Y})(Xˉ,Yˉ).

Coefficient of Determination (R2R^2R2)

R2=1−SSresidualSStotal=SSregressionSStotalR^2 = 1 - \frac{SS_{residual}}{SS_{total}} = \frac{SS_{regression}}{SS_{total}} R2=1−SStotal​SSresidual​​=SStotal​SSregression​​

Where:

  • SStotal=∑(Yi−Yˉ)2SS_{total} = \sum(Y_i - \bar{Y})^2SStotal​=∑(Yi​−Yˉ)2 — total variability in YYY
  • SSregression=∑(Y^i−Yˉ)2SS_{regression} = \sum(\hat{Y}_i - \bar{Y})^2SSregression​=∑(Y^i​−Yˉ)2 — variability explained by the model
  • SSresidual=∑(Yi−Y^i)2SS_{residual} = \sum(Y_i - \hat{Y}_i)^2SSresidual​=∑(Yi​−Y^i​)2 — unexplained variability

In simple linear regression, R2=r2R^2 = r^2R2=r2 (the square of the Pearson correlation).

Standard Error of the Estimate

Se=SSresidualn−2S_e = \sqrt{\frac{SS_{residual}}{n - 2}} Se​=n−2SSresidual​​​

This tells you the average distance of observed values from the regression line, in the units of YYY.

Testing the Slope

To test whether b1b_1b1​ is significantly different from zero:

t=b1SEb1t = \frac{b_1}{SE_{b_1}} t=SEb1​​b1​​

where SEb1=Se∑(Xi−Xˉ)2SE_{b_1} = \frac{S_e}{\sqrt{\sum(X_i - \bar{X})^2}}SEb1​​=∑(Xi​−Xˉ)2​Se​​, with df=n−2df = n - 2df=n−2.

Worked Example

Scenario: An educational researcher wants to predict final exam scores (YYY) from the number of hours spent studying (XXX) for n=6n = 6n=6 students.

Student Study Hours (XXX) Exam Score (YYY)
1 2 58
2 4 70
3 5 74
4 6 80
5 8 85
6 9 92

Step 1: Compute the means.

Xˉ=2+4+5+6+8+96=346=5.667\bar{X} = \frac{2+4+5+6+8+9}{6} = \frac{34}{6} = 5.667 Xˉ=62+4+5+6+8+9​=634​=5.667

Yˉ=58+70+74+80+85+926=4596=76.5\bar{Y} = \frac{58+70+74+80+85+92}{6} = \frac{459}{6} = 76.5 Yˉ=658+70+74+80+85+92​=6459​=76.5

Step 2: Compute the required sums.

Xi−XˉX_i - \bar{X}Xi​−Xˉ Yi−YˉY_i - \bar{Y}Yi​−Yˉ (Xi−Xˉ)(Yi−Yˉ)(X_i - \bar{X})(Y_i - \bar{Y})(Xi​−Xˉ)(Yi​−Yˉ) (Xi−Xˉ)2(X_i - \bar{X})^2(Xi​−Xˉ)2
−3.667-3.667−3.667 −18.5-18.5−18.5 67.8367.8367.83 13.4413.4413.44
−1.667-1.667−1.667 −6.5-6.5−6.5 10.8310.8310.83 2.782.782.78
−0.667-0.667−0.667 −2.5-2.5−2.5 1.671.671.67 0.440.440.44
0.3330.3330.333 3.53.53.5 1.171.171.17 0.110.110.11
2.3332.3332.333 8.58.58.5 19.8319.8319.83 5.445.445.44
3.3333.3333.333 15.515.515.5 51.6751.6751.67 11.1111.1111.11

∑(Xi−Xˉ)(Yi−Yˉ)=153.00\sum(X_i - \bar{X})(Y_i - \bar{Y}) = 153.00 ∑(Xi​−Xˉ)(Yi​−Yˉ)=153.00

∑(Xi−Xˉ)2=33.33\sum(X_i - \bar{X})^2 = 33.33 ∑(Xi​−Xˉ)2=33.33

Step 3: Calculate the slope.

b1=153.0033.33=4.59b_1 = \frac{153.00}{33.33} = 4.59 b1​=33.33153.00​=4.59

For every additional hour of studying, the predicted exam score increases by about 4.59 points.

Step 4: Calculate the intercept.

b0=76.5−(4.59)(5.667)=76.5−26.01=50.49b_0 = 76.5 - (4.59)(5.667) = 76.5 - 26.01 = 50.49 b0​=76.5−(4.59)(5.667)=76.5−26.01=50.49

Step 5: Write the regression equation.

Y^=50.49+4.59X\hat{Y} = 50.49 + 4.59X Y^=50.49+4.59X

Step 6: Calculate R2R^2R2.

SStotal=∑(Yi−Yˉ)2=18.52+6.52+2.52+3.52+8.52+15.52=700.50SS_{total} = \sum(Y_i - \bar{Y})^2 = 18.5^2 + 6.5^2 + 2.5^2 + 3.5^2 + 8.5^2 + 15.5^2 = 700.50 SStotal​=∑(Yi​−Yˉ)2=18.52+6.52+2.52+3.52+8.52+15.52=700.50

Predicted values and residuals:

XXX YYY Y^\hat{Y}Y^ Y−Y^Y - \hat{Y}Y−Y^ (Y−Y^)2(Y - \hat{Y})^2(Y−Y^)2
2 58 59.67 −1.67-1.67−1.67 2.79
4 70 68.85 1.151.151.15 1.32
5 74 73.44 0.560.560.56 0.31
6 80 78.03 1.971.971.97 3.88
8 85 87.21 −2.21-2.21−2.21 4.88
9 92 91.80 0.200.200.20 0.04

SSresidual=13.23SS_{residual} = 13.23 SSresidual​=13.23

R2=1−13.23700.50=1−0.019=0.981R^2 = 1 - \frac{13.23}{700.50} = 1 - 0.019 = 0.981 R2=1−700.5013.23​=1−0.019=0.981

About 98% of the variance in exam scores is explained by study hours.

Interpretation

The regression equation Y^=50.49+4.59X\hat{Y} = 50.49 + 4.59XY^=50.49+4.59X tells us:

  • Intercept (b0=50.49b_0 = 50.49b0​=50.49): A student who studies zero hours is predicted to score about 50.5 on the exam. (Note: interpret the intercept cautiously if X=0X = 0X=0 falls outside your data range.)
  • Slope (b1=4.59b_1 = 4.59b1​=4.59): Each additional hour of studying is associated with a 4.59-point increase in the predicted exam score.
  • R2=.98R^2 = .98R2=.98: Study hours account for 98% of the variability in exam scores in this sample — an exceptionally strong relationship (likely inflated by the small sample).

What R2R^2R2 Does and Does Not Tell You

  • R2R^2R2 tells you the proportion of variance explained but not whether the model is correctly specified.
  • A high R2R^2R2 does not mean the relationship is causal.
  • A low R2R^2R2 does not mean the predictor is unimportant — it may explain a small but theoretically meaningful portion of variance.

Common Mistakes

  1. Extrapolating beyond the data range. The regression equation is only valid within the range of observed XXX values. Predicting exam scores for someone who studied 20 hours when your data range from 2 to 9 is unreliable.

  2. Ignoring residual plots. Looking only at R2R^2R2 without checking residual plots can hide violated assumptions. Always plot residuals vs. predicted values and inspect a Q-Q plot.

  3. Confusing correlation with prediction. A significant correlation does not automatically mean predictions are useful. Check the standard error of the estimate to gauge prediction accuracy.

  4. Interpreting the intercept literally when X=0X = 0X=0 is meaningless. If your predictor is "years of experience" and no one in your sample has zero years, the intercept is a mathematical anchor, not a meaningful prediction.

  5. Assuming causation. Regression shows association. Without random assignment and experimental control, you cannot claim XXX causes changes in YYY.

  6. Ignoring influential observations. One outlier can dramatically change the slope. Always check Cook's distance and leverage values.

  7. Not reporting the standard error of the estimate. R2R^2R2 alone does not tell the reader how precise your predictions are. SeS_eSe​ provides the average prediction error in the units of YYY.

How to Run It

```r # Simple linear regression in R model <- lm(exam_score ~ study_hours, data = mydata) summary(model)

Confidence intervals for coefficients

confint(model)

Diagnostic plots

par(mfrow = c(2, 2)) plot(model)

```python import statsmodels.api as sm # Fit the model X = sm.add_constant(df['study_hours']) # adds intercept model = sm.OLS(df['exam_score'], X).fit() print(model.summary()) # Using pingouin import pingouin as pg result = pg.linear_regression(df[['study_hours']], df['exam_score']) print(result) ```
  1. Go to Analyze > Regression > Linear
  2. Move your dependent variable (e.g., Exam Score) into the Dependent box
  3. Move your independent variable (e.g., Study Hours) into the Independent(s) box
  4. Click Statistics and check Estimates, Model fit, and Descriptives
  5. Click Plots and add a scatter plot of *ZRESID vs. *ZPRED to check assumptions
  6. Click OK

SPSS outputs a Model Summary (R, R²), ANOVA table (F-test for the overall model), and Coefficients table (b, SE, Beta, t, p for each predictor).

  1. Go to Data > Data Analysis > Regression (requires the Analysis ToolPak)
  2. Set Input Y Range to your dependent variable column
  3. Set Input X Range to your independent variable column
  4. Check Labels if your first row contains headers
  5. Click OK

Excel outputs R², the ANOVA table, and coefficients with standard errors, t-statistics, and p-values.

For a quick slope and intercept only: =SLOPE(y_range, x_range) and =INTERCEPT(y_range, x_range).

## How to Report in APA Format > A simple linear regression was conducted to predict exam score from hours of study. Hours of study significantly predicted exam scores, $F(1, 4) = 206.21$, $p < .001$, $R^2 = .98$. For each additional hour of study, exam scores increased by 4.59 points ($b = 4.59$, $SE = 0.32$, $\beta = .99$, $p < .001$). The regression equation was: predicted exam score $= 50.49 + 4.59 \times$ study hours. Key elements to include: - The $F$-test for the overall model with degrees of freedom - $R^2$ (and adjusted $R^2$ if reporting multiple regression) - Unstandardized coefficient ($b$), its standard error, standardized coefficient ($\beta$), and p-value - The regression equation in words or symbols

Ready to calculate?

Now that you understand the concept, use the free Subthesis Research Tools on Subthesis to run your own analysis.

Explore Research Tools on Subthesis

Related Concepts

Pearson Correlation

Learn how to calculate and interpret the Pearson correlation coefficient (r) to measure the strength and direction of linear relationships between two variables.

Effect Size

Learn what effect size is, why it matters more than p-values alone, and how to calculate and interpret Cohen's d, Hedges' g, and eta-squared for your research.

Descriptive Statistics

Master descriptive statistics: learn about mean, median, mode, standard deviation, variance, and range. Know when to use each measure for your research data.

Multiple Linear Regression

Learn how to conduct and interpret multiple linear regression: predict a continuous outcome from two or more predictor variables, assess model fit with R-squared, and check for multicollinearity.

Logistic Regression

Learn how to conduct and interpret binary logistic regression: predict a dichotomous outcome from one or more predictors, calculate odds ratios, and assess model fit.

Stats for Scholars

Statistics for Researchers, Not Statisticians

A Subthesis Resource

Learn

  • Statistical Concepts
  • Choose a Test
  • APA Reporting
  • Blog

Resources

  • Calculators
  • Cheat Sheets
  • About
  • FAQ
  • Accessibility
  • Privacy
  • Terms

© 2026 Angel Reyes / Subthesis. All rights reserved.

Privacy Policy Terms of Use