Simple Linear Regression

Purpose

Predicts a continuous dependent variable from a single continuous independent variable and quantifies the strength of that prediction.

When to Use

When you want to predict or explain variation in a continuous outcome using one predictor variable.

Data Type

One continuous predictor (X) and one continuous outcome (Y)

Key Assumptions

Linearity, independence of residuals, homoscedasticity (constant variance of residuals), normality of residuals, no significant outliers.

Tools

Subthesis Research Tools on Subthesis →

What Is Simple Linear Regression?

Simple linear regression is a method for modelling the relationship between a single predictor variable ( $X$ ) and a continuous outcome variable ( $Y$ ) by fitting a straight line through the data. The goal is to find the line that minimizes the total squared distance between the observed data points and the predicted values on the line.

The regression equation takes the form:

\hat{Y} = b_0 + b_1 X

Where:

$\hat{Y}$ (Y-hat) is the predicted value of the outcome
$b_0$ is the y-intercept — the predicted value of $Y$ when $X = 0$
$b_1$ is the slope — the predicted change in $Y$ for each one-unit increase in $X$

While Pearson correlation tells you the strength and direction of a linear association, regression goes further by giving you a prediction equation. Correlation asks "Are these related?" Regression asks "By how much does $Y$ change when $X$ changes, and can I predict $Y$ from $X$ ?"

When to Use It

Use simple linear regression when:

You have one continuous predictor and one continuous outcome.
You want to predict the value of the outcome from the predictor (e.g., predicting exam score from study hours).
You want to quantify the rate of change — how much the outcome changes per unit change in the predictor.
You have a theoretical reason to treat one variable as the predictor and the other as the outcome.

If you have multiple predictors, you need multiple linear regression. If the outcome is categorical (e.g., pass/fail), you need logistic regression.

Assumptions

Simple linear regression requires the following assumptions. Violations can lead to biased coefficients, incorrect p-values, or poor predictions.

Linearity. The relationship between $X$ and $Y$ is linear. Check by inspecting a scatter plot of $X$ vs. $Y$ and a residual plot (residuals vs. predicted values). If you see a curve, consider transforming variables or using polynomial regression.
Independence of residuals. Each observation is independent of the others. This is violated in time-series or clustered data (e.g., students nested in classrooms).
Homoscedasticity. The variance of the residuals is constant across all levels of $X$ . In the residual plot, the spread of points should be roughly the same width throughout. A "funnel" shape indicates heteroscedasticity.
Normality of residuals. The residuals (not the raw variables) should be approximately normally distributed. Check with a Q-Q plot or a Shapiro-Wilk test on the residuals. With large samples ( $n > 30$ ), regression is fairly robust to this.
No significant outliers or influential points. Extreme values can drag the regression line. Use Cook's distance ( $D_i > 1$ is concerning) and leverage values to identify influential cases.

Formula

Slope

The slope $b_1$ is calculated using the same quantities as the Pearson correlation:

b_1 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2} = r \cdot \frac{s_Y}{s_X}

Where $r$ is the Pearson correlation, and $s_X$ and $s_Y$ are the standard deviations of $X$ and $Y$ .

Intercept

b_0 = \bar{Y} - b_1 \bar{X}

The intercept ensures the regression line passes through the point $(\bar{X}, \bar{Y})$ .

Coefficient of Determination ( $R^2$ )

R^2 = 1 - \frac{SS_{residual}}{SS_{total}} = \frac{SS_{regression}}{SS_{total}}

Where:

$SS_{total} = \sum(Y_i - \bar{Y})^2$ — total variability in $Y$
$SS_{regression} = \sum(\hat{Y}_i - \bar{Y})^2$ — variability explained by the model
$SS_{residual} = \sum(Y_i - \hat{Y}_i)^2$ — unexplained variability

In simple linear regression, $R^2 = r^2$ (the square of the Pearson correlation).

Standard Error of the Estimate

S_e = \sqrt{\frac{SS_{residual}}{n - 2}}

This tells you the average distance of observed values from the regression line, in the units of $Y$ .

Testing the Slope

To test whether $b_1$ is significantly different from zero:

t = \frac{b_1}{SE_{b_1}}

where $SE_{b_1} = \frac{S_e}{\sqrt{\sum(X_i - \bar{X})^2}}$ , with $df = n - 2$ .

Worked Example

Scenario: An educational researcher wants to predict final exam scores ( $Y$ ) from the number of hours spent studying ( $X$ ) for $n = 6$ students.

Student	Study Hours ( $X$ )	Exam Score ( $Y$ )
1	2	58
2	4	70
3	5	74
4	6	80
5	8	85
6	9	92

Step 1: Compute the means.

\bar{X} = \frac{2+4+5+6+8+9}{6} = \frac{34}{6} = 5.667

\bar{Y} = \frac{58+70+74+80+85+92}{6} = \frac{459}{6} = 76.5

Step 2: Compute the required sums.

$X_i - \bar{X}$	$Y_i - \bar{Y}$	$(X_i - \bar{X})(Y_i - \bar{Y})$	$(X_i - \bar{X})^2$
$-3.667$	$-18.5$	$67.83$	$13.44$
$-1.667$	$-6.5$	$10.83$	$2.78$
$-0.667$	$-2.5$	$1.67$	$0.44$
$0.333$	$3.5$	$1.17$	$0.11$
$2.333$	$8.5$	$19.83$	$5.44$
$3.333$	$15.5$	$51.67$	$11.11$

\sum(X_i - \bar{X})(Y_i - \bar{Y}) = 153.00

\sum(X_i - \bar{X})^2 = 33.33

Step 3: Calculate the slope.

b_1 = \frac{153.00}{33.33} = 4.59

For every additional hour of studying, the predicted exam score increases by about 4.59 points.

Step 4: Calculate the intercept.

b_0 = 76.5 - (4.59)(5.667) = 76.5 - 26.01 = 50.49

Step 5: Write the regression equation.

\hat{Y} = 50.49 + 4.59X

Step 6: Calculate $R^2$ .

SS_{total} = \sum(Y_i - \bar{Y})^2 = 18.5^2 + 6.5^2 + 2.5^2 + 3.5^2 + 8.5^2 + 15.5^2 = 700.50

Predicted values and residuals:

$X$	$Y$	$\hat{Y}$	$Y - \hat{Y}$	$(Y - \hat{Y})^2$
2	58	59.67	$-1.67$	2.79
4	70	68.85	$1.15$	1.32
5	74	73.44	$0.56$	0.31
6	80	78.03	$1.97$	3.88
8	85	87.21	$-2.21$	4.88
9	92	91.80	$0.20$	0.04

SS_{residual} = 13.23

R^2 = 1 - \frac{13.23}{700.50} = 1 - 0.019 = 0.981

About 98% of the variance in exam scores is explained by study hours.

Interpretation

The regression equation $\hat{Y} = 50.49 + 4.59X$ tells us:

Intercept ( $b_0 = 50.49$ ): A student who studies zero hours is predicted to score about 50.5 on the exam. (Note: interpret the intercept cautiously if $X = 0$ falls outside your data range.)
Slope ( $b_1 = 4.59$ ): Each additional hour of studying is associated with a 4.59-point increase in the predicted exam score.
$R^2 = .98$ : Study hours account for 98% of the variability in exam scores in this sample — an exceptionally strong relationship (likely inflated by the small sample).

What $R^2$ Does and Does Not Tell You

$R^2$ tells you the proportion of variance explained but not whether the model is correctly specified.
A high $R^2$ does not mean the relationship is causal.
A low $R^2$ does not mean the predictor is unimportant — it may explain a small but theoretically meaningful portion of variance.

Common Mistakes

Extrapolating beyond the data range. The regression equation is only valid within the range of observed $X$ values. Predicting exam scores for someone who studied 20 hours when your data range from 2 to 9 is unreliable.
Ignoring residual plots. Looking only at $R^2$ without checking residual plots can hide violated assumptions. Always plot residuals vs. predicted values and inspect a Q-Q plot.
Confusing correlation with prediction. A significant correlation does not automatically mean predictions are useful. Check the standard error of the estimate to gauge prediction accuracy.
Interpreting the intercept literally when $X = 0$ is meaningless. If your predictor is "years of experience" and no one in your sample has zero years, the intercept is a mathematical anchor, not a meaningful prediction.
Assuming causation. Regression shows association. Without random assignment and experimental control, you cannot claim $X$ causes changes in $Y$ .
Ignoring influential observations. One outlier can dramatically change the slope. Always check Cook's distance and leverage values.
Not reporting the standard error of the estimate. $R^2$ alone does not tell the reader how precise your predictions are. $S_e$ provides the average prediction error in the units of $Y$ .

How to Run It

```r # Simple linear regression in R model <- lm(exam_score ~ study_hours, data = mydata) summary(model)

Confidence intervals for coefficients

confint(model)

Diagnostic plots

par(mfrow = c(2, 2)) plot(model)

```python
import statsmodels.api as sm

# Fit the model
X = sm.add_constant(df['study_hours'])  # adds intercept
model = sm.OLS(df['exam_score'], X).fit()
print(model.summary())

# Using pingouin
import pingouin as pg
result = pg.linear_regression(df[['study_hours']], df['exam_score'])
print(result)
```


Go to Analyze > Regression > Linear
Move your dependent variable (e.g., Exam Score) into the Dependent box
Move your independent variable (e.g., Study Hours) into the Independent(s) box
Click Statistics and check Estimates, Model fit, and Descriptives
Click Plots and add a scatter plot of *ZRESID vs. *ZPRED to check assumptions
Click OK

SPSS outputs a Model Summary (R, R²), ANOVA table (F-test for the overall model), and Coefficients table (b, SE, Beta, t, p for each predictor).



Go to Data > Data Analysis > Regression (requires the Analysis ToolPak)
Set Input Y Range to your dependent variable column
Set Input X Range to your independent variable column
Check Labels if your first row contains headers
Click OK

Excel outputs R², the ANOVA table, and coefficients with standard errors, t-statistics, and p-values.
For a quick slope and intercept only: =SLOPE(y_range, x_range) and =INTERCEPT(y_range, x_range).



## How to Report in APA Format

> A simple linear regression was conducted to predict exam score from hours of study. Hours of study significantly predicted exam scores, $F(1, 4) = 206.21$, $p < .001$, $R^2 = .98$. For each additional hour of study, exam scores increased by 4.59 points ($b = 4.59$, $SE = 0.32$, $\beta = .99$, $p < .001$). The regression equation was: predicted exam score $= 50.49 + 4.59 \times$ study hours.

Key elements to include:

- The $F$-test for the overall model with degrees of freedom
- $R^2$ (and adjusted $R^2$ if reporting multiple regression)
- Unstandardized coefficient ($b$), its standard error, standardized coefficient ($\beta$), and p-value
- The regression equation in words or symbols

Ready to calculate?

Now that you understand the concept, use the free Subthesis Research Tools on Subthesis to run your own analysis.

Explore Research Tools on Subthesis

Simple Linear Regression

Simple Linear Regression

What Is Simple Linear Regression?

When to Use It

Assumptions

Formula

Slope

Intercept

Coefficient of Determination ( $R^2$ )

Standard Error of the Estimate

Testing the Slope

Worked Example

Interpretation

What $R^2$ Does and Does Not Tell You

Common Mistakes

How to Run It

Confidence intervals for coefficients

Diagnostic plots

Related Concepts

Pearson Correlation

Effect Size

Descriptive Statistics

Multiple Linear Regression

Logistic Regression

Simple Linear Regression

What Is Simple Linear Regression?

When to Use It

Assumptions

Formula

Slope

Intercept

Coefficient of Determination (R2R^2R2)

Standard Error of the Estimate

Testing the Slope

Worked Example

Interpretation

What R2R^2R2 Does and Does Not Tell You

Common Mistakes

How to Run It

Confidence intervals for coefficients

Diagnostic plots

Related Concepts

Pearson Correlation

Effect Size

Descriptive Statistics

Multiple Linear Regression

Logistic Regression

Coefficient of Determination ( $R^2$ )

What $R^2$ Does and Does Not Tell You