Simple Linear Regression
Simple Linear Regression
What Is Simple Linear Regression?
Simple linear regression is a method for modelling the relationship between a single predictor variable () and a continuous outcome variable () by fitting a straight line through the data. The goal is to find the line that minimizes the total squared distance between the observed data points and the predicted values on the line.
The regression equation takes the form:
Where:
- (Y-hat) is the predicted value of the outcome
- is the y-intercept — the predicted value of when
- is the slope — the predicted change in for each one-unit increase in
While Pearson correlation tells you the strength and direction of a linear association, regression goes further by giving you a prediction equation. Correlation asks "Are these related?" Regression asks "By how much does change when changes, and can I predict from ?"
When to Use It
Use simple linear regression when:
- You have one continuous predictor and one continuous outcome.
- You want to predict the value of the outcome from the predictor (e.g., predicting exam score from study hours).
- You want to quantify the rate of change — how much the outcome changes per unit change in the predictor.
- You have a theoretical reason to treat one variable as the predictor and the other as the outcome.
If you have multiple predictors, you need multiple linear regression. If the outcome is categorical (e.g., pass/fail), you need logistic regression.
Assumptions
Simple linear regression requires the following assumptions. Violations can lead to biased coefficients, incorrect p-values, or poor predictions.
-
Linearity. The relationship between and is linear. Check by inspecting a scatter plot of vs. and a residual plot (residuals vs. predicted values). If you see a curve, consider transforming variables or using polynomial regression.
-
Independence of residuals. Each observation is independent of the others. This is violated in time-series or clustered data (e.g., students nested in classrooms).
-
Homoscedasticity. The variance of the residuals is constant across all levels of . In the residual plot, the spread of points should be roughly the same width throughout. A "funnel" shape indicates heteroscedasticity.
-
Normality of residuals. The residuals (not the raw variables) should be approximately normally distributed. Check with a Q-Q plot or a Shapiro-Wilk test on the residuals. With large samples (), regression is fairly robust to this.
-
No significant outliers or influential points. Extreme values can drag the regression line. Use Cook's distance ( is concerning) and leverage values to identify influential cases.
Formula
Slope
The slope is calculated using the same quantities as the Pearson correlation:
Where is the Pearson correlation, and and are the standard deviations of and .
Intercept
The intercept ensures the regression line passes through the point .
Coefficient of Determination ()
Where:
- — total variability in
- — variability explained by the model
- — unexplained variability
In simple linear regression, (the square of the Pearson correlation).
Standard Error of the Estimate
This tells you the average distance of observed values from the regression line, in the units of .
Testing the Slope
To test whether is significantly different from zero:
where , with .
Worked Example
Scenario: An educational researcher wants to predict final exam scores () from the number of hours spent studying () for students.
| Student | Study Hours () | Exam Score () |
|---|---|---|
| 1 | 2 | 58 |
| 2 | 4 | 70 |
| 3 | 5 | 74 |
| 4 | 6 | 80 |
| 5 | 8 | 85 |
| 6 | 9 | 92 |
Step 1: Compute the means.
Step 2: Compute the required sums.
Step 3: Calculate the slope.
For every additional hour of studying, the predicted exam score increases by about 4.59 points.
Step 4: Calculate the intercept.
Step 5: Write the regression equation.
Step 6: Calculate .
Predicted values and residuals:
| 2 | 58 | 59.67 | 2.79 | |
| 4 | 70 | 68.85 | 1.32 | |
| 5 | 74 | 73.44 | 0.31 | |
| 6 | 80 | 78.03 | 3.88 | |
| 8 | 85 | 87.21 | 4.88 | |
| 9 | 92 | 91.80 | 0.04 |
About 98% of the variance in exam scores is explained by study hours.
Interpretation
The regression equation tells us:
- Intercept (): A student who studies zero hours is predicted to score about 50.5 on the exam. (Note: interpret the intercept cautiously if falls outside your data range.)
- Slope (): Each additional hour of studying is associated with a 4.59-point increase in the predicted exam score.
- : Study hours account for 98% of the variability in exam scores in this sample — an exceptionally strong relationship (likely inflated by the small sample).
What Does and Does Not Tell You
- tells you the proportion of variance explained but not whether the model is correctly specified.
- A high does not mean the relationship is causal.
- A low does not mean the predictor is unimportant — it may explain a small but theoretically meaningful portion of variance.
Common Mistakes
-
Extrapolating beyond the data range. The regression equation is only valid within the range of observed values. Predicting exam scores for someone who studied 20 hours when your data range from 2 to 9 is unreliable.
-
Ignoring residual plots. Looking only at without checking residual plots can hide violated assumptions. Always plot residuals vs. predicted values and inspect a Q-Q plot.
-
Confusing correlation with prediction. A significant correlation does not automatically mean predictions are useful. Check the standard error of the estimate to gauge prediction accuracy.
-
Interpreting the intercept literally when is meaningless. If your predictor is "years of experience" and no one in your sample has zero years, the intercept is a mathematical anchor, not a meaningful prediction.
-
Assuming causation. Regression shows association. Without random assignment and experimental control, you cannot claim causes changes in .
-
Ignoring influential observations. One outlier can dramatically change the slope. Always check Cook's distance and leverage values.
-
Not reporting the standard error of the estimate. alone does not tell the reader how precise your predictions are. provides the average prediction error in the units of .
How to Run It
Confidence intervals for coefficients
confint(model)
Diagnostic plots
par(mfrow = c(2, 2)) plot(model)
```python
import statsmodels.api as sm
# Fit the model
X = sm.add_constant(df['study_hours']) # adds intercept
model = sm.OLS(df['exam_score'], X).fit()
print(model.summary())
# Using pingouin
import pingouin as pg
result = pg.linear_regression(df[['study_hours']], df['exam_score'])
print(result)
```
Go to Analyze > Regression > Linear
Move your dependent variable (e.g., Exam Score) into the Dependent box
Move your independent variable (e.g., Study Hours) into the Independent(s) box
Click Statistics and check Estimates, Model fit, and Descriptives
Click Plots and add a scatter plot of *ZRESID vs. *ZPRED to check assumptions
Click OK
SPSS outputs a Model Summary (R, R²), ANOVA table (F-test for the overall model), and Coefficients table (b, SE, Beta, t, p for each predictor).
Go to Data > Data Analysis > Regression (requires the Analysis ToolPak)
Set Input Y Range to your dependent variable column
Set Input X Range to your independent variable column
Check Labels if your first row contains headers
Click OK
Excel outputs R², the ANOVA table, and coefficients with standard errors, t-statistics, and p-values.
For a quick slope and intercept only: =SLOPE(y_range, x_range) and =INTERCEPT(y_range, x_range).
Ready to calculate?
Now that you understand the concept, use the free Subthesis Research Tools on Subthesis to run your own analysis.
Related Concepts
Pearson Correlation
Learn how to calculate and interpret the Pearson correlation coefficient (r) to measure the strength and direction of linear relationships between two variables.
Effect Size
Learn what effect size is, why it matters more than p-values alone, and how to calculate and interpret Cohen's d, Hedges' g, and eta-squared for your research.
Descriptive Statistics
Master descriptive statistics: learn about mean, median, mode, standard deviation, variance, and range. Know when to use each measure for your research data.
Multiple Linear Regression
Learn how to conduct and interpret multiple linear regression: predict a continuous outcome from two or more predictor variables, assess model fit with R-squared, and check for multicollinearity.
Logistic Regression
Learn how to conduct and interpret binary logistic regression: predict a dichotomous outcome from one or more predictors, calculate odds ratios, and assess model fit.