Pearson Correlation
Pearson Product-Moment Correlation
What Is Pearson Correlation?
The Pearson product-moment correlation coefficient, denoted , quantifies the strength and direction of the linear relationship between two continuous variables. It answers the question: "As one variable increases, does the other tend to increase (positive), decrease (negative), or show no consistent pattern?"
The value of ranges from to :
| Value of | Interpretation |
|---|---|
| Perfect positive linear relationship | |
| to | Strong positive |
| to | Moderate positive |
| to | Weak positive |
| No linear relationship | |
| to | Weak negative |
| to | Moderate negative |
| to | Strong negative |
| Perfect negative linear relationship |
It is important to remember that measures linear relationships only. Two variables can have a strong curvilinear relationship yet produce an near zero.
When to Use It
Use the Pearson correlation when:
- You have two continuous variables measured on interval or ratio scales (e.g., test scores, age, income, reaction time).
- You want to describe the direction and strength of a relationship rather than predict one variable from another (for prediction, use regression).
- Your data appear to follow a roughly linear pattern when plotted in a scatter plot.
If one or both variables are ordinal (e.g., Likert-scale items, ranked data), consider Spearman's rank correlation () instead.
Assumptions
Before computing , verify these assumptions:
- Level of measurement. Both variables must be continuous (interval or ratio).
- Linearity. The relationship between the two variables should be linear. Always inspect a scatter plot first.
- Bivariate normality. The pair of variables should be approximately normally distributed. With large samples (), the test is robust to moderate violations.
- No significant outliers. Outliers can dramatically inflate or deflate . A single extreme point can turn a weak correlation into a strong one (or vice versa).
- Homoscedasticity. The spread of one variable should be roughly constant across levels of the other variable. A "funnel" shape in the scatter plot signals a violation.
Formula
The Pearson correlation coefficient is calculated as:
Where:
- and are individual data points
- and are the sample means of and
- is the number of paired observations
The numerator is the sum of cross-products of deviations, which captures how and co-vary. The denominator standardizes this quantity by the total variability in each variable.
Coefficient of Determination
The square of the correlation, , is called the coefficient of determination. It tells you the proportion of variance in one variable that is explained by (shared with) the other.
For example, if , then , meaning 36% of the variance in is accounted for by its linear relationship with .
Testing Significance
To test whether is significantly different from zero, compute the -statistic:
This follows a -distribution with . If exceeds the critical value, you reject .
Worked Example
Scenario: A health psychologist wants to know whether the number of hours of sleep per night is associated with self-reported stress scores (0--50 scale) among university students. She collects data from participants.
| Participant | Hours of Sleep () | Stress Score () |
|---|---|---|
| 1 | 5 | 38 |
| 2 | 6 | 32 |
| 3 | 7 | 25 |
| 4 | 6 | 30 |
| 5 | 8 | 20 |
| 6 | 7 | 28 |
| 7 | 9 | 15 |
| 8 | 8 | 22 |
Step 1: Compute the means.
Step 2: Compute deviations and products.
Step 3: Sum the columns.
Step 4: Compute .
Step 5: Test significance.
With and a critical of at (two-tailed), far exceeds the critical value, so .
Step 6: Coefficient of determination.
About 97% of the variance in stress scores is explained by hours of sleep in this sample.
Interpretation
The results show a strong negative correlation between hours of sleep and stress scores, . As sleep increases, stress decreases in a nearly perfect linear pattern. The coefficient of determination () indicates that 97% of the variability in stress scores can be accounted for by the linear relationship with sleep hours.
Keep in mind:
- Correlation does not imply causation. We cannot conclude that more sleep causes lower stress. A third variable (e.g., workload) might drive both.
- Strength benchmarks are field-dependent. In psychology, can be practically meaningful; in physics, might be considered weak.
- only captures linear relationships. If the scatter plot shows a curve, consider polynomial regression or Spearman's correlation.
Common Mistakes
- Assuming causation. Correlation quantifies association, not cause and effect. Always consider confounding variables and the direction-of-causality problem.
- Ignoring outliers. A single extreme data point can flip the sign or magnitude of . Always inspect scatter plots and consider running analyses with and without potential outliers.
- Restricting the range. If you only sample from a narrow range of one variable (e.g., only high-achieving students), will be artificially attenuated. This is called range restriction.
- Confusing with . Reporting as "50% of variance explained" is wrong. The correct figure is , or 25%.
- Using Pearson's with non-linear data. If the scatter plot shows a curved pattern, will underestimate the true strength of the relationship. Use a non-linear method or transform the data.
- Using Pearson's with ordinal data. Likert-scale items (e.g., 1--5 agreement ratings) are ordinal. Use Spearman's or polychoric correlation instead.
How to Run It
Correlation matrix for multiple variables
cor(mydata[, c("sleep", "stress", "gpa")], use = "complete.obs")
Visualize with a scatter plot
plot(mydata$sleep, mydata$stress, xlab = "Sleep", ylab = "Stress") abline(lm(stress ~ sleep, data = mydata))
```python
from scipy import stats
import pandas as pd
# Pearson correlation with p-value
r, p = stats.pearsonr(df['sleep'], df['stress'])
print(f"r = {r:.3f}, p = {p:.4f}")
# Correlation matrix
print(df[['sleep', 'stress', 'gpa']].corr())
# Using pingouin for detailed output
import pingouin as pg
result = pg.corr(df['sleep'], df['stress'], method='pearson')
print(result)
```
Go to Analyze > Correlate > Bivariate
Move your variables into the Variables box
Ensure Pearson is checked under Correlation Coefficients
Check Flag significant correlations
Click OK
SPSS produces a correlation matrix with r-values, p-values, and sample sizes. Significant correlations are flagged with asterisks (* p < .05, ** p < .01).
For the correlation coefficient:
=CORREL(array1, array2)
Example: =CORREL(A2:A9, B2:B9)
For a full correlation matrix, use Data > Data Analysis > Correlation (requires the Analysis ToolPak add-in).
Excel does not provide a p-value for correlations. To test significance, compute the t-statistic manually: =r*SQRT((n-2)/(1-r^2)) and use =T.DIST.2T(ABS(t), n-2) for the p-value.
Ready to calculate?
Now that you understand the concept, use the free Subthesis Research Tools on Subthesis to run your own analysis.
Related Concepts
Simple Linear Regression
Master simple linear regression: learn how to predict a continuous outcome from one predictor variable, interpret slope, intercept, and R-squared values.
Effect Size
Learn what effect size is, why it matters more than p-values alone, and how to calculate and interpret Cohen's d, Hedges' g, and eta-squared for your research.
Descriptive Statistics
Master descriptive statistics: learn about mean, median, mode, standard deviation, variance, and range. Know when to use each measure for your research data.