Pearson Product-Moment Correlation

Purpose

Measures the strength and direction of the linear relationship between two continuous variables.

When to Use

When you want to know whether two continuous variables are linearly associated and how strongly.

Data Type

Two continuous (interval or ratio) variables

Key Assumptions

Both variables are continuous, the relationship is linear, no significant outliers, bivariate normality, homoscedasticity.

Tools

Subthesis Research Tools on Subthesis →

What Is Pearson Correlation?

The Pearson product-moment correlation coefficient, denoted $r$ , quantifies the strength and direction of the linear relationship between two continuous variables. It answers the question: "As one variable increases, does the other tend to increase (positive), decrease (negative), or show no consistent pattern?"

The value of $r$ ranges from $-1$ to $+1$ :

Value of $r$	Interpretation
$+1.0$	Perfect positive linear relationship
$+0.7$ to $+0.9$	Strong positive
$+0.4$ to $+0.6$	Moderate positive
$+0.1$ to $+0.3$	Weak positive
$0.0$	No linear relationship
$-0.1$ to $-0.3$	Weak negative
$-0.4$ to $-0.6$	Moderate negative
$-0.7$ to $-0.9$	Strong negative
$-1.0$	Perfect negative linear relationship

It is important to remember that $r$ measures linear relationships only. Two variables can have a strong curvilinear relationship yet produce an $r$ near zero.

When to Use It

Use the Pearson correlation when:

You have two continuous variables measured on interval or ratio scales (e.g., test scores, age, income, reaction time).
You want to describe the direction and strength of a relationship rather than predict one variable from another (for prediction, use regression).
Your data appear to follow a roughly linear pattern when plotted in a scatter plot.

If one or both variables are ordinal (e.g., Likert-scale items, ranked data), consider Spearman's rank correlation ( $r_s$ ) instead.

Assumptions

Before computing $r$ , verify these assumptions:

Level of measurement. Both variables must be continuous (interval or ratio).
Linearity. The relationship between the two variables should be linear. Always inspect a scatter plot first.
Bivariate normality. The pair of variables should be approximately normally distributed. With large samples ( $n > 30$ ), the test is robust to moderate violations.
No significant outliers. Outliers can dramatically inflate or deflate $r$ . A single extreme point can turn a weak correlation into a strong one (or vice versa).
Homoscedasticity. The spread of one variable should be roughly constant across levels of the other variable. A "funnel" shape in the scatter plot signals a violation.

Formula

The Pearson correlation coefficient is calculated as:

r = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{i=1}^{n}(X_i - \bar{X})^2 \cdot \sum_{i=1}^{n}(Y_i - \bar{Y})^2}}

Where:

$X_i$ and $Y_i$ are individual data points
$\bar{X}$ and $\bar{Y}$ are the sample means of $X$ and $Y$
$n$ is the number of paired observations

The numerator is the sum of cross-products of deviations, which captures how $X$ and $Y$ co-vary. The denominator standardizes this quantity by the total variability in each variable.

Coefficient of Determination

The square of the correlation, $r^2$ , is called the coefficient of determination. It tells you the proportion of variance in one variable that is explained by (shared with) the other.

r^2 = \text{proportion of shared variance}

For example, if $r = .60$ , then $r^2 = .36$ , meaning 36% of the variance in $Y$ is accounted for by its linear relationship with $X$ .

Testing Significance

To test whether $r$ is significantly different from zero, compute the $t$ -statistic:

t = \frac{r\sqrt{n - 2}}{\sqrt{1 - r^2}}

This follows a $t$ -distribution with $df = n - 2$ . If $|t|$ exceeds the critical value, you reject $H_0: \rho = 0$ .

Worked Example

Scenario: A health psychologist wants to know whether the number of hours of sleep per night is associated with self-reported stress scores (0--50 scale) among university students. She collects data from $n = 8$ participants.

Participant	Hours of Sleep ( $X$ )	Stress Score ( $Y$ )
1	5	38
2	6	32
3	7	25
4	6	30
5	8	20
6	7	28
7	9	15
8	8	22

Step 1: Compute the means.

\bar{X} = \frac{5+6+7+6+8+7+9+8}{8} = \frac{56}{8} = 7.0

\bar{Y} = \frac{38+32+25+30+20+28+15+22}{8} = \frac{210}{8} = 26.25

Step 2: Compute deviations and products.

$X_i - \bar{X}$	$Y_i - \bar{Y}$	$(X_i - \bar{X})(Y_i - \bar{Y})$	$(X_i - \bar{X})^2$	$(Y_i - \bar{Y})^2$
$-2$	$11.75$	$-23.50$	$4$	$138.06$
$-1$	$5.75$	$-5.75$	$1$	$33.06$
$0$	$-1.25$	$0.00$	$0$	$1.56$
$-1$	$3.75$	$-3.75$	$1$	$14.06$
$1$	$-6.25$	$-6.25$	$1$	$39.06$
$0$	$1.75$	$0.00$	$0$	$3.06$
$2$	$-11.25$	$-22.50$	$4$	$126.56$
$1$	$-4.25$	$-4.25$	$1$	$18.06$

Step 3: Sum the columns.

\sum(X_i - \bar{X})(Y_i - \bar{Y}) = -66.00

\sum(X_i - \bar{X})^2 = 12.00

\sum(Y_i - \bar{Y})^2 = 373.50

Step 4: Compute $r$ .

r = \frac{-66.00}{\sqrt{12.00 \times 373.50}} = \frac{-66.00}{\sqrt{4482.00}} = \frac{-66.00}{66.95} = -0.986

Step 5: Test significance.

t = \frac{-0.986\sqrt{8 - 2}}{\sqrt{1 - 0.972}} = \frac{-0.986 \times 2.449}{\sqrt{0.028}} = \frac{-2.415}{0.167} = -14.46

With $df = 6$ and a critical $t$ of $\pm 2.447$ at $\alpha = .05$ (two-tailed), $|t| = 14.46$ far exceeds the critical value, so $p < .001$ .

Step 6: Coefficient of determination.

r^2 = (-0.986)^2 = 0.972

About 97% of the variance in stress scores is explained by hours of sleep in this sample.

Interpretation

The results show a strong negative correlation between hours of sleep and stress scores, $r = -.99$ . As sleep increases, stress decreases in a nearly perfect linear pattern. The coefficient of determination ( $r^2 = .97$ ) indicates that 97% of the variability in stress scores can be accounted for by the linear relationship with sleep hours.

Keep in mind:

Correlation does not imply causation. We cannot conclude that more sleep causes lower stress. A third variable (e.g., workload) might drive both.
Strength benchmarks are field-dependent. In psychology, $r = .30$ can be practically meaningful; in physics, $r = .90$ might be considered weak.
$r$ only captures linear relationships. If the scatter plot shows a curve, consider polynomial regression or Spearman's correlation.

Common Mistakes

Assuming causation. Correlation quantifies association, not cause and effect. Always consider confounding variables and the direction-of-causality problem.
Ignoring outliers. A single extreme data point can flip the sign or magnitude of $r$ . Always inspect scatter plots and consider running analyses with and without potential outliers.
Restricting the range. If you only sample from a narrow range of one variable (e.g., only high-achieving students), $r$ will be artificially attenuated. This is called range restriction.
Confusing $r$ with $r^2$ . Reporting $r = .50$ as "50% of variance explained" is wrong. The correct figure is $r^2 = .25$ , or 25%.
Using Pearson's $r$ with non-linear data. If the scatter plot shows a curved pattern, $r$ will underestimate the true strength of the relationship. Use a non-linear method or transform the data.
Using Pearson's $r$ with ordinal data. Likert-scale items (e.g., 1--5 agreement ratings) are ordinal. Use Spearman's $r_s$ or polychoric correlation instead.

How to Run It

```r # Pearson correlation in R cor.test(mydata$sleep, mydata$stress, method = "pearson")

Correlation matrix for multiple variables

cor(mydata[, c("sleep", "stress", "gpa")], use = "complete.obs")

Visualize with a scatter plot

plot(mydata$sleep, mydata$stress, xlab = "Sleep", ylab = "Stress") abline(lm(stress ~ sleep, data = mydata))

```python
from scipy import stats
import pandas as pd

# Pearson correlation with p-value
r, p = stats.pearsonr(df['sleep'], df['stress'])
print(f"r = {r:.3f}, p = {p:.4f}")

# Correlation matrix
print(df[['sleep', 'stress', 'gpa']].corr())

# Using pingouin for detailed output
import pingouin as pg
result = pg.corr(df['sleep'], df['stress'], method='pearson')
print(result)
```


Go to Analyze > Correlate > Bivariate
Move your variables into the Variables box
Ensure Pearson is checked under Correlation Coefficients
Check Flag significant correlations
Click OK

SPSS produces a correlation matrix with r-values, p-values, and sample sizes. Significant correlations are flagged with asterisks (* p < .05, ** p < .01).


For the correlation coefficient:
=CORREL(array1, array2)
Example: =CORREL(A2:A9, B2:B9)
For a full correlation matrix, use Data > Data Analysis > Correlation (requires the Analysis ToolPak add-in).
Excel does not provide a p-value for correlations. To test significance, compute the t-statistic manually: =r*SQRT((n-2)/(1-r^2)) and use =T.DIST.2T(ABS(t), n-2) for the p-value.



## How to Report in APA Format

For a significant correlation:

> A Pearson correlation indicated a strong negative relationship between hours of sleep and stress scores, $r(6) = -.99$, $p < .001$, $r^2 = .97$.

For a non-significant correlation:

> There was no significant linear relationship between study hours and GPA, $r(48) = .12$, $p = .41$.

Note that the degrees of freedom in parentheses after $r$ equal $n - 2$. Always report the effect size ($r$ itself serves as the effect size, or report $r^2$ for variance explained).

Ready to calculate?

Now that you understand the concept, use the free Subthesis Research Tools on Subthesis to run your own analysis.

Explore Research Tools on Subthesis

Pearson Correlation