Skip to main content
Stats for Scholars
Concepts Decision Tree Reporting Calculators Blog Software Cheat Sheets
Concepts Decision Tree Reporting Calculators Blog Software Cheat Sheets
Home Concepts Pearson Correlation

Descriptive Statistics

  • Descriptive Statistics

Inferential Statistics

  • Chi-Square Test of Independence
  • Independent Samples t-Test
  • Kruskal-Wallis H Test
  • Logistic Regression
  • Mann-Whitney U Test
  • Multiple Linear Regression
  • One-Way ANOVA
  • Paired Samples t-Test
  • Pearson Correlation
  • Repeated Measures ANOVA
  • Simple Linear Regression
  • Two-Way (Factorial) ANOVA
  • Wilcoxon Signed-Rank Test

Effect Size & Power

  • Effect Size
  • Sample Size Determination
  • Statistical Power & Power Analysis

Reliability & Validity

  • Cronbach's Alpha
  • Inter-Rater Reliability

Pearson Correlation

beginner Inferential Statistics

Pearson Product-Moment Correlation

Purpose
Measures the strength and direction of the linear relationship between two continuous variables.
When to Use
When you want to know whether two continuous variables are linearly associated and how strongly.
Data Type
Two continuous (interval or ratio) variables
Key Assumptions
Both variables are continuous, the relationship is linear, no significant outliers, bivariate normality, homoscedasticity.
Tools
Subthesis Research Tools on Subthesis →

What Is Pearson Correlation?

The Pearson product-moment correlation coefficient, denoted rrr, quantifies the strength and direction of the linear relationship between two continuous variables. It answers the question: "As one variable increases, does the other tend to increase (positive), decrease (negative), or show no consistent pattern?"

The value of rrr ranges from −1-1−1 to +1+1+1:

Value of rrr Interpretation
+1.0+1.0+1.0 Perfect positive linear relationship
+0.7+0.7+0.7 to +0.9+0.9+0.9 Strong positive
+0.4+0.4+0.4 to +0.6+0.6+0.6 Moderate positive
+0.1+0.1+0.1 to +0.3+0.3+0.3 Weak positive
0.00.00.0 No linear relationship
−0.1-0.1−0.1 to −0.3-0.3−0.3 Weak negative
−0.4-0.4−0.4 to −0.6-0.6−0.6 Moderate negative
−0.7-0.7−0.7 to −0.9-0.9−0.9 Strong negative
−1.0-1.0−1.0 Perfect negative linear relationship

It is important to remember that rrr measures linear relationships only. Two variables can have a strong curvilinear relationship yet produce an rrr near zero.

When to Use It

Use the Pearson correlation when:

  • You have two continuous variables measured on interval or ratio scales (e.g., test scores, age, income, reaction time).
  • You want to describe the direction and strength of a relationship rather than predict one variable from another (for prediction, use regression).
  • Your data appear to follow a roughly linear pattern when plotted in a scatter plot.

If one or both variables are ordinal (e.g., Likert-scale items, ranked data), consider Spearman's rank correlation (rsr_srs​) instead.

Assumptions

Before computing rrr, verify these assumptions:

  1. Level of measurement. Both variables must be continuous (interval or ratio).
  2. Linearity. The relationship between the two variables should be linear. Always inspect a scatter plot first.
  3. Bivariate normality. The pair of variables should be approximately normally distributed. With large samples (n>30n > 30n>30), the test is robust to moderate violations.
  4. No significant outliers. Outliers can dramatically inflate or deflate rrr. A single extreme point can turn a weak correlation into a strong one (or vice versa).
  5. Homoscedasticity. The spread of one variable should be roughly constant across levels of the other variable. A "funnel" shape in the scatter plot signals a violation.

Formula

The Pearson correlation coefficient is calculated as:

r=∑i=1n(Xi−Xˉ)(Yi−Yˉ)∑i=1n(Xi−Xˉ)2⋅∑i=1n(Yi−Yˉ)2r = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{i=1}^{n}(X_i - \bar{X})^2 \cdot \sum_{i=1}^{n}(Y_i - \bar{Y})^2}} r=∑i=1n​(Xi​−Xˉ)2⋅∑i=1n​(Yi​−Yˉ)2​∑i=1n​(Xi​−Xˉ)(Yi​−Yˉ)​

Where:

  • XiX_iXi​ and YiY_iYi​ are individual data points
  • Xˉ\bar{X}Xˉ and Yˉ\bar{Y}Yˉ are the sample means of XXX and YYY
  • nnn is the number of paired observations

The numerator is the sum of cross-products of deviations, which captures how XXX and YYY co-vary. The denominator standardizes this quantity by the total variability in each variable.

Coefficient of Determination

The square of the correlation, r2r^2r2, is called the coefficient of determination. It tells you the proportion of variance in one variable that is explained by (shared with) the other.

r2=proportion of shared variancer^2 = \text{proportion of shared variance} r2=proportion of shared variance

For example, if r=.60r = .60r=.60, then r2=.36r^2 = .36r2=.36, meaning 36% of the variance in YYY is accounted for by its linear relationship with XXX.

Testing Significance

To test whether rrr is significantly different from zero, compute the ttt-statistic:

t=rn−21−r2t = \frac{r\sqrt{n - 2}}{\sqrt{1 - r^2}} t=1−r2​rn−2​​

This follows a ttt-distribution with df=n−2df = n - 2df=n−2. If ∣t∣|t|∣t∣ exceeds the critical value, you reject H0:ρ=0H_0: \rho = 0H0​:ρ=0.

Worked Example

Scenario: A health psychologist wants to know whether the number of hours of sleep per night is associated with self-reported stress scores (0--50 scale) among university students. She collects data from n=8n = 8n=8 participants.

Participant Hours of Sleep (XXX) Stress Score (YYY)
1 5 38
2 6 32
3 7 25
4 6 30
5 8 20
6 7 28
7 9 15
8 8 22

Step 1: Compute the means.

Xˉ=5+6+7+6+8+7+9+88=568=7.0\bar{X} = \frac{5+6+7+6+8+7+9+8}{8} = \frac{56}{8} = 7.0 Xˉ=85+6+7+6+8+7+9+8​=856​=7.0

Yˉ=38+32+25+30+20+28+15+228=2108=26.25\bar{Y} = \frac{38+32+25+30+20+28+15+22}{8} = \frac{210}{8} = 26.25 Yˉ=838+32+25+30+20+28+15+22​=8210​=26.25

Step 2: Compute deviations and products.

Xi−XˉX_i - \bar{X}Xi​−Xˉ Yi−YˉY_i - \bar{Y}Yi​−Yˉ (Xi−Xˉ)(Yi−Yˉ)(X_i - \bar{X})(Y_i - \bar{Y})(Xi​−Xˉ)(Yi​−Yˉ) (Xi−Xˉ)2(X_i - \bar{X})^2(Xi​−Xˉ)2 (Yi−Yˉ)2(Y_i - \bar{Y})^2(Yi​−Yˉ)2
−2-2−2 11.7511.7511.75 −23.50-23.50−23.50 444 138.06138.06138.06
−1-1−1 5.755.755.75 −5.75-5.75−5.75 111 33.0633.0633.06
000 −1.25-1.25−1.25 0.000.000.00 000 1.561.561.56
−1-1−1 3.753.753.75 −3.75-3.75−3.75 111 14.0614.0614.06
111 −6.25-6.25−6.25 −6.25-6.25−6.25 111 39.0639.0639.06
000 1.751.751.75 0.000.000.00 000 3.063.063.06
222 −11.25-11.25−11.25 −22.50-22.50−22.50 444 126.56126.56126.56
111 −4.25-4.25−4.25 −4.25-4.25−4.25 111 18.0618.0618.06

Step 3: Sum the columns.

∑(Xi−Xˉ)(Yi−Yˉ)=−66.00\sum(X_i - \bar{X})(Y_i - \bar{Y}) = -66.00 ∑(Xi​−Xˉ)(Yi​−Yˉ)=−66.00

∑(Xi−Xˉ)2=12.00\sum(X_i - \bar{X})^2 = 12.00 ∑(Xi​−Xˉ)2=12.00

∑(Yi−Yˉ)2=373.50\sum(Y_i - \bar{Y})^2 = 373.50 ∑(Yi​−Yˉ)2=373.50

Step 4: Compute rrr.

r=−66.0012.00×373.50=−66.004482.00=−66.0066.95=−0.986r = \frac{-66.00}{\sqrt{12.00 \times 373.50}} = \frac{-66.00}{\sqrt{4482.00}} = \frac{-66.00}{66.95} = -0.986 r=12.00×373.50​−66.00​=4482.00​−66.00​=66.95−66.00​=−0.986

Step 5: Test significance.

t=−0.9868−21−0.972=−0.986×2.4490.028=−2.4150.167=−14.46t = \frac{-0.986\sqrt{8 - 2}}{\sqrt{1 - 0.972}} = \frac{-0.986 \times 2.449}{\sqrt{0.028}} = \frac{-2.415}{0.167} = -14.46 t=1−0.972​−0.9868−2​​=0.028​−0.986×2.449​=0.167−2.415​=−14.46

With df=6df = 6df=6 and a critical ttt of ±2.447\pm 2.447±2.447 at α=.05\alpha = .05α=.05 (two-tailed), ∣t∣=14.46|t| = 14.46∣t∣=14.46 far exceeds the critical value, so p<.001p < .001p<.001.

Step 6: Coefficient of determination.

r2=(−0.986)2=0.972r^2 = (-0.986)^2 = 0.972 r2=(−0.986)2=0.972

About 97% of the variance in stress scores is explained by hours of sleep in this sample.

Interpretation

The results show a strong negative correlation between hours of sleep and stress scores, r=−.99r = -.99r=−.99. As sleep increases, stress decreases in a nearly perfect linear pattern. The coefficient of determination (r2=.97r^2 = .97r2=.97) indicates that 97% of the variability in stress scores can be accounted for by the linear relationship with sleep hours.

Keep in mind:

  • Correlation does not imply causation. We cannot conclude that more sleep causes lower stress. A third variable (e.g., workload) might drive both.
  • Strength benchmarks are field-dependent. In psychology, r=.30r = .30r=.30 can be practically meaningful; in physics, r=.90r = .90r=.90 might be considered weak.
  • rrr only captures linear relationships. If the scatter plot shows a curve, consider polynomial regression or Spearman's correlation.

Common Mistakes

  1. Assuming causation. Correlation quantifies association, not cause and effect. Always consider confounding variables and the direction-of-causality problem.
  2. Ignoring outliers. A single extreme data point can flip the sign or magnitude of rrr. Always inspect scatter plots and consider running analyses with and without potential outliers.
  3. Restricting the range. If you only sample from a narrow range of one variable (e.g., only high-achieving students), rrr will be artificially attenuated. This is called range restriction.
  4. Confusing rrr with r2r^2r2. Reporting r=.50r = .50r=.50 as "50% of variance explained" is wrong. The correct figure is r2=.25r^2 = .25r2=.25, or 25%.
  5. Using Pearson's rrr with non-linear data. If the scatter plot shows a curved pattern, rrr will underestimate the true strength of the relationship. Use a non-linear method or transform the data.
  6. Using Pearson's rrr with ordinal data. Likert-scale items (e.g., 1--5 agreement ratings) are ordinal. Use Spearman's rsr_srs​ or polychoric correlation instead.

How to Run It

```r # Pearson correlation in R cor.test(mydata$sleep, mydata$stress, method = "pearson")

Correlation matrix for multiple variables

cor(mydata[, c("sleep", "stress", "gpa")], use = "complete.obs")

Visualize with a scatter plot

plot(mydata$sleep, mydata$stress, xlab = "Sleep", ylab = "Stress") abline(lm(stress ~ sleep, data = mydata))

```python from scipy import stats import pandas as pd # Pearson correlation with p-value r, p = stats.pearsonr(df['sleep'], df['stress']) print(f"r = {r:.3f}, p = {p:.4f}") # Correlation matrix print(df[['sleep', 'stress', 'gpa']].corr()) # Using pingouin for detailed output import pingouin as pg result = pg.corr(df['sleep'], df['stress'], method='pearson') print(result) ```
  1. Go to Analyze > Correlate > Bivariate
  2. Move your variables into the Variables box
  3. Ensure Pearson is checked under Correlation Coefficients
  4. Check Flag significant correlations
  5. Click OK

SPSS produces a correlation matrix with r-values, p-values, and sample sizes. Significant correlations are flagged with asterisks (* p < .05, ** p < .01).

For the correlation coefficient:

=CORREL(array1, array2)

Example: =CORREL(A2:A9, B2:B9)

For a full correlation matrix, use Data > Data Analysis > Correlation (requires the Analysis ToolPak add-in).

Excel does not provide a p-value for correlations. To test significance, compute the t-statistic manually: =r*SQRT((n-2)/(1-r^2)) and use =T.DIST.2T(ABS(t), n-2) for the p-value.

## How to Report in APA Format For a significant correlation: > A Pearson correlation indicated a strong negative relationship between hours of sleep and stress scores, $r(6) = -.99$, $p < .001$, $r^2 = .97$. For a non-significant correlation: > There was no significant linear relationship between study hours and GPA, $r(48) = .12$, $p = .41$. Note that the degrees of freedom in parentheses after $r$ equal $n - 2$. Always report the effect size ($r$ itself serves as the effect size, or report $r^2$ for variance explained).

Ready to calculate?

Now that you understand the concept, use the free Subthesis Research Tools on Subthesis to run your own analysis.

Explore Research Tools on Subthesis

Related Concepts

Simple Linear Regression

Master simple linear regression: learn how to predict a continuous outcome from one predictor variable, interpret slope, intercept, and R-squared values.

Effect Size

Learn what effect size is, why it matters more than p-values alone, and how to calculate and interpret Cohen's d, Hedges' g, and eta-squared for your research.

Descriptive Statistics

Master descriptive statistics: learn about mean, median, mode, standard deviation, variance, and range. Know when to use each measure for your research data.

Stats for Scholars

Statistics for Researchers, Not Statisticians

A Subthesis Resource

Learn

  • Statistical Concepts
  • Choose a Test
  • APA Reporting
  • Blog

Resources

  • Calculators
  • Cheat Sheets
  • About
  • FAQ
  • Accessibility
  • Privacy
  • Terms

© 2026 Angel Reyes / Subthesis. All rights reserved.

Privacy Policy Terms of Use