Descriptive Statistics

Purpose

Summarizes and describes the main features of a dataset using measures of central tendency (mean, median, mode) and variability (standard deviation, variance, range).

When to Use

Always — descriptive statistics should be reported for every variable in every study before any inferential tests are conducted.

Data Type

Continuous (interval or ratio) for means and standard deviations; ordinal for medians; nominal for modes

Key Assumptions

Mean and standard deviation assume approximately symmetric distributions without extreme outliers. For skewed distributions, the median and interquartile range are preferred.

Tools

Effect Size Calculator on Subthesis →

What Are Descriptive Statistics?

Descriptive statistics are numerical summaries that describe, organize, and simplify a dataset. They reduce a collection of raw numbers into a few meaningful values that capture the essential features of your data: where the center is, how spread out the values are, and what the distribution looks like.

Every research paper, thesis, or report begins with descriptive statistics. Before you run a t-test, ANOVA, or regression, you must first describe your data. Descriptive statistics serve two purposes:

Summarize the data for your readers so they understand what you measured and what the values look like.
Check assumptions for inferential tests — many statistical procedures require data to be approximately normally distributed, free of outliers, and have adequate variability.

Descriptive statistics fall into two broad families:

Measures of central tendency — Where is the center of the distribution? (Mean, median, mode)
Measures of variability (dispersion) — How spread out are the values? (Standard deviation, variance, range, interquartile range)

When to Use It

Descriptive statistics are used in every quantitative study. Specifically:

Report means and standard deviations for all continuous variables in your study (APA 7th edition requires this).
Use medians and interquartile ranges when your data are skewed or contain outliers.
Report frequencies and percentages for categorical variables (e.g., 58% female, 42% male).
Include a descriptive statistics table in your results section before presenting any inferential analyses.

Assumptions

Descriptive statistics themselves have minimal assumptions, but choosing the right measure depends on the characteristics of your data:

Level of measurement. The mean requires interval or ratio data. The median requires at least ordinal data. The mode can be used with any level of measurement.
Distribution shape. The mean is appropriate for roughly symmetric distributions. For highly skewed data, the median is a better measure of center.
Outliers. The mean is sensitive to extreme values; the median is resistant. If you have outliers, report both and explain the discrepancy.

Formula

Measures of Central Tendency

Mean (Arithmetic Average)

The sum of all values divided by the number of values:

\bar{X} = \frac{\sum_{i=1}^{n} X_i}{n} = \frac{X_1 + X_2 + \cdots + X_n}{n}

The mean uses every data point, which makes it the most informative measure of center for symmetric data — but also the most sensitive to outliers.

Median

The middle value when data are ordered from lowest to highest. For $n$ values:

If $n$ is odd, the median is the value at position $\frac{n+1}{2}$ .
If $n$ is even, the median is the average of the values at positions $\frac{n}{2}$ and $\frac{n}{2} + 1$ .

The median is resistant to outliers and is the preferred measure of center for skewed distributions (e.g., income, reaction times).

Mode

The most frequently occurring value. A distribution can be:

Unimodal — one mode
Bimodal — two modes (suggesting two subgroups)
Multimodal — three or more modes

The mode is the only measure of central tendency that can be used with nominal data (e.g., the most common political affiliation).

Measures of Variability

Range

The simplest measure of spread:

\text{Range} = X_{max} - X_{min}

The range uses only the two most extreme values, making it highly sensitive to outliers and unstable across samples.

Variance

The average squared deviation from the mean. For a sample:

s^2 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})^2}{n - 1}

We divide by $n - 1$ (not $n$ ) to correct for the bias that occurs when estimating the population variance from a sample. This is called Bessel's correction.

Variance is measured in squared units, which makes it difficult to interpret directly (e.g., "squared years" or "squared points"). That is why we typically take the square root to get the standard deviation.

Standard Deviation

The square root of the variance:

s = \sqrt{\frac{\sum_{i=1}^{n}(X_i - \bar{X})^2}{n - 1}}

The standard deviation is expressed in the same units as the original data, making it far more interpretable than the variance. It tells you, on average, how far each data point falls from the mean.

Interquartile Range (IQR)

The range of the middle 50% of the data:

\text{IQR} = Q_3 - Q_1

Where $Q_1$ is the 25th percentile and $Q_3$ is the 75th percentile. Like the median, the IQR is resistant to outliers and is preferred for skewed distributions.

The Normal Distribution

Many inferential statistics assume that data follow a normal (Gaussian) distribution — the familiar bell curve. Key properties:

f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x - \mu}{\sigma}\right)^2}

In a normal distribution:

About 68% of values fall within $\pm 1$ standard deviation of the mean
About 95% of values fall within $\pm 2$ standard deviations
About 99.7% of values fall within $\pm 3$ standard deviations

This is known as the 68-95-99.7 rule (or the empirical rule).

You can assess normality using:

Visual methods: Histograms, Q-Q plots, box plots
Statistical tests: Shapiro-Wilk test (best for $n < 50$ ), Kolmogorov-Smirnov test (larger samples)
Skewness and kurtosis values: Values between $-2$ and $+2$ are generally considered acceptable

Worked Example

Scenario: A cognitive psychologist measures reaction time (in milliseconds) for $n = 10$ participants in a word recognition task.

Raw data (in ms): 420, 385, 510, 390, 400, 395, 880, 410, 405, 415

Step 1: Order the data.

385, 390, 395, 400, 405, 410, 415, 420, 510, 880

Step 2: Compute the mean.

\bar{X} = \frac{385 + 390 + 395 + 400 + 405 + 410 + 415 + 420 + 510 + 880}{10} = \frac{4610}{10} = 461.0 \text{ ms}

Step 3: Compute the median.

With $n = 10$ (even), the median is the average of the 5th and 6th values:

\text{Median} = \frac{405 + 410}{2} = 407.5 \text{ ms}

Step 4: Note the discrepancy. The mean (461.0) is considerably higher than the median (407.5). This signals a positively skewed distribution, pulled by the outlier value of 880 ms.

Step 5: Compute the standard deviation.

First, compute each squared deviation from the mean:

$X_i$	$X_i - \bar{X}$	$(X_i - \bar{X})^2$
385	$-76.0$	$5{,}776$
390	$-71.0$	$5{,}041$
395	$-66.0$	$4{,}356$
400	$-61.0$	$3{,}721$
405	$-56.0$	$3{,}136$
410	$-51.0$	$2{,}601$
415	$-46.0$	$2{,}116$
420	$-41.0$	$1{,}681$
510	$49.0$	$2{,}401$
880	$419.0$	$175{,}561$

\sum(X_i - \bar{X})^2 = 206{,}390

s = \sqrt{\frac{206{,}390}{10 - 1}} = \sqrt{\frac{206{,}390}{9}} = \sqrt{22{,}932.2} = 151.4 \text{ ms}

Step 6: Compute the range and IQR.

\text{Range} = 880 - 385 = 495 \text{ ms}

$Q_1$ (median of lower half: 385, 390, 395, 400, 405) $= 395$

$Q_3$ (median of upper half: 410, 415, 420, 510, 880) $= 420$

\text{IQR} = 420 - 395 = 25 \text{ ms}

Step 7: Interpret the results.

The large discrepancy between the range (495 ms) and the IQR (25 ms) confirms that the extreme value of 880 ms is an outlier. For these data, the median and IQR are better summaries than the mean and SD because the distribution is heavily right-skewed.

Measure	Value
Mean	461.0 ms
Median	407.5 ms
Mode	None (all unique)
SD	151.4 ms
Variance	22,932.2 ms$^2$
Range	495 ms
IQR	25 ms

Interpretation

Choosing the Right Measure

Data Characteristic	Central Tendency	Variability
Symmetric, no outliers	Mean	SD
Skewed or has outliers	Median	IQR
Nominal (categories)	Mode	--
Ordinal (ranked)	Median	IQR

Reading a Standard Deviation

The standard deviation tells you the "typical" distance from the mean. In the worked example, $s = 151.4$ ms is very large relative to $\bar{X} = 461.0$ ms, indicating high variability. The coefficient of variation (CV) puts this in perspective:

CV = \frac{s}{\bar{X}} \times 100 = \frac{151.4}{461.0} \times 100 = 32.8\%

A CV above 30% usually signals high variability. In this case, the outlier is the primary driver.

Standard Deviation vs. Standard Error

Students frequently confuse these two:

Standard deviation ( $s$ ) describes variability in the data — how spread out individual scores are.
Standard error of the mean ( $SE$ ) describes variability in the sampling distribution of the mean — how much the sample mean would fluctuate across repeated samples.

SE = \frac{s}{\sqrt{n}}

Report $SD$ when describing your data. Report $SE$ (or confidence intervals) when making inferences about the population mean.

Common Mistakes

Reporting the mean for skewed data. If income data are right-skewed, the mean is inflated by high earners. The median provides a more representative picture. Always inspect histograms before choosing your summary statistics.
Confusing SD and SE. Reporting $SE$ in a descriptive table makes the variability look artificially small (since $SE = s / \sqrt{n}$ ). APA style requires $SD$ in descriptive tables unless you are specifically reporting precision of a mean estimate.
Ignoring outliers. A single extreme value can dramatically change the mean and SD. Always check for outliers using box plots or z-scores (values with $|z| > 3$ are typically flagged).
Reporting too many decimal places. A mean reaction time of 461.00000 ms implies false precision. Generally, report one more decimal place than the original measurement. For whole-number data, one decimal place is sufficient.
Forgetting to report variability. A mean without a measure of spread is incomplete. Saying "the average score was 75" does not tell the reader whether scores ranged from 70 to 80 or from 30 to 100. Always pair a central tendency measure with a variability measure.
Computing the mean of ordinal data. Strictly speaking, you should not average Likert-scale items (e.g., 1--5 ratings) because the intervals between values may not be equal. In practice, researchers often do compute means of Likert-type scales, but this should be done thoughtfully and acknowledged as a limitation.
Using the population formula instead of the sample formula. When computing variance and SD from a sample, always divide by $n - 1$ , not $n$ . Dividing by $n$ underestimates the population variance.

How to Report in APA Format

In-text

Participants' average reaction time was $M = 461.0$ ms ( $SD = 151.4$ ). Due to positive skew, the median ( $Mdn = 407.5$ ms) may better represent the typical response.