Descriptive Statistics
Descriptive Statistics
What Are Descriptive Statistics?
Descriptive statistics are numerical summaries that describe, organize, and simplify a dataset. They reduce a collection of raw numbers into a few meaningful values that capture the essential features of your data: where the center is, how spread out the values are, and what the distribution looks like.
Every research paper, thesis, or report begins with descriptive statistics. Before you run a t-test, ANOVA, or regression, you must first describe your data. Descriptive statistics serve two purposes:
- Summarize the data for your readers so they understand what you measured and what the values look like.
- Check assumptions for inferential tests — many statistical procedures require data to be approximately normally distributed, free of outliers, and have adequate variability.
Descriptive statistics fall into two broad families:
- Measures of central tendency — Where is the center of the distribution? (Mean, median, mode)
- Measures of variability (dispersion) — How spread out are the values? (Standard deviation, variance, range, interquartile range)
When to Use It
Descriptive statistics are used in every quantitative study. Specifically:
- Report means and standard deviations for all continuous variables in your study (APA 7th edition requires this).
- Use medians and interquartile ranges when your data are skewed or contain outliers.
- Report frequencies and percentages for categorical variables (e.g., 58% female, 42% male).
- Include a descriptive statistics table in your results section before presenting any inferential analyses.
Assumptions
Descriptive statistics themselves have minimal assumptions, but choosing the right measure depends on the characteristics of your data:
- Level of measurement. The mean requires interval or ratio data. The median requires at least ordinal data. The mode can be used with any level of measurement.
- Distribution shape. The mean is appropriate for roughly symmetric distributions. For highly skewed data, the median is a better measure of center.
- Outliers. The mean is sensitive to extreme values; the median is resistant. If you have outliers, report both and explain the discrepancy.
Formula
Measures of Central Tendency
Mean (Arithmetic Average)
The sum of all values divided by the number of values:
The mean uses every data point, which makes it the most informative measure of center for symmetric data — but also the most sensitive to outliers.
Median
The middle value when data are ordered from lowest to highest. For values:
- If is odd, the median is the value at position .
- If is even, the median is the average of the values at positions and .
The median is resistant to outliers and is the preferred measure of center for skewed distributions (e.g., income, reaction times).
Mode
The most frequently occurring value. A distribution can be:
- Unimodal — one mode
- Bimodal — two modes (suggesting two subgroups)
- Multimodal — three or more modes
The mode is the only measure of central tendency that can be used with nominal data (e.g., the most common political affiliation).
Measures of Variability
Range
The simplest measure of spread:
The range uses only the two most extreme values, making it highly sensitive to outliers and unstable across samples.
Variance
The average squared deviation from the mean. For a sample:
We divide by (not ) to correct for the bias that occurs when estimating the population variance from a sample. This is called Bessel's correction.
Variance is measured in squared units, which makes it difficult to interpret directly (e.g., "squared years" or "squared points"). That is why we typically take the square root to get the standard deviation.
Standard Deviation
The square root of the variance:
The standard deviation is expressed in the same units as the original data, making it far more interpretable than the variance. It tells you, on average, how far each data point falls from the mean.
Interquartile Range (IQR)
The range of the middle 50% of the data:
Where is the 25th percentile and is the 75th percentile. Like the median, the IQR is resistant to outliers and is preferred for skewed distributions.
The Normal Distribution
Many inferential statistics assume that data follow a normal (Gaussian) distribution — the familiar bell curve. Key properties:
In a normal distribution:
- About 68% of values fall within standard deviation of the mean
- About 95% of values fall within standard deviations
- About 99.7% of values fall within standard deviations
This is known as the 68-95-99.7 rule (or the empirical rule).
You can assess normality using:
- Visual methods: Histograms, Q-Q plots, box plots
- Statistical tests: Shapiro-Wilk test (best for ), Kolmogorov-Smirnov test (larger samples)
- Skewness and kurtosis values: Values between and are generally considered acceptable
Worked Example
Scenario: A cognitive psychologist measures reaction time (in milliseconds) for participants in a word recognition task.
Raw data (in ms): 420, 385, 510, 390, 400, 395, 880, 410, 405, 415
Step 1: Order the data.
385, 390, 395, 400, 405, 410, 415, 420, 510, 880
Step 2: Compute the mean.
Step 3: Compute the median.
With (even), the median is the average of the 5th and 6th values:
Step 4: Note the discrepancy. The mean (461.0) is considerably higher than the median (407.5). This signals a positively skewed distribution, pulled by the outlier value of 880 ms.
Step 5: Compute the standard deviation.
First, compute each squared deviation from the mean:
| 385 | ||
| 390 | ||
| 395 | ||
| 400 | ||
| 405 | ||
| 410 | ||
| 415 | ||
| 420 | ||
| 510 | ||
| 880 |
Step 6: Compute the range and IQR.
(median of lower half: 385, 390, 395, 400, 405)
(median of upper half: 410, 415, 420, 510, 880)
Step 7: Interpret the results.
The large discrepancy between the range (495 ms) and the IQR (25 ms) confirms that the extreme value of 880 ms is an outlier. For these data, the median and IQR are better summaries than the mean and SD because the distribution is heavily right-skewed.
| Measure | Value |
|---|---|
| Mean | 461.0 ms |
| Median | 407.5 ms |
| Mode | None (all unique) |
| SD | 151.4 ms |
| Variance | 22,932.2 ms$^2$ |
| Range | 495 ms |
| IQR | 25 ms |
Interpretation
Choosing the Right Measure
| Data Characteristic | Central Tendency | Variability |
|---|---|---|
| Symmetric, no outliers | Mean | SD |
| Skewed or has outliers | Median | IQR |
| Nominal (categories) | Mode | -- |
| Ordinal (ranked) | Median | IQR |
Reading a Standard Deviation
The standard deviation tells you the "typical" distance from the mean. In the worked example, ms is very large relative to ms, indicating high variability. The coefficient of variation (CV) puts this in perspective:
A CV above 30% usually signals high variability. In this case, the outlier is the primary driver.
Standard Deviation vs. Standard Error
Students frequently confuse these two:
- Standard deviation () describes variability in the data — how spread out individual scores are.
- Standard error of the mean () describes variability in the sampling distribution of the mean — how much the sample mean would fluctuate across repeated samples.
Report when describing your data. Report (or confidence intervals) when making inferences about the population mean.
Common Mistakes
-
Reporting the mean for skewed data. If income data are right-skewed, the mean is inflated by high earners. The median provides a more representative picture. Always inspect histograms before choosing your summary statistics.
-
Confusing SD and SE. Reporting in a descriptive table makes the variability look artificially small (since ). APA style requires in descriptive tables unless you are specifically reporting precision of a mean estimate.
-
Ignoring outliers. A single extreme value can dramatically change the mean and SD. Always check for outliers using box plots or z-scores (values with are typically flagged).
-
Reporting too many decimal places. A mean reaction time of 461.00000 ms implies false precision. Generally, report one more decimal place than the original measurement. For whole-number data, one decimal place is sufficient.
-
Forgetting to report variability. A mean without a measure of spread is incomplete. Saying "the average score was 75" does not tell the reader whether scores ranged from 70 to 80 or from 30 to 100. Always pair a central tendency measure with a variability measure.
-
Computing the mean of ordinal data. Strictly speaking, you should not average Likert-scale items (e.g., 1--5 ratings) because the intervals between values may not be equal. In practice, researchers often do compute means of Likert-type scales, but this should be done thoughtfully and acknowledged as a limitation.
-
Using the population formula instead of the sample formula. When computing variance and SD from a sample, always divide by , not . Dividing by underestimates the population variance.
How to Report in APA Format
In-text
Participants' average reaction time was ms (). Due to positive skew, the median ( ms) may better represent the typical response.
Descriptive Statistics Table
APA recommends a table for studies with multiple variables:
Table 1
Descriptive Statistics for Study Variables
Variable Min Max Skewness Reaction time (ms) 10 461.0 151.4 385 880 2.54 Accuracy (%) 10 88.3 6.2 78 97 -0.31
Key formatting guidelines:
- Use and (italicized) as column headers
- Report values to one or two decimal places consistently
- Include , range, and skewness when space permits
- Note if medians and IQRs are reported instead of means and SDs, and explain why
- For categorical variables, report frequencies and percentages rather than means
Ready to calculate?
Now that you understand the concept, use the free Effect Size Calculator on Subthesis to run your own analysis.
Related Concepts
Pearson Correlation
Learn how to calculate and interpret the Pearson correlation coefficient (r) to measure the strength and direction of linear relationships between two variables.
Cronbach's Alpha
Understand Cronbach's alpha for measuring internal consistency reliability. Learn the formula, interpretation guidelines, and what to do when alpha is low.
Effect Size
Learn what effect size is, why it matters more than p-values alone, and how to calculate and interpret Cohen's d, Hedges' g, and eta-squared for your research.