Inter-Rater Reliability
Inter-Rater Reliability (Cohen's Kappa / ICC)
What Is Inter-Rater Reliability?
Inter-rater reliability (IRR), also called inter-rater agreement or inter-observer reliability, measures the extent to which two or more independent raters assign the same scores or categories to the same set of subjects. It answers the question: "If different people apply the same coding scheme, do they reach the same conclusions?"
IRR is essential in research that involves human judgment — content analysis, behavioral coding, clinical diagnosis, essay grading, and qualitative classification. Without adequate IRR, you cannot trust that your measurements reflect the construct of interest rather than idiosyncratic rater differences.
There are several approaches to assessing IRR, each suited to a different data type:
| Method | Data Type | Number of Raters |
|---|---|---|
| Percent agreement | Categorical | 2+ |
| Cohen's Kappa () | Categorical (nominal) | 2 |
| Weighted Kappa | Categorical (ordinal) | 2 |
| Fleiss' Kappa | Categorical (nominal) | 3+ |
| Intraclass Correlation (ICC) | Continuous | 2+ |
| Krippendorff's Alpha | Any | 2+ |
When to Use It
Report inter-rater reliability when:
- Two or more coders independently classify qualitative data (e.g., coding interview themes, diagnosing disorders from case files).
- Judges rate performances, essays, or other subjective material on a scale.
- Observers record behaviors in an observational study (e.g., counting occurrences of aggression in a playground).
- You need to demonstrate that your measurement procedure produces consistent, replicable results regardless of who does the rating.
IRR should be established before the main data collection and reported in the methods section. Typically, raters independently code a subset (10--20%) of the data, IRR is computed, discrepancies are discussed, and then the remaining data are coded.
Assumptions
For Cohen's Kappa
- Exactly two raters. For three or more raters, use Fleiss' Kappa or Krippendorff's Alpha.
- Same subjects rated by both raters. Every subject must be rated by both raters.
- Mutually exclusive and exhaustive categories. Each subject is assigned to exactly one category.
- Independent ratings. Raters must not discuss cases or see each other's ratings.
For ICC
- Continuous (interval or ratio) data. Ratings must be numeric and meaningful in magnitude.
- Appropriate model selection. You must choose the correct ICC form based on your design:
- ICC(1,1): Each subject is rated by a different set of randomly selected raters (one-way random).
- ICC(2,1): Each subject is rated by the same set of raters, who are considered a random sample from a larger population (two-way random).
- ICC(3,1): Each subject is rated by the same set of raters, who are the only raters of interest (two-way mixed).
- Normality. Ratings should be approximately normally distributed.
Formula
Percent Agreement
The simplest measure, but it does not account for agreement that occurs by chance:
Cohen's Kappa ()
Cohen's Kappa corrects percent agreement for the amount of agreement expected by chance:
Where:
- = observed proportion of agreement
- = expected proportion of agreement by chance
is calculated from the marginal totals. If Rater 1 assigns category A to 60% of cases and Rater 2 assigns category A to 50% of cases, the chance probability of both assigning A is .
Interpretation of Kappa
The most widely used benchmarks come from Landis and Koch (1977):
| Interpretation | |
|---|---|
| Poor (less than chance) | |
| -- | Slight |
| -- | Fair |
| -- | Moderate |
| -- | Substantial |
| -- | Almost perfect |
Intraclass Correlation Coefficient (ICC)
For continuous ratings, the ICC compares variance between subjects to total variance:
In a two-way model (ICC(2,1) or ICC(3,1)):
Where = mean square from a two-way ANOVA, = number of raters, = number of subjects.
Interpretation of ICC
| ICC | Interpretation |
|---|---|
| Poor | |
| -- | Moderate |
| -- | Good |
| -- | Excellent |
(Koo & Li, 2016)
Worked Example
Example 1: Cohen's Kappa (Categorical Data)
Scenario: Two clinical psychologists independently diagnose 50 patients as either having "Major Depressive Disorder" (MDD) or "No MDD" based on structured interviews.
Contingency table:
| Rater 2: MDD | Rater 2: No MDD | Row Total | |
|---|---|---|---|
| Rater 1: MDD | 20 | 5 | 25 |
| Rater 1: No MDD | 3 | 22 | 25 |
| Column Total | 23 | 27 | 50 |
Step 1: Calculate observed agreement ().
The raters agreed on 84% of cases.
Step 2: Calculate expected agreement ().
Step 3: Calculate Kappa.
Interpretation: falls in the "substantial" agreement range. The two clinicians show good agreement beyond what would be expected by chance. However, there is room for improvement — the five cases where Rater 1 said MDD and Rater 2 said No MDD should be reviewed for diagnostic clarity.
Example 2: ICC (Continuous Data)
Scenario: Three trained observers rate the severity of disruptive classroom behavior on a 1--10 scale for 6 students.
| Student | Rater 1 | Rater 2 | Rater 3 |
|---|---|---|---|
| 1 | 7 | 6 | 7 |
| 2 | 3 | 4 | 3 |
| 3 | 8 | 8 | 9 |
| 4 | 5 | 5 | 4 |
| 5 | 2 | 3 | 2 |
| 6 | 9 | 8 | 9 |
Using a two-way mixed model (ICC(3,1)) because the same three raters rate all students and we are interested only in these raters:
From the ANOVA decomposition:
Interpretation: ICC indicates "excellent" agreement. The three raters are highly consistent in their severity ratings.
Interpretation
When interpreting IRR:
- Kappa is always lower than percent agreement because it removes chance agreement. A percent agreement of 80% might correspond to a Kappa of only .60 if categories are unbalanced.
- Base rates matter. When one category is very common (e.g., 90% of cases are "normal"), percent agreement is inflated and Kappa can be paradoxically low even with good agreement. This is known as the Kappa paradox.
- ICC model matters. Always specify which ICC form you used (e.g., ICC(2,1) or ICC(3,1)) and whether it reflects single-measure or average-measure reliability.
- Confidence intervals. Always report 95% CIs alongside point estimates. A Kappa of .70 with a CI of [.45, .95] is much less informative than one with a CI of [.62, .78].
Common Mistakes
-
Reporting only percent agreement. Percent agreement ignores chance and is almost always inflated. Journals and reviewers expect Kappa or ICC.
-
Using the wrong ICC model. Choosing ICC(1,1) when the same raters rate all subjects (should be ICC(2,1) or ICC(3,1)) produces incorrect estimates. Map your design to the correct model.
-
Not training raters. Computing IRR before raters are adequately trained wastes effort. Conduct practice sessions, discuss disagreements, and refine the coding manual before the reliability check.
-
Using too small a reliability sample. Reliability estimates based on 10 cases are unstable. Aim for at least 30 cases or 20% of the total sample, whichever is larger.
-
Ignoring systematic rater bias. Kappa and percent agreement do not capture systematic differences (e.g., Rater 1 always assigns higher scores). ICC detects this, but only with the correct model. Consider computing mean scores per rater to check for bias.
-
Computing Kappa for continuous data. If raters assign numerical severity scores on a continuous scale, use ICC, not Kappa. Kappa treats a rating of 4 vs. 5 the same as 4 vs. 9 — both count as disagreements.
-
Not computing reliability for each category. Overall Kappa can mask poor agreement on rare categories. Report category-specific Kappa values when possible.
How to Report in APA Format
For Cohen's Kappa:
Two clinical psychologists independently diagnosed 50 patients. Inter-rater reliability was substantial, , 95% CI [.49, .87], .
For ICC:
Three observers rated classroom behavior severity. Intraclass correlation using a two-way mixed model (ICC(3,1)) indicated excellent inter-rater reliability, ICC , 95% CI [.78, .98].
For percent agreement alongside Kappa:
Coders independently classified 200 social media posts into five content categories. Percent agreement was 82%, and Cohen's Kappa was , indicating substantial agreement beyond chance.
Key elements:
- Number of raters and number of cases
- The specific statistic used (Kappa, weighted Kappa, Fleiss' Kappa, or ICC with model specified)
- The point estimate and 95% confidence interval
- A verbal interpretation (e.g., "substantial," "excellent")
- For ICC, always state the model form (e.g., ICC(3,1) two-way mixed, single measures)
Ready to calculate?
Now that you understand the concept, use the free Reliability Calculator on Subthesis to run your own analysis.
Related Concepts
Cronbach's Alpha
Understand Cronbach's alpha for measuring internal consistency reliability. Learn the formula, interpretation guidelines, and what to do when alpha is low.
Descriptive Statistics
Master descriptive statistics: learn about mean, median, mode, standard deviation, variance, and range. Know when to use each measure for your research data.
Effect Size
Learn what effect size is, why it matters more than p-values alone, and how to calculate and interpret Cohen's d, Hedges' g, and eta-squared for your research.