Inter-Rater Reliability

Inter-Rater Reliability (Cohen's Kappa / ICC)

Purpose

Quantifies the degree of agreement between two or more raters or observers who classify, rate, or measure the same set of subjects.

When to Use

When your study involves subjective coding, classification, or ratings by human judges and you need to demonstrate that the measurement process is consistent.

Data Type

Categorical (nominal or ordinal) for Cohen's Kappa; continuous (interval or ratio) for Intraclass Correlation Coefficient (ICC)

Key Assumptions

Kappa: two raters, same set of subjects, mutually exclusive categories, independent ratings. ICC: ratings are at least interval-level, raters are a sample from a population of raters (for random-effects models).

Tools

Reliability Calculator on Subthesis →

What Is Inter-Rater Reliability?

Inter-rater reliability (IRR), also called inter-rater agreement or inter-observer reliability, measures the extent to which two or more independent raters assign the same scores or categories to the same set of subjects. It answers the question: "If different people apply the same coding scheme, do they reach the same conclusions?"

IRR is essential in research that involves human judgment — content analysis, behavioral coding, clinical diagnosis, essay grading, and qualitative classification. Without adequate IRR, you cannot trust that your measurements reflect the construct of interest rather than idiosyncratic rater differences.

There are several approaches to assessing IRR, each suited to a different data type:

Method	Data Type	Number of Raters
Percent agreement	Categorical	2+
Cohen's Kappa ( $\kappa$ )	Categorical (nominal)	2
Weighted Kappa	Categorical (ordinal)	2
Fleiss' Kappa	Categorical (nominal)	3+
Intraclass Correlation (ICC)	Continuous	2+
Krippendorff's Alpha	Any	2+

When to Use It

Report inter-rater reliability when:

Two or more coders independently classify qualitative data (e.g., coding interview themes, diagnosing disorders from case files).
Judges rate performances, essays, or other subjective material on a scale.
Observers record behaviors in an observational study (e.g., counting occurrences of aggression in a playground).
You need to demonstrate that your measurement procedure produces consistent, replicable results regardless of who does the rating.

IRR should be established before the main data collection and reported in the methods section. Typically, raters independently code a subset (10--20%) of the data, IRR is computed, discrepancies are discussed, and then the remaining data are coded.

Assumptions

For Cohen's Kappa

Exactly two raters. For three or more raters, use Fleiss' Kappa or Krippendorff's Alpha.
Same subjects rated by both raters. Every subject must be rated by both raters.
Mutually exclusive and exhaustive categories. Each subject is assigned to exactly one category.
Independent ratings. Raters must not discuss cases or see each other's ratings.

For ICC

Continuous (interval or ratio) data. Ratings must be numeric and meaningful in magnitude.
Appropriate model selection. You must choose the correct ICC form based on your design:
- ICC(1,1): Each subject is rated by a different set of randomly selected raters (one-way random).
- ICC(2,1): Each subject is rated by the same set of raters, who are considered a random sample from a larger population (two-way random).
- ICC(3,1): Each subject is rated by the same set of raters, who are the only raters of interest (two-way mixed).
Normality. Ratings should be approximately normally distributed.

Formula

Percent Agreement

The simplest measure, but it does not account for agreement that occurs by chance:

\text{Percent Agreement} = \frac{\text{Number of agreements}}{\text{Total number of ratings}} \times 100

Cohen's Kappa ( $\kappa$ )

Cohen's Kappa corrects percent agreement for the amount of agreement expected by chance:

\kappa = \frac{P_o - P_e}{1 - P_e}

Where:

$P_o$ = observed proportion of agreement
$P_e$ = expected proportion of agreement by chance

$P_e$ is calculated from the marginal totals. If Rater 1 assigns category A to 60% of cases and Rater 2 assigns category A to 50% of cases, the chance probability of both assigning A is $0.60 \times 0.50 = 0.30$ .

Interpretation of Kappa

The most widely used benchmarks come from Landis and Koch (1977):

$\kappa$	Interpretation
$< 0.00$	Poor (less than chance)
$0.00$ -- $0.20$	Slight
$0.21$ -- $0.40$	Fair
$0.41$ -- $0.60$	Moderate
$0.61$ -- $0.80$	Substantial
$0.81$ -- $1.00$	Almost perfect

Intraclass Correlation Coefficient (ICC)

For continuous ratings, the ICC compares variance between subjects to total variance:

\text{ICC} = \frac{\sigma^2_{between}}{\sigma^2_{between} + \sigma^2_{within}}

In a two-way model (ICC(2,1) or ICC(3,1)):

\text{ICC}(2,1) = \frac{MS_{subjects} - MS_{error}}{MS_{subjects} + (k - 1) \cdot MS_{error} + \frac{k}{n}(MS_{raters} - MS_{error})}

Where $MS$ = mean square from a two-way ANOVA, $k$ = number of raters, $n$ = number of subjects.

Interpretation of ICC

ICC	Interpretation
$< 0.50$	Poor
$0.50$ -- $0.74$	Moderate
$0.75$ -- $0.89$	Good
$0.90$ -- $1.00$	Excellent

(Koo & Li, 2016)

Worked Example

Example 1: Cohen's Kappa (Categorical Data)

Scenario: Two clinical psychologists independently diagnose 50 patients as either having "Major Depressive Disorder" (MDD) or "No MDD" based on structured interviews.

Contingency table:

	Rater 2: MDD	Rater 2: No MDD	Row Total
Rater 1: MDD	20	5	25
Rater 1: No MDD	3	22	25
Column Total	23	27	50

Step 1: Calculate observed agreement ( $P_o$ ).

P_o = \frac{20 + 22}{50} = \frac{42}{50} = 0.84

The raters agreed on 84% of cases.

Step 2: Calculate expected agreement ( $P_e$ ).

P(\text{both say MDD}) = \frac{25}{50} \times \frac{23}{50} = 0.50 \times 0.46 = 0.23

P(\text{both say No MDD}) = \frac{25}{50} \times \frac{27}{50} = 0.50 \times 0.54 = 0.27

P_e = 0.23 + 0.27 = 0.50

Step 3: Calculate Kappa.

\kappa = \frac{0.84 - 0.50}{1 - 0.50} = \frac{0.34}{0.50} = 0.68

Interpretation: $\kappa = .68$ falls in the "substantial" agreement range. The two clinicians show good agreement beyond what would be expected by chance. However, there is room for improvement — the five cases where Rater 1 said MDD and Rater 2 said No MDD should be reviewed for diagnostic clarity.

Example 2: ICC (Continuous Data)

Scenario: Three trained observers rate the severity of disruptive classroom behavior on a 1--10 scale for 6 students.

Student	Rater 1	Rater 2	Rater 3
1	7	6	7
2	3	4	3
3	8	8	9
4	5	5	4
5	2	3	2
6	9	8	9

Using a two-way mixed model (ICC(3,1)) because the same three raters rate all students and we are interested only in these raters:

From the ANOVA decomposition:

$MS_{subjects} = 18.78$
$MS_{raters} = 0.39$
$MS_{error} = 0.50$

\text{ICC}(3,1) = \frac{18.78 - 0.50}{18.78 + (3-1)(0.50)} = \frac{18.28}{19.78} = 0.924

Interpretation: ICC $= .92$ indicates "excellent" agreement. The three raters are highly consistent in their severity ratings.

Interpretation

When interpreting IRR:

Kappa is always lower than percent agreement because it removes chance agreement. A percent agreement of 80% might correspond to a Kappa of only .60 if categories are unbalanced.
Base rates matter. When one category is very common (e.g., 90% of cases are "normal"), percent agreement is inflated and Kappa can be paradoxically low even with good agreement. This is known as the Kappa paradox.
ICC model matters. Always specify which ICC form you used (e.g., ICC(2,1) or ICC(3,1)) and whether it reflects single-measure or average-measure reliability.
Confidence intervals. Always report 95% CIs alongside point estimates. A Kappa of .70 with a CI of [.45, .95] is much less informative than one with a CI of [.62, .78].

Common Mistakes

Reporting only percent agreement. Percent agreement ignores chance and is almost always inflated. Journals and reviewers expect Kappa or ICC.
Using the wrong ICC model. Choosing ICC(1,1) when the same raters rate all subjects (should be ICC(2,1) or ICC(3,1)) produces incorrect estimates. Map your design to the correct model.
Not training raters. Computing IRR before raters are adequately trained wastes effort. Conduct practice sessions, discuss disagreements, and refine the coding manual before the reliability check.
Using too small a reliability sample. Reliability estimates based on 10 cases are unstable. Aim for at least 30 cases or 20% of the total sample, whichever is larger.
Ignoring systematic rater bias. Kappa and percent agreement do not capture systematic differences (e.g., Rater 1 always assigns higher scores). ICC detects this, but only with the correct model. Consider computing mean scores per rater to check for bias.
Computing Kappa for continuous data. If raters assign numerical severity scores on a continuous scale, use ICC, not Kappa. Kappa treats a rating of 4 vs. 5 the same as 4 vs. 9 — both count as disagreements.
Not computing reliability for each category. Overall Kappa can mask poor agreement on rare categories. Report category-specific Kappa values when possible.

How to Report in APA Format

For Cohen's Kappa:

Two clinical psychologists independently diagnosed 50 patients. Inter-rater reliability was substantial, $\kappa = .68$ , 95% CI [.49, .87], $p < .001$ .

For ICC:

Three observers rated classroom behavior severity. Intraclass correlation using a two-way mixed model (ICC(3,1)) indicated excellent inter-rater reliability, ICC $= .92$ , 95% CI [.78, .98].

For percent agreement alongside Kappa:

Coders independently classified 200 social media posts into five content categories. Percent agreement was 82%, and Cohen's Kappa was $\kappa = .76$ , indicating substantial agreement beyond chance.

Key elements:

Number of raters and number of cases
The specific statistic used (Kappa, weighted Kappa, Fleiss' Kappa, or ICC with model specified)
The point estimate and 95% confidence interval
A verbal interpretation (e.g., "substantial," "excellent")
For ICC, always state the model form (e.g., ICC(3,1) two-way mixed, single measures)

Ready to calculate?

Now that you understand the concept, use the free Reliability Calculator on Subthesis to run your own analysis.

Calculate Reliability on Subthesis

Inter-Rater Reliability