Skip to main content
Stats for Scholars
Concepts Decision Tree Reporting Calculators Blog Software Cheat Sheets
Concepts Decision Tree Reporting Calculators Blog Software Cheat Sheets
Home Concepts Inter-Rater Reliability

Descriptive Statistics

  • Descriptive Statistics

Inferential Statistics

  • Chi-Square Test of Independence
  • Independent Samples t-Test
  • One-Way ANOVA
  • Paired Samples t-Test
  • Pearson Correlation
  • Simple Linear Regression

Effect Size & Power

  • Effect Size
  • Sample Size Determination
  • Statistical Power & Power Analysis

Reliability & Validity

  • Cronbach's Alpha
  • Inter-Rater Reliability

Inter-Rater Reliability

intermediate Reliability & Validity

Inter-Rater Reliability (Cohen's Kappa / ICC)

Purpose
Quantifies the degree of agreement between two or more raters or observers who classify, rate, or measure the same set of subjects.
When to Use
When your study involves subjective coding, classification, or ratings by human judges and you need to demonstrate that the measurement process is consistent.
Data Type
Categorical (nominal or ordinal) for Cohen's Kappa; continuous (interval or ratio) for Intraclass Correlation Coefficient (ICC)
Key Assumptions
Kappa: two raters, same set of subjects, mutually exclusive categories, independent ratings. ICC: ratings are at least interval-level, raters are a sample from a population of raters (for random-effects models).
Tools
Reliability Calculator on Subthesis →

What Is Inter-Rater Reliability?

Inter-rater reliability (IRR), also called inter-rater agreement or inter-observer reliability, measures the extent to which two or more independent raters assign the same scores or categories to the same set of subjects. It answers the question: "If different people apply the same coding scheme, do they reach the same conclusions?"

IRR is essential in research that involves human judgment — content analysis, behavioral coding, clinical diagnosis, essay grading, and qualitative classification. Without adequate IRR, you cannot trust that your measurements reflect the construct of interest rather than idiosyncratic rater differences.

There are several approaches to assessing IRR, each suited to a different data type:

Method Data Type Number of Raters
Percent agreement Categorical 2+
Cohen's Kappa (κ\kappaκ) Categorical (nominal) 2
Weighted Kappa Categorical (ordinal) 2
Fleiss' Kappa Categorical (nominal) 3+
Intraclass Correlation (ICC) Continuous 2+
Krippendorff's Alpha Any 2+

When to Use It

Report inter-rater reliability when:

  • Two or more coders independently classify qualitative data (e.g., coding interview themes, diagnosing disorders from case files).
  • Judges rate performances, essays, or other subjective material on a scale.
  • Observers record behaviors in an observational study (e.g., counting occurrences of aggression in a playground).
  • You need to demonstrate that your measurement procedure produces consistent, replicable results regardless of who does the rating.

IRR should be established before the main data collection and reported in the methods section. Typically, raters independently code a subset (10--20%) of the data, IRR is computed, discrepancies are discussed, and then the remaining data are coded.

Assumptions

For Cohen's Kappa

  1. Exactly two raters. For three or more raters, use Fleiss' Kappa or Krippendorff's Alpha.
  2. Same subjects rated by both raters. Every subject must be rated by both raters.
  3. Mutually exclusive and exhaustive categories. Each subject is assigned to exactly one category.
  4. Independent ratings. Raters must not discuss cases or see each other's ratings.

For ICC

  1. Continuous (interval or ratio) data. Ratings must be numeric and meaningful in magnitude.
  2. Appropriate model selection. You must choose the correct ICC form based on your design:
    • ICC(1,1): Each subject is rated by a different set of randomly selected raters (one-way random).
    • ICC(2,1): Each subject is rated by the same set of raters, who are considered a random sample from a larger population (two-way random).
    • ICC(3,1): Each subject is rated by the same set of raters, who are the only raters of interest (two-way mixed).
  3. Normality. Ratings should be approximately normally distributed.

Formula

Percent Agreement

The simplest measure, but it does not account for agreement that occurs by chance:

Percent Agreement=Number of agreementsTotal number of ratings×100\text{Percent Agreement} = \frac{\text{Number of agreements}}{\text{Total number of ratings}} \times 100 Percent Agreement=Total number of ratingsNumber of agreements​×100

Cohen's Kappa (κ\kappaκ)

Cohen's Kappa corrects percent agreement for the amount of agreement expected by chance:

κ=Po−Pe1−Pe\kappa = \frac{P_o - P_e}{1 - P_e} κ=1−Pe​Po​−Pe​​

Where:

  • PoP_oPo​ = observed proportion of agreement
  • PeP_ePe​ = expected proportion of agreement by chance

PeP_ePe​ is calculated from the marginal totals. If Rater 1 assigns category A to 60% of cases and Rater 2 assigns category A to 50% of cases, the chance probability of both assigning A is 0.60×0.50=0.300.60 \times 0.50 = 0.300.60×0.50=0.30.

Interpretation of Kappa

The most widely used benchmarks come from Landis and Koch (1977):

κ\kappaκ Interpretation
<0.00< 0.00<0.00 Poor (less than chance)
0.000.000.00 -- 0.200.200.20 Slight
0.210.210.21 -- 0.400.400.40 Fair
0.410.410.41 -- 0.600.600.60 Moderate
0.610.610.61 -- 0.800.800.80 Substantial
0.810.810.81 -- 1.001.001.00 Almost perfect

Intraclass Correlation Coefficient (ICC)

For continuous ratings, the ICC compares variance between subjects to total variance:

ICC=σbetween2σbetween2+σwithin2\text{ICC} = \frac{\sigma^2_{between}}{\sigma^2_{between} + \sigma^2_{within}} ICC=σbetween2​+σwithin2​σbetween2​​

In a two-way model (ICC(2,1) or ICC(3,1)):

ICC(2,1)=MSsubjects−MSerrorMSsubjects+(k−1)⋅MSerror+kn(MSraters−MSerror)\text{ICC}(2,1) = \frac{MS_{subjects} - MS_{error}}{MS_{subjects} + (k - 1) \cdot MS_{error} + \frac{k}{n}(MS_{raters} - MS_{error})} ICC(2,1)=MSsubjects​+(k−1)⋅MSerror​+nk​(MSraters​−MSerror​)MSsubjects​−MSerror​​

Where MSMSMS = mean square from a two-way ANOVA, kkk = number of raters, nnn = number of subjects.

Interpretation of ICC

ICC Interpretation
<0.50< 0.50<0.50 Poor
0.500.500.50 -- 0.740.740.74 Moderate
0.750.750.75 -- 0.890.890.89 Good
0.900.900.90 -- 1.001.001.00 Excellent

(Koo & Li, 2016)

Worked Example

Example 1: Cohen's Kappa (Categorical Data)

Scenario: Two clinical psychologists independently diagnose 50 patients as either having "Major Depressive Disorder" (MDD) or "No MDD" based on structured interviews.

Contingency table:

Rater 2: MDD Rater 2: No MDD Row Total
Rater 1: MDD 20 5 25
Rater 1: No MDD 3 22 25
Column Total 23 27 50

Step 1: Calculate observed agreement (PoP_oPo​).

Po=20+2250=4250=0.84P_o = \frac{20 + 22}{50} = \frac{42}{50} = 0.84 Po​=5020+22​=5042​=0.84

The raters agreed on 84% of cases.

Step 2: Calculate expected agreement (PeP_ePe​).

P(both say MDD)=2550×2350=0.50×0.46=0.23P(\text{both say MDD}) = \frac{25}{50} \times \frac{23}{50} = 0.50 \times 0.46 = 0.23 P(both say MDD)=5025​×5023​=0.50×0.46=0.23

P(both say No MDD)=2550×2750=0.50×0.54=0.27P(\text{both say No MDD}) = \frac{25}{50} \times \frac{27}{50} = 0.50 \times 0.54 = 0.27 P(both say No MDD)=5025​×5027​=0.50×0.54=0.27

Pe=0.23+0.27=0.50P_e = 0.23 + 0.27 = 0.50 Pe​=0.23+0.27=0.50

Step 3: Calculate Kappa.

κ=0.84−0.501−0.50=0.340.50=0.68\kappa = \frac{0.84 - 0.50}{1 - 0.50} = \frac{0.34}{0.50} = 0.68 κ=1−0.500.84−0.50​=0.500.34​=0.68

Interpretation: κ=.68\kappa = .68κ=.68 falls in the "substantial" agreement range. The two clinicians show good agreement beyond what would be expected by chance. However, there is room for improvement — the five cases where Rater 1 said MDD and Rater 2 said No MDD should be reviewed for diagnostic clarity.

Example 2: ICC (Continuous Data)

Scenario: Three trained observers rate the severity of disruptive classroom behavior on a 1--10 scale for 6 students.

Student Rater 1 Rater 2 Rater 3
1 7 6 7
2 3 4 3
3 8 8 9
4 5 5 4
5 2 3 2
6 9 8 9

Using a two-way mixed model (ICC(3,1)) because the same three raters rate all students and we are interested only in these raters:

From the ANOVA decomposition:

  • MSsubjects=18.78MS_{subjects} = 18.78MSsubjects​=18.78
  • MSraters=0.39MS_{raters} = 0.39MSraters​=0.39
  • MSerror=0.50MS_{error} = 0.50MSerror​=0.50

ICC(3,1)=18.78−0.5018.78+(3−1)(0.50)=18.2819.78=0.924\text{ICC}(3,1) = \frac{18.78 - 0.50}{18.78 + (3-1)(0.50)} = \frac{18.28}{19.78} = 0.924 ICC(3,1)=18.78+(3−1)(0.50)18.78−0.50​=19.7818.28​=0.924

Interpretation: ICC =.92= .92=.92 indicates "excellent" agreement. The three raters are highly consistent in their severity ratings.

Interpretation

When interpreting IRR:

  • Kappa is always lower than percent agreement because it removes chance agreement. A percent agreement of 80% might correspond to a Kappa of only .60 if categories are unbalanced.
  • Base rates matter. When one category is very common (e.g., 90% of cases are "normal"), percent agreement is inflated and Kappa can be paradoxically low even with good agreement. This is known as the Kappa paradox.
  • ICC model matters. Always specify which ICC form you used (e.g., ICC(2,1) or ICC(3,1)) and whether it reflects single-measure or average-measure reliability.
  • Confidence intervals. Always report 95% CIs alongside point estimates. A Kappa of .70 with a CI of [.45, .95] is much less informative than one with a CI of [.62, .78].

Common Mistakes

  1. Reporting only percent agreement. Percent agreement ignores chance and is almost always inflated. Journals and reviewers expect Kappa or ICC.

  2. Using the wrong ICC model. Choosing ICC(1,1) when the same raters rate all subjects (should be ICC(2,1) or ICC(3,1)) produces incorrect estimates. Map your design to the correct model.

  3. Not training raters. Computing IRR before raters are adequately trained wastes effort. Conduct practice sessions, discuss disagreements, and refine the coding manual before the reliability check.

  4. Using too small a reliability sample. Reliability estimates based on 10 cases are unstable. Aim for at least 30 cases or 20% of the total sample, whichever is larger.

  5. Ignoring systematic rater bias. Kappa and percent agreement do not capture systematic differences (e.g., Rater 1 always assigns higher scores). ICC detects this, but only with the correct model. Consider computing mean scores per rater to check for bias.

  6. Computing Kappa for continuous data. If raters assign numerical severity scores on a continuous scale, use ICC, not Kappa. Kappa treats a rating of 4 vs. 5 the same as 4 vs. 9 — both count as disagreements.

  7. Not computing reliability for each category. Overall Kappa can mask poor agreement on rare categories. Report category-specific Kappa values when possible.

How to Report in APA Format

For Cohen's Kappa:

Two clinical psychologists independently diagnosed 50 patients. Inter-rater reliability was substantial, κ=.68\kappa = .68κ=.68, 95% CI [.49, .87], p<.001p < .001p<.001.

For ICC:

Three observers rated classroom behavior severity. Intraclass correlation using a two-way mixed model (ICC(3,1)) indicated excellent inter-rater reliability, ICC =.92= .92=.92, 95% CI [.78, .98].

For percent agreement alongside Kappa:

Coders independently classified 200 social media posts into five content categories. Percent agreement was 82%, and Cohen's Kappa was κ=.76\kappa = .76κ=.76, indicating substantial agreement beyond chance.

Key elements:

  • Number of raters and number of cases
  • The specific statistic used (Kappa, weighted Kappa, Fleiss' Kappa, or ICC with model specified)
  • The point estimate and 95% confidence interval
  • A verbal interpretation (e.g., "substantial," "excellent")
  • For ICC, always state the model form (e.g., ICC(3,1) two-way mixed, single measures)

Ready to calculate?

Now that you understand the concept, use the free Reliability Calculator on Subthesis to run your own analysis.

Calculate Reliability on Subthesis

Related Concepts

Cronbach's Alpha

Understand Cronbach's alpha for measuring internal consistency reliability. Learn the formula, interpretation guidelines, and what to do when alpha is low.

Descriptive Statistics

Master descriptive statistics: learn about mean, median, mode, standard deviation, variance, and range. Know when to use each measure for your research data.

Effect Size

Learn what effect size is, why it matters more than p-values alone, and how to calculate and interpret Cohen's d, Hedges' g, and eta-squared for your research.

Stats for Scholars

Statistics for Researchers, Not Statisticians

A Subthesis Resource

Learn

  • Statistical Concepts
  • Choose a Test
  • APA Reporting
  • Blog

Resources

  • Calculators
  • Cheat Sheets
  • About
  • FAQ
  • Privacy
  • Terms

© 2026 Angel Reyes / Subthesis. All rights reserved.

Privacy Policy Terms of Use