Introduction
Fleiss’ Kappa, originally proposed by Fleiss in 1971 and further developed by Fleiss and colleagues in 2003, is a measure of agreement among two or more raters when they evaluate data recorded on a categorical scale. Unlike Cohen’s kappa, which is limited to two raters, Fleiss’ kappa allows the analysis of cases where multiple observers are involved and where both the subjects being evaluated and the raters themselves are randomly selected. This makes it particularly useful in research designs where sampling is random and assessment is carried out by different non-unique raters drawn from a wider population.
Example of Application
To understand the use of Fleiss’ kappa, consider the example of a large medical practice where the director wants to determine whether physicians agree on when to prescribe antibiotics. Four doctors are randomly selected from the population of all available doctors and are asked to examine a patient presenting symptoms of a possible infection. Their decisions are categorized into three groups: “prescribe,” “follow-up,” and “do not prescribe.” This procedure is repeated for ten patients, who are also randomly selected from the overall population of the practice. Each time, different doctors are chosen to evaluate the patients, highlighting the non-unique character of the raters. Analysis using Fleiss’ kappa makes it possible to calculate the overall level of agreement among the doctors, while also revealing in which categories agreement was stronger or weaker. For instance, doctors may show higher agreement when the decision is “prescribe” or “do not prescribe,” but much lower agreement when the decision is “follow-up.”
Basic Requirements and Assumptions
The use of Fleiss’ kappa requires certain fundamental conditions to be satisfied. First, the response variable being evaluated must be categorical, either nominal or ordinal. However, Fleiss’ kappa does not take into account the ordered nature of ordinal variables, treating all categories as equal. A second requirement is that categories must be mutually exclusive. This means that each observation can be classified into only one category and cannot belong to multiple categories at the same time. Moreover, all raters must use the same number of categories, so that the agreement table remains balanced. For example, one rater cannot use three categories while another uses only two.
Another important assumption is that raters are non-unique, meaning they are not the same for all observations but are randomly selected each time from a larger pool. This is what distinguishes Fleiss’ kappa from Cohen’s kappa, which requires a fixed pair of raters. It is equally essential that raters are independent, so that the judgment of one does not influence the judgment of another. Finally, the subjects being evaluated must be randomly sampled from the population of interest, ensuring that results are generalizable and free from sampling bias.
Application in SPSS
The analysis of Fleiss’ kappa can be carried out in SPSS through the “Reliability Analysis” procedure in six steps. The software provides the overall κ statistic, accompanied by a significance test and a 95% confidence interval. These outputs allow researchers to assess the overall level of agreement among non-unique raters. At the same time, SPSS makes it possible to compute individual kappas for each category of the response variable. In this way, it becomes possible to identify cases where agreement was higher and those where it was lower.
Interpretation of the results is based on established guidelines, according to which values below 0.20 indicate poor agreement, values between 0.21 and 0.40 represent fair agreement, values from 0.41 to 0.60 suggest moderate agreement, values from 0.61 to 0.80 denote substantial agreement, and values from 0.81 to 1.00 correspond to almost perfect agreement.
Conclusions
Fleiss’ kappa is a powerful statistical tool for evaluating agreement when more than two raters are involved and when their selection is random and based on non-unique observers. Its proper application, however, requires strict adherence to basic assumptions, such as the use of categorical variables, mutual exclusivity of categories, equal numbers of categories across raters, independence of judgments, and random selection of both raters and subjects. Through SPSS, the calculation and interpretation of the κ statistic become easily accessible, providing researchers with the ability not only to determine the overall level of agreement but also to detect differences across categories. Thus, Fleiss’ kappa emerges as an essential measure for research designs aiming to study the reliability of multiple observers, contributing to the robustness and credibility of scientific conclusions.