Introduction

In research designs where one or more raters, that is evaluators or observers, assess a categorical variable, it is crucial to examine the degree of agreement between them. Cohen’s kappa (κ) is a statistical measure that captures the level of agreement between two raters when their decisions are recorded on categorical scales. This method goes beyond a simple calculation of the percentage of agreement, as it also takes into account agreement that may occur by chance, thus providing a more reliable estimate. A typical example comes from the medical field, where two physicians evaluate thirty patients with skin problems and decide whether or not to refer them to a specialist. Comparing their decisions using the κ statistic offers a clear picture of their consistency and uniformity in diagnostic practice.

The Importance of Measuring Agreement

Agreement between raters is of fundamental importance, as it directly affects the validity and reliability of a study’s conclusions. In the example with the two physicians, if a high level of agreement is found, the study leader may feel confident that the doctors follow similar evaluation criteria. However, it is important to note that agreement does not necessarily imply correctness of diagnosis. Two raters may agree with each other and still be wrong, for example by referring more patients than necessary. Cohen’s kappa measures only the degree of agreement and not the accuracy of the decisions.

Assumptions of Cohen’s Kappa

For the calculation of κ to be valid, several assumptions must be met. The first assumption concerns the categorical scale. The raters’ judgments must be recorded on a categorical scale, either nominal or dichotomous, and the categories must be mutually exclusive. A rater, for example, cannot classify the same case as both “normal” and “suspicious” simultaneously. The second assumption relates to the form of the data, which must be pairs of observations of the same phenomenon. This means that each observation is assessed by the same two raters. If thirty patients are being evaluated, then thirty pairs of decisions should exist. The third assumption requires symmetry in the categories used by the raters, so that the agreement table is square, such as 2×2, 3×3, or 4×4. For instance, one rater cannot use two categories while the other uses three. The fourth assumption refers to the independence of the raters. The judgment of one should not influence the judgment of the other, which would be the case if they discussed their evaluations before recording them. The fifth and final assumption states that the same raters must be used to evaluate all observations. If different raters were involved in each case, the appropriate statistic would not be Cohen’s kappa but Fleiss’ kappa, which accounts for multiple raters.

Application in SPSS

The application of Cohen’s kappa in SPSS is performed by creating a crosstab table that compares the decisions of the two raters. SPSS automatically provides the κ statistic, along with significance levels and confidence intervals. The interpretation of the result is based on the value range. Values close to zero indicate no agreement beyond chance, while values close to one reflect almost perfect agreement. Specifically, values below 0.20 indicate poor agreement, values between 0.21 and 0.40 indicate fair agreement, between 0.41 and 0.60 moderate agreement, between 0.61 and 0.80 substantial agreement, and between 0.81 and 1.00 almost perfect agreement. In the case of the two physicians, a κ value of 0.85 would mean that their decisions are highly consistent.

Conclusions

Cohen’s kappa is an extremely useful tool for estimating the agreement between two raters on categorical data. Its correct application, however, requires that five basic assumptions be satisfied, which concern the categorical nature of the data, the presence of paired observations, the symmetry of categories, the independence of raters, and the stability of the same evaluators across all cases. The method is applied in many fields, from medical diagnosis to service quality assessment and psychometric testing, while in more complex scenarios there are variations such as the weighted kappa and Fleiss’ kappa. In any case, the use of the statistic highlights the importance of reliability and consistency in every scientific or professional field where decisions carry critical weight.