Evaluating inter-rater reliability in the context of “Sysmex UN2000 detection of protein/creatinine ratio and of renal tubular epithelial cells can be used for screening lupus nephritis”: a statistical examination – BMC Nephrology

Cohen devised a classification system for interpreting Kappa values, indicating various levels of agreement [4]. However, McHugh [15] highlighted practical concerns, arguing that labeling a 61% agreement rate as “substantial” might be misleading, especially in critical settings like clinical laboratories where a 40% error rate would be significant. He emphasized the need for a higher standard, with many sources recommending a minimum interrater agreement of 80%. McHugh proposed an alternative interpretation of Kappa values, categorizing ≤ 0.20 as no agreement, 0.21 to 0.39 as minimal agreement, 0.40 to 0.59 as weak agreement, 0.60 to 0.79 as moderate agreement, 0.80 to 0.90 as strong agreement, and values exceeding 0.90 as almost perfect agreement. This alternative approach considers the practical implications of different Kappa values, offering a nuanced perspective on agreement levels in situations where accuracy has substantial real-world consequences. By addressing these concerns, McHugh’s interpretation provides a more contextually relevant framework for understanding and applying Kappa values, particularly in critical decision-making environments.

Comparative Kappa statistics between Cohen’s and weighted approaches

Chen et al. assessed the viability of utilizing automated urine sediment analysis (UN2000) for lupus nephritis screening. 284 urine samples from systemic lupus erythematosus patients were examined with UN2000, evaluating protein/creatinine ratio (P/C) and renal tubular epithelial cells. Employing biochemical analysis and microscopy as the gold standard, the Kappa consistency test demonstrated strong and good agreement for P/C and renal tubular epithelial cells (RTEC), respectively (Cohen’s Kappa, 0.858). Setting P/C ≥ 2 + as the sole screening standard yielded the highest specificity, positive predictive value, and coincidence for lupus nephritis. Combining P/C ≥ 2 + or RTEC > 2.8 cells/µl as the standard maximized sensitivity and negative predictive value. UN2000 proves effective in lupus nephritis screening by detecting P/C and RTEC. Yet, as mentioned earlier, in the context of three-category ordinal variables, opting for weighted Kappa is often a more suitable approach for evaluating IRR compared to Cohen’s Kappa.

Upon examining the data provided by the authors, the agreement between AU5800 and UN2000 was assessed using three Kappa values with SPSSAU (https://spssau.com/) (Table 1). There was strong agreement between the two categories, with Cohen’s Kappa, and almost complete agreement with LWK and QWK. As a result, LWK and QWK are the preferred measures for more sensitive evaluation that emphasizes larger differences in judgment when assessing agreement.

Table 1 The Kappa coefficient between the AU5800 and UN2000