Natural Language Processing Faculty Publications

Analysis of predictive performance and reliability of classifiers for quality assessment of medical evidence revealed important variation by medical area

Simon Šuster, School of Computing and Information Systems
Timothy Baldwin, University of Melbourne & Mohamed Bin Zayed University of Artificial IntelligenceFollow
Karin Verspoor, University of Melbourne & RMIT University

Document Type

Article

Publication Title

Journal of Clinical Epidemiology

Abstract

Objectives: A major obstacle in deployment of models for automated quality assessment is their reliability. To analyze their calibration and selective classification performance. Study Design and Setting: We examine two systems for assessing the quality of medical evidence, EvidenceGRADEr and RobotReviewer, both developed from Cochrane Database of Systematic Reviews (CDSR) to measure strength of bodies of evidence and risk of bias (RoB) of individual studies, respectively. We report their calibration error and Brier scores, present their reliability diagrams, and analyze the risk–coverage trade-off in selective classification. Results: The models are reasonably well calibrated on most quality criteria (expected calibration error [ECE] 0.04–0.09 for EvidenceGRADEr, 0.03–0.10 for RobotReviewer). However, we discover that both calibration and predictive performance vary significantly by medical area. This has ramifications for the application of such models in practice, as average performance is a poor indicator of group-level performance (e.g., health and safety at work, allergy and intolerance, and public health see much worse performance than cancer, pain, and anesthesia, and Neurology). We explore the reasons behind this disparity. Conclusion: Practitioners adopting automated quality assessment should expect large fluctuations in system reliability and predictive performance depending on the medical area. Prospective indicators of such behavior should be further researched.

First Page

Last Page

DOI

10.1016/j.jclinepi.2023.04.006

Publication Date

7-1-2023

Keywords

Automated quality assessment of medical evidence, Calibration, Critical appraisal, Disparity, Reliability, Risk of bias, Selective classification, Systematic reviews, Uncertainty estimation

Comments

Hybrid Gold Open Access

Archived with thanks to Elsevier

Preprint License: CC BY-NC-ND

Uploaded 16 November 2023

Recommended Citation

S. Šuster, T. Baldwin, and K. Verspoor, “Analysis of predictive performance and reliability of classifiers for Quality Assessment of medical evidence revealed important variation by Medical Area,” Journal of Clinical Epidemiology, vol. 159, pp. 58–69, 2023. doi:10.1016/j.jclinepi.2023.04.006

Additional Links

Publisher's link: https://www.sciencedirect.com/science/article/pii/S0895435623000914?via%3Dihub

Download

Included in

Artificial Intelligence and Robotics Commons

COinS

Natural Language Processing Faculty Publications

Analysis of predictive performance and reliability of classifiers for quality assessment of medical evidence revealed important variation by medical area

Document Type

Publication Title

Abstract

First Page

Last Page

DOI

Publication Date

Keywords

Comments

Recommended Citation

Additional Links

Included in

Browse

Contribute

Links

Natural Language Processing Faculty Publications

Analysis of predictive performance and reliability of classifiers for quality assessment of medical evidence revealed important variation by medical area

Authors

Document Type

Publication Title

Abstract

First Page

Last Page

DOI

Publication Date

Keywords

Comments

Recommended Citation

Additional Links

Included in

Share

Browse

Contribute

Links