The Evaluation of Transformer Models for the Detection of Adverse Drug Events: A Benchmark Study Using Dutch Free-Text Documents of Hospitalized Patients.

Adverse drug events (ADEs) are a leading cause of preventable patient harm in hospitals. Because they are often recorded only in clinical free-text documents, retrieval and quantification are significantly limited. Automating ADE detection with natural language processing (NLP) is promising. Recent work shows that bidirectional encoder representations from transformers (BERT)-based models outperform bidirectional long short-term memory (Bi-LSTM) models and even larger generative pretrained transformers while being more computationally efficient. However, most ADE-NLP research focuses on the English language, often applies metrics less suitable for rare outcomes such as ADEs, and lacks external validation.

To evaluate four transformer models for the detection of ADEs by reusing Dutch clinical free-text documents and create a benchmark with realistic clinical scenarios, appropriate performance measures, and external validation.

We used three anonymized datasets: (1) Dutch ADE corpus with 102 densely annotated progress notes of patients admitted to the intensive care unit (ICU) from one Dutch academic hospital, (2) ICU AKI corpus with 411 sparsely annotated ICU notes from the same hospital, and (3) WINGS corpus with 100 discharge letters of internal medicine patients from two Dutch non-academic hospitals, labeled for ADE presence. A Bi-LSTM model and four transformer-based Dutch or multilingual encoder models (BERTje, RobBERT-base, MedRoBERTa.nl, NuNER) were trained for named entity recognition (NER) and relation classification (RC) using the Dutch ADE corpus. We used fivefold cross validation with 60%/20%/20% train/validation/test splits and performed hyperparameter tuning on the first fold for NER and across all folds for RC. We evaluated our ADE RC models internally using gold standard (two-step task) and predicted entities (end-to-end task). In addition, all models were externally validated using WINGS Corpus on detecting ADEs at the document level. We report both micro- and macro-averaged F1 scores, to account for ADE rarity.

In our internal validation, MedRoBERTa.nl achieved the best performance, with macro-averaged F1 score of 0.63 using gold standard entities and 0.62 using predicted entities, while all models reached micro-averaged F1 scores ± 0.99. MedRoBERTa.nl also performed the best in our external validation, with recall range 0.67-0.74 using predicted entities (end-to-end task), meaning that between 67% and 74% of discharge letters with ADEs were detected.

The Dutch domain-specific MedRoBERTa.nl showed the best performance in detecting ADEs in Dutch clinical texts, and in line with previous studies in English language settings, outperformed Bi-LSTM. The inclusion of external validation highlights its generalization potential. Our findings also underline the need for further model improvement and use of performance measures suited to rare outcomes such as ADEs, as micro-averaged F1 scores inflate performance compared with macro-averaged F1 scores. We provide a robust and clinically meaningful benchmark approach for NLP-based ADE detection in clinical free-text documents. Our approach can serve as a guidance for future NLP benchmarks in ADE domain.
Mental Health
Access
Care/Management
Policy

Authors

Murphy Murphy, Mishra Mishra, de Keizer de Keizer, Dongelmans Dongelmans, Jager Jager, Abu-Hanna Abu-Hanna, Klopotowska Klopotowska, Calixto Calixto
View on Pubmed
Share
Facebook
X (Twitter)
Bluesky
Linkedin
Copy to clipboard