Accelerating real-world data collection using large language models in rare neoplasms: a bone sarcoma example.
Real-world data collection in oncology remains a challenge due to the complex and unstructured format of medical notes. Recently, large language models (LLMs) have demonstrated success in extracting information from free-text data across various domains. This study evaluates the performance of multiple small LLMs as information extractors on Polish medical notes.
Electronic health records (EHRs) of 302 bone sarcoma patients treated in a reference center between 2016 and 2022 were selected. Five variables-pathology type, tumor size, localization, grade, and primary resection-were annotated by an experienced oncologist. Multiple prompting techniques and four LLMs were used to query the models with the task of returning the value for each variable using an XML tag. Additionally, among non-concordant values we distinguished valid results, i.e. of expected format and containing a key word/phrase from a per-variable, expert-devised list. An ensemble voting approach was applied, selecting values appearing in the majority of valid outputs.
Single-model accuracy was modest (17.5%-30.3%) and highly prompt-dependent. The tumor localization values turned out to be the easiest to assess with an accuracy of up to 36.2%. The majority of non-concordant values were non-valid. The voting strategy improved performance significantly, with 83.6% overall accuracy, peaking at 90.0% for the resection type variable.
Our study highlights the potential of using lightweight LLMs in the automation of data extraction from medical notes, which could significantly accelerate clinical research. A singular small LLM is not yet sufficient for real use cases in non-English settings; however, prompt engineering and ensemble methods can greatly improve performance.
Electronic health records (EHRs) of 302 bone sarcoma patients treated in a reference center between 2016 and 2022 were selected. Five variables-pathology type, tumor size, localization, grade, and primary resection-were annotated by an experienced oncologist. Multiple prompting techniques and four LLMs were used to query the models with the task of returning the value for each variable using an XML tag. Additionally, among non-concordant values we distinguished valid results, i.e. of expected format and containing a key word/phrase from a per-variable, expert-devised list. An ensemble voting approach was applied, selecting values appearing in the majority of valid outputs.
Single-model accuracy was modest (17.5%-30.3%) and highly prompt-dependent. The tumor localization values turned out to be the easiest to assess with an accuracy of up to 36.2%. The majority of non-concordant values were non-valid. The voting strategy improved performance significantly, with 83.6% overall accuracy, peaking at 90.0% for the resection type variable.
Our study highlights the potential of using lightweight LLMs in the automation of data extraction from medical notes, which could significantly accelerate clinical research. A singular small LLM is not yet sufficient for real use cases in non-English settings; however, prompt engineering and ensemble methods can greatly improve performance.
Authors
Teterycz Teterycz, Rynkun Rynkun, Szostakowski Szostakowski, Wągrodzki Wągrodzki, Rutkowski Rutkowski, Rosińska Rosińska
View on Pubmed