Expert evaluation of GPT-4o and Gemini responses to patient questions on carotid endarterectomy.
The aim of this study was to compare the accuracy, scientific quality, and clarity of responses generated by GPT-4o and Gemini to frequently asked patient questions related to carotid artery disease and carotid endarterectomy.
In total, 40 unique carotid endarterectomy-related questions were compiled from online sources and clinical experience. Each was entered into separate new sessions with GPT-4o and Gemini 2.5 Flash in Turkish, and responses were collected without modification. Notably, four blinded cardiovascular surgeons independently rated each answer (1-5 Likert scale) in three domains: Accuracy, Scientific Quality, and Clarity. Mean response lengths and domain scores were compared using appropriate paired tests.
GPT-4o produced longer responses than Gemini (258.1±101.6 vs. 193.2±43.7 words; p<0.001). Overall, GPT-4o had higher Accuracy scores (4.33±0.39 vs. 4.16±0.33; p=0.04), with no significant differences in Scientific Quality or Clarity (p=0.377 and p=0.154, respectively). In rater-level analyses, Gemini scored higher in Clarity for one rater, whereas GPT-4o was superior in Accuracy and Scientific Quality for another. Overall mean scores were comparable (4.17±0.36 vs. 4.13±0.31; p=0.636). Physician referral was recommended in 62.5% of GPT-4o and 52.5% of Gemini (p=0.366).
Both GPT-4o and Gemini provided "good"-quality responses to carotid endarterectomy patient questions, with GPT-4o showing a modest accuracy advantage, with no difference in other domains. Explicit disclaimers on both platforms underscore their supportive, not definitive, role in patient education. Physicians should remain the primary source for individualized decisions, and AI-generated information should always be verified.
In total, 40 unique carotid endarterectomy-related questions were compiled from online sources and clinical experience. Each was entered into separate new sessions with GPT-4o and Gemini 2.5 Flash in Turkish, and responses were collected without modification. Notably, four blinded cardiovascular surgeons independently rated each answer (1-5 Likert scale) in three domains: Accuracy, Scientific Quality, and Clarity. Mean response lengths and domain scores were compared using appropriate paired tests.
GPT-4o produced longer responses than Gemini (258.1±101.6 vs. 193.2±43.7 words; p<0.001). Overall, GPT-4o had higher Accuracy scores (4.33±0.39 vs. 4.16±0.33; p=0.04), with no significant differences in Scientific Quality or Clarity (p=0.377 and p=0.154, respectively). In rater-level analyses, Gemini scored higher in Clarity for one rater, whereas GPT-4o was superior in Accuracy and Scientific Quality for another. Overall mean scores were comparable (4.17±0.36 vs. 4.13±0.31; p=0.636). Physician referral was recommended in 62.5% of GPT-4o and 52.5% of Gemini (p=0.366).
Both GPT-4o and Gemini provided "good"-quality responses to carotid endarterectomy patient questions, with GPT-4o showing a modest accuracy advantage, with no difference in other domains. Explicit disclaimers on both platforms underscore their supportive, not definitive, role in patient education. Physicians should remain the primary source for individualized decisions, and AI-generated information should always be verified.
Authors
Rahman Rahman, Özbakkaloğlu Özbakkaloğlu, Arslangilay Arslangilay, Daylan Daylan, Keleş Keleş, Bozkurt Bozkurt, Bozok Bozok
View on Pubmed