Evaluating GPT-4 Performance on Plastic Surgery Oral Examination Vignettes: A Quantitative and Qualitative Analysis.
Abstract
[BACKGROUND] The use of artificial intelligence (AI) in medicine is rapidly evolving. However, its role in plastic and reconstructive surgery remains underexplored. Plastic surgery requires nuanced, dynamic decision making, and individualized care, making AI integration challenging. This study evaluates GPT-4's ability to respond to American Board of Plastic Surgery (ABPS) Oral Board-style clinical vignettes and assesses its potential as a decision making and educational adjunct.
[METHODS] Twelve clinical vignettes spanning aesthetic, reconstructive, hand, craniofacial, pediatric, and trauma surgery were input into GPT-4. Each response was scored by 2 board-certified plastic surgeons across 6 clinical domains: Case Introduction, Diagnosis, Treatment Planning, Patient Counseling, Operative Steps, and Complication Management. Domains were scored on a 0-4 scale; scores <2 were considered failing. Interrater reliability was assessed via weighted κ . A structured qualitative interview was conducted.
[RESULTS] GPT-4 achieved mean domain scores ranging from 2.3 to 3.7 across vignettes, with passing rates of 91.7% (rater 1) and 83.3% (rater 2). The highest-performing domains were Diagnosis (84.4%) and Complication Handling (82.3%), and the lowest-performing domains were seen in Case Introductions (62.5%). Interrater agreement was strong across domains ( κ = 0.70-0.89). Qualitative findings emphasized GPT-4's accuracy in guideline-based cases, limitations in trauma triage, and potential as an educational resource.
[CONCLUSION] GPT-4 demonstrates concordance with board-level clinical reasoning in structured plastic surgery scenarios. However, limitations in heuristic judgment and real-time adaptability underscore the need for optimization with specialty-specific training datasets and performance rubrics. With appropriate curation, GPT-4 could serve as a valuable supplement in surgical education and oral board preparation.
[METHODS] Twelve clinical vignettes spanning aesthetic, reconstructive, hand, craniofacial, pediatric, and trauma surgery were input into GPT-4. Each response was scored by 2 board-certified plastic surgeons across 6 clinical domains: Case Introduction, Diagnosis, Treatment Planning, Patient Counseling, Operative Steps, and Complication Management. Domains were scored on a 0-4 scale; scores <2 were considered failing. Interrater reliability was assessed via weighted κ . A structured qualitative interview was conducted.
[RESULTS] GPT-4 achieved mean domain scores ranging from 2.3 to 3.7 across vignettes, with passing rates of 91.7% (rater 1) and 83.3% (rater 2). The highest-performing domains were Diagnosis (84.4%) and Complication Handling (82.3%), and the lowest-performing domains were seen in Case Introductions (62.5%). Interrater agreement was strong across domains ( κ = 0.70-0.89). Qualitative findings emphasized GPT-4's accuracy in guideline-based cases, limitations in trauma triage, and potential as an educational resource.
[CONCLUSION] GPT-4 demonstrates concordance with board-level clinical reasoning in structured plastic surgery scenarios. However, limitations in heuristic judgment and real-time adaptability underscore the need for optimization with specialty-specific training datasets and performance rubrics. With appropriate curation, GPT-4 could serve as a valuable supplement in surgical education and oral board preparation.
추출된 의학 개체 (NER)
| 유형 | 영어 표현 | 한국어 / 풀이 | UMLS CUI | 출처 | 등장 |
|---|---|---|---|---|---|
| 해부 | Oral
|
scispacy | 1 | ||
| 합병증 | craniofacial
|
scispacy | 1 | ||
| 약물 | ABPS
→ American Board of Plastic Surgery
|
scispacy | 1 | ||
| 약물 | [BACKGROUND]
|
scispacy | 1 | ||
| 약물 | [RESULTS] GPT-4
|
scispacy | 1 | ||
| 질환 | trauma
|
C0043251
Wounds and Injuries
|
scispacy | 1 | |
| 질환 | GPT-4
|
scispacy | 1 | ||
| 질환 | Case
|
scispacy | 1 | ||
| 기타 | Patient
|
scispacy | 1 |
MeSH Terms
Humans; Surgery, Plastic; Clinical Competence; Plastic Surgery Procedures; Artificial Intelligence; Female; United States; Qualitative Research; Male; Educational Measurement