Evaluating the Detection Accuracy of AI-Generated Content in Plastic Surgery: A Comparative Study of Medical Professionals and AI Tools.

Plastic and reconstructive surgery 2026 Vol.157(1) p. 129e-136e

Fine KS, Zona EE, O'Shea AW, Via EC, Attaluri PK, Wirth PJ, Dingle AM, Poore SO

Abstract

[BACKGROUND] The growing use of artificial intelligence (AI) in academic writing raises concerns about the integrity of scientific articles and the ability to accurately distinguish human-written from AI-generated content. This study evaluated the ability of medical professionals and AI detection tools to identify AI involvement in plastic surgery written content.

[METHODS] Eight manuscript passages across 4 topics were assessed, including 4 on plastic surgery. Passages were human-written, human-written with AI edits, or fully AI-generated. Twenty-four raters, including medical students, residents, and attendings, classified the passages by origin. Interrater reliability was measured using Fleiss kappa. Human-written and AI-generated passages were analyzed using 3 different online AI detection tools, which identified AI-generated content within the passages in terms of percentage generated by AI. A receiver operating characteristic curve analysis was conducted to assess their accuracy in detecting AI-generated content. Intraclass correlation coefficients were calculated to assess agreement among the detection tools.

[RESULTS] Raters correctly identified the origin of passages 26.5% of the time. For AI-generated passages, accuracy was 34.4%, and for human-written passages, 14.5% ( P = 0.012). Interrater reliability was poor (κ = 0.078). AI detection tools showed strong discriminatory power (area under the receiver operating characteristic curve = 0.962), but false-positives were frequent at optimal cutoffs (25% to 50%). The intraclass correlation coefficient between tools was low (-0.118).

[CONCLUSIONS] Medical professionals and AI detection tools struggle to reliably identify AI-generated content. AI tools demonstrated high discriminatory power, but often misclassified human-written passages. These findings highlight the need for improved methods to protect the integrity of scientific writing and prevent false plagiarism claims.

추출된 의학 개체 (NER)

유형영어 표현한국어 / 풀이UMLS CUI출처등장
기타 human-written scispacy 1

MeSH Terms

Humans; Surgery, Plastic; Artificial Intelligence; Reproducibility of Results; Writing; Students, Medical