Evaluating the Detection Accuracy of AI-Generated Content in Plastic Surgery: A Comparative Study of Medical Professionals and AI Tools.
Abstract
[BACKGROUND] The growing use of artificial intelligence (AI) in academic writing raises concerns about the integrity of scientific articles and the ability to accurately distinguish human-written from AI-generated content. This study evaluated the ability of medical professionals and AI detection tools to identify AI involvement in plastic surgery written content.
[METHODS] Eight manuscript passages across 4 topics were assessed, including 4 on plastic surgery. Passages were human-written, human-written with AI edits, or fully AI-generated. Twenty-four raters, including medical students, residents, and attendings, classified the passages by origin. Interrater reliability was measured using Fleiss kappa. Human-written and AI-generated passages were analyzed using 3 different online AI detection tools, which identified AI-generated content within the passages in terms of percentage generated by AI. A receiver operating characteristic curve analysis was conducted to assess their accuracy in detecting AI-generated content. Intraclass correlation coefficients were calculated to assess agreement among the detection tools.
[RESULTS] Raters correctly identified the origin of passages 26.5% of the time. For AI-generated passages, accuracy was 34.4%, and for human-written passages, 14.5% ( P = 0.012). Interrater reliability was poor (κ = 0.078). AI detection tools showed strong discriminatory power (area under the receiver operating characteristic curve = 0.962), but false-positives were frequent at optimal cutoffs (25% to 50%). The intraclass correlation coefficient between tools was low (-0.118).
[CONCLUSIONS] Medical professionals and AI detection tools struggle to reliably identify AI-generated content. AI tools demonstrated high discriminatory power, but often misclassified human-written passages. These findings highlight the need for improved methods to protect the integrity of scientific writing and prevent false plagiarism claims.
[METHODS] Eight manuscript passages across 4 topics were assessed, including 4 on plastic surgery. Passages were human-written, human-written with AI edits, or fully AI-generated. Twenty-four raters, including medical students, residents, and attendings, classified the passages by origin. Interrater reliability was measured using Fleiss kappa. Human-written and AI-generated passages were analyzed using 3 different online AI detection tools, which identified AI-generated content within the passages in terms of percentage generated by AI. A receiver operating characteristic curve analysis was conducted to assess their accuracy in detecting AI-generated content. Intraclass correlation coefficients were calculated to assess agreement among the detection tools.
[RESULTS] Raters correctly identified the origin of passages 26.5% of the time. For AI-generated passages, accuracy was 34.4%, and for human-written passages, 14.5% ( P = 0.012). Interrater reliability was poor (κ = 0.078). AI detection tools showed strong discriminatory power (area under the receiver operating characteristic curve = 0.962), but false-positives were frequent at optimal cutoffs (25% to 50%). The intraclass correlation coefficient between tools was low (-0.118).
[CONCLUSIONS] Medical professionals and AI detection tools struggle to reliably identify AI-generated content. AI tools demonstrated high discriminatory power, but often misclassified human-written passages. These findings highlight the need for improved methods to protect the integrity of scientific writing and prevent false plagiarism claims.
추출된 의학 개체 (NER)
| 유형 | 영어 표현 | 한국어 / 풀이 | UMLS CUI | 출처 | 등장 |
|---|---|---|---|---|---|
| 기타 | human-written
|
scispacy | 1 |
MeSH Terms
Humans; Surgery, Plastic; Artificial Intelligence; Reproducibility of Results; Writing; Students, Medical