Using Large Language Models to Summarize Evidence in Biomedical Articles: Exploratory Comparison Between AI- and Human-Annotated Bibliographies
Background: Annotated bibliographies summarize literature, but training, experience, and time are needed to create concise yet accurate annotations. Summaries generated by artificial intelligence (#AI) (AI) can save human resources, but AI-generated content can also contain serious errors. Objective: To determine the #feasibility of using AI as an alternative to human annotators, we explored whether ChatGPT can generate annotations with characteristics that are comparable to those written by humans. Methods: We had 2 humans and 3 versions of ChatGPT (3.5, 4, and 5) independently write annotations on the same set of 15 publications. We collected data on word count and Flesch Reading Ease (FRE). Two assessors who were masked to the source of the annotations independently evaluated (1) capture of main points, (2) presence of errors, and (3) whether the annotation included a discussion of both the quality and context of the article within the broader literature. We evaluated agreement and disagreement between the assessors and used descriptive statistics and assessor-stratified binary and cumulative mixed-effects logit models to compare annotations written by ChatGPT and humans. Results: On average, humans wrote shorter annotations (Mean = 90.20 words) than ChatGPT (Mean = 113 words) which were easier to interpret (Human FRE mean=15.3; ChatGPT FRE mean=5.76). Our assessments of agreement and disagreement revealed that one assessor was consistently stricter than the other. However, assessor-stratified models of main points, errors, and quality/context showed similar qualitative conclusions. There was no statistically significant difference in the odds of presenting a better summary of main points between ChatGPT- and human-generated annotations for either assessor (Assessor 1: OR=0.96, 95% CI 0.12-7.71; Assessor 2: OR=1.64, 95% CI 0.67-4.06). However, both assessors observed that human annotations had lower odds of having one or more types of errors compared to ChatGPT (Assessor 1: OR=0.31, 95% CI 0.09-1.02; Assessor 2: OR=0.10, 95% CI 0.03-0.33). On the other hand, human annotations also had lower odds of summarizing the paper’s quality and context when compared to ChatGPT (Assessor 1: OR=0.11, 95% CI 0.03-0.33; Assessor 2: OR=0.03, 95% CI 0.01-0.10). That said, ChatGPT’s summaries of quality and context were sometimes inaccurate. Conclusions: Rapidly learning a body of scientific literature is a vital yet daunting task that may be made more efficient by AI tools. In our study, ChatGPT quickly generated concise summaries of academic literature and also provided quality and context more consistently than humans. However, ChatGPT’s discussion of the quality and context was not always accurate, and ChatGPT annotations included more errors. Annotated bibliographies that are AI-generated and carefully verified by humans may thus be an efficient way to provide a rapid overview of literature. More research is needed to determine the extent that prompt engineering can reduce errors and improve chatbot performance.