Clinicians and patients are already using artificial intelligence (AI) tools online for advice and guidance for a wide range of issues, including healthcare. Understanding how AI is used, the strengths and weaknesses of AI in these functions, and developing applications that provide optimal information are all important priorities. An overview of these issues in the context of ophthalmology can be found in Ophthalmology Science. More general summaries on large language models (LLMs) and generative AI are available in Nature Medicine. For reassurance that doctors are not about to be replaced by computers, see this article in Journal of the Royal Society of Medicine.
ChatGPT initially made headlines in the medical world for passing medical school examinations, and we trialled the same model in a test aimed at fully qualified doctors which indicated the early strengths and weaknesses of this technology: JMIR Medical Education.
However, exam results are poor indicators of clinical performance without context. To establish the clinical reasoning and recall ability of flagship LLMs, we recruited ophthalmologists at every stage of training to provide benchmark comparators. We found that the strongest LLM (GPT-4) matched consultant and attending ophthalmologists (E1-E5), indicating expert-level performance. The full study was published in PLOS Digital Medicine.
In a follow-up study, we tested more emerging models in both text-based as well as multimodel questions, finding that generative AI exhibited much weaker performance when tasked with interpreting images. These results were published in JAMA Ophthalmology.
We have also published a comprehensive review on evaluation of LLMs with a focus on ophthalmological applications, available in Current Opinion in Ophthalmology.
Thirunavukarasu, A. J. et al. Large language models approach expert-level clinical knowledge and reasoning in ophthalmology: A head-to-head cross-sectional study. PLOS Digital Health 3, e0000341 (2024).
Performance in examination questions demonstrate potential, but understanding the performance of generative AI (genAI) in the contexts in which it is used is crucial to identify potential harms and help develop better tools to assist patients and practitioners. Through work to replicate patient and practitioner behaviour we are providing specific insight into these issues.
Comparing genAI advice to patient information leaflets in response to common queries reveals that commercial platforms match gold-standard accuracy and comprehensiveness, albeit with subtle errors occasionally present which limit trust. Preprint.
An enormous number of published studies try to establish the ability of generative AI to provide useful advice to clinicians and patients. This is a new genre of research, but a lack of reporting guidelines in this growing space mean that many studies do not provide actionable or interpretable information for clinicians and researchers to build upon. In a systematic review of the early literature base (JAMA Network Open), we found that poor reporting is common. For example, 99.3% of studies fail to provide sufficient information to identify the AI model tested.
CHART is an EQUATOR Network-endorsed reporting guideline for chatbot health advice (CHA) studies developed through an ambitious multinational and multidisciplinary Delphi method consensus process. By involving clinicians, researchers, methodologists, and patients, CHART was designed to provide an accessible and comprehensive tool to empower researchers to design and report studies that provide useful information to inform clinical practice and development work. We explore how guidelines like CHART fit into efforts to translate generative AI into clinical applications in The Lancet Digital Health.
Our explanation and elaboration report for CHART was published by The BMJ. The ready-to-use checklist was co-published by BJS, BMJ Medicine, BMC Medicine, JAMA Network Open, Annals of Family Medicine, and Artificial Intelligence in Medicine. We have also developed a website to maximise ease of use of the checklist with automatic printouts of compliant checklists and flow charts: https://chartguideline.org/
As AI applications proliferate in healthcare, it is imperative that patients are safeguarded from new risks. However, regulation should not act to stymie innovation and progress, necessitating careful balancing of stakeholders' priorities. A Partnership for Oversight, Leadership, and Accountability in Regulating Intelligent Systems–Generative Models in Medicine (POLARIS-GM) has been set up to bring technical and ethical expertise together to help regulators navigate the rapidly evolving landscape of generative AI. Our initial statement was published in Nature Medicine.