Stanford University

Introduction

The integration of large language models (LLMs) in healthcare offers immense opportunity to streamline healthcare tasks, but also carries risks such as response accuracy and the perpetuation of biases. To address this, we conducted a red-teaming exercise to assess LLMs in healthcare and developed a dataset of clinically relevant scenarios for future teams to use. Results demonstrate a significant proportion of inappropriate responses across GPT-3.5, GPT-4.0, and GPT-4.0 with Internet (25.7%, 16.2%, and 17.5%, respectively) and illustrate the valuable role that non-technical clinicians can play in evaluating models.


Dataset

Labeling: We convened 80 multi-disciplinary experts to evaluate the performance of popular LLMs across multiple medical scenarios. Teams composed of clinicians, medical, and engineering students, and technical professionals stress-tested LLMs with real world clinical use cases. Teams were given a framework comprising four categories to analyze for inappropriate responses: Safety, Privacy, Hallucinations, and Bias. Six medically trained reviewers subsequently reanalyzed the prompt-response pairs, with dual reviewers for each prompt and a third to resolve discrepancies.


Dataset: There were a total of 382 unique prompts, with 1146 total responses across three iterations of ChatGPT (GPT-3.5, GPT-4.0, GPT-4.0 with Internet). 19.8% of the responses were labeled as inappropriate, with GPT-3.5 accounting for the highest percentage at 25.7% while GPT-4.0 and GPT-4.0 with internet performing comparably at 16.2% and 17.5% respectively. Interestingly, 11.8% of responses were deemed appropriate with GPT-3.5 but inappropriate in updated models, highlighting the ongoing need to evaluate evolving LLMs.


Download the datasheet for the dataset by clicking here
Download the dataset by clicking here
Download the synthetic notes provided to participants by clicking here

Paper

***preprint available here***
Crystal T. Chang*, Hodan Farah*, Haiwen Gui*, Shawheen Justin Rezaei, Charbel Bou-Khalil, Ye-Jean Park, Akshay Swaminathan, Jesutofunmi A. Omiye, Akaash Kolluri, Akash Chaurasia, Alejandro Lozano, Alice Heiman, Allison Sihan Jia, Amit Kaushal, Angela Jia, Angelica Iacovelli, Archer Yang, Arghavan Salles, Arpita Singhal, Balasubramanian Narasimhan, Benjamin Belai, Benjamin H. Jacobson, Binglan Li, Celeste H. Poe, Chandan Sanghera, Chenming Zheng, Conor Messer, Damien Varid Kettud, Deven Pandya, Dhamanpreet Kaur, Diana Hla, Diba Dindoust, Dominik Moehrle, Duncan Ross, Ellaine Chou, Eric Lin, Fateme Nateghi Haredasht, Ge Cheng, Irena Gao, Jacob Chang, Jake Silberg, Jason A. Fries, Jenelle Jindal, Jiapeng Xu, Joe Jamison, John S. Tamaresis, Jonathan H Chen, Joshua Lazaro, Juan M. Banda, Julie J. Lee, Karen Ebert Matthy, Kirsten R. Steffner, Lu Tian, Luca Pegolotti, Malathi Srinivasan, Maniragav Manimaran, Matthew Schwede, Minghe Zhang, Minh Nguyen, Mohsen Fathzadeh, Qian Zhao, Rika Bajra, Rohit Khurana, Ruhana Azam, Rush Bartlett, Sang T. Truong, Scott L Fleming, Shriti Raj, Solveig Behr, Sonia Onyeka, Sri Muppidi, Tarek Bandali, Tiffany Y. Eulalio, Wenyuan Chen, Xuanyu Zhou, Yanan Ding, Ying Cui, Yuqi Tan, Yutong Liu, Nigam Shah, Roxana Daneshjou. arXiv(2024)

*These authors contributed equally as a co-first author to this manuscript, and are presented in alphabetical order

For inquiries, contact us at roxanad@stanford.edu.