From drafting responses to patient messages in electronic health record systems to clinical decision support, Large Language Models (LLMs) present many opportunities for use in medicine. Patient-facing use-cases are also relevant, such as a patient using an LLM to obtain information on potential treatments for a medical issue. In these applications, it is important to consider potential harms to minority groups through the propagation of medical misinformation or misconceptions. Leading LLMs propagate harmful and debunked notions of race-based medicine and binary gender bias. This has been explored in the context of prompting LLMs directly with questions relating to race-based medical misconceptions and through incorporating race-identifying information into clinical notes and investigating how the presence of this information can lead to bias and inaccuracy.
Though the presence of anti-LGBTQIA+ bias and inaccuracy has long been suspected in LLMs tasked with medical use cases, our study is the first to investigate this across multiple real-world clinical scenarios in cooperation with clinical experts. We include both explicit questions, which mimic the use of LLMs as a search tool, and extended clinical scenarios, which simulate medical scenarios through realistic patient notes. We also probe for both incidental bias associated only with the mention of the LGBTQIA+ identity and expected historical bias surrounding stereotyped medical conditions, and thoroughly classify and qualitatively annotate inaccuracies at a level of detail not captured by previous numerical-only evaluations of bias. We test both publicly accessible LLMs, which have been previously shown to be used by community clinicians, and a secure model intended for clinical use.
We prompted 4 LLMs (Gemini 1.5 Flash, Claude 3 Haiku, GPT-4o, Stanford Medicine Secure GPT (GPT-4.0)) with a set of 38 prompts designed to explore anti-LGBTQIA+ bias. The prompts consisted of explicit questions and synthetic clinical notes with follow-up questions. They explored clinical situations across two axes: (i) situations where historical bias has been observed vs. not observed, and (ii) situations where LGBTQIA+ identity is relevant to clinical care vs. not relevant (Figure 1). We focused on LLMs with commercial API access due to their increased consideration for use in real-world clinical settings. Secure GPT is Stanford Medicine’s private and secure instance for healthcare professionals to use LLMs for clinical care and is built on OpenAI’s GPT-4.0 infrastructure. We chose to evaluate Secure GPT due to its deployment into the clinical care setting.
Each response was graded by a primary reviewer based on criteria outlined in conjunction with LGBTQIA+ health experts, followed by a secondary reviewer (with tiebreakers made by a third reviewer). Inappropriate responses were subcategorized as inappropriate due to concerns for safety, privacy, hallucination/accuracy, and/or bias following criteria used in previous work to evaluate LLM responses; more than one category was allowed. Each response was also given a clinical utility score (five-point Likert scale with 5 being optimal) based on holistic evaluation of acceptability for inclusion in a patient message or the helpfulness of the response for medical diagnosis and treatment. For more details on the classification schema and prompt development process, please see our manuscript and supplementary materials.
Figure 1: Prompt construction framework. The four subgroups in this diagram represent the four categories of prompts that were generated along these two axes: Subgroup 1 (historical bias observed and LGBTQIA+ identity should not affect optimal clinical care), Subgroup 2 (historical bias observed and LGBTQIA+ identity could be important for optimal clinical care but not necessarily), Subgroup 3 (historical bias observed and LGBTQIA+ identity should affect optimal clinical care), and Subgroup 4 (no historical bias noted beyond what would be expected by mentioning the LGBTQIA+ identity and LGBTQIA+ identity should not affect optimal clinical care).
Quantitative Results:
Most model responses were of low to intermediate clinical utility (mean model response across all appropriate and inappropriate responses for all four models was 3.08). Major factors compromising utility included the inclusion of extraneous information, the inclusion of inaccurate and/or biased information, vagueness of response, and refusal to answer certain questions (“I’m sorry, but I cannot…”).
Figure 2: Quantitative Results. Panel A: the percentage of responses per model that were categorized as appropriate, inappropriate, or refused to answer. Panel B: the percentage of responses categorized as inappropriate that were sub-categorized as being inappropriate due to concerns of safety, privacy, hallucination/accuracy, and/or bias. Multiple concerns could exist for each response, thus the sum percentages across the four sub-categories could exceed 100% for each model. Panel C: the average clinical utility score for appropriate and inappropriate responses per model (including across all models).
Qualitative Results:
The majority of model responses displayed concerning levels of bias and inaccuracy (refer to Table 2 in our manuscript). Answers were also verbose and lacked specific, up-to-date, guideline-directed recommendations. Models often over-anchored on conditions in the prompt, inappropriately creating and justifying differential diagnoses based on conditions from the patient note (e.g., past medical history or family history) and/or patient sexual orientation and gender identity (SOGI) while excluding conditions that would have been higher on the differential but were not mentioned in the past medical history. This effect was the strongest for information mentioned earliest in the prompt. Some models displayed significant sycophantic behavior, such as including misleading statements overemphasizing risk when prompts were focused on risks compared to prompts that were focused on safety or prompts that were neutral. For more details, see our manuscript.
Models were the most adept at handling simple questions or vignettes where the correct assessment depended heavily on conditions mentioned in the prompt. Responses varied in format and style according to the user request, although there were inconsistencies (e.g., the model drafting patient message as if written by a physician reverts to recommending that the patient discuss their situation with a doctor halfway through the response). Responses reflected the gist of various situations, including those based on cluttered real-world medical documentation. However, these achievements were hampered by the aforementioned factors.
We present our prompt set and the responses of the LLMs to our prompt set, along with the categories of inappropriateness, qualitative reviewer comments, and clinical utility scores.
Dataset Composition: There are a total of 152 instances (4 LLMs evaluated and 38 prompts provided to each LLM). Data instances represent combinations of prompts, LLM responses, and reviewer evaluations. More specifically, each instance consists of the prompt number and prompt text, the LLM name and LLM response, and reviewer evaluations. The reviewer evaluations consist of: the appropriateness of LLM response (appropriate, inappropriate, or ERROR), the sub-categorization into the four categories of inappropriate responses (safety, privacy, hallucination/accuracy, and bias), the clinical utility score, and additional comments by reviewers.
Further description of the dataset is available in the datasheet and our paper.
By registering for downloads from the Evaluating Anti-LGBTQIA+ Medical Bias in LLMs Dataset, you are agreeing to this Research Use Agreement, as well as to the Terms of Use of the Stanford University School of Medicine website as posted and updated periodically at http://www.stanford.edu/site/terms/.
1. Permission is granted to view and use the Evaluating Anti-LGBTQIA+ Medical Bias in LLMs Dataset without charge for personal, non-commercial research purposes only. Any commercial use, sale, or other monetization is prohibited.
2. Other than the rights granted herein, the Stanford University School of Medicine (“School of Medicine”) retains all rights, title, and interest in the Evaluating Anti-LGBTQIA+ Medical Bias in LLMs Dataset.
3. You may make a verbatim copy of the Evaluating Anti-LGBTQIA+ Medical Bias in LLMs Dataset for personal, non-commercial research use as permitted in this Research Use Agreement. If another user within your organization wishes to use the Evaluating Anti-LGBTQIA+ Medical Bias in LLMs Dataset, they must register as an individual user and comply with all the terms of this Research Use Agreement.
4. YOU MAY NOT DISTRIBUTE, PUBLISH, OR REPRODUCE A COPY of any portion or all of the Evaluating Anti-LGBTQIA+ Medical Bias in LLMs Dataset to others without specific prior written permission from the School of Medicine.
5. YOU MAY NOT SHARE THE DOWNLOAD LINK to the Evaluating Anti-LGBTQIA+ Medical Bias in LLMs Dataset to others. If another user within your organization wishes to use the Evaluating Anti-LGBTQIA+ Medical Bias in LLMs Dataset, they must register as an individual user and comply with all the terms of this Research Use Agreement.
6. You must not modify, reverse engineer, decompile, or create derivative works from the Evaluating Anti-LGBTQIA+ Medical Bias in LLMs Dataset. You must not remove or alter any copyright or other proprietary notices in the Evaluating Anti-LGBTQIA+ Medical Bias in LLMs Dataset.
7. The Evaluating Anti-LGBTQIA+ Medical Bias in LLMs Dataset has not been reviewed or approved by the Food and Drug Administration, and is for non-clinical, Research Use Only. In no event shall data or images generated through the use of the Evaluating Anti-LGBTQIA+ Medical Bias in LLMs Dataset be used or relied upon in the diagnosis or provision of patient care.
8. THE Evaluating Anti-LGBTQIA+ Medical Bias in LLMs DATASET IS PROVIDED "AS IS," AND STANFORD UNIVERSITY AND ITS COLLABORATORS DO NOT MAKE ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, NOR DO THEY ASSUME ANY LIABILITY OR RESPONSIBILITY FOR THE USE OF THIS Evaluating Anti-LGBTQIA+ Medical Bias in LLMs DATASET.
9. You will not make any attempt to re-identify any of the individual data subjects. Re-identification of individuals is strictly prohibited. Any re-identification of any individual data subject shall be immediately reported to the School of Medicine.
10. Any violation of this Research Use Agreement or other impermissible use shall be grounds for immediate termination of use of this Evaluating Anti-LGBTQIA+ Medical Bias in LLMs Dataset. In the event that the School of Medicine determines that the recipient has violated this Research Use Agreement or other impermissible use has been made, the School of Medicine may direct that the undersigned data recipient immediately return all copies of the Evaluating Anti-LGBTQIA+ Medical Bias in LLMs Dataset and retain no copies thereof even if you did not cause the violation or impermissible use.
In consideration for your agreement to the terms and conditions contained here, Stanford grants you permission to view and use the Evaluating Anti-LGBTQIA+ Medical Bias in LLMs Dataset for personal, non-commercial research. You may not otherwise copy, reproduce, retransmit, distribute, publish, commercially exploit or otherwise transfer any material.
You may use Evaluating Anti-LGBTQIA+ Medical Bias in LLMs Dataset for legal purposes only.
You agree to indemnify and hold Stanford harmless from any claims, losses or damages, including legal fees, arising out of or resulting from your use of the Evaluating Anti-LGBTQIA+ Medical Bias in LLMs Dataset or your violation or role in violation of these Terms. You agree to fully cooperate in Stanford’s defense against any such claims. These Terms shall be governed by and interpreted in accordance with the laws of California.
***preprint here***
Crystal T Chang*, Neha Srivathsa*, Charbel Bou-Khalil, Akshay Swaminathan, Mitchell R Lunn, Kavita Mishra, Roxana Daneshjou**, Sanmi Koyejo**. Evaluating Anti-LGBTQIA+ Medical Bias in Large Language Models. medRxiv (2024).
***download the datasheet describing the dataset here***
For inquiries, contact us at roxanad@stanford.edu.