“Kelly is a Warm Person, Joseph is a Role Model”: Gender Biases in LLM-Generated Reference Letters

paper

Code

https://github.com/uclanlp/biases-llm-reference-letters

https://sourcegraph.com/github.com/uclanlp/biases-llm-reference-letters

prompt generation: https://sourcegraph.com/github.com/uclanlp/biases-llm-reference-letters/-/blob/generate_clg.py?L24=

generation: https://github.com/uclanlp/biases-llm-reference-letters/blob/main/generated_letters/chatgpt/clg/clg_letters.csv

longer version

Speaker Notes

Task, Input, Output, Significance

  • What: The research focuses on identifying gender biases in recommendation letters generated by Large Language Models (LLMs), such as ChatGPT and Alpaca.
  • Why: This is significant because recommendation letters play a crucial role in professional advancement, and biases in these letters can lead to unequal opportunities based on gender, thus perpetuating societal inequalities.

Existing Effort & Limitation

  • What: Previous studies have attempted to evaluate and mitigate biases in natural language processing models. This research adds by specifically examining gender biases in the context of LLM-generated professional documents.
    • allocational harms: An allocative harm is when a system allocates or withholds certain groups an opportunity or a resource.
    • representational harms: What is an example of representational harm?
      In 2023, Google's photos algorithm was still blocked from identifying gorillas in photos. Another prevalent example of representational harm is the possibility of stereotypes being encoded in word embeddings, which are trained using a wide range of text.
  • Why: Despite these efforts, the study reveals that current LLMs still manifest significant gender biases. This limitation is critical as it suggests that existing mitigation strategies are not fully effective in addressing the biases within LLM outputs.

Limitation Triviality

  • What: The question of whether the limitation (i.e., gender bias in LLM outputs) is trivial is addressed.
  • Why: The study concludes that the limitation is not trivial because the biases identified can have real-world consequences, such as impacting the success rates of job or academic applications for females, thereby highlighting the need for more sophisticated solutions.

Challenge (If Non-Trivial)

  • What: The challenge lies in the inherent complexity of societal and linguistic biases that are embedded in the large datasets used to train LLMs. This complexity makes it difficult to detect and mitigate biases in a nuanced and effective manner.
  • Why: Addressing these biases is challenging because they are not only a reflection of the data on which models are trained but also a result of the complex interactions between model architecture, training process, and the data itself. Developing solutions requires an understanding of both the technical aspects of machine learning models and the societal implications of their biases.

Proposed Solutions

  • What: The paper proposes the development of a comprehensive testbed for identifying gender biases in LLM-generated documents and suggests that future research should focus on creating effective bias mitigation techniques.
  • Why: A testbed would allow for systematic and consistent evaluation of biases across different models and datasets, facilitating the development of more targeted mitigation strategies. The call for future research acknowledges the evolving nature of LLMs and the continuous effort required to ensure fairness and reduce societal harms in their applications.

Task, Input, Output, Significance

  • Task: Investigate gender biases in LLM-generated recommendation letters
  • Input: Prompts with minimal or detailed context about candidates
  • Output: Gender-biased language in recommendation letters
  • Significance: Identifies fairness issues in professional document automation

Existing Effort & Limitation

  • Effort: Evaluation methods to reveal language style and lexical content biases
  • Limitation: Current LLMs (ChatGPT, Alpaca) exhibit significant gender biases

Limitation Triviality

  • Not trivial: Bias perpetuates societal inequalities, requires advanced mitigation strategies

Challenge (If Non-Trivial)

  • Complex societal and linguistic biases embedded in training data
  • Difficulty in automatically detecting and mitigating nuanced biases

Proposed Solutions

  • Comprehensive testbed for bias identification
  • Future research to develop effective bias mitigation techniques

questions

how is Biases in Language Style, formality and positivity evaluated?

how is Biases in Language Style, formality and positivity evaluated?

Biases in Language Style, including formality and positivity, are evaluated through a combination of automated classification tools and statistical testing to quantify and compare the stylistic differences in language used in text generated by Large Language Models (LLMs) for different genders. Here's a step-by-step breakdown of how each component is evaluated:

Biases in Language Formality

  1. Classification of Formality: To evaluate formality, each sentence in the generated document is classified as formal or informal. This is usually done using a language formality classifier. A common approach involves using models fine-tuned on a formality corpus, such as the Grammarly's Yahoo Answers Formality Corpus (GYAFC), which provides a dataset with examples of formal and informal sentences.

  2. Percentage Calculation: Once sentences are classified, the percentage of formal sentences in each generated document is calculated. This involves counting the number of sentences classified as formal and dividing by the total number of sentences in the document, then multiplying by 100 to get a percentage.

  3. Statistical Testing: The final step involves comparing the percentages of formal sentences in documents generated for male and female subjects using statistical tests, typically t-tests. This test assesses whether there's a statistically significant difference in the formality levels of texts generated for different genders.

Biases in Language Positivity

  1. Classification of Positivity: To evaluate positivity, sentiment analysis is applied to each sentence in the generated documents. Sentiment analysis classifiers, often pretrained and available in NLP libraries, are used to determine the sentiment of each sentence. Sentiments can typically be categorized as positive, neutral, or negative.

  2. Percentage Calculation: The percentage of sentences with positive sentiment is calculated for each document. This involves counting the number of sentences classified with positive sentiment and dividing by the total number of sentences, then converting to a percentage.

  3. Statistical Testing: Similar to formality, the percentages of positive sentences in male and female documents are compared using statistical tests, such as t-tests. This quantifies whether there's a significant difference in positivity between documents generated for different genders, indicating bias.

These evaluations help to identify and quantify biases in the language style of LLM-generated texts. By examining the differences in formality and positivity, researchers can understand how LLMs may perpetuate or amplify existing social biases, contributing to the broader conversation on fairness and bias in AI technologies.

proposed comprehensive testbed

The comprehensive testbed mentioned in the study is designed to identify and quantify biases in the generated recommendation letters by large language models (LLMs), focusing on gender biases. This testbed is a framework consisting of metrics and prompt datasets tailored to evaluate the fairness of LLMs in producing professional documents, such as recommendation letters. Here's a detailed breakdown of its components and functionalities:

Metrics for Bias Evaluation

The testbed employs several metrics to assess biases in different dimensions:

  1. Biases in Lexical Content: This metric evaluates the differences in word choices between recommendation letters generated for male and female candidates. It looks at nouns and adjectives used in the letters to identify whether there's a salient frequency difference that could indicate a bias towards using certain types of words for one gender over another.

  2. Biases in Language Style: This metric measures stylistic differences in the language used in letters for candidates of different genders. It assesses aspects such as formality, positivity, and agency, which are crucial for understanding how language might subtly convey biases.

  3. Hallucination Bias: Given that LLMs can generate content not strictly based on the input (known as "hallucinations"), this metric specifically looks at whether these hallucinations exacerbate bias by analyzing the nature of information that is fabricated by the model.

Prompt Datasets

The testbed also includes prompt datasets designed to elicit recommendation letters from LLMs under controlled conditions. These datasets consist of prompts that vary in the amount of context provided about the candidate, allowing for the evaluation of biases under different scenarios:

  1. Context-Less Generation (CLG): In this scenario, the model is prompted to produce a letter based solely on minimal information, such as a name and a few descriptors. This setting aims to reveal the model's inherent biases.

  2. Context-Based Generation (CBG): Here, the model is given more detailed contextual information about the candidate, including personal achievements and experiences. This scenario tests how the model's biases manifest when more specific details are provided.

Evaluation Framework

The comprehensive testbed uses a structured approach to evaluate the biases present in LLM-generated recommendation letters:

  • Odds Ratio Analysis: For biases in lexical content, the testbed applies odds ratio analysis to compare the frequency of specific words in letters for male versus female candidates, identifying words that are disproportionately associated with one gender.

  • T-Testing for Language Style: For assessing biases in language style, the testbed employs statistical t-tests to compare the average scores of formality, positivity, and agency between letters for different genders, determining if there are significant stylistic differences.

  • Hallucination Detection and Evaluation: The testbed uses a Context-Sentence Natural Language Inference (NLI) framework to identify hallucinated content in the letters. It then evaluates this content for biases in formality, positivity, and agency to understand how biases might be amplified through hallucinations.

This comprehensive testbed represents a significant effort to systematically identify, quantify, and understand biases in LLM-generated professional documents. By using a combination of metrics and controlled prompt datasets, the study aims to uncover the nuances of how gender biases manifest in LLM outputs and to highlight the importance of addressing these biases for fair and equitable use of AI in professional settings.