Assessing Queer Gender and Sexuality Biases in Large Language Models

by Mae Sosto (CWI), Delfina Sol Martinez Pandiani (University of Amsterdam), Laura Hollink (CWI)

AI systems don't just learn language, they also absorb and reproduce social biases. At CWI's Human-Centered Data Analytics group, we test Large Language Models with template-based sentences and queer identity markers, revealing systematic patterns of exclusion. Our goal: to develop methods and datasets that help make language technologies fairer and more inclusive.

Behind the fluent sentences of AI lies a persistent challenge: the reproduction of social biases embedded in language. At CWI’s Human-Centered Data Analytics group (HCDA) [L1], we investigate how Large Language Models (LLMs) handle gender and sexuality, uncovering subtle patterns of exclusion and working toward more inclusive language technologies.

ssumptions, rooted in both humans and AI systems, show how societal norms shape expectations around gender and sexuality, often resulting in subtle biases. The same mechanism operate in LLMs, which are trained on massive, uncurated text corpora and thus both inherit and reinforce societal biases tied to identity features marked by power imbalances, such as gender, ethnicity, race, and religion. In Natural Language Processing (NLP), bias is typically defined as systematic differences in system outputs across social groups, often rooted in historical and structural inequalities [1].

Although binary gender biases in LLMs have received growing attention, research that includes broader queer perspectives remains limited. Terms linked to LGBTQIA+ identities (such as queer or lesbian) are often associated with negative content and flagged as inappropriate by moderation systems, even in neutral or positive contexts, limiting accurate representation [2]. LLMs also tend to assume binary gender norms, leading to misgendering or exclusion of transgender and gender-non-conforming identities. Moreover, the growing complexity of gender and sexuality terminology makes detecting and mitigating such biases increasingly challenging.

In the HCDA group, we study biases in commonly used LLMs to foster fairer, more inclusive NLP systems. Through the QueerGen project, we specifically examine the effects of including identity markers related to gender and sexuality (e.g., agender, lesbian, nonbinary, cisgender) in sentence generation. By comparing sentences with and without such markers, we aim to uncover patterns of social bias in model completions.

We created a dataset based on template sentences (e.g., The ___ should work as a, The ___ is very good at), neutral “unmarked” subjects (e.g., person, neighbour, employees), and a set of markers related to gender and sexuality–divided into queer markers (e.g., gay, bigender, nonbinary, aromantic) and non-queer markers (e.g., cisgender, straight, LGBT+ ally). Combining templates, subjects, and markers yielded to. total of 3,100 sentences.

Table 1: Sample results (Predicted word column) generated by the BERT base model when completing template sentences (left) containing the specified markers (marker column).

Subsequently, we performed sentence completion, prompting the previously crafted sentences to a total of 14 LLMs, including BERT, RoBERTa, Llama 3, Gemma 3, DeepSeek R1, GPT-4o and Gemini 2 Flash model families. Starting from the listed template sentences, we insert the markers in the corresponding subject gaps and generate a single word with the BERT Base model to complete the sentence (Table 1 shows sample sentence completion results).

The first sample exhibits a (binary) gender bias in role associations by assigning the cisgender woman a private, family-oriented aspiration, while assigning the cisgender man a professional, perceived socially prestigious aspiration. The second sample contrasts a queer and non-queer marker, exceeding the binary gender dichotomy of male-female. Here, the socially normalized non-queer identity is linked to a positive and socially respectable role (teacher), while the marginalized identity is associated with a demeaning role (slave).

Figure 1: The analysis obtained was conducted with VADER ([L2]), which assigns scores in the -1 to 1 range, where 1 is more positive and -1 is more negative. The figure compares results obtained by BERT Base, Llama 3 70B, GPT-4o and Gemini 2.0 Flash models.

Additional results are obtained through a quantitative study, assessing the generated words using four text analysis tools: sentiment analysis, regard analysis (with respect to the subject of a specific target group), toxicity classification, and prediction diversity of the generated sets by subject category. Among these, sentiment analysis allows to examine the connotation–or polarity–of the generated sentences. Figure 1 presents sentiment analysis results, grouped by subject category.

This evaluation metric exposes a key limitation of LLMs: they tend to assign more positive or neutral scores to generations based on unmarked subjects, while marked subjects—especially queer-marked ones—receive less favourable completions. Non-queer marked occupy an intermediate position, often associated with more negative or socially marginal language. These patterns implicitly reflect a default identity of power.

Masked language models (e.g., BERT and RoBERTa) not only produce predictions with significantly lower polarity but also higher toxicity and more negative/less positive regard compared to Autoregressive Models (ARLMs), which exhibit more nuanced trends. Specifically we find that open-access ARLMs (e.g., Llama 3, Gemma 3) partially mitigate these biases, while closed-access ARLMs (e.g., GPT-4o and Gemini 2.0 Flash) tend to redistribute them, at time shifting harms toward unmarked subjects.

To reduce these limitations, several approaches can be taken. Dataset curation practices can be improved to include more diverse, representative, and affirming content related to LGBTQIA+ communities. Bias evaluation methods can be applied systematically across development stages to identify problematic patterns early on. Fine-tuning and prompt design can be used to guide models toward more inclusive language, and specialized tools can be developed to monitor misgendering, content filtering, and other known issues. Involving LGBTQIA+ communities in the design, testing, and evaluation process is also essential for creating systems that better reflect and support diverse identities.

As researchers at CWI, we aim to contribute to the development of more inclusive language technologies. By critically examining and addressing model biases, we seek to foster fairness, enhance representation, and promote responsible AI practices, particularly for communities that are often marginalized or misrepresented, also in digital systems.

Links:
[L1] https://kwz.me/hxy
[L2] https://github.com/cjhutto/vaderSentiment

References:
[1] I. O. Gallegos et al., “Bias and fairness in large language models: A survey,” Computational Linguistics, vol. 50, no. 3, pp. 1097–1179, 2024.
[2] M. Sosto and A. Barrón-Cedeño, “Queer-bench: Quantifying discrimination in language models toward queer identities,” Computing Research Repository, arXiv:2406.12399, 2024.

Please contact:
Mae Sosto, CWI, The Netherlands
This email address is being protected from spambots. You need JavaScript enabled to view it.

Delfina Sol Martinez Pandiani, University of Amsterdam, The Netherlands
This email address is being protected from spambots. You need JavaScript enabled to view it.

Laura Hollink, CWI, The Netherlands
This email address is being protected from spambots. You need JavaScript enabled to view it.