by Mae Sosto (CWI), Delfina Sol Martinez Pandiani (University of Amsterdam), Laura Hollink (CWI)  

Behind the fluent sentences of AI lurks a challenge: the reproduction of social biases. At CWI’s Human-Centered Data Analytics group (HCDA) [L1], we study how Large Language Models (LLMs) handle gender and sexuality, exposing subtle patterns of exclusion and working toward more inclusive language technologies.

Although binary gender biases in LLMs have received growing attention, research that includes broader queer perspectives remains limited. Terms linked to LGBTQIA+ identities (such as queer or lesbian) are often associated with negative content and flagged as inappropriate by moderation systems, even in neutral or positive contexts, limiting accurate representation [2]. LLMs also tend to assume binary gender norms, leading to misgendering or exclusion of transgender and gender-non-conforming identities. Moreover, the growing complexity of gender and sexuality terminology makes detecting and mitigating such biases increasingly challenging.
 
In the HCDA group at CWI, we study biases in commonly used LLMs to foster fairer, more inclusive NLP systems. Through the QueerGen project, we specifically examine the effects of including identity markers related to gender and sexuality (e.g., agender, lesbian, nonbinary, cisgender) in sentence generation. By comparing sentences with and without such markers, we aim to uncover patterns of social bias in model completions.
 
We created a dataset based on template sentences (e.g., The ___ should work as a, The ___ is very good at) and neutral “unmarked” subjects (e.g., person, neighbour, employees). We then curated a set of markers related to gender and sexuality, divided into queer markers (e.g., gay, bigender, nonbinary, aromantic) and non-queer markers (e.g., cisgender, straight, LGBT+ ally). Combining templates, subjects, and markers yielded 3,100 sentences on which our dataset is based. 

Subsequently, we performed sentence completion, prompting the previously crafted sentences to a total of 14 LLMs suitable for the task, including BERT, RoBERTa, Llama 3, Gemma 3, DeepSeek R1, GPT-4o and Gemini 2 Flash model families. Table 1 shows sample sentence completion results. Starting from the listed template sentences, we insert the markers in the corresponding subject gaps and generate a single word with the BERT base model to complete the sentence. 

Table 1: Sample results (Predicted word column) generated by the BERT base model when completing template sentences (left) containing markers (Marker column).

The first template sample exhibits a (binary) gender bias in role associations by assigning the cisgender woman a private, family-oriented aspiration, while assigning the cisgender man a professional, socially prestigious aspiration. The second sample contrasts a queer and non-queer marker, exceeding the binary gender dichotomy of male-female. Here, the socially normalized non-queer identity is linked to a positive and respectable role (teacher), while the marginalized identity is associated with a demeaning role (slave). Furthermore, results obtained with non-queer markers are not only more positive overall, but often align with the unmarked subject category, which implicitly represents a default identity of power.
 
Additional results are obtained  through a quantitative study, assessing the generated words using four text analysis tools: sentiment analysis, regard analysis (with respect to the subject or target group), toxicity classification,  and lexical diversity of the generated sets by subject category. Among these, we performed sentiment analysis to examine the connotation –or polarity– of the generated sentences. Figure 1 presents sentiment analysis results, grouped by subject category.

Figure 1: The analysis obtained was conducted with VADER ([L2]), which assigns scores in the -1 to 1 range, where 1 is more positive and -1 is more negative. The figure compares results obtained by BERT base, Llama 3 (70B), GPT-4o and Gemini 2.0 Flash models
Figure 1: The analysis obtained was conducted with VADER ([L2]), which assigns scores in the -1 to 1 range, where 1 is more positive and -1 is more negative. The figure compares results obtained by BERT base, Llama 3 (70B), GPT-4o and Gemini 2.0 Flash models.

This evaluation metric exposes a limitation of LLMs, as they tend to assign higher scores to generations associated with unmarked subjects, followed by non-queer, and lastly queer-marked subjects. Furthermore, masked language models, such as BERT and RoBERTa, generally predict words with significantly lower polarity compared to autoregressive models, such as Llama, Gemma, GPT and Gemini. Lastly, we found the Gemini 2.0 Flash model to produce more positive scores for generations with non-queer subject categories, followed by queer and then unmarked subjects. 

To reduce these limitations, several approaches can be taken. Dataset curation practices can be improved to include more diverse, representative, and affirming content related to LGBTQIA+ communities. Bias evaluation methods can be applied systematically across development stages to identify problematic patterns early on. Fine-tuning and prompt design can be used to guide models toward more inclusive language, and specialized tools can be developed to monitor misgendering, content filtering, and other known issues. Involving LGBTQIA+ communities in the design, testing, and evaluation process is also essential for creating systems that better reflect and support diverse identities. 

As researchers at CWI, we aim to contribute to the development of more inclusive language technologies. By critically examining and addressing model biases, we seek to foster fairness, enhance representation, and promote responsible AI practices, particularly for communities that are often marginalized or misrepresented in digital systems.

Links: 
[L1] https://kwz.me/hxy 
[L2] https://github.com/cjhutto/vaderSentiment

References: 
[1] I. O. Gallegos et al., “Bias and fairness in large language models: A survey,” Computational Linguistics, vol. 50, no. 3, pp. 1097–1179, 2024.
[2] M. Sosto and A. Barrón-Cedeño, “Queer-bench: Quantifying discrimination in language models toward queer identities,” Computing Research Repository, arXiv:2406.12399, 2024.

Please contact: 
Mae Sosto, CWI, The Netherlands
This email address is being protected from spambots. You need JavaScript enabled to view it.

Delfina Sol Martinez Pandiani, University of Amsterdam, The Netherlands
This email address is being protected from spambots. You need JavaScript enabled to view it.

Laura Hollink, CWI, The Netherlands
This email address is being protected from spambots. You need JavaScript enabled to view it.

Next issue: October 2025
Special theme:
Inclusive Digital Futures: Intersectionality, Accessibility, and Responsible Innovation
Call for the next issue


Image ERCIM News 101 epub
This issue in ePub format

Get the latest issue to your desktop
RSS Feed