S-131 | Analysis of Semantic Bias in ChatGPT: Generating Word Definitions by Extension

S-131 | Analysis of Semantic Bias in ChatGPT: Generating Word Definitions by Extension 150 150 SAN 2024 Annual Meeting

Theoretical and Computational Neuroscience
Author: Facundo Ariel Totaro | Email: facutotaro@gmail.com


Facundo Ariel Totaro, Julieta Laurino, Laura Kaczer,  Juan Kamienkowski1°2°4°, Bruno Bianchi1°2°

Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación. Buenos Aires, Argentina.
CONICET-Universidad de Buenos Aires. Instituto de Ciencias de la Computación (ICC). Buenos Aires, Argentina.
.Universidad de Buenos Aires. Departamento de Fisiología, Biología Molecular y Celular. Buenos Aires, Argentina.
Facultad de Ciencias Exactas y Naturales, Maestría en Explotación de Datos y Descubrimiento del Conocimiento, Universidad de Buenos Aires, Buenos Aires, Argentina

The development of the Transformer architecture in 2017 has enabled the creation of Language Models (LMs) capable of interacting fluently with humans. This similarity between humans and LMs prompts the question of whether they process language in a similar manner. One process of interest is the mechanism by which these models assign meaning to words. Given that a large percentage of words in language have more than one meaning (i.e., are polysemous or homonymous), this study aimed to investigate the mechanisms by which LMs disambiguate word meaning. However, current LMs do not process words, but smaller units called tokens. For this, they use a tokenization method called Byte-Pair Encoding (BPE), where each word is generally represented by more than one token. This creates challenges when conducting word-level analysis. In this study, we proposed replacing GPT-2’s BPE tokenization with word-level tokenization and analyzing how this change affects the results of behavioral experiments on meaning disambiguation. For this purpose, a pretrained model in Spanish was fine-tuned with a new text corpus. Both models (the original and the fine-tuned) went through analysis showing that the model with word-level tokenization disambiguates meanings more than the model with BPE tokenizer. We conclude that word-level tokenization significantly impacts the disambiguation of polysemous words, making these models better suited for analyzing such tasks.

Masterfully Handcrafted for Awesomeness

WE DO MOVE

YOUR WORLD

Greatives – Design, Marketing, Sales

Working Hours : 09:00 – 19:00
Address : 44 Oxford Street, London, UK 22004
Phone : +380 22 333 555