Theoretical and Computational Neuroscience
Author: Gianolini Agustín Andrés | Email:
Gianolini Agustín Andrés 1°, Laurino Julieta 3°, Kaczer Laura 3°, Kamienkowski Juan 1°3°4°, BianchiBruno1°2°
1° Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación. Buenos Aires, Argentina.
2° CONICET-Universidad de Buenos Aires. Instituto de Ciencias de la Computación (ICC). Buenos Aires, Argentina.
3° Universidad de Buenos Aires. Departamento de Fisiología, Biología Molecular y Celular. Buenos Aires, Argentina.
4° Facultad de Ciencias Exactas y Naturales, Maestría en Explotación de Datos y Descubrimiento del Conocimiento, Universidad de Buenos Aires, Buenos Aires, Argentina
The development of the Transformer architecture in 2017 has enabled the creation of Language Models (LMs) capable of interacting fluently with humans. This similarity between humans and LMs prompts the question of whether they process language in a similar manner. One process of interest is the mechanism by which these models assign meaning to words. Given that a large percentage of words in language have more than one meaning (i.e., are polysemous or homonymous), this study aimed to investigate the mechanisms by which LMs disambiguate word meaning. However, current LMs do not process words, but smaller units called tokens. For this, they use a tokenization method called Byte-Pair Encoding (BPE), where each word is generally represented by more than one token. This creates challenges when conducting word-level analysis. In this study, we proposed replacing GPT-2’s BPE tokenization with word-level tokenization and analyzing how this change affects the results of behavioral experiments on meaning disambiguation. For this purpose, a pretrained model in Spanish was fine-tuned with a new text corpus. Both models (the original and the fine-tuned) went through analysis showing that the model with word-level tokenization disambiguates meanings more than the model with BPE tokenizer. We conclude that word-level tokenization significantly impacts the disambiguation of polysemous words, making these models better suited for analyzing such tasks.