22 results on '"Limisiewicz, Tomasz"'
Search Results
2. MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization
- Author
-
Ahia, Orevaoghene, Kumar, Sachin, Gonen, Hila, Hoffman, Valentin, Limisiewicz, Tomasz, Tsvetkov, Yulia, and Smith, Noah A.
- Subjects
Computer Science - Computation and Language - Abstract
In multilingual settings, non-Latin scripts and low-resource languages are usually disadvantaged in terms of language models' utility, efficiency, and cost. Specifically, previous studies have reported multiple modeling biases that the current tokenization algorithms introduce to non-Latin script languages, the main one being over-segmentation. In this work, we propose MAGNET; multilingual adaptive gradient-based tokenization to reduce over-segmentation via adaptive gradient-based subword tokenization. MAGNET learns to predict segment boundaries between byte tokens in a sequence via sub-modules within the model, which act as internal boundary predictors (tokenizers). Previous gradient-based tokenization methods aimed for uniform compression across sequences by integrating a single boundary predictor during training and optimizing it end-to-end through stochastic reparameterization alongside the next token prediction objective. However, this approach still results in over-segmentation for non-Latin script languages in multilingual settings. In contrast, MAGNET offers a customizable architecture where byte-level sequences are routed through language-script-specific predictors, each optimized for its respective language script. This modularity enforces equitable segmentation granularity across different language scripts compared to previous methods. Through extensive experiments, we demonstrate that in addition to reducing segmentation disparities, MAGNET also enables faster language modelling and improves downstream utility.
- Published
- 2024
3. MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling
- Author
-
Limisiewicz, Tomasz, Blevins, Terra, Gonen, Hila, Ahia, Orevaoghene, and Zettlemoyer, Luke
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning - Abstract
A major consideration in multilingual language modeling is how to best represent languages with diverse vocabularies and scripts. Although contemporary text encoding methods cover most of the world's writing systems, they exhibit bias towards the high-resource languages of the Global West. As a result, texts of underrepresented languages tend to be segmented into long sequences of linguistically meaningless units. To address the disparities, we introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages. Our encoding convention (MYTE) is based on morphemes, as their inventories are more balanced across languages than characters, which are used in previous methods. We show that MYTE produces shorter encodings for all 99 analyzed languages, with the most notable improvements for non-European languages and non-Latin scripts. This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.
- Published
- 2024
4. Breaking the Curse of Multilinguality with Cross-lingual Expert Language Models
- Author
-
Blevins, Terra, Limisiewicz, Tomasz, Gururangan, Suchin, Li, Margaret, Gonen, Hila, Smith, Noah A., and Zettlemoyer, Luke
- Subjects
Computer Science - Computation and Language - Abstract
Despite their popularity in non-English NLP, multilingual language models often underperform monolingual ones due to inter-language competition for model parameters. We propose Cross-lingual Expert Language Models (X-ELM), which mitigate this competition by independently training language models on subsets of the multilingual corpus. This process specializes X-ELMs to different languages while remaining effective as a multilingual ensemble. Our experiments show that when given the same compute budget, X-ELM outperforms jointly trained multilingual models across all considered languages and that these gains transfer to downstream tasks. X-ELM provides additional benefits over performance improvements: new experts can be iteratively added, adapting X-ELM to new languages without catastrophic forgetting. Furthermore, training is asynchronous, reducing the hardware requirements for multilingual training and democratizing multilingual modeling., Comment: EMNLP 2024
- Published
- 2024
5. Debiasing Algorithm through Model Adaptation
- Author
-
Limisiewicz, Tomasz, Mareček, David, and Musil, Tomáš
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence ,Statistics - Machine Learning - Abstract
Large language models are becoming the go-to solution for the ever-growing number of tasks. However, with growing capacity, models are prone to rely on spurious correlations stemming from biases and stereotypes present in the training data. This work proposes a novel method for detecting and mitigating gender bias in language models. We perform causal analysis to identify problematic model components and discover that mid-upper feed-forward layers are most prone to convey bias. Based on the analysis results, we intervene in the model by applying a linear projection to the weight matrices of these layers. Our titular method, DAMA, significantly decreases bias as measured by diverse metrics while maintaining the model's performance on downstream tasks. We release code for our method and models, which retrain LLaMA's state-of-the-art performance while being significantly less biased., Comment: Accepted to ICLR 2024
- Published
- 2023
6. Exploring the Impact of Training Data Distribution and Subword Tokenization on Gender Bias in Machine Translation
- Author
-
Iluz, Bar, Limisiewicz, Tomasz, Stanovsky, Gabriel, and Mareček, David
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
We study the effect of tokenization on gender bias in machine translation, an aspect that has been largely overlooked in previous works. Specifically, we focus on the interactions between the frequency of gendered profession names in training data, their representation in the subword tokenizer's vocabulary, and gender bias. We observe that female and non-stereotypical gender inflections of profession names (e.g., Spanish "doctora" for "female doctor") tend to be split into multiple subword tokens. Our results indicate that the imbalance of gender forms in the model's training corpus is a major factor contributing to gender bias and has a greater impact than subword splitting. We show that analyzing subword splits provides good estimates of gender-form imbalance in the training data and can be used even when the corpus is not publicly available. We also demonstrate that fine-tuning just the token embedding layer can decrease the gap in gender prediction accuracy between female and male forms without impairing the translation quality., Comment: Accepted to AACL 2023
- Published
- 2023
7. Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages
- Author
-
Limisiewicz, Tomasz, Balhar, Jiří, and Mareček, David
- Subjects
Computer Science - Computation and Language - Abstract
Multilingual language models have recently gained attention as a promising solution for representing multiple languages in a single model. In this paper, we propose new criteria to evaluate the quality of lexical representation and vocabulary overlap observed in sub-word tokenizers. Our findings show that the overlap of vocabulary across languages can be actually detrimental to certain downstream tasks (POS, dependency tree labeling). In contrast, NER and sentence-level tasks (cross-lingual retrieval, NLI) benefit from sharing vocabulary. We also observe that the coverage of the language-specific tokens in the multilingual vocabulary significantly impacts the word-level tasks. Our study offers a deeper understanding of the role of tokenizers in multilingual language models and guidelines for future model developers to choose the most suitable tokenizer for their specific application before undertaking costly model pre-training, Comment: in ACL Findings 2023
- Published
- 2023
8. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
- Author
-
Workshop, BigScience, Scao, Teven Le, Fan, Angela, Akiki, Christopher, Pavlick, Ellie, Ilić, Suzana, Hesslow, Daniel, Castagné, Roman, Luccioni, Alexandra Sasha, Yvon, François, Gallé, Matthias, Tow, Jonathan, Rush, Alexander M., Biderman, Stella, Webson, Albert, Ammanamanchi, Pawan Sasanka, Wang, Thomas, Sagot, Benoît, Muennighoff, Niklas, del Moral, Albert Villanova, Ruwase, Olatunji, Bawden, Rachel, Bekman, Stas, McMillan-Major, Angelina, Beltagy, Iz, Nguyen, Huu, Saulnier, Lucile, Tan, Samson, Suarez, Pedro Ortiz, Sanh, Victor, Laurençon, Hugo, Jernite, Yacine, Launay, Julien, Mitchell, Margaret, Raffel, Colin, Gokaslan, Aaron, Simhi, Adi, Soroa, Aitor, Aji, Alham Fikri, Alfassy, Amit, Rogers, Anna, Nitzav, Ariel Kreisberg, Xu, Canwen, Mou, Chenghao, Emezue, Chris, Klamm, Christopher, Leong, Colin, van Strien, Daniel, Adelani, David Ifeoluwa, Radev, Dragomir, Ponferrada, Eduardo González, Levkovizh, Efrat, Kim, Ethan, Natan, Eyal Bar, De Toni, Francesco, Dupont, Gérard, Kruszewski, Germán, Pistilli, Giada, Elsahar, Hady, Benyamina, Hamza, Tran, Hieu, Yu, Ian, Abdulmumin, Idris, Johnson, Isaac, Gonzalez-Dios, Itziar, de la Rosa, Javier, Chim, Jenny, Dodge, Jesse, Zhu, Jian, Chang, Jonathan, Frohberg, Jörg, Tobing, Joseph, Bhattacharjee, Joydeep, Almubarak, Khalid, Chen, Kimbo, Lo, Kyle, Von Werra, Leandro, Weber, Leon, Phan, Long, allal, Loubna Ben, Tanguy, Ludovic, Dey, Manan, Muñoz, Manuel Romero, Masoud, Maraim, Grandury, María, Šaško, Mario, Huang, Max, Coavoux, Maximin, Singh, Mayank, Jiang, Mike Tian-Jian, Vu, Minh Chien, Jauhar, Mohammad A., Ghaleb, Mustafa, Subramani, Nishant, Kassner, Nora, Khamis, Nurulaqilla, Nguyen, Olivier, Espejel, Omar, de Gibert, Ona, Villegas, Paulo, Henderson, Peter, Colombo, Pierre, Amuok, Priscilla, Lhoest, Quentin, Harliman, Rheza, Bommasani, Rishi, López, Roberto Luis, Ribeiro, Rui, Osei, Salomey, Pyysalo, Sampo, Nagel, Sebastian, Bose, Shamik, Muhammad, Shamsuddeen Hassan, Sharma, Shanya, Longpre, Shayne, Nikpoor, Somaieh, Silberberg, Stanislav, Pai, Suhas, Zink, Sydney, Torrent, Tiago Timponi, Schick, Timo, Thrush, Tristan, Danchev, Valentin, Nikoulina, Vassilina, Laippala, Veronika, Lepercq, Violette, Prabhu, Vrinda, Alyafeai, Zaid, Talat, Zeerak, Raja, Arun, Heinzerling, Benjamin, Si, Chenglei, Taşar, Davut Emre, Salesky, Elizabeth, Mielke, Sabrina J., Lee, Wilson Y., Sharma, Abheesht, Santilli, Andrea, Chaffin, Antoine, Stiegler, Arnaud, Datta, Debajyoti, Szczechla, Eliza, Chhablani, Gunjan, Wang, Han, Pandey, Harshit, Strobelt, Hendrik, Fries, Jason Alan, Rozen, Jos, Gao, Leo, Sutawika, Lintang, Bari, M Saiful, Al-shaibani, Maged S., Manica, Matteo, Nayak, Nihal, Teehan, Ryan, Albanie, Samuel, Shen, Sheng, Ben-David, Srulik, Bach, Stephen H., Kim, Taewoon, Bers, Tali, Fevry, Thibault, Neeraj, Trishala, Thakker, Urmish, Raunak, Vikas, Tang, Xiangru, Yong, Zheng-Xin, Sun, Zhiqing, Brody, Shaked, Uri, Yallow, Tojarieh, Hadar, Roberts, Adam, Chung, Hyung Won, Tae, Jaesung, Phang, Jason, Press, Ofir, Li, Conglong, Narayanan, Deepak, Bourfoune, Hatim, Casper, Jared, Rasley, Jeff, Ryabinin, Max, Mishra, Mayank, Zhang, Minjia, Shoeybi, Mohammad, Peyrounette, Myriam, Patry, Nicolas, Tazi, Nouamane, Sanseviero, Omar, von Platen, Patrick, Cornette, Pierre, Lavallée, Pierre François, Lacroix, Rémi, Rajbhandari, Samyam, Gandhi, Sanchit, Smith, Shaden, Requena, Stéphane, Patil, Suraj, Dettmers, Tim, Baruwa, Ahmed, Singh, Amanpreet, Cheveleva, Anastasia, Ligozat, Anne-Laure, Subramonian, Arjun, Névéol, Aurélie, Lovering, Charles, Garrette, Dan, Tunuguntla, Deepak, Reiter, Ehud, Taktasheva, Ekaterina, Voloshina, Ekaterina, Bogdanov, Eli, Winata, Genta Indra, Schoelkopf, Hailey, Kalo, Jan-Christoph, Novikova, Jekaterina, Forde, Jessica Zosa, Clive, Jordan, Kasai, Jungo, Kawamura, Ken, Hazan, Liam, Carpuat, Marine, Clinciu, Miruna, Kim, Najoung, Cheng, Newton, Serikov, Oleg, Antverg, Omer, van der Wal, Oskar, Zhang, Rui, Zhang, Ruochen, Gehrmann, Sebastian, Mirkin, Shachar, Pais, Shani, Shavrina, Tatiana, Scialom, Thomas, Yun, Tian, Limisiewicz, Tomasz, Rieser, Verena, Protasov, Vitaly, Mikhailov, Vladislav, Pruksachatkun, Yada, Belinkov, Yonatan, Bamberger, Zachary, Kasner, Zdeněk, Rueda, Alice, Pestana, Amanda, Feizpour, Amir, Khan, Ammar, Faranak, Amy, Santos, Ana, Hevia, Anthony, Unldreaj, Antigona, Aghagol, Arash, Abdollahi, Arezoo, Tammour, Aycha, HajiHosseini, Azadeh, Behroozi, Bahareh, Ajibade, Benjamin, Saxena, Bharat, Ferrandis, Carlos Muñoz, McDuff, Daniel, Contractor, Danish, Lansky, David, David, Davis, Kiela, Douwe, Nguyen, Duong A., Tan, Edward, Baylor, Emi, Ozoani, Ezinwanne, Mirza, Fatima, Ononiwu, Frankline, Rezanejad, Habib, Jones, Hessie, Bhattacharya, Indrani, Solaiman, Irene, Sedenko, Irina, Nejadgholi, Isar, Passmore, Jesse, Seltzer, Josh, Sanz, Julio Bonis, Dutra, Livia, Samagaio, Mairon, Elbadri, Maraim, Mieskes, Margot, Gerchick, Marissa, Akinlolu, Martha, McKenna, Michael, Qiu, Mike, Ghauri, Muhammed, Burynok, Mykola, Abrar, Nafis, Rajani, Nazneen, Elkott, Nour, Fahmy, Nour, Samuel, Olanrewaju, An, Ran, Kromann, Rasmus, Hao, Ryan, Alizadeh, Samira, Shubber, Sarmad, Wang, Silas, Roy, Sourav, Viguier, Sylvain, Le, Thanh, Oyebade, Tobi, Le, Trieu, Yang, Yoyo, Nguyen, Zach, Kashyap, Abhinav Ramesh, Palasciano, Alfredo, Callahan, Alison, Shukla, Anima, Miranda-Escalada, Antonio, Singh, Ayush, Beilharz, Benjamin, Wang, Bo, Brito, Caio, Zhou, Chenxi, Jain, Chirag, Xu, Chuxin, Fourrier, Clémentine, Periñán, Daniel León, Molano, Daniel, Yu, Dian, Manjavacas, Enrique, Barth, Fabio, Fuhrimann, Florian, Altay, Gabriel, Bayrak, Giyaseddin, Burns, Gully, Vrabec, Helena U., Bello, Imane, Dash, Ishani, Kang, Jihyun, Giorgi, John, Golde, Jonas, Posada, Jose David, Sivaraman, Karthik Rangasai, Bulchandani, Lokesh, Liu, Lu, Shinzato, Luisa, de Bykhovetz, Madeleine Hahn, Takeuchi, Maiko, Pàmies, Marc, Castillo, Maria A, Nezhurina, Marianna, Sänger, Mario, Samwald, Matthias, Cullan, Michael, Weinberg, Michael, De Wolf, Michiel, Mihaljcic, Mina, Liu, Minna, Freidank, Moritz, Kang, Myungsun, Seelam, Natasha, Dahlberg, Nathan, Broad, Nicholas Michio, Muellner, Nikolaus, Fung, Pascale, Haller, Patrick, Chandrasekhar, Ramya, Eisenberg, Renata, Martin, Robert, Canalli, Rodrigo, Su, Rosaline, Su, Ruisi, Cahyawijaya, Samuel, Garda, Samuele, Deshmukh, Shlok S, Mishra, Shubhanshu, Kiblawi, Sid, Ott, Simon, Sang-aroonsiri, Sinee, Kumar, Srishti, Schweter, Stefan, Bharati, Sushil, Laud, Tanmay, Gigant, Théo, Kainuma, Tomoya, Kusa, Wojciech, Labrak, Yanis, Bajaj, Yash Shailesh, Venkatraman, Yash, Xu, Yifan, Xu, Yingxin, Xu, Yu, Tan, Zhe, Xie, Zhongli, Ye, Zifan, Bras, Mathilde, Belkada, Younes, and Wolf, Thomas
- Subjects
Computer Science - Computation and Language - Abstract
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
- Published
- 2022
9. You Can Have Your Data and Balance It Too: Towards Balanced and Efficient Multilingual Models
- Author
-
Limisiewicz, Tomasz, Malkin, Dan, and Stanovsky, Gabriel
- Subjects
Computer Science - Computation and Language - Abstract
Multilingual models have been widely used for cross-lingual transfer to low-resource languages. However, the performance on these languages is hindered by their underrepresentation in the pretraining data. To alleviate this problem, we propose a novel multilingual training technique based on teacher-student knowledge distillation. In this setting, we utilize monolingual teacher models optimized for their language. We use those teachers along with balanced (sub-sampled) data to distill the teachers' knowledge into a single multilingual student. Our method outperforms standard training methods in low-resource languages and retrains performance on high-resource languages while using the same amount of data. If applied widely, our approach can increase the representation of low-resource languages in NLP systems., Comment: SIGTYP 2023
- Published
- 2022
10. Don't Forget About Pronouns: Removing Gender Bias in Language Models Without Losing Factual Gender Information
- Author
-
Limisiewicz, Tomasz and Mareček, David
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
The representations in large language models contain multiple types of gender information. We focus on two types of such signals in English texts: factual gender information, which is a grammatical or semantic property, and gender bias, which is the correlation between a word and specific gender. We can disentangle the model's embeddings and identify components encoding both types of information with probing. We aim to diminish the stereotypical bias in the representations while preserving the factual gender signal. Our filtering method shows that it is possible to decrease the bias of gender-neutral profession names without significant deterioration of language modeling capabilities. The findings can be applied to language generation to mitigate reliance on stereotypes while preserving gender agreement in coreferences., Comment: Presented at GeBNLP 2022
- Published
- 2022
11. A Balanced Data Approach for Evaluating Cross-Lingual Transfer: Mapping the Linguistic Blood Bank
- Author
-
Malkin, Dan, Limisiewicz, Tomasz, and Stanovsky, Gabriel
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
We show that the choice of pretraining languages affects downstream cross-lingual transfer for BERT-based models. We inspect zero-shot performance in balanced data conditions to mitigate data size confounds, classifying pretraining languages that improve downstream performance as donors, and languages that are improved in zero-shot performance as recipients. We develop a method of quadratic time complexity in the number of languages to estimate these relations, instead of an exponential exhaustive computation of all possible combinations. We find that our method is effective on a diverse set of languages spanning different linguistic features and two downstream tasks. Our findings can inform developers of large-scale multilingual language models in choosing better pretraining configurations., Comment: Accepted to NAACL 2022
- Published
- 2022
12. Examining Cross-lingual Contextual Embeddings with Orthogonal Structural Probes
- Author
-
Limisiewicz, Tomasz and Mareček, David
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
State-of-the-art contextual embeddings are obtained from large language models available only for a few languages. For others, we need to learn representations using a multilingual model. There is an ongoing debate on whether multilingual embeddings can be aligned in a space shared across many languages. The novel Orthogonal Structural Probe (Limisiewicz and Mare\v{c}ek, 2021) allows us to answer this question for specific linguistic features and learn a projection based only on mono-lingual annotated datasets. We evaluate syntactic (UD) and lexical (WordNet) structural information encoded inmBERT's contextual representations for nine diverse languages. We observe that for languages closely related to English, no transformation is needed. The evaluated information is encoded in a shared cross-lingual embedding space. For other languages, it is beneficial to apply orthogonal transformation learned separately for each language. We successfully apply our findings to zero-shot and few-shot cross-lingual parsing., Comment: EMNLP 2021 Main Conference
- Published
- 2021
13. Introducing Orthogonal Constraint in Structural Probes
- Author
-
Limisiewicz, Tomasz and Mareček, David
- Subjects
Computer Science - Computation and Language - Abstract
With the recent success of pre-trained models in NLP, a significant focus was put on interpreting their representations. One of the most prominent approaches is structural probing (Hewitt and Manning, 2019), where a linear projection of word embeddings is performed in order to approximate the topology of dependency structures. In this work, we introduce a new type of structural probing, where the linear projection is decomposed into 1. isomorphic space rotation; 2. linear scaling that identifies and scales the most relevant dimensions. In addition to syntactic dependency, we evaluate our method on novel tasks (lexical hypernymy and position in a sentence). We jointly train the probes for multiple tasks and experimentally show that lexical and syntactic information is separated in the representations. Moreover, the orthogonal constraint makes the Structural Probes less vulnerable to memorization.
- Published
- 2020
14. Gender Coreference and Bias Evaluation at WMT 2020
- Author
-
Kocmi, Tom, Limisiewicz, Tomasz, and Stanovsky, Gabriel
- Subjects
Computer Science - Computation and Language - Abstract
Gender bias in machine translation can manifest when choosing gender inflections based on spurious gender correlations. For example, always translating doctors as men and nurses as women. This can be particularly harmful as models become more popular and deployed within commercial systems. Our work presents the largest evidence for the phenomenon in more than 19 systems submitted to the WMT over four diverse target languages: Czech, German, Polish, and Russian. To achieve this, we use WinoMT, a recent automatic test suite which examines gender coreference and bias when translating from English to languages with grammatical gender. We extend WinoMT to handle two new languages tested in WMT: Polish and Czech. We find that all systems consistently use spurious correlations in the data rather than meaningful contextual information., Comment: Accepted WMT20
- Published
- 2020
15. Syntax Representation in Word Embeddings and Neural Networks -- A Survey
- Author
-
Limisiewicz, Tomasz and Mareček, David
- Subjects
Computer Science - Computation and Language - Abstract
Neural networks trained on natural language processing tasks capture syntax even though it is not provided as a supervision signal. This indicates that syntactic analysis is essential to the understating of language in artificial intelligence systems. This overview paper covers approaches of evaluating the amount of syntactic information included in the representations of words for different neural network architectures. We mainly summarize re-search on English monolingual data on language modeling tasks and multilingual data for neural machine translation systems and multilingual language models. We describe which pre-trained models and representations of language are best suited for transfer to syntactic tasks.
- Published
- 2020
16. Universal Dependencies according to BERT: both more specific and more general
- Author
-
Limisiewicz, Tomasz, Rosa, Rudolf, and Mareček, David
- Subjects
Computer Science - Computation and Language - Abstract
This work focuses on analyzing the form and extent of syntactic abstraction captured by BERT by extracting labeled dependency trees from self-attentions. Previous work showed that individual BERT heads tend to encode particular dependency relation types. We extend these findings by explicitly comparing BERT relations to Universal Dependencies (UD) annotations, showing that they often do not match one-to-one. We suggest a method for relation identification and syntactic tree construction. Our approach produces significantly more consistent dependency trees than previous work, showing that it better explains the syntactic abstractions in BERT. At the same time, it can be successfully applied with only a minimal amount of supervision and generalizes well across languages.
- Published
- 2020
- Full Text
- View/download PDF
17. Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages
- Author
-
Limisiewicz, Tomasz, primary, Balhar, Jiří, additional, and Mareček, David, additional
- Published
- 2023
- Full Text
- View/download PDF
18. Don’t Forget About Pronouns: Removing Gender Bias in Language Models Without Losing Factual Gender Information
- Author
-
Limisiewicz, Tomasz, primary and Mareček, David, additional
- Published
- 2022
- Full Text
- View/download PDF
19. A Balanced Data Approach for Evaluating Cross-Lingual Transfer: Mapping the Linguistic Blood Bank
- Author
-
Malkin, Dan, primary, Limisiewicz, Tomasz, additional, and Stanovsky, Gabriel, additional
- Published
- 2022
- Full Text
- View/download PDF
20. Introducing Orthogonal Constraint in Structural Probes
- Author
-
Limisiewicz, Tomasz, primary and Mareček, David, additional
- Published
- 2021
- Full Text
- View/download PDF
21. Examining Cross-lingual Contextual Embeddings with Orthogonal Structural Probes
- Author
-
Limisiewicz, Tomasz, primary and Mareček, David, additional
- Published
- 2021
- Full Text
- View/download PDF
22. Universal Dependencies According to BERT: Both More Specific and More General
- Author
-
Limisiewicz, Tomasz, primary, Mareček, David, additional, and Rosa, Rudolf, additional
- Published
- 2020
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.