top of page

Insights from the LocLearn™ Upskill Course on Multilingual NLP

Updated: 7 days ago

Content contributors: Minting Lu, Hongru Xu and Ruiyi Zhang

Word cloud created in the course

You don't need to have an advanced degree in linguistics or be a Python coding expert to use Natural Language Processing (NLP) to prepare the high-quality language data essential for training chatbots, analyzing customer sentiment, or fine-tuning a Large Language Model (LLM). In this course, Dr. Rafał Jaworski taught us how to use Google Colab, a browser-based tool for executing Python code, to pre-process and organize language data using NLP operations.


This course covered key NLP pre-processing steps, including tokenization, segmentation, stemming, lemmatization, parsing, and named-entity recognition. We also learned to use clustering and classification techniques to organize our multilingual content. Clustering involves grouping documents that are similar, as determined by the most significant terms used in them. Classification builds on this by assigning the clusters to categories (e.g., a category of “Music & Pop Culture” for a cluster with significant terms such as music, film, song, band, and people), making it easier to manage large amounts of language data.


Students could immediately apply what they learned to understand how each pre-processing step affected the quality of the language data output. We tested how various AI models, including ChatGPT, Gemini, and Claude, handle language processing tasks. We observed meaningful differences in their behavior. For example, some tokenize the word "self-driving" as a single token, "self-driving," while others tokenize each word and punctuation element separately (e.g., "self," "-," and "driving").


Tokenization is a foundational NLP operation that influences downstream tasks, such as part-of-speech tagging, and can significantly shape results over time. Encountering differences in implementation across AI models underscored an essential lesson: the importance of carefully defining and executing early-stage NLP operations to achieve a longer-term optimal outcome. We utilized this understanding to experiment with a parallel corpus of our choice, obtained from Opus, an open collection of parallel corpora used in many NLP research projects. We then cleaned and filtered the parallel corpus selected by removing empty rows, identifying extraneous or non-parallel punctuation, and applying additional NLP pre-processing operations. The goal was to produce the highest-quality language data possible.


This optimized language data can be utilized in various ways, such as training chatbots and automated virtual assistants, gathering customer feedback, assessing consumer sentiment on social media, and in training or fine-tuning a language model, among other applications. One student in the class, who once taught English as a Second Language (ESL) in China, observed that NLP’s part-of-speech tagging, for example, could support English language learners in better understanding sentence structure and verb usage. Regardless of the use case, carefully refined language data enhances the quality of the output.


Many thanks to Dr. Rafał Jaworski for creating and facilitating this "Multilingual NLP" course, his second in a series focused on AI, for the LocLearn™ Upskill School.  

 
 
 

Comments


bottom of page