Natural Language Processing Lab for the Nordics
Natural Language Processing (NLP) is what happens when Siri, Alexa or Google Translate understands human or so-called “natural” language instead of formal or programming languages. NLP is an interdisciplinary field, bringing together computer science, artificial intelligence and linguistics. In order to study the interactions between computers and human language and develop NLP technology, it is necessary to process and analyse very large amounts of natural language data.
The Nordic Language Processing Laboratory (NLPL) is a use case in EOSC-Nordic that connects NLP research groups from various Northern European universities. The NLPL project aims to make NLP research in Scandinavia more competitive at the international level. The vision of NLPL is to implement a virtual language technology laboratory for large-scale NLP research by developing innovative methods for sharing High-Performance Computing (HPC) resources across country borders among the Nordic countries.
“We want to enable our doctoral students and master’s students and also researchers on funded research projects to carry out computational experiments that are very resource-demanding. They need much bigger computers, than what are typically available through a local university’s computing infrastructure. And this use case in EOSC-Nordic is paving the road to make Nordic NLP researchers more efficient users of these large-scale national e-infrastructures,” says Stephan Oepen, professor of Machine Learning, Department of Informatics, University of Oslo.
He is working on this project along colleague Jörg Tiedemann, professor of Language Technology, from the University of Helsinki. Jointly, they represent the larger community, with university research groups in NLP in four countries that have collaborated for years.
The NLPL collaboration has created new ways to enable data- and compute-intensive NLP research by implementing a common software, data and service stack in multiple Nordic HPC centres. The development of the virtual laboratory facilitated to improve techniques and recipes for data management across storage systems. To top it off, the project has enabled internationally competitive, data-intensive research and experimentation on a scale that would be difficult to sustain on commodity computing resources.
NLPL virtual software infrastructure was validated through the creation of the first Very Large Language Model for Norwegian. Around two billion tokens of running Norwegian text were used to train the model. This was achieved by combining GPU resources in Norway and Finland, namely the national Saga and Puhti superclusters. The model is now publicly available.
Pooling of competencies within the user community and among expert support teams is another useful feature that, among other things, fosters training of the next generation of scientists to work on the language technology and related services. Since 2018, NLPL organizes an annual winter school on machine learning and scientific computing techniques for NLP. These events target doctoral and post-doctoral fellows and other research staff at Nordic NLP research centres.
The NLPL consortium is comprised of Nordic research groups in NLP and the national e-infrastructure providers of Finland and Norway: Helsinki University (Finland), IT University Copenhagen (Denmark), University of Copenhagen (Denmark), University of Oslo (Norway), Turku University (Finland), and Uppsala University (Sweden) are the academic partners.