Since the ChatGPT moment, generative language models have been used, among other things, to make the "world knowledge" they contain accessible in an understandable form. The precision of the language models depends above all on the data that can be accessed and the computing power that is invested. Due to the dominance of the English language on the Internet, language models work better for queries in English. Well-funded companies are also more likely to be able to provide the necessary computing power.
Less common languages and non-commercial projects need innovative approaches instead to compensate for disadvantages and tap into areas that do not appear profitable. This is where the Occiglot project comes in by forming a community of researchers, language experts, software developers and users. By pooling common interests, all 24 official languages of the European Union as well as other unofficial and regional languages are to be included in the language model. The first version of Occiglot was made possible by the use of computers from DFKI and the AI Service Center hessian.AISC, which is funded by the Federal Ministry of Education and Research (BMBF).
The Parliamentary State Secretary at the BMBF, Mario Brandenburg, emphasizes: "This is an example of the high value of academic freedom in our society. Through the free exchange between scientists from the disciplines of artificial intelligence (AI) and language technology, an idea has emerged that directly serves European language sovereignty. I wish the Occiglot project a wide dissemination and the participation of committed people with diverse language backgrounds. Open source is the right framework for a project with this objective and this history."
"The development of European language models is key to maintaining Europe's academic and economic competitiveness and its digital and AI sovereignty. It is also necessary to achieve the goal of digital language equality in Europe," adds Prof. Dr. Georg Rehm, Principal Researcher and Research Fellow at DFKI in Berlin.
European research collective and call for collaboration
Occiglot sees itself as an open European collective of researchers from organizations and initiatives such as the German Research Center for Artificial Intelligence (DFKI), Hessian.AI, TU Darmstadt, the Catholic University of Leuven (Belgium), the Barcelona Supercomputing Center BSC (Spain) and a number of other teams.
The Occiglot initiative is actively seeking collaborations within the international AI and NLP community and feedback from users.
Supported by DFKI, Hessian.AI and BMBF
The conception of Occiglot was largely made possible by researchers at the DFKI laboratories in Darmstadt and Berlin. The hessian.AI Innovation Lab (funded by the Hessian Ministry for Digital Strategy and Innovation) and the hessian.AISC Service Center (funded by the Federal Ministry of Education and Research, BMBF) support Occiglot by providing computing time on their AI supercomputer fortytwo.
Kristian Kersting, head of the Fundamentals of Systemic AI research department at DFKI in Darmstadt and co-director of Hessian.AI, emphasizes the success of the collaboration in the network: "Future language models - whether larger than ChatGPT or so small that they fit on a cell phone, whether open or proprietary - will still have a few surprises in store in terms of their performance. We need more of these synergies so that we can exploit the enormous potential for Germany and Europe. We need a strong AI ecosystem with a corresponding computing infrastructure and models that are also accessible and economically viable for companies."
The curation of the training data is also partially funded by the German Federal Ministry for Economic Affairs and Climate Protection (BMWK) via the OpenGPT-X project (project number 68GX21007D).
Occiglot-LLM Release v0.1
Initially, ten language models were published, each with a size of seven billion parameters. These models are the first version of a series of language models and initially focus on the five largest European languages: English, German, French, Spanish and Italian.
Starting from Mistral-7B - an open source model already trained for English - bilingual continuous pre-training and subsequent instruction tuning was performed for each language. In addition, a multilingual model covering all five languages was trained.
In total, 700 billion additional multilingual tokens were used during continuous pre-training and about one billion tokens for instruction tuning.
All language models (with and without instruction tuning) are available under Apache 2.0 license on the Hugging Face platform.
Roadmap
The main focus of the Occiglot initiative in the coming months will be to create a coherent approach to modeling a language model that supports all 24 official languages within the European Union as well as several unofficial and regional languages.
To achieve this goal, approximately 1 trillion tokens of non-English pre-training data have already been collected. This corpus will be continuously expanded by additional data collected by Occiglot community members and by further crawling on the Internet.