French private AI lab PleIAs “is committed to training LLMs in the open,” they write in a blog post at Mozilla.org. “This means not only releasing our models but also being open about every aspect, from the training data to the training code. We define ‘open’ strictly: all data must be both accessible and under permissive licenses.”
Wednesday PleIAs announced they were releasing the largest open multilingual pretraining dataset, according to their blog post at HuggingFace:
Many have claimed that training large language models requires copyrighted data, making truly open AI development impossible. Today, Pleias is proving otherwise with the release of Common Corpus (part of the AI Alliance Open Trusted Data Initiative) — the largest fully open multilingual dataset for training LLMs, containing over 2 trillion tokens of permissibly licensed content with provenance information (2,003,039,184,047 tokens).
As developers are responding to pressures from new regulations like the EU AI Act, Common Corpus goes beyond compliance by making our entire permissibly licensed dataset freely available on HuggingFace, with detailed documentation of every data source. We have taken extensive steps to ensure that the dataset is high-quality and is curated to train powerful models. Through this release, we are demonstrating that there doesn’t have to be such a [heavy] trade-off between openness and performance.
Common Corpus is:
— Truly Open: contains only data that is permissively licensed and provenance is documented
— Multilingual: mostly representing English and French data, but contains at least 1B tokens for over 30 languages
— Diverse: consisting of scientific articles, government and legal documents, code, and cultural heritage data, including books and newspapers
— Extensively Curated: spelling and formatting has been corrected from digitized texts, harmful and toxic content has been removed, and content with low educational content has also been removed.
Common corpus builds on a growing ecosystem of large, open datasets, such as Dolma, FineWeb, RefinedWeb. The Common Pile currently in preparation under the coordination of Eleuther is built around the same principle of using permissible content in English language and, unsurprisingly, there were many opportunities for collaborations and shared efforts. But even together, these datasets do not provide enough training data for models much larger than a few billion parameters. So in order to expand the options for open model training, we still need more open data…
Based on an analysis of 1 million user interactions with ChatGPT, the plurality of user requests are for creative compositions… The kind of content we actually need — like creative writing — is usually tied up in copyright restrictions. Common Corpus tackles these challenges through five carefully curated collections…
Last week AMD also released its first series of fully open 1 billion parameter language models, AMD OLMo.
And last month VentureBeat reported that the non-profit Allen Institute for AI had unveiled Molmo, “an open-source family of state-of-the-art multimodal AI models which outpeform top proprietary rivals including OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 1.5 on several third-party benchmarks.”
Read more of this story at Slashdot. Read More