icon

According to EleutherAI, a research organization focused on AI, they have recently published a vast compilation of licensed and open-domain text. The creation of the Common Pile v0.1 dataset was a two-year collaborative effort with AI startups such as Poolside and Hugging Face, as well as multiple academic institutions. This extensive dataset, which weighs 8 terabytes... The debut version of The Common Pile (v0.1) is available for download on Hugging Face's AI dev platform and GitHub. Developed in collaboration with legal professionals, it incorporates material from various sources, including 300,000 public domain books digitized by the Library of Congress and the Internet Archive. In addition, EleutherAI utilized Whisper, an open source speech-to-text model from OpenAI, to transcribe audio content.

Copyright Lawsuits Fuel AI Transparency Crisis

AI companies, like OpenAI, are currently facing lawsuits regarding their methods of training AI. These methods involve collecting data from the internet, which may include copyrighted material such as books and research journals... According to EleutherAI, the decrease in transparency among AI companies due to these lawsuits has had a negative impact on the broader field of AI research. This has made it increasingly challenging to comprehend the inner workings of models and identify potential flaws. According to Stella Biderman, executive director of EleutherAI, lawsuits surrounding copyrights have not significantly altered the way companies source data for training their models. However, they have greatly diminished the level of transparency that companies are willing to provide. Biderman further explains that during conversations with various companies, researchers have pinpointed lawsuits as the main barrier preventing them from publicly sharing their research in data-intensive fields.

img
img
"Comma" Models Prove Viability of Licensed Data

This extensive dataset, which weighs 8 terabytes, was leveraged by EleutherAI to develop two new AI models, namely Comma v0.1-1T and Comma v0.1-2T. According to EleutherAI, these models rival those developed using copyrighted data without proper licensing... According to EleutherAI, the presence of both Comma v0.1-1T and Comma v0.1-2T serves as proof that the Common Pile v0.1 was meticulously curated, providing developers with the ability to create competitive models compared to those proprietary alternatives. These models, each containing 7 billion parameters and trained on only a small portion of the Common Pile v0.1, are on par with Meta's first Llama AI model in terms of performance on coding, image comprehension, and mathematical benchmarks. Parameters, also known as weights, serve as the internal building blocks of an AI model, dictating its actions and responses. Biderman stated in her post that the common belief regarding unlicensed text driving performance is not well-founded. With the increasing availability of openly licensed and public domain data, there is potential for improved quality of models trained on such content.

Correcting Past Mistakes & Commitment to Open Data

It seems that The Common Pile v0.1 is an attempt to correct EleutherAI's past errors. In the past, EleutherAI had published The Pile, which contains copyrighted content and was intended as a resource for training text. As a result, AI companies have faced criticism and legal actions for utilizing The Pile in their model training. Moving forward, EleutherAI is dedicated to increasing the frequency of open dataset releases through partnerships with our research and infrastructure allies.