Article

FlexOlmo Lets Data Owners Reclaim Their Information from Trained AI Models

DATE: 7/10/2025 · STATUS: LIVE

Seattle researchers introduced FlexOlmo, letting data owners yank their own submissions, but early trials suddenly uncovered an unexpected major twist…

FlexOlmo Lets Data Owners Reclaim Their Information from Trained AI Models

Article content

A team at the Allen Institute for AI in Seattle has introduced a large language model that gives data owners the option to remove their contributions after a model has already been built. Known as FlexOlmo, this approach separates training into distinct sub-models so that each piece of data remains controllable. Once a standard model absorbs a dataset, that information is effectively locked in, making any removal akin to trying to retrieve ingredients from a finished cake.

Currently, most AI developers collect vast volumes of text from the web, books, and other repositories, then train a single monolithic network on that data. Once those lines are folded into a model’s neural weights, they cannot be separated without a full retraining run that can cost millions. “Conventionally, your data is either in or out,” says Ali Farhadi, CEO of Ai2, based in Seattle, Washington. “Once I train on that data, you lose control. And you have no way out, unless you force me to go through another multi-million-dollar round of training.”

FlexOlmo addresses this by having each contributor work on a copy of a shared base model, sometimes called the anchor. Data providers download the anchor, fine-tune it on their own licensed or proprietary texts, and then send back only the resulting sub-model and merge metadata. The central team never handles raw documents. Instead, it applies a specialized merge operator that stitches together the various sub-models into a final composite model. Because each contribution remains discrete, it can be extracted or removed at any point.

“The training is completely asynchronous,” says Sewon Min, a research scientist at Ai2 who led the technical work. “Data owners do not have to coordinate, and the training can be done completely independently.”

At its core, FlexOlmo uses a mixture-of-experts design, in which different modules specialize in distinct parts of the input space. Most existing systems that adopt that structure assume experts must be trained together to align their internal representations. Ai2’s key innovation is a representation format that lets independently trained experts merge without loss of quality. During inference, a gating mechanism negotiates which expert modules to activate, producing predictions that reflect all incorporated knowledge yet remain traceable to their original source.

To put the idea to the test, researchers assembled a dataset called Flexmix, drawing on both subscription-based texts and materials available in the public domain. They trained a FlexOlmo instance with around 37 billion parameters—roughly one tenth the size of Meta’s largest open source release—and benchmarked it against standalone models and other approaches for integrating separate networks. FlexOlmo outperformed every individual entry on a range of tasks, from question-answering to text summarization, and scored about ten percent higher on common benchmarks than two alternative merge techniques.

Farhadi highlights the practical advantage of this reversible pipeline. By reversing the typical workflow, data contributors can extract their sub-model when required, without disrupting service or adding latency. “You could just opt out of the system without any major damage and inference time,” he says. “It’s a whole new way of thinking about how to train these models.”

Percy Liang, an AI researcher at Stanford, sees modular data control as a step toward greater clarity in how models are built. “Providing more modular control over data—especially without retraining—is a refreshing direction that challenges the status quo of thinking of language models as monolithic black boxes,” he says. “Openness of the development process—how the model was built, what experiments were run, how decisions were made—is something that’s missing.”

Ai2 researchers believe that the same merge-and-remove strategy could offer a route for integrating sensitive or restricted data without exposing raw texts to outside parties. Still, the combined model retains those updates in its parameters, and clever probing might reconstruct fragments of the original inputs. To guard against that risk, Ai2 suggests weaving in privacy safeguards such as differential privacy, which adds mathematical limits on how much any single example can influence the final output.

Use of copyrighted content in model training has become a flashpoint in recent legal battles. Several publishers have filed suits alleging unauthorized scraping, while others have negotiated licenses to supply text. In June, a federal judge ruled that Meta did not infringe copyright when it trained an open source model on material by thirteen authors, setting a new precedent for training data. As the field works through these intellectual property questions, methods like FlexOlmo could help define fresh terms for data collaboration.

Sewon Min notes that data access itself often represents the key constraint in pushing model capabilities forward. “I really think the data is the bottleneck in building the state of the art models,” she says. “This could be a way to have better shared models where different data owners can codevelop, and they don’t have to sacrifice their data privacy or control.”

Keep building

Join Skool — Ship Your First Microapp Back to feed