Saturday, September 7

An Industry Insider Drives an Open Alternative to Big Tech’s A.I.

Ali Farhadi is no tech rebel.

The 42-year-old computer scientist is a highly respected researcher, a professor at the University of Washington and the founder of a start-up that was acquired by Apple, where he worked until four months ago.

But Mr. Farhadi, who in July became chief executive of the Allen Institute for AI, is calling for “radical openness” to democratize research and development in a new wave of artificial intelligence that many believe is the most important technology advance in decades.

The Allen Institute has begun an ambitious initiative to build a freely available A.I. alternative to tech giants like Google and start-ups like OpenAI. In an industry process called open source, other researchers will be allowed to scrutinize and use this new system and the data fed into it.

The stance adopted by the Allen Institute, an influential nonprofit research center in Seattle, puts it squarely on one side of a fierce debate over how open or closed new A.I. should be. Would opening up so-called generative A.I., which powers chatbots like OpenAI’s ChatGPT and Google’s Bard, lead to more innovation and opportunity? Or would it open a Pandora’s box of digital harm?

Definitions of what “open” means in the context of the generative A.I. vary. Traditionally, software projects have opened up the underlying “source” code for programs. Anyone can then look at the code, spot bugs and make suggestions. There are rules governing whether changes get made.

That is how popular open-source projects behind the widely used Linux operating system, the Apache web server and the Firefox browser operate.

But generative A.I. technology involves more than code. The A.I. models are trained and fine-tuned on round after round of enormous amounts of data.

However well intentioned, experts warn, the path the Allen Institute is taking is inherently risky.

“Decisions about the openness of A.I. systems are irreversible, and will likely be among the most consequential of our time,” said Aviv Ovadya, a researcher at the Berkman Klein Center for Internet & Society at Harvard. He believes international agreements are needed to determine what technology should not be publicly released.

Generative A.I. is powerful but often unpredictable. It can instantly write emails, poetry and term papers, and reply to any imaginable question with humanlike fluency. But it also has an unnerving tendency to make things up in what researchers call “hallucinations.”

The leading chatbots makers — Microsoft-backed OpenAI and Google — have kept their newer technology closed, not revealing how their A.I. models are trained and tuned. Google, in particular, had a long history of publishing its research and sharing its A.I. software, but it has increasingly kept its technology to itself as it has developed Bard.

That approach, the companies say, reduces the risk that criminals hijack the technology to further flood the internet with misinformation and scams or engage in more dangerous behavior.

Supporters of open systems acknowledge the risks but say having more smart people working to combat them is the better solution.

When Meta released an A.I. model called LLaMA (Large Language Model Meta AI) this year, it created a stir. Mr. Farhadi praised Meta’s move, but does not think it goes far enough.

“Their approach is basically: I’ve done some magic. I’m not going to tell you what it is,” he said.

Mr. Farhadi proposes disclosing the technical details of A.I. models, the data they were trained on, the fine-tuning that was done and the tools used to evaluate their behavior.

The Allen Institute has taken a first step by releasing a huge data set for training A.I. models. It is made of publicly available data from the web, books, academic journals and computer code. The data set is curated to remove personally identifiable information and toxic language like racist and obscene phrases.

In the editing, judgment calls are made. Will removing some language deemed toxic decrease the ability of a model to detect hate speech?

The Allen Institute data trove is the largest open data set currently available, Mr. Farhadi said. Since it was released in August, it has been downloaded more than 500,000 times on Hugging Face, a site for open-source A.I. resources and collaboration.

At the Allen Institute, the data set will be used to train and fine-tune a large generative A.I. program, OLMo (Open Language Model), which will be released this year or early next.

The big commercial A.I. models, Mr. Farhadi said, are “black box” technology. “We’re pushing for a glass box,” he said. “Open up the whole thing, and then we can talk about the behavior and explain partly what’s happening inside.”

Only a handful of core generative A.I. models of the size that the Allen Institute has in mind are openly available. They include Meta’s LLaMA and Falcon, a project backed by the Abu Dhabi government.

The Allen Institute seems like a logical home for a big A.I. project. “It’s well funded but operates with academic values, and has a history of helping to advance open science and A.I. technology,” said Zachary Lipton, a computer scientist at Carnegie Mellon University.

The Allen Institute is working with others to push its open vision. This year, the nonprofit Mozilla Foundation put $30 million into a start-up, Mozilla.ai, to build open-source software that will initially focus on developing tools that surround open A.I. engines, like the Allen Institute’s, to make them easier to use, monitor and deploy.

The Mozilla Foundation, which was founded in 2003 to promote keeping the internet a global resource open to all, worries about a further concentration of technology and economic power.

“A tiny set of players, all on the West Coast of the U.S., is trying to lock down the generative A.I. space even before it really gets out the gate,” said Mark Surman, the foundation’s president.

Mr. Farhadi and his team have spent time trying to control the risks of their openness strategy. For example, they are working on ways to evaluate a model’s behavior in the training stage and then prevent certain actions like racial discrimination and the making of bioweapons.

Mr. Farhadi considers the guardrails in the big chatbot models as Band-Aids that clever hackers can easily tear off. “My argument is that we should not let that kind of knowledge be encoded in these models,” he said.

People will do bad things with this technology, Mr. Farhadi said, as they have with all powerful technologies. The task for society, he added, is to better understand and manage the risks. Openness, he contends, is the best bet to find safety and share economic opportunity.

“Regulation won’t solve this by itself,” Mr. Farhadi said.

The Allen Institute effort faces some formidable hurdles. A major one is that building and improving a big generative model requires lots of computing firepower.

Mr. Farhadi and his colleagues say emerging software techniques are more efficient. Still, he estimates that the Allen Institute initiative will require $1 billion worth of computing over the next couple of years. He has begun trying to assemble support from government agencies, private companies and tech philanthropists. But he declined to say whether he had lined up backers or name them.

If he succeeds, the larger test will be nurturing a lasting community to support the project.

“It takes an ecosystem of open players to really make a dent in the big players,” said Mr. Surman of the Mozilla Foundation. “And the challenge in that kind of play is just patience and tenacity.”