Why Are Large AI Models Being Red Teamed?

In February, OpenAI announced the arrival of Sora, a stunning “text-to-video” tool. Simply enter a prompt, and Sora generates a realistic video within seconds. But it wasn’t immediately available to the public. Some of the delay is because OpenAI reportedly has a set of experts called a red team who, the company has said, will probe the model to understand its capacity for deepfake videos, misinformation, bias, and hateful content.

Red teaming, while having proved useful for cybersecurity applications, is a military tool that was never intended for widespread adoption by the private sector.

“Done well, red teaming can identify and help address vulnerabilities in AI,” says Brian Chen, director of policy from the New York–based think tank Data & Society. “What it does not do is address the structural gap in regulating the technology in the public interest.”

What is red teaming?

The practice of red teaming derives its early origins from Sun Tzu’s military stratagem from The Art of War: “If you know the enemy and know yourself, you need not fear the result of a hundred battles.” The purpose of red-teaming exercises is to play the role of the adversary (the red team) and find hidden vulnerabilities in the defenses of the blue team (the defenders) who then think creatively about how to fix the gaps.

The practice originated in U.S. government and military circles during the 1960s as a way to anticipate threats from the Soviet Union. Today, it is mostly known as a trusted cybersecurity technique used to help protect computer networks, software, and proprietary data.

That’s the idea, at least. And in cybersecurity, where the role of hackers and the defenders are clear-cut, red teaming has a substantial track record. But how blue and red teams might be apportioned for AI—and what motivates the players in this whole exercise to ultimately act toward, ideally, furthering the public good—is unclear.

In a scenario where red teaming is being used to ostensibly help safeguard society from the potential harms of AI, who plays the blue and red teams? Is the blue team the developers and the red team hackers? Or is the red team the AI model? And who oversees the blue team?

Micah Zenko, author of Red Team: How to Succeed by Thinking Like the Enemy, says the concept of red teaming is not always well-defined and can be varied in its applications. He says AI red teamers should “proceed with caution: Be clear on reasoning, scope, intent, and learning outcomes. Be sure to pressure-test thinking and challenge assumptions.”

Zenko also reveals a glaring mismatch between red teaming and the pace of AI advancement. The whole point, he says, is to identify existing vulnerabilities and then fix them. “If the system being tested isn’t sufficiently static,” he says, “then we’re just chasing the past.”

Why is red teaming now part of AI public policy?

On 30 October last year, President Joe Biden issued Executive Order 14110 instructing the U.S. National Institute of Standards and Technology (NIST) to develop science-based guidelines to support the deployment of safe, secure, and trustworthy systems, including for AI red teaming.

Three months later, NIST has concluded the first few steps toward implementing its new responsibilities—red teaming and otherwise. It has collected public comments on the federal register, announced the inaugural leadership of the U.S. Artificial Intelligence Safety Institute, and started a consortium to evaluate AI systems and improve their trustworthiness and safety.

This, however, is not the Biden administration’s first instance of turning to AI red teaming.

The technique’s popularity in Biden administration circles started earlier in the year. According to Politico, White House officials met with organizers of the hacker conference DEFCON in March and agreed at that time to support a public red-teaming exercise. By May, administration officials announced their support to attempt an AI red teaming exercise at the upcoming DEFCON 31 conference in Las Vegas. Then, as scheduled, in August, thousands descended upon Caesar’s Forum in Las Vegas to test the capacity of AI models to cause harm. As of press time, the results of this exercise have yet to be made public.

What can AI red teaming do?

Like any computer software, AI models share the same cybervulnerabilities: They can be hacked by nefarious actors to achieve a variety of objectives including data theft or sabotage. As such, red teaming can offer one approach for protecting AI models from external threats. For example, Google uses red teaming to protect its AI models from threats such as prompt attacks, data poisoning, and backdooring. Once such vulnerabilities are identified, they can close the gaps in the software.

To address the potential risks of AI, tech developers have built networks of external experts to help them assess the safety and security of their models. However, they tend to hire contractors and require them to sign nondisclosure agreements . The exercises still take place behind closed doors, and results are reported to the public in broad terms.

Especially for the case of AI, experts from Data & Society, a technology think tank, say that red teaming should not take place internally within a company. Zenko suggests that “not only is there a need for independent third-party validation, companies should build cross-functional and multidisciplinary teams—not just engineers and hackers.”

Dan Hendrycks, executive and research director of the San Francisco–based Center for AI Safety, says red teaming shouldn’t be treated as a turnkey solution either. “The technique is certainly useful,” he says. “But it represents only one line of defense against the potential risks of AI, and a broader ecosystem of policies and methods is essential.”

NIST’s new AI Safety Institute now has an opportunity to change the way red teaming is used in AI. The Institute’s consortium of more than 200 organizations has already reportedly begun developing standards for AI red teaming. Tech developers have also begun exploring best practices on their own. For example, Anthropic, Google, Microsoft, and OpenAI have established the Frontier Model Forum (FMF) to develop standards for AI safety and share best practices across the industry.

Chris Meserole, FMF executive director, says that “red teaming can be a great starting point for assessing the potential risks a model might introduce.” However, he adds, AI models at the bleeding edge of technology development demand a range of strategies, not just a tool recycled from cybersecurity—and ultimately from the Cold War.

Red teaming, Meserole says, is far from “a panacea, which is why we’ve been keen to support the development of other evaluation, assessment, and mitigation techniques to assure the safety of frontier AI models.”