Connect with us


Red team methods introduced by Anthropic will close security gaps




Red team methods introduced by Anthropic will close security gaps

It’s time to celebrate the incredible women leading the way in AI! Nominate your inspirational leaders for VentureBeat’s Women in AI Awards today by June 18. More information

AI red teaming is proving effective at discovering security gaps that other security approaches cannot see, preventing AI companies from having their models used to produce objectionable content.

Anthropic has been released AI Red Team Guidelines last week joined a group of AI providers including Googling, Microsoft, NIST, NVIDIA And Open AI, who have also released similar frameworks. But Anthropic’s approach appears to be more comprehensive than these others, offering a human-in-the-middle approach, along with methods to encourage real-time knowledge sharing between red teams.

The goal is to identify and close security gaps in the AI ​​model

All announced frameworks share the common goal of identifying and closing the growing security gaps in AI models.

It is these growing security gaps that have lawmakers and policymakers concerned and pushing for safer, secure, and reliable AI. President Biden’s Safe, Secure, and Trustworthy Artificial Intelligence (14110) Executive Order (EO), which came out on October 30, 2018, says that NIST will “establish appropriate guidance (except for AI used as part of a national security system) , including appropriate procedures and processes, to enable AI developers, especially dual-use foundation models, to conduct AI red-teaming testing to enable the deployment of safe, secure and reliable systems.”

VB Transform 2024 Registration is open

Join business leaders in San Francisco from July 9 to 11 for our flagship AI event. Connect with colleagues, explore the opportunities and challenges of generative AI, and learn how to integrate AI applications into your industry. register now

NIST released two draft publications in late April to help manage the risks of generative AI. They are additional resources to NIST’s AI Risk Management Framework (AI RMF) and Secure Software Development Framework (SSDF).

The German Federal Office for Information Security (BSI) offers red teaming as part of the broader IT Grundschutz framework. Australia, Canada, the European Union, Japan, the Netherlands and Singapore have notable frameworks. The European parliament passed the EU law on artificial intelligence in March this year.

Red teaming AI models are based on iterations of randomized techniques

Red teaming is a technique that interactively tests AI models to simulate diverse, unpredictable attacks, with the aim of determining where their areas of strength and weakness are located. Generative AI (genAI) models are exceptionally difficult to test because they mimic human-generated content at scale.

The goal is to get models to do and say things they are not programmed to do, including exposing biases. They rely on LLMs to automate prompt generation and attack scenarios to find and correct model weaknesses at scale. Models can easily be ‘jailbroken’ to create hate speech, pornography, use copyrighted material, or retrieve source dataincluding social security and telephone numbers.

A recent VentureBeat interview with ChatGPT’s most prolific jailbreaker and other leading LLMs illustrates why red teaming must take a multimodal, multifaceted approach to the challenge.

The value of Red Teaming in improving the security of AI models continues to be proven in industry-wide competitions. One of the four methods Anthropic mentions in their blog post is crowdsourced red teaming. In recent years DEF CON hosted the very first Generative Red Team (GRT) Challenge, considered one of the more successful applications of crowdsourcing techniques. Models were provided by Anthropic, Cohere, Google, Hugging Face, Meta, Nvidia, OpenAI and Stability. Participants in the challenge tested the models on an evaluation platform developed by Scale AI.

Anthropic publishes their AI red team strategy

In releasing their methods, Anthropic emphasizes the need for systematic, standardized testing processes that are scalable and reveals that the lack of standards has slowed progress in AI redteaming across the industry.

“In an effort to contribute to this goal, we are sharing an overview of some of the red teaming methods we have explored and showing how they can be integrated into an iterative process from qualitative red teaming to the development of automated evaluations,” Anthropic writes in the blog post.

The four methods Anthropic mentions include domain-specific expert red teaming, using red team language models, red teaming in new modalities, and general open-ended red teaming.

Anthropic’s approach to red teaming ensures that human-in-the-middle insights enrich the quantitative results of other red teaming techniques and provide contextual intelligence. There is a balance between human intuition and knowledge and automated text data that needs that context to determine how models are updated and made more secure.

An example of this is how Anthropic is going all-in on domain-specific expertise teams by relying on experts while prioritizing Policy Vulnerability Testing (PVT), a qualitative technique to identify and implement security safeguards for many of the most challenging areas where they are located. are compromised. Electoral interference, extremism, hate speech and pornography are among the many areas where models need to be refined to reduce bias and abuse.

Every AI company that has released an AI Red Team framework automates their testing with models. Essentially, they create models to perform random, unpredictable attacks that are most likely to result in target behavior. “As models become more capable, we are interested in ways we can use them to complement manual testing with automated red teaming performed by the models themselves,” says Anthropic.

Anthropic relies on the red team/blue team dynamic and uses models to generate attacks in an attempt to induce a target behavior, relying on red team techniques to deliver results. These results are used to refine the model and make it more robust and robust against similar attacks, which is the core of blue teaming. Anthropic notes that “we can run this process repeatedly to come up with new attack vectors and, ideally, make our systems more robust against a range of adversarial attacks.”

Multimodal red teaming is one of the most fascinating and necessary areas Anthropic is pursuing. Testing AI models with image and audio input is one of the most challenging to get right, as attackers have successfully embedded text in images that can redirect models to bypass protections, as multimodal prompt injection attacks have proven . The Claude 3 Series models accept visual information in a wide variety of formats and provide text-based output in responses. Anthropic writes that they extensively tested Claude 3’s multimodalities before releasing it to reduce potential risks, including fraudulent activity, extremism, and threats to child safety.

Open general red teaming balances the four methods with greater contextual insight and intelligence from the human at the center. Crowdsourcing red teaming and community-based red teaming are essential for gaining insights that are not available with other techniques.

Protecting AI models is a moving target

Red teaming is essential to protect models and ensure they remain safe and secure. Attacker craft continues to accelerate faster than many AI companies can keep up, further demonstrating how this field is still in its infancy. Automating red teaming is a first step. Combining human insight and automated testing is the key to the future of model stability, security and safety.