What is AI red teaming?

AI red teaming is the practice of systematically testing AI systems to find vulnerabilities, safety issues, and unintended behaviors before deployment. Microsoft's AI Red Team has red teamed over 100 generative AI products since 2018, finding that simple system-level attacks are often more effective than complex model-targeting algorithms.

What are current jailbreak success rates against LLMs?

Research shows significant variation: GPT-4 showed 87.2% Attack Success Rate (ASR), Claude 2 at 82.5%, while Claude 3.7 Sonnet achieved 100% resistance. Multi-turn human jailbreaks achieve over 70% ASR even against models with strong single-turn defenses. Defense systems like Anthropic's Constitutional Classifiers reduced jailbreak rates from 86% to 4.4%.

What tools are used for AI red teaming?

Leading tools include Microsoft PyRIT (open-source toolkit with TAP, PAIR, and Crescendo attack strategies) and NVIDIA Garak (LLM vulnerability scanner with 120+ vulnerability categories). Both support multiple model providers and are actively maintained by their respective AI security teams.

Is AI red teaming required by regulations?

Yes. The EU AI Act requires documented red teaming for high-risk AI systems, with Article 55 mandating pre-release testing for systemic models. NIST AI RMF recommends continuous adversarial testing throughout the AI lifecycle. Non-compliance can result in fines up to €35 million or 7% of global turnover.

AI Red Teaming Guide: Testing AI Systems

State of the field

Q1 → Q2 2026 update brief

Federal red-teaming requirement under EO 14110 was rescinded on 20 January 2025 by EO 14148 and replaced by EO 14179 "Removing Barriers to American Leadership in AI." OMB M-25-21 (AI use) and M-25-22 (AI acquisition), issued 3 April 2025, supersede M-24-10. M-25-22 binds federal contracts awarded or renewed on or after 1 October 2025.^[A1][A2][A3]

DEF CON 33 GRT 3 (7–10 Aug 2025) pivoted from direct model jailbreaks to red-teaming the evaluations that establish a model’s performance — bounties paid for findings that demonstrate evals are incomplete or wrong. Multi-model setup with model cards scoping each model’s intent. Designed to address the GRT-1/2 problem where vendor-actionable artefacts were rarely produced.^[A4]

Anthropic Constitutional Classifiers reduced jailbreak attack-success rate from 86% → 4.4% on Claude. Feb 2025 bug bounty: 339 participants, ~300,000 interactions across 8 CBRN difficulty levels — a single universal jailbreak found. Next-generation classifiers cut overhead to ~1% with 0.05% benign refusal rate on Sonnet 4.5 traffic.^[A5][A6]

MITRE ATLAS v5.4.0 (Feb 2026): 14 agentic-AI techniques added across 2025; new "Publish Poisoned AI Agent Tool" technique (e.g. malicious MCP servers).^[A7]

AI red teaming has evolved from an ad-hoc practice to a critical security discipline. Major AI labs, enterprises, and regulators now recognize structured adversarial testing as essential for safe AI deployment.

Industry Investment

Big Tech’s AI spending surpassed $240 billion in 2024 alone (Dataconomy), yet approaches to red teaming vary dramatically. Microsoft leads in transparency, having published detailed case studies from 100+ products. Anthropic invests extensively in domain expert testing. Yet many organizations still treat red teaming as a checkbox exercise.^[1]

Major lab efforts

Microsoft AI Red Team

Formed 2018 · 100+ products tested

One of the first red teams to cover both security and responsible AI. Published comprehensive white paper in January 2025 with 8 key lessons. Open-sourced PyRIT framework for automated attack orchestration.^[2]

Anthropic

Frontier Red Team · Constitutional AI

Pioneered automated red teaming with model-vs-model loops. In cyber domain, Claude improved from "high schooler to undergraduate level" in CTF exercises in one year. Developed Constitutional Classifiers reducing jailbreak success from 86% to 4.4%.^[3]

NVIDIA

Garak LLM Vulnerability Scanner

Released Garak open-source scanner with 120+ vulnerability categories. Leon Derczynski leads development—also on OWASP LLM Top 10 core team. Presented at Black Hat USA 2024 to "heavily attended" sessions.^[4]

Large-scale public testing

Event	Scale	Key Finding
DEFCON 2023	2,244 hackers, 17,000+ conversations, 8 LLMs	21 harm categories tested across Anthropic, OpenAI, Google, Meta
Crucible Platform	214,271 attempts, 1,674 users, 400 days	Average user: 128 attack attempts across 4 challenges
HackerOne AI Challenge	300,000+ interactions, 3,700+ hours	Zero universal jailbreaks discovered
Anthropic Bug Bounty	183 participants, 3,000+ hours, 2 months	Constitutional Classifiers reduced success to 4.4%

Attack success rates by model

Research reveals significant variation in model vulnerability. Attack Success Rate (ASR) measures the percentage of adversarial prompts that successfully bypass safety measures. Without dedicated defenses, most models show high vulnerability.

Baseline Vulnerability (No Additional Defenses)

GPT-4 87.2% ASR

Claude 2 82.5% ASR

Mistral 7B 71.3% ASR

Vicuna 69.4% ASR

DeepSeek R1 68% ASR

Source: Red Teaming the Mind of the Machine (2024), various model-specific studies^[5]

With Defensive Measures

4.4%

Claude with Constitutional Classifiers

Down from 86% baseline

100%

Claude 3.7 Sonnet resistance

Perfect jailbreak resistance in testing

Key Insight

The gap between defended and undefended models is dramatic. Microsoft warns that RLHF and alignment techniques "make jailbreaking more difficult but not impossible." A zero-trust approach—assuming any model could be jailbroken—combined with layered defenses is essential.^[6]

Attack techniques and effectiveness

Research has quantified the effectiveness of different attack techniques. Roleplay-based attacks consistently outperform technical encoding tricks, though multi-turn attacks show the highest success rates overall.

Technique	ASR	Mechanism
Roleplay/Persona	89.6%	Impersonation, fictional characters, hypothetical scenarios
Logic Traps	81.4%	Conditional structures, moral dilemmas, contradictions
Encoding Tricks	76.2%	Base64, zero-width characters, leetspeak, ROT13
Multi-Turn Human	70%+	Gradual escalation across conversation turns
Automated Single-Turn	<10%	Against well-defended models with safety layers

Multi-Turn Gap

Human red teamers achieved attack success rates 19-65% higher than ensembles of automated attacks across multiple LLM defenses. This suggests single-turn automated testing significantly underestimates real-world risk.^[7]

Attack taxonomy

Prompt-Level Attacks

Direct prompt injection Indirect prompt injection Jailbreaking / DAN Goal hijacking Instruction extraction Context manipulation

Data Extraction Attacks

Training data extraction System prompt leakage Context window exfiltration PII extraction Model inversion Membership inference

Safety Bypass Attacks

Harmful content generation Bias amplification Ethical guardrail bypass Content policy evasion Multi-turn manipulation Language switching

Capability Abuse Attacks

Tool/function abuse Unauthorized API calls Privilege escalation Resource exhaustion Chained attack sequences Cross-system exploitation

Microsoft’s 8 key lessons from 100+ products

In January 2025, Microsoft’s AI Red Team published their comprehensive white paper detailing lessons from testing over 100 generative AI products since 2018. These findings represent the most extensive industry experience with structured AI adversarial testing.^[2]

System-Level Attacks Win

Relatively simple attacks targeting weaknesses in end-to-end systems are more likely to succeed than complex algorithms targeting only the underlying AI model. Red teams should adopt a system-wide perspective.

Red Teaming ≠ Benchmarking

Benchmarks measure preexisting notions of harm on curated datasets. Red teaming explores unfamiliar scenarios and helps define novel harm categories. Both are necessary but serve different purposes.

Human Judgment Remains Essential

Despite automation benefits, human judgment is essential for prioritizing risks, designing system-level attacks, and assessing nuanced harms. Many risks require subject matter expertise, cultural understanding, and emotional intelligence.

Simple Attacks Often Work

Attackers often use simple, practical methods like hand-crafted prompts and fuzzing to exploit weaknesses. Sophisticated academic attacks are less common in practice than straightforward exploitation.

Mental Health Matters

Organizations need to consider red team members’ mental health—they "may be exposed to disproportionate amounts of unsettling and disturbing AI-generated content." Support structures are essential.

Security is Never Complete

AI models amplify existing security risks and create new ones. Theoretical research shows that for any output with non-zero probability, a sufficiently long prompt exists to elicit it. The goal is raising attack cost, not elimination.

Use AI as Force Multiplier

AI-generated attacks can lack creativity or context understanding. Use AI to brute-force simple variations while human experts analyze and guide the process. Completely hands-off AI red teaming isn’t yet viable.

Document with TTPs

Use a structured ontology to model attacks including adversarial actors, TTPs (Tactics, Techniques, and Procedures), system weaknesses, and downstream impacts. This enables systematic tracking and improvement.

Red teaming tools

Two open-source tools have emerged as industry standards for AI red teaming: Microsoft’s PyRIT and NVIDIA’s Garak. Both are actively maintained and integrate with major model providers.

PyRIT

Microsoft PyRIT

Python Risk Identification Tool

Attack strategies: TAP, PAIR, Crescendo

Prompt converters (encodings, transformations)

Multimodal output scorers

Prompt datasets for harm categories

"Enabled a major shift from fully manual probing to red teaming supported by automation" — Microsoft AI Red Team

github.com/microsoft/pyrit

Garak

NVIDIA Garak

LLM Vulnerability Scanner

120+ vulnerability categories

Static, dynamic, and adaptive probes

Supports HuggingFace, OpenAI, Cohere, NIM

Detailed remediation reports

"Similar to nmap or Metasploit Framework, garak does comparable things for LLMs" — NVIDIA

github.com/NVIDIA/garak

Tool comparison

Capability	PyRIT	Garak
Attack orchestration	✓ Extensive	✓ Good
Pre-built attack library	✓ TAP, PAIR, Crescendo	✓ 120+ categories
Multi-turn attacks	✓ Native support	◐ Limited
Multimodal support	✓ Images, audio	✓ Images
Reporting	✓ JSON, scoring	✓ Detailed remediation
Best for	Complex orchestrated attacks	Broad vulnerability scanning

Defense effectiveness

Research demonstrates that well-designed defenses dramatically reduce attack success rates. The key is layered defense—no single technique provides complete protection.

Constitutional Classifiers (Anthropic)

Without Classifiers 86% success

With Classifiers 4.4% success

95%

Reduction in jailbreak success

183 participants, 3,000+ hours testing

Defense-in-depth layers

Layer 1 Input Filtering

Block known attack patterns, encoding tricks, and malicious payloads before they reach the model.

Layer 2 Model Safety Training

RLHF, Constitutional AI, and other alignment techniques that train refusal behavior into the model.

Layer 3 Output Classifiers

Constitutional Classifiers and similar systems that detect and block harmful outputs before delivery.

Layer 4 Application Controls

Rate limiting, user authentication, capability restrictions, and audit logging at the application layer.

Layer 5 Human Oversight

Human review for high-risk actions, escalation procedures, and continuous monitoring of system behavior.

Regulatory requirements

AI red teaming is increasingly mandated by regulation. The EU AI Act and NIST AI RMF establish specific requirements for adversarial testing, with significant penalties for non-compliance.

Regulation	Requirement	Effective	Penalty
EU AI Act (Art. 55)	Pre-release red teaming for systemic GPAI models	Aug 2025	€15M or 3%
EU AI Act (High-Risk)	Documented testing for high-risk AI systems	Aug 2026	€35M or 7%
NIST AI RMF	Continuous adversarial testing recommended	Now	Framework
NIST 600-1 GenAI	Specific GenAI risk mitigations including red teaming	July 2024	Framework
Colorado AI Act	NIST AI RMF compliance provides safe harbor	June 2026	Per violation

NIST Adversarial ML Guidance

In January 2024, NIST published guidance identifying four specific types of AI cyberattacks with recommended mitigations:^[8]

1. Data Poisoning

Corrupting training data to manipulate model behavior

2. Data Abuse

Exploiting legitimate data access for unauthorized purposes

3. Privacy Attacks

Extracting sensitive information from models

4. Evasion Attacks

Crafting inputs that bypass detection or classification

Building a red team program

Based on research from Microsoft, Anthropic, and academic studies, effective AI red team programs share common characteristics. The SEI notes that no standardized protocols yet exist for generative AI red teaming—organizations must build custom programs.^[9]

Team composition

Security Expertise

Traditional security testing skills, threat modeling, attack methodology

AI/ML Knowledge

Understanding of model behavior, training dynamics, alignment techniques

Domain Expertise

Knowledge of your specific use case, regulatory requirements, user context

Creative Thinking

Ability to find unexpected attack paths that automated tools miss

Testing cadence

PRE-DEPLOY

Full red team before any production release — mandatory for regulatory compliance

MAJOR

Red team when capabilities change significantly or new features added

MODEL

Test when underlying models are updated—model changes can introduce new vulnerabilities

PERIODIC

Quarterly for high-risk systems—new attack techniques emerge constantly

CONTINUOUS

Automated testing for known attack patterns using PyRIT/Garak in CI/CD

Regression Testing Gap

Feffer et al. (2024) found that fewer than 1/3 of 42 enterprise AI programs track post-fix regression. Many organizations fix vulnerabilities but don’t verify the fixes or check for reintroduction in later updates.^[10]

GLACIS red team framework

The GLACIS Red Team Framework synthesizes best practices from Microsoft, Anthropic, and regulatory requirements into a structured five-phase approach that produces compliance-ready evidence.

GLACIS red team framework

Scope

Threat Model

Attack

Report

Verify

PHASE 1

Scoping

Define what you’re testing and establish boundaries aligned with regulatory requirements.

Identify AI systems in scope (models, applications, integrations)
Map to regulatory requirements (EU AI Act risk level, NIST functions)
Define threat actors and attack motivations
Establish rules of engagement and access levels

Deliverable: Scoping document with regulatory mapping, signed authorization

PHASE 2

Threat Modeling

Map the attack surface using Microsoft’s TTP ontology and prioritize testing areas.

Document AI system architecture and data flows
Map attack taxonomy categories to system components
Prioritize attacks based on likelihood and impact
Include system-level attacks, not just model-level (per Microsoft lessons)

Deliverable: Threat model document, prioritized attack plan

PHASE 3

Attack Execution

Execute structured attacks combining automated tools with manual human testing.

Run automated scans with Garak (120+ categories)
Execute multi-turn attacks with PyRIT orchestration
Conduct manual creative testing (19-65% higher success than automated)
Document all findings with reproduction steps

Deliverable: Attack logs, finding documentation, severity ratings

PHASE 4

Reporting

Document findings with actionable remediation guidance and compliance evidence.

Create detailed finding reports with CVSS-style severity ratings
Map findings to regulatory requirements (EU AI Act articles, NIST categories)
Recommend specific remediations with defense layer mapping
Generate executive summary for board/customer communication

Deliverable: Technical report, executive summary, compliance evidence package

PHASE 5

Verification

Confirm remediations are effective and establish regression testing baseline.

Retest all findings after remediation
Verify fixes don’t introduce new vulnerabilities
Create automated regression test suite for CI/CD integration
Document closure evidence for compliance records

Deliverable: Verification report, regression test suite, compliance closure documentation

AI Red Team Playbook

Comprehensive playbook with 100+ attack techniques based on Microsoft and Anthropic research, testing scripts for PyRIT/Garak, and compliance-ready reporting templates.

Download Playbook LLM Security Guide

References

[1] IEEE Spectrum. "Why Are Large AI Models Being Red Teamed?" 2024.
[2] Microsoft AI Red Team. "3 Takeaways from Red Teaming 100 Generative AI Products." Microsoft Security Blog, January 2025.
[3] Anthropic. "Constitutional Classifiers." Anthropic Research, 2024.
[4] NVIDIA. "Defining LLM Red Teaming." NVIDIA Technical Blog, 2024.
[5] "Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs." arXiv, 2024.
[6] Microsoft Security. "AI Jailbreaks: What They Are and How They Can Be Mitigated." June 2024.
[7] "Practical AI Red Teaming: The Power of Multi-Turn Tests vs Single-Turn Evaluations." Pillar Security, 2024.
[8] NIST. "Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations." NIST AI 100-2e2023, January 2024.
[9] Software Engineering Institute. "What Can Generative AI Red-Teaming Learn from Cyber Red-Teaming?" Carnegie Mellon University, 2024.
[10] Feffer et al. "The Automation Advantage in AI Red Teaming." arXiv, 2024.
[11] Anthropic. "Frontier Threats: Red Teaming for AI Safety." 2024.
[12] Pillar Security. "AI Red Teaming Regulations and Standards." 2024.
[13] HackerOne. "AI Red Teaming: Offensive Testing for AI Models." 2024.

Disclaimer: Attack success rates and vulnerability statistics cited are from controlled research environments and may not reflect real-world deployment conditions. Methodology variations across studies may affect comparability. All figures reflect data available as of publication date (24 April 2026). This guide is for defensive security purposes only and does not constitute legal advice.

Related guides

Security

AI red teaming, the working playbook for April 2026.

Executive summary

In This Guide

State of the field

Industry Investment

Major lab efforts

Microsoft AI Red Team

Anthropic

NVIDIA

Large-scale public testing

Attack success rates by model

Baseline Vulnerability (No Additional Defenses)

With Defensive Measures

Attack techniques and effectiveness

Attack taxonomy

Prompt-Level Attacks

Data Extraction Attacks

Safety Bypass Attacks

Capability Abuse Attacks

Microsoft’s 8 key lessons from 100+ products

System-Level Attacks Win

Red Teaming ≠ Benchmarking

Human Judgment Remains Essential

Simple Attacks Often Work

Mental Health Matters

Security is Never Complete

Use AI as Force Multiplier

Document with TTPs

Red teaming tools

Microsoft PyRIT

NVIDIA Garak

Tool comparison

Defense effectiveness

Constitutional Classifiers (Anthropic)

Defense-in-depth layers

Regulatory requirements

NIST Adversarial ML Guidance

Building a red team program

Team composition

Security Expertise

AI/ML Knowledge

Domain Expertise

Creative Thinking

Testing cadence

GLACIS red team framework

GLACIS red team framework

Scoping

Threat Modeling

Attack Execution

Reporting

Verification

AI Red Team Playbook

Make your red team report a board-ready receipt.

References

Related guides

LLM Security Guide

AI Governance Tools

NIST AI RMF Guide