Executive Summary
AI red teaming has emerged as the critical practice for identifying risks in generative AI systems before deployment. Microsoft's AI Red Team has tested 100+ generative AI products since 2018, publishing their methodology and lessons learned. Research shows attack success rates ranging from 87% on prior-generation models (GPT-4) to near-zero with modern defenses (Constitutional Classifiers), demonstrating that defenses work—but only when properly implemented. Current models like GPT-5.2 and Claude Opus 4.5 incorporate these lessons but require ongoing red teaming.
This guide synthesizes findings from industry leaders (Microsoft, Anthropic, NVIDIA), academic research, and regulatory requirements (EU AI Act, NIST AI RMF) to provide actionable guidance for enterprise AI security teams.
In This Guide
State of the Field
AI red teaming has evolved from an ad-hoc practice to a critical security discipline. Major AI labs, enterprises, and regulators now recognize structured adversarial testing as essential for safe AI deployment.
Industry Investment
Big Tech's AI spending surpassed $240 billion in 2024 alone (Dataconomy), yet approaches to red teaming vary dramatically. Microsoft leads in transparency, having published detailed case studies from 100+ products. Anthropic invests extensively in domain expert testing. Yet many organizations still treat red teaming as a checkbox exercise.[1]
Major Lab Efforts
Microsoft AI Red Team
Formed 2018 · 100+ products tested
One of the first red teams to cover both security and responsible AI. Published comprehensive white paper in January 2025 with 8 key lessons. Open-sourced PyRIT framework for automated attack orchestration.[2]
Anthropic
Frontier Red Team · Constitutional AI
Pioneered automated red teaming with model-vs-model loops. In cyber domain, Claude improved from "high schooler to undergraduate level" in CTF exercises in one year. Developed Constitutional Classifiers reducing jailbreak success from 86% to 4.4%.[3]
NVIDIA
Garak LLM Vulnerability Scanner
Released Garak open-source scanner with 120+ vulnerability categories. Leon Derczynski leads development—also on OWASP LLM Top 10 core team. Presented at Black Hat USA 2024 to "heavily attended" sessions.[4]
Large-Scale Public Testing
| Event | Scale | Key Finding |
|---|---|---|
| DEFCON 2023 | 2,244 hackers, 17,000+ conversations, 8 LLMs | 21 harm categories tested across Anthropic, OpenAI, Google, Meta |
| Crucible Platform | 214,271 attempts, 1,674 users, 400 days | Average user: 128 attack attempts across 4 challenges |
| HackerOne AI Challenge | 300,000+ interactions, 3,700+ hours | Zero universal jailbreaks discovered |
| Anthropic Bug Bounty | 183 participants, 3,000+ hours, 2 months | Constitutional Classifiers reduced success to 4.4% |
Attack Success Rates by Model
Research reveals significant variation in model vulnerability. Attack Success Rate (ASR) measures the percentage of adversarial prompts that successfully bypass safety measures. Without dedicated defenses, most models show high vulnerability.
Baseline Vulnerability (No Additional Defenses)
Source: Red Teaming the Mind of the Machine (2024), various model-specific studies[5]
With Defensive Measures
Claude with Constitutional Classifiers
Down from 86% baseline
Claude 3.7 Sonnet resistance
Perfect jailbreak resistance in testing
Key Insight
The gap between defended and undefended models is dramatic. Microsoft warns that RLHF and alignment techniques "make jailbreaking more difficult but not impossible." A zero-trust approach—assuming any model could be jailbroken—combined with layered defenses is essential.[6]
Attack Techniques & Effectiveness
Research has quantified the effectiveness of different attack techniques. Roleplay-based attacks consistently outperform technical encoding tricks, though multi-turn attacks show the highest success rates overall.
| Technique | ASR | Mechanism |
|---|---|---|
| Roleplay/Persona | 89.6% | Impersonation, fictional characters, hypothetical scenarios |
| Logic Traps | 81.4% | Conditional structures, moral dilemmas, contradictions |
| Encoding Tricks | 76.2% | Base64, zero-width characters, leetspeak, ROT13 |
| Multi-Turn Human | 70%+ | Gradual escalation across conversation turns |
| Automated Single-Turn | <10% | Against well-defended models with safety layers |
Multi-Turn Gap
Human red teamers achieved attack success rates 19-65% higher than ensembles of automated attacks across multiple LLM defenses. This suggests single-turn automated testing significantly underestimates real-world risk.[7]
Attack Taxonomy
Prompt-Level Attacks
Data Extraction Attacks
Safety Bypass Attacks
Capability Abuse Attacks
Microsoft's 8 Key Lessons from 100+ Products
In January 2025, Microsoft's AI Red Team published their comprehensive white paper detailing lessons from testing over 100 generative AI products since 2018. These findings represent the most extensive industry experience with structured AI adversarial testing.[2]
System-Level Attacks Win
Relatively simple attacks targeting weaknesses in end-to-end systems are more likely to succeed than complex algorithms targeting only the underlying AI model. Red teams should adopt a system-wide perspective.
Red Teaming ≠ Benchmarking
Benchmarks measure preexisting notions of harm on curated datasets. Red teaming explores unfamiliar scenarios and helps define novel harm categories. Both are necessary but serve different purposes.
Human Judgment Remains Essential
Despite automation benefits, human judgment is essential for prioritizing risks, designing system-level attacks, and assessing nuanced harms. Many risks require subject matter expertise, cultural understanding, and emotional intelligence.
Simple Attacks Often Work
Attackers often use simple, practical methods like hand-crafted prompts and fuzzing to exploit weaknesses. Sophisticated academic attacks are less common in practice than straightforward exploitation.
Mental Health Matters
Organizations need to consider red team members' mental health—they "may be exposed to disproportionate amounts of unsettling and disturbing AI-generated content." Support structures are essential.
Security is Never Complete
AI models amplify existing security risks and create new ones. Theoretical research shows that for any output with non-zero probability, a sufficiently long prompt exists to elicit it. The goal is raising attack cost, not elimination.
Use AI as Force Multiplier
AI-generated attacks can lack creativity or context understanding. Use AI to brute-force simple variations while human experts analyze and guide the process. Completely hands-off AI red teaming isn't yet viable.
Document with TTPs
Use a structured ontology to model attacks including adversarial actors, TTPs (Tactics, Techniques, and Procedures), system weaknesses, and downstream impacts. This enables systematic tracking and improvement.
Red Teaming Tools
Two open-source tools have emerged as industry standards for AI red teaming: Microsoft's PyRIT and NVIDIA's Garak. Both are actively maintained and integrate with major model providers.
Microsoft PyRIT
Python Risk Identification Tool
"Enabled a major shift from fully manual probing to red teaming supported by automation" — Microsoft AI Red Team
NVIDIA Garak
LLM Vulnerability Scanner
"Similar to nmap or Metasploit Framework, garak does comparable things for LLMs" — NVIDIA
Tool Comparison
| Capability | PyRIT | Garak |
|---|---|---|
| Attack orchestration | ✓ Extensive | ✓ Good |
| Pre-built attack library | ✓ TAP, PAIR, Crescendo | ✓ 120+ categories |
| Multi-turn attacks | ✓ Native support | ◐ Limited |
| Multimodal support | ✓ Images, audio | ✓ Images |
| Reporting | ✓ JSON, scoring | ✓ Detailed remediation |
| Best for | Complex orchestrated attacks | Broad vulnerability scanning |
Defense Effectiveness
Research demonstrates that well-designed defenses dramatically reduce attack success rates. The key is layered defense—no single technique provides complete protection.
Constitutional Classifiers (Anthropic)
Reduction in jailbreak success
183 participants, 3,000+ hours testing
Defense-in-Depth Layers
Block known attack patterns, encoding tricks, and malicious payloads before they reach the model.
RLHF, Constitutional AI, and other alignment techniques that train refusal behavior into the model.
Constitutional Classifiers and similar systems that detect and block harmful outputs before delivery.
Rate limiting, user authentication, capability restrictions, and audit logging at the application layer.
Human review for high-risk actions, escalation procedures, and continuous monitoring of system behavior.
Regulatory Requirements
AI red teaming is increasingly mandated by regulation. The EU AI Act and NIST AI RMF establish specific requirements for adversarial testing, with significant penalties for non-compliance.
| Regulation | Requirement | Effective | Penalty |
|---|---|---|---|
| EU AI Act (Art. 55) | Pre-release red teaming for systemic GPAI models | Aug 2025 | €15M or 3% |
| EU AI Act (High-Risk) | Documented testing for high-risk AI systems | Aug 2026 | €35M or 7% |
| NIST AI RMF | Continuous adversarial testing recommended | Now | Framework |
| NIST 600-1 GenAI | Specific GenAI risk mitigations including red teaming | July 2024 | Framework |
| Colorado AI Act | NIST AI RMF compliance provides safe harbor | Feb 2026 | Per violation |
NIST Adversarial ML Guidance
In January 2024, NIST published guidance identifying four specific types of AI cyberattacks with recommended mitigations:[8]
Corrupting training data to manipulate model behavior
Exploiting legitimate data access for unauthorized purposes
Extracting sensitive information from models
Crafting inputs that bypass detection or classification
Building a Red Team Program
Based on research from Microsoft, Anthropic, and academic studies, effective AI red team programs share common characteristics. The SEI notes that no standardized protocols yet exist for generative AI red teaming—organizations must build custom programs.[9]
Team Composition
Security Expertise
Traditional security testing skills, threat modeling, attack methodology
AI/ML Knowledge
Understanding of model behavior, training dynamics, alignment techniques
Domain Expertise
Knowledge of your specific use case, regulatory requirements, user context
Creative Thinking
Ability to find unexpected attack paths that automated tools miss
Testing Cadence
Full red team before any production release — mandatory for regulatory compliance
Red team when capabilities change significantly or new features added
Test when underlying models are updated—model changes can introduce new vulnerabilities
Quarterly for high-risk systems—new attack techniques emerge constantly
Automated testing for known attack patterns using PyRIT/Garak in CI/CD
Regression Testing Gap
Feffer et al. (2024) found that fewer than 1/3 of 42 enterprise AI programs track post-fix regression. Many organizations fix vulnerabilities but don't verify the fixes or check for reintroduction in later updates.[10]
GLACIS Red Team Framework
The GLACIS Red Team Framework synthesizes best practices from Microsoft, Anthropic, and regulatory requirements into a structured five-phase approach that produces compliance-ready evidence.
GLACIS Red Team Framework
Scoping
Define what you're testing and establish boundaries aligned with regulatory requirements.
- Identify AI systems in scope (models, applications, integrations)
- Map to regulatory requirements (EU AI Act risk level, NIST functions)
- Define threat actors and attack motivations
- Establish rules of engagement and access levels
Deliverable: Scoping document with regulatory mapping, signed authorization
Threat Modeling
Map the attack surface using Microsoft's TTP ontology and prioritize testing areas.
- Document AI system architecture and data flows
- Map attack taxonomy categories to system components
- Prioritize attacks based on likelihood and impact
- Include system-level attacks, not just model-level (per Microsoft lessons)
Deliverable: Threat model document, prioritized attack plan
Attack Execution
Execute structured attacks combining automated tools with manual human testing.
- Run automated scans with Garak (120+ categories)
- Execute multi-turn attacks with PyRIT orchestration
- Conduct manual creative testing (19-65% higher success than automated)
- Document all findings with reproduction steps
Deliverable: Attack logs, finding documentation, severity ratings
Reporting
Document findings with actionable remediation guidance and compliance evidence.
- Create detailed finding reports with CVSS-style severity ratings
- Map findings to regulatory requirements (EU AI Act articles, NIST categories)
- Recommend specific remediations with defense layer mapping
- Generate executive summary for board/customer communication
Deliverable: Technical report, executive summary, compliance evidence package
Verification
Confirm remediations are effective and establish regression testing baseline.
- Retest all findings after remediation
- Verify fixes don't introduce new vulnerabilities
- Create automated regression test suite for CI/CD integration
- Document closure evidence for compliance records
Deliverable: Verification report, regression test suite, compliance closure documentation
AI Red Team Playbook
Comprehensive playbook with 100+ attack techniques based on Microsoft and Anthropic research, testing scripts for PyRIT/Garak, and compliance-ready reporting templates.
Need AI Red Team Evidence?
Our Evidence Pack Sprint includes structured AI red teaming with documented findings, remediation verification, and compliance-ready reports that satisfy EU AI Act and NIST AI RMF requirements.
Learn About the Evidence PackReferences
- [1] IEEE Spectrum. "Why Are Large AI Models Being Red Teamed?" 2024.
- [2] Microsoft AI Red Team. "3 Takeaways from Red Teaming 100 Generative AI Products." Microsoft Security Blog, January 2025.
- [3] Anthropic. "Constitutional Classifiers." Anthropic Research, 2024.
- [4] NVIDIA. "Defining LLM Red Teaming." NVIDIA Technical Blog, 2024.
- [5] "Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs." arXiv, 2024.
- [6] Microsoft Security. "AI Jailbreaks: What They Are and How They Can Be Mitigated." June 2024.
- [7] "Practical AI Red Teaming: The Power of Multi-Turn Tests vs Single-Turn Evaluations." Pillar Security, 2024.
- [8] NIST. "Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations." NIST AI 100-2e2023, January 2024.
- [9] Software Engineering Institute. "What Can Generative AI Red-Teaming Learn from Cyber Red-Teaming?" Carnegie Mellon University, 2024.
- [10] Feffer et al. "The Automation Advantage in AI Red Teaming." arXiv, 2024.
- [11] Anthropic. "Frontier Threats: Red Teaming for AI Safety." 2024.
- [12] Pillar Security. "AI Red Teaming Regulations and Standards." 2024.
- [13] HackerOne. "AI Red Teaming: Offensive Testing for AI Models." 2024.
Disclaimer: Attack success rates and vulnerability statistics cited are from controlled research environments and may not reflect real-world deployment conditions. Methodology variations across studies may affect comparability. All figures reflect data available as of publication date (December 2025). This guide is for defensive security purposes only and does not constitute legal advice.