Security Guide • Updated December 2025

AI Red Teaming Guide

Adversarial testing methodologies for AI systems. Attack techniques, tools (PyRIT, Garak), and defense effectiveness data.

30 min read 8,500+ words

Executive Summary

AI red teaming has emerged as the critical practice for identifying risks in generative AI systems before deployment. Microsoft's AI Red Team has tested 100+ generative AI products since 2018, publishing their methodology and lessons learned. Research shows attack success rates ranging from 87% on prior-generation models (GPT-4) to near-zero with modern defenses (Constitutional Classifiers), demonstrating that defenses work—but only when properly implemented. Current models like GPT-5.2 and Claude Opus 4.5 incorporate these lessons but require ongoing red teaming.

This guide synthesizes findings from industry leaders (Microsoft, Anthropic, NVIDIA), academic research, and regulatory requirements (EU AI Act, NIST AI RMF) to provide actionable guidance for enterprise AI security teams.

100+
Products Red Teamed
Microsoft since 2018
87%
Prior-Gen Attack Rate
GPT-4 without defenses
95%
Attacks Blocked
With Constitutional Classifiers
214K
Attack Attempts
Crucible platform dataset

State of the Field

AI red teaming has evolved from an ad-hoc practice to a critical security discipline. Major AI labs, enterprises, and regulators now recognize structured adversarial testing as essential for safe AI deployment.

Industry Investment

Big Tech's AI spending surpassed $240 billion in 2024 alone (Dataconomy), yet approaches to red teaming vary dramatically. Microsoft leads in transparency, having published detailed case studies from 100+ products. Anthropic invests extensively in domain expert testing. Yet many organizations still treat red teaming as a checkbox exercise.[1]

Major Lab Efforts

MS

Microsoft AI Red Team

Formed 2018 · 100+ products tested

One of the first red teams to cover both security and responsible AI. Published comprehensive white paper in January 2025 with 8 key lessons. Open-sourced PyRIT framework for automated attack orchestration.[2]

A

Anthropic

Frontier Red Team · Constitutional AI

Pioneered automated red teaming with model-vs-model loops. In cyber domain, Claude improved from "high schooler to undergraduate level" in CTF exercises in one year. Developed Constitutional Classifiers reducing jailbreak success from 86% to 4.4%.[3]

NV

NVIDIA

Garak LLM Vulnerability Scanner

Released Garak open-source scanner with 120+ vulnerability categories. Leon Derczynski leads development—also on OWASP LLM Top 10 core team. Presented at Black Hat USA 2024 to "heavily attended" sessions.[4]

Large-Scale Public Testing

Event Scale Key Finding
DEFCON 2023 2,244 hackers, 17,000+ conversations, 8 LLMs 21 harm categories tested across Anthropic, OpenAI, Google, Meta
Crucible Platform 214,271 attempts, 1,674 users, 400 days Average user: 128 attack attempts across 4 challenges
HackerOne AI Challenge 300,000+ interactions, 3,700+ hours Zero universal jailbreaks discovered
Anthropic Bug Bounty 183 participants, 3,000+ hours, 2 months Constitutional Classifiers reduced success to 4.4%

Attack Success Rates by Model

Research reveals significant variation in model vulnerability. Attack Success Rate (ASR) measures the percentage of adversarial prompts that successfully bypass safety measures. Without dedicated defenses, most models show high vulnerability.

Baseline Vulnerability (No Additional Defenses)

GPT-4 87.2% ASR
Claude 2 82.5% ASR
Mistral 7B 71.3% ASR
Vicuna 69.4% ASR
DeepSeek R1 68% ASR

Source: Red Teaming the Mind of the Machine (2024), various model-specific studies[5]

With Defensive Measures

4.4%

Claude with Constitutional Classifiers

Down from 86% baseline

100%

Claude 3.7 Sonnet resistance

Perfect jailbreak resistance in testing

Key Insight

The gap between defended and undefended models is dramatic. Microsoft warns that RLHF and alignment techniques "make jailbreaking more difficult but not impossible." A zero-trust approach—assuming any model could be jailbroken—combined with layered defenses is essential.[6]

Attack Techniques & Effectiveness

Research has quantified the effectiveness of different attack techniques. Roleplay-based attacks consistently outperform technical encoding tricks, though multi-turn attacks show the highest success rates overall.

Technique ASR Mechanism
Roleplay/Persona 89.6% Impersonation, fictional characters, hypothetical scenarios
Logic Traps 81.4% Conditional structures, moral dilemmas, contradictions
Encoding Tricks 76.2% Base64, zero-width characters, leetspeak, ROT13
Multi-Turn Human 70%+ Gradual escalation across conversation turns
Automated Single-Turn <10% Against well-defended models with safety layers

Multi-Turn Gap

Human red teamers achieved attack success rates 19-65% higher than ensembles of automated attacks across multiple LLM defenses. This suggests single-turn automated testing significantly underestimates real-world risk.[7]

Attack Taxonomy

Prompt-Level Attacks

Direct prompt injection Indirect prompt injection Jailbreaking / DAN Goal hijacking Instruction extraction Context manipulation

Data Extraction Attacks

Training data extraction System prompt leakage Context window exfiltration PII extraction Model inversion Membership inference

Safety Bypass Attacks

Harmful content generation Bias amplification Ethical guardrail bypass Content policy evasion Multi-turn manipulation Language switching

Capability Abuse Attacks

Tool/function abuse Unauthorized API calls Privilege escalation Resource exhaustion Chained attack sequences Cross-system exploitation

Microsoft's 8 Key Lessons from 100+ Products

In January 2025, Microsoft's AI Red Team published their comprehensive white paper detailing lessons from testing over 100 generative AI products since 2018. These findings represent the most extensive industry experience with structured AI adversarial testing.[2]

1

System-Level Attacks Win

Relatively simple attacks targeting weaknesses in end-to-end systems are more likely to succeed than complex algorithms targeting only the underlying AI model. Red teams should adopt a system-wide perspective.

2

Red Teaming ≠ Benchmarking

Benchmarks measure preexisting notions of harm on curated datasets. Red teaming explores unfamiliar scenarios and helps define novel harm categories. Both are necessary but serve different purposes.

3

Human Judgment Remains Essential

Despite automation benefits, human judgment is essential for prioritizing risks, designing system-level attacks, and assessing nuanced harms. Many risks require subject matter expertise, cultural understanding, and emotional intelligence.

4

Simple Attacks Often Work

Attackers often use simple, practical methods like hand-crafted prompts and fuzzing to exploit weaknesses. Sophisticated academic attacks are less common in practice than straightforward exploitation.

5

Mental Health Matters

Organizations need to consider red team members' mental health—they "may be exposed to disproportionate amounts of unsettling and disturbing AI-generated content." Support structures are essential.

6

Security is Never Complete

AI models amplify existing security risks and create new ones. Theoretical research shows that for any output with non-zero probability, a sufficiently long prompt exists to elicit it. The goal is raising attack cost, not elimination.

7

Use AI as Force Multiplier

AI-generated attacks can lack creativity or context understanding. Use AI to brute-force simple variations while human experts analyze and guide the process. Completely hands-off AI red teaming isn't yet viable.

8

Document with TTPs

Use a structured ontology to model attacks including adversarial actors, TTPs (Tactics, Techniques, and Procedures), system weaknesses, and downstream impacts. This enables systematic tracking and improvement.

Red Teaming Tools

Two open-source tools have emerged as industry standards for AI red teaming: Microsoft's PyRIT and NVIDIA's Garak. Both are actively maintained and integrate with major model providers.

PyRIT

Microsoft PyRIT

Python Risk Identification Tool

Attack strategies: TAP, PAIR, Crescendo
Prompt converters (encodings, transformations)
Multimodal output scorers
Prompt datasets for harm categories

"Enabled a major shift from fully manual probing to red teaming supported by automation" — Microsoft AI Red Team

github.com/microsoft/pyrit
Garak

NVIDIA Garak

LLM Vulnerability Scanner

120+ vulnerability categories
Static, dynamic, and adaptive probes
Supports HuggingFace, OpenAI, Cohere, NIM
Detailed remediation reports

"Similar to nmap or Metasploit Framework, garak does comparable things for LLMs" — NVIDIA

github.com/NVIDIA/garak

Tool Comparison

Capability PyRIT Garak
Attack orchestration ✓ Extensive ✓ Good
Pre-built attack library ✓ TAP, PAIR, Crescendo ✓ 120+ categories
Multi-turn attacks ✓ Native support ◐ Limited
Multimodal support ✓ Images, audio ✓ Images
Reporting ✓ JSON, scoring ✓ Detailed remediation
Best for Complex orchestrated attacks Broad vulnerability scanning

Defense Effectiveness

Research demonstrates that well-designed defenses dramatically reduce attack success rates. The key is layered defense—no single technique provides complete protection.

Constitutional Classifiers (Anthropic)

Without Classifiers 86% success
With Classifiers 4.4% success
95%

Reduction in jailbreak success

183 participants, 3,000+ hours testing

Defense-in-Depth Layers

Layer 1 Input Filtering

Block known attack patterns, encoding tricks, and malicious payloads before they reach the model.

Layer 2 Model Safety Training

RLHF, Constitutional AI, and other alignment techniques that train refusal behavior into the model.

Layer 3 Output Classifiers

Constitutional Classifiers and similar systems that detect and block harmful outputs before delivery.

Layer 4 Application Controls

Rate limiting, user authentication, capability restrictions, and audit logging at the application layer.

Layer 5 Human Oversight

Human review for high-risk actions, escalation procedures, and continuous monitoring of system behavior.

Regulatory Requirements

AI red teaming is increasingly mandated by regulation. The EU AI Act and NIST AI RMF establish specific requirements for adversarial testing, with significant penalties for non-compliance.

Regulation Requirement Effective Penalty
EU AI Act (Art. 55) Pre-release red teaming for systemic GPAI models Aug 2025 €15M or 3%
EU AI Act (High-Risk) Documented testing for high-risk AI systems Aug 2026 €35M or 7%
NIST AI RMF Continuous adversarial testing recommended Now Framework
NIST 600-1 GenAI Specific GenAI risk mitigations including red teaming July 2024 Framework
Colorado AI Act NIST AI RMF compliance provides safe harbor Feb 2026 Per violation

NIST Adversarial ML Guidance

In January 2024, NIST published guidance identifying four specific types of AI cyberattacks with recommended mitigations:[8]

1. Data Poisoning

Corrupting training data to manipulate model behavior

2. Data Abuse

Exploiting legitimate data access for unauthorized purposes

3. Privacy Attacks

Extracting sensitive information from models

4. Evasion Attacks

Crafting inputs that bypass detection or classification

Building a Red Team Program

Based on research from Microsoft, Anthropic, and academic studies, effective AI red team programs share common characteristics. The SEI notes that no standardized protocols yet exist for generative AI red teaming—organizations must build custom programs.[9]

Team Composition

Security Expertise

Traditional security testing skills, threat modeling, attack methodology

AI/ML Knowledge

Understanding of model behavior, training dynamics, alignment techniques

Domain Expertise

Knowledge of your specific use case, regulatory requirements, user context

Creative Thinking

Ability to find unexpected attack paths that automated tools miss

Testing Cadence

PRE-DEPLOY

Full red team before any production release — mandatory for regulatory compliance

MAJOR

Red team when capabilities change significantly or new features added

MODEL

Test when underlying models are updated—model changes can introduce new vulnerabilities

PERIODIC

Quarterly for high-risk systems—new attack techniques emerge constantly

CONTINUOUS

Automated testing for known attack patterns using PyRIT/Garak in CI/CD

Regression Testing Gap

Feffer et al. (2024) found that fewer than 1/3 of 42 enterprise AI programs track post-fix regression. Many organizations fix vulnerabilities but don't verify the fixes or check for reintroduction in later updates.[10]

GLACIS Red Team Framework

The GLACIS Red Team Framework synthesizes best practices from Microsoft, Anthropic, and regulatory requirements into a structured five-phase approach that produces compliance-ready evidence.

GLACIS Red Team Framework

1
Scope
2
Threat Model
3
Attack
4
Report
5
Verify
PHASE 1

Scoping

Define what you're testing and establish boundaries aligned with regulatory requirements.

  • Identify AI systems in scope (models, applications, integrations)
  • Map to regulatory requirements (EU AI Act risk level, NIST functions)
  • Define threat actors and attack motivations
  • Establish rules of engagement and access levels

Deliverable: Scoping document with regulatory mapping, signed authorization

PHASE 2

Threat Modeling

Map the attack surface using Microsoft's TTP ontology and prioritize testing areas.

  • Document AI system architecture and data flows
  • Map attack taxonomy categories to system components
  • Prioritize attacks based on likelihood and impact
  • Include system-level attacks, not just model-level (per Microsoft lessons)

Deliverable: Threat model document, prioritized attack plan

PHASE 3

Attack Execution

Execute structured attacks combining automated tools with manual human testing.

  • Run automated scans with Garak (120+ categories)
  • Execute multi-turn attacks with PyRIT orchestration
  • Conduct manual creative testing (19-65% higher success than automated)
  • Document all findings with reproduction steps

Deliverable: Attack logs, finding documentation, severity ratings

PHASE 4

Reporting

Document findings with actionable remediation guidance and compliance evidence.

  • Create detailed finding reports with CVSS-style severity ratings
  • Map findings to regulatory requirements (EU AI Act articles, NIST categories)
  • Recommend specific remediations with defense layer mapping
  • Generate executive summary for board/customer communication

Deliverable: Technical report, executive summary, compliance evidence package

PHASE 5

Verification

Confirm remediations are effective and establish regression testing baseline.

  • Retest all findings after remediation
  • Verify fixes don't introduce new vulnerabilities
  • Create automated regression test suite for CI/CD integration
  • Document closure evidence for compliance records

Deliverable: Verification report, regression test suite, compliance closure documentation

AI Red Team Playbook

Comprehensive playbook with 100+ attack techniques based on Microsoft and Anthropic research, testing scripts for PyRIT/Garak, and compliance-ready reporting templates.

Need AI Red Team Evidence?

Our Evidence Pack Sprint includes structured AI red teaming with documented findings, remediation verification, and compliance-ready reports that satisfy EU AI Act and NIST AI RMF requirements.

Learn About the Evidence Pack

References

  1. [1] IEEE Spectrum. "Why Are Large AI Models Being Red Teamed?" 2024.
  2. [2] Microsoft AI Red Team. "3 Takeaways from Red Teaming 100 Generative AI Products." Microsoft Security Blog, January 2025.
  3. [3] Anthropic. "Constitutional Classifiers." Anthropic Research, 2024.
  4. [4] NVIDIA. "Defining LLM Red Teaming." NVIDIA Technical Blog, 2024.
  5. [5] "Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs." arXiv, 2024.
  6. [6] Microsoft Security. "AI Jailbreaks: What They Are and How They Can Be Mitigated." June 2024.
  7. [7] "Practical AI Red Teaming: The Power of Multi-Turn Tests vs Single-Turn Evaluations." Pillar Security, 2024.
  8. [8] NIST. "Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations." NIST AI 100-2e2023, January 2024.
  9. [9] Software Engineering Institute. "What Can Generative AI Red-Teaming Learn from Cyber Red-Teaming?" Carnegie Mellon University, 2024.
  10. [10] Feffer et al. "The Automation Advantage in AI Red Teaming." arXiv, 2024.
  11. [11] Anthropic. "Frontier Threats: Red Teaming for AI Safety." 2024.
  12. [12] Pillar Security. "AI Red Teaming Regulations and Standards." 2024.
  13. [13] HackerOne. "AI Red Teaming: Offensive Testing for AI Models." 2024.

Disclaimer: Attack success rates and vulnerability statistics cited are from controlled research environments and may not reflect real-world deployment conditions. Methodology variations across studies may affect comparability. All figures reflect data available as of publication date (December 2025). This guide is for defensive security purposes only and does not constitute legal advice.