AI red teaming, the working playbook for April 2026.

Adversarial testing methodologies, tooling (PyRIT, Garak), defense data (Constitutional Classifiers, OpenAI’s Instruction Hierarchy), and what changed when EO 14110 was rescinded — including the post-DEF CON 33 GRT 3 evaluation pivot and the federal posture under EO 14179, OMB M-25-21 and M-25-22.

The Assurance Loop See, Control, Prove, Improve — inside your stack. Architecture How Glacis runs, and what stays inside your environment. Documentation Setup, SDK, integrations. Agentic AI Security Runtime controls and signed proof for agents that act Clinical AI Ambient scribe, CDSS, clinical chatbots, and FDA-track AI AI Operations & Observability Runtime evidence for AI behavior in production Evidence Packs Regulator, customer, auditor, and internal artifacts Sample Receipt A signed runtime receipt and assembled pack OVERT Standard Portable verification for AI evidence Verification Check receipt integrity and provenance Resources Company Book the Sprint

Navigate

Home Resources Company

Product

The Assurance LoopSee, Control, Prove, Improve — inside your stack. ArchitectureHow Glacis runs, and what stays inside your environment. DocumentationSetup, SDK, integrations.

Solutions

Agentic AI SecurityRuntime controls and signed proof for agents that act Clinical AIAmbient scribe, CDSS, clinical chatbots, and FDA-track AI AI Operations & ObservabilityRuntime evidence for AI behavior in production

Evidence

Evidence PacksArtifacts assembled from signed runtime receipts Sample ReceiptSee runtime proof become an evidence pack OVERT StandardWhy proof can travel VerificationCheck receipt integrity and provenance Book the Sprint

GLACIS·AI security frameworks·Red teaming·Updated April 2026

Glacis runs Agent Runtime Security & Evidence Sprints — 10‑business‑day engagements that include adversarial probes against the agent boundary (prompt‑injection, tool‑misuse, exfiltration paths) and ship a signed evidence pack your security review can actually use.

By Joe Braidwood, CEO GLACIS·30 min read·Updated 24 April 2026

Executive summary

AI red teaming has emerged as the critical practice for identifying risks in generative AI systems before deployment. Microsoft’s AI Red Team has tested 100+ generative AI products since 2018, publishing their methodology and lessons learned. Research shows attack success rates ranging from 87% on prior-generation models (GPT-4) to near-zero with modern defenses (Constitutional Classifiers), demonstrating that defenses work—but only when properly implemented. Current models like GPT-5.2 and Claude Opus 4.5 incorporate these lessons but require ongoing red teaming.

This guide synthesizes findings from industry leaders (Microsoft, Anthropic, NVIDIA), academic research, and regulatory requirements (EU AI Act, NIST AI RMF) to provide actionable guidance for enterprise AI security teams.

In This Guide

→ State of the Field → Attack Success Rates by Model → Attack Techniques & Effectiveness → Microsoft’s 8 Key Lessons → Red Teaming Tools (PyRIT, Garak) → Defense Effectiveness → Regulatory Requirements → Building a Program → GLACIS Red Team Framework → References

State of the field

Federal red-teaming requirement under EO 14110 was rescinded on 20 January 2025 by EO 14148 and replaced by EO 14179 "Removing Barriers to American Leadership in AI." OMB M-25-21 (AI use) and M-25-22 (AI acquisition), issued 3 April 2025, supersede M-24-10. M-25-22 binds federal contracts awarded or renewed on or after 1 October 2025.[A1][A2][A3]

DEF CON 33 GRT 3 (7–10 Aug 2025) pivoted from direct model jailbreaks to red-teaming the evaluations that establish a model’s performance — bounties paid for findings that demonstrate evals are incomplete or wrong. Multi-model setup with model cards scoping each model’s intent. Designed to address the GRT-1/2 problem where vendor-actionable artefacts were rarely produced.[A4]

Anthropic Constitutional Classifiers reduced jailbreak attack-success rate from 86% → 4.4% on Claude. Feb 2025 bug bounty: 339 participants, ~300,000 interactions across 8 CBRN difficulty levels — a single universal jailbreak found. Next-generation classifiers cut overhead to ~1% with 0.05% benign refusal rate on Sonnet 4.5 traffic.[A5][A6]

MITRE ATLAS v5.4.0 (Feb 2026): 14 agentic-AI techniques added across 2025; new "Publish Poisoned AI Agent Tool" technique (e.g. malicious MCP servers).[A7]

AI red teaming has evolved from an ad-hoc practice to a critical security discipline. Major AI labs, enterprises, and regulators now recognize structured adversarial testing as essential for safe AI deployment.

Industry Investment

Big Tech’s AI spending surpassed $240 billion in 2024 alone (Dataconomy), yet approaches to red teaming vary dramatically. Microsoft leads in transparency, having published detailed case studies from 100+ products. Anthropic invests extensively in domain expert testing. Yet many organizations still treat red teaming as a checkbox exercise.[1]

Major lab efforts

Microsoft AI Red Team

Formed 2018 · 100+ products tested

One of the first red teams to cover both security and responsible AI. Published comprehensive white paper in January 2025 with 8 key lessons. Open-sourced PyRIT framework for automated attack orchestration.[2]

Anthropic

Frontier Red Team · Constitutional AI

Pioneered automated red teaming with model-vs-model loops. In cyber domain, Claude improved from "high schooler to undergraduate level" in CTF exercises in one year. Developed Constitutional Classifiers reducing jailbreak success from 86% to 4.4%.[3]

NVIDIA

Garak LLM Vulnerability Scanner

Released Garak open-source scanner with 120+ vulnerability categories. Leon Derczynski leads development—also on OWASP LLM Top 10 core team. Presented at Black Hat USA 2024 to "heavily attended" sessions.[4]

Large-scale public testing

Event Scale Key Finding
DEFCON 2023 2,244 hackers, 17,000+ conversations, 8 LLMs 21 harm categories tested across Anthropic, OpenAI, Google, Meta
Crucible Platform 214,271 attempts, 1,674 users, 400 days Average user: 128 attack attempts across 4 challenges
HackerOne AI Challenge 300,000+ interactions, 3,700+ hours Zero universal jailbreaks discovered
Anthropic Bug Bounty 183 participants, 3,000+ hours, 2 months Constitutional Classifiers reduced success to 4.4%

Attack success rates by model

Research reveals significant variation in model vulnerability. Attack Success Rate (ASR) measures the percentage of adversarial prompts that successfully bypass safety measures. Without dedicated defenses, most models show high vulnerability.

Baseline Vulnerability (No Additional Defenses)

Source: Red Teaming the Mind of the Machine (2024), various model-specific studies[5]

With Defensive Measures

Claude with Constitutional Classifiers

Down from 86% baseline

Claude 3.7 Sonnet resistance

Perfect jailbreak resistance in testing

Key Insight

The gap between defended and undefended models is dramatic. Microsoft warns that RLHF and alignment techniques "make jailbreaking more difficult but not impossible." A zero-trust approach—assuming any model could be jailbroken—combined with layered defenses is essential.[6]

Attack techniques and effectiveness

Research has quantified the effectiveness of different attack techniques. Roleplay-based attacks consistently outperform technical encoding tricks, though multi-turn attacks show the highest success rates overall.

Technique ASR Mechanism
Roleplay/Persona 89.6% Impersonation, fictional characters, hypothetical scenarios
Logic Traps 81.4% Conditional structures, moral dilemmas, contradictions
Encoding Tricks 76.2% Base64, zero-width characters, leetspeak, ROT13
Multi-Turn Human 70%+ Gradual escalation across conversation turns
Automated Single-Turn <10% Against well-defended models with safety layers

Multi-Turn Gap

Human red teamers achieved attack success rates 19-65% higher than ensembles of automated attacks across multiple LLM defenses. This suggests single-turn automated testing significantly underestimates real-world risk.[7]

Attack taxonomy

Prompt-Level Attacks

Data Extraction Attacks

Safety Bypass Attacks

Capability Abuse Attacks

Microsoft’s 8 key lessons from 100+ products

In January 2025, Microsoft’s AI Red Team published their comprehensive white paper detailing lessons from testing over 100 generative AI products since 2018. These findings represent the most extensive industry experience with structured AI adversarial testing.[2]

System-Level Attacks Win

Relatively simple attacks targeting weaknesses in end-to-end systems are more likely to succeed than complex algorithms targeting only the underlying AI model. Red teams should adopt a system-wide perspective.

Red Teaming ≠ Benchmarking

Benchmarks measure preexisting notions of harm on curated datasets. Red teaming explores unfamiliar scenarios and helps define novel harm categories. Both are necessary but serve different purposes.

Human Judgment Remains Essential

Despite automation benefits, human judgment is essential for prioritizing risks, designing system-level attacks, and assessing nuanced harms. Many risks require subject matter expertise, cultural understanding, and emotional intelligence.

Simple Attacks Often Work

Attackers often use simple, practical methods like hand-crafted prompts and fuzzing to exploit weaknesses. Sophisticated academic attacks are less common in practice than straightforward exploitation.

Mental Health Matters

Organizations need to consider red team members’ mental health—they "may be exposed to disproportionate amounts of unsettling and disturbing AI-generated content." Support structures are essential.

Security is Never Complete

AI models amplify existing security risks and create new ones. Theoretical research shows that for any output with non-zero probability, a sufficiently long prompt exists to elicit it. The goal is raising attack cost, not elimination.

Use AI as Force Multiplier

AI-generated attacks can lack creativity or context understanding. Use AI to brute-force simple variations while human experts analyze and guide the process. Completely hands-off AI red teaming isn’t yet viable.

Document with TTPs

Use a structured ontology to model attacks including adversarial actors, TTPs (Tactics, Techniques, and Procedures), system weaknesses, and downstream impacts. This enables systematic tracking and improvement.

Red teaming tools

Two open-source tools have emerged as industry standards for AI red teaming: Microsoft’s PyRIT and NVIDIA’s Garak. Both are actively maintained and integrate with major model providers.

Microsoft PyRIT

Python Risk Identification Tool

"Enabled a major shift from fully manual probing to red teaming supported by automation" — Microsoft AI Red Team

github.com/microsoft/pyrit

NVIDIA Garak

LLM Vulnerability Scanner

"Similar to nmap or Metasploit Framework, garak does comparable things for LLMs" — NVIDIA

github.com/NVIDIA/garak

Tool comparison

Capability PyRIT Garak
Attack orchestration ✓ Extensive ✓ Good
Pre-built attack library ✓ TAP, PAIR, Crescendo ✓ 120+ categories
Multi-turn attacks ✓ Native support ◐ Limited
Multimodal support ✓ Images, audio ✓ Images
Reporting ✓ JSON, scoring ✓ Detailed remediation
Best for Complex orchestrated attacks Broad vulnerability scanning

Defense effectiveness

Research demonstrates that well-designed defenses dramatically reduce attack success rates. The key is layered defense—no single technique provides complete protection.

Constitutional Classifiers (Anthropic)

Reduction in jailbreak success

183 participants, 3,000+ hours testing

Defense-in-depth layers

Block known attack patterns, encoding tricks, and malicious payloads before they reach the model.

RLHF, Constitutional AI, and other alignment techniques that train refusal behavior into the model.

Constitutional Classifiers and similar systems that detect and block harmful outputs before delivery.

Rate limiting, user authentication, capability restrictions, and audit logging at the application layer.

Human review for high-risk actions, escalation procedures, and continuous monitoring of system behavior.

Regulatory requirements

AI red teaming is increasingly mandated by regulation. The EU AI Act and NIST AI RMF establish specific requirements for adversarial testing, with significant penalties for non-compliance.

Regulation Requirement Effective Penalty
EU AI Act (Art. 55) Pre-release red teaming for systemic GPAI models Aug 2025 €15M or 3%
EU AI Act (High-Risk) Documented testing for high-risk AI systems Aug 2026 €35M or 7%
NIST AI RMF Continuous adversarial testing recommended Now Framework
NIST 600-1 GenAI Specific GenAI risk mitigations including red teaming July 2024 Framework
Colorado AI Act NIST AI RMF compliance provides safe harbor June 2026 Per violation

NIST Adversarial ML Guidance

In January 2024, NIST published guidance identifying four specific types of AI cyberattacks with recommended mitigations:[8]

Corrupting training data to manipulate model behavior

Exploiting legitimate data access for unauthorized purposes

Extracting sensitive information from models

Crafting inputs that bypass detection or classification

Building a red team program

Based on research from Microsoft, Anthropic, and academic studies, effective AI red team programs share common characteristics. The SEI notes that no standardized protocols yet exist for generative AI red teaming—organizations must build custom programs.[9]

Team composition

Security Expertise

Traditional security testing skills, threat modeling, attack methodology

AI/ML Knowledge

Understanding of model behavior, training dynamics, alignment techniques

Domain Expertise

Knowledge of your specific use case, regulatory requirements, user context

Creative Thinking

Ability to find unexpected attack paths that automated tools miss

Testing cadence

Full red team before any production release — mandatory for regulatory compliance

Red team when capabilities change significantly or new features added

Test when underlying models are updated—model changes can introduce new vulnerabilities

Quarterly for high-risk systems—new attack techniques emerge constantly

Automated testing for known attack patterns using PyRIT/Garak in CI/CD

Regression Testing Gap

Feffer et al. (2024) found that fewer than 1/3 of 42 enterprise AI programs track post-fix regression. Many organizations fix vulnerabilities but don’t verify the fixes or check for reintroduction in later updates.[10]

GLACIS red team framework

Adversarial probes are one input to a Glacis Agent Runtime Security & Evidence Sprint. The Sprint runs locally inside your infrastructure with zero sensitive‑data egress, executes structured red‑team exercises against the agent boundary, captures the result as signed evidence receipts, and assembles a board‑ready evidence pack. The five‑phase framework below mirrors what a Sprint produces — synthesized from Microsoft, Anthropic, and regulatory practice.

GLACIS red team framework

Scoping

Define what you’re testing and establish boundaries aligned with regulatory requirements.

Deliverable: Scoping document with regulatory mapping, signed authorization

Threat Modeling

Map the attack surface using Microsoft’s TTP ontology and prioritize testing areas.

Deliverable: Threat model document, prioritized attack plan

Attack Execution

Execute structured attacks combining automated tools with manual human testing.

Deliverable: Attack logs, finding documentation, severity ratings

Reporting

Document findings with actionable remediation guidance and compliance evidence.

Deliverable: Technical report, executive summary, compliance evidence package

Verification

Confirm remediations are effective and establish regression testing baseline.

Deliverable: Verification report, regression test suite, compliance closure documentation

See what a Sprint produces

A sample evidence pack from an Agent Runtime Security & Evidence Sprint — signed receipts from adversarial probes, runtime control mappings, and the artifact your customers’ security teams ask for.

Book a Sprint

GLACIS·Agent Runtime Security & Evidence Sprint

Turn red‑team findings into a signed evidence pack.

A 10‑business‑day Sprint, run locally inside your infrastructure: adversarial probes mapped to OWASP Top 10 for LLM Apps 2025, OWASP Top 10 for Agentic Apps 2026 and MITRE ATLAS; runtime controls in place at the agent boundary; signed evidence receipts; and a board‑ready evidence pack that satisfies EU AI Act, NIST AI RMF and ISO/IEC 42001 review. Founder design‑partner engagements available for the first three customers.

Book a Sprint →

References

  1. [1] IEEE Spectrum. "Why Are Large AI Models Being Red Teamed?" 2024.
  2. [2] Microsoft AI Red Team. "3 Takeaways from Red Teaming 100 Generative AI Products." Microsoft Security Blog, January 2025.
  3. [3] Anthropic. "Constitutional Classifiers." Anthropic Research, 2024.
  4. [4] NVIDIA. "Defining LLM Red Teaming." NVIDIA Technical Blog, 2024.
  5. [5] "Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs." arXiv, 2024.
  6. [6] Microsoft Security. "AI Jailbreaks: What They Are and How They Can Be Mitigated." June 2024.
  7. [7] "Practical AI Red Teaming: The Power of Multi-Turn Tests vs Single-Turn Evaluations." Pillar Security, 2024.
  8. [8] NIST. "Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations." NIST AI 100-2e2023, January 2024.
  9. [9] Software Engineering Institute. "What Can Generative AI Red-Teaming Learn from Cyber Red-Teaming?" Carnegie Mellon University, 2024.
  10. [10] Feffer et al. "The Automation Advantage in AI Red Teaming." arXiv, 2024.
  11. [11] Anthropic. "Frontier Threats: Red Teaming for AI Safety." 2024.
  12. [12] Pillar Security. "AI Red Teaming Regulations and Standards." 2024.
  13. [13] HackerOne. "AI Red Teaming: Offensive Testing for AI Models." 2024.

Disclaimer: Attack success rates and vulnerability statistics cited are from controlled research environments and may not reflect real-world deployment conditions. Methodology variations across studies may affect comparability. All figures reflect data available as of publication date (24 April 2026). This guide is for defensive security purposes only and does not constitute legal advice.

Related guides

Security

LLM Security Guide

Defense Stack Framework

Governance

AI Governance Tools

2026 Buyer’s Guide

Framework

NIST AI RMF Guide

Risk Management Framework

Runtime assurance infrastructure for AI systems that act. Harden the runtime, prove which controls ran, and keep sensitive data inside your stack.

Solutions

AI security topics

Evidence

Company

Developers

© 2026 Glacis Technologies, Inc.

Terms Privacy Cookies Do Not Sell or Share Trust Center

Customer data stays inside the customer environment — Glacis runs at the edge of your stack, not in the middle of it. SOC 2 Type II attestation and live trust posture are available in the Trust Center.

We use cookies for analytics and marketing. Details

Ready to make your AI auditable?

Talk to our team. 30 minutes. One named workflow. Decide if the next 10 days save you a quarter.