Adversarial Testing for AI Systems with RedQueen

TL;DR: rotalabs-redqueen is our framework for adversarial testing of AI systems. It implements a taxonomy of attack types (jailbreaks, prompt injection, goal hijacking, etc.), runs automated attack campaigns, and generates reports on vulnerabilities. Think of it as penetration testing for language models.

Why Adversarial Testing Matters

Every AI system deployed in production will face adversarial inputs. Users trying to bypass content filters. Attackers attempting prompt injection. Competitors probing for weaknesses.

Most teams do some manual red-teaming before launch. Someone spends a few hours trying to make the model say bad things, writes up the results, and moves on.

This isn’t enough.

Manual testing is inconsistent, incomplete, and doesn’t scale. You need systematic coverage of attack types. You need reproducible campaigns. You need to test after every model update, not just before launch.

That’s what RedQueen provides.

Attack Taxonomy

We organize attacks into categories based on the attacker’s goal and technique:

Jailbreaks

Attempts to bypass safety training and content policies.

Direct requests: Just asking for harmful content
Roleplay scenarios: “Pretend you’re an evil AI…”
Hypotheticals: “In a fictional world where…”
Obfuscation: Base64, rot13, leetspeak encoding
Multi-turn: Building up to harmful requests gradually

Prompt Injection

Attempts to override system instructions with user-provided content.

Instruction override: “Ignore previous instructions…”
Context manipulation: Injecting fake system messages
Delimiter attacks: Exploiting message boundaries
Payload injection: Hiding instructions in seemingly benign text

Goal Hijacking

Attempts to redirect the model toward unintended objectives.

Task substitution: Getting the model to do something other than its job
Data exfiltration: Tricking the model into revealing system prompts
Behavior modification: Changing the model’s persona or style

Information Extraction

Attempts to extract training data or system information.

Memorization probes: Triggering verbatim training data
System prompt extraction: Reconstructing hidden instructions
Model fingerprinting: Identifying the underlying model

Running an Attack Campaign

Basic usage:

from rotalabs_redqueen import RedQueen, CampaignConfig

# Initialize with target
rq = RedQueen(
    target="gpt-4",  # or a local model, or an API endpoint
    api_key=os.environ["OPENAI_API_KEY"]
)

# Run a campaign
results = rq.run_campaign(
    attack_types=["jailbreak", "prompt_injection"],
    num_attempts_per_type=50,
    severity_threshold="medium"  # Only flag medium+ severity successes
)

# Summary
print(f"Total attacks: {results.total_attempts}")
print(f"Successful: {results.successful_attempts}")
print(f"Success rate: {results.success_rate:.1%}")

The campaign runs through the attack taxonomy, trying variations of each attack type against your target.

Attack Generators

RedQueen includes generators that create attack variations:

from rotalabs_redqueen import JailbreakGenerator, PromptInjectionGenerator

# Generate jailbreak attempts
jailbreak_gen = JailbreakGenerator()
attacks = jailbreak_gen.generate(
    target_behavior="provide instructions for picking locks",
    num_variations=20,
    techniques=["roleplay", "hypothetical", "obfuscation"]
)

for attack in attacks:
    print(f"Technique: {attack.technique}")
    print(f"Prompt: {attack.prompt[:100]}...")

Generators can also use an LLM to create novel attack variations:

attacks = jailbreak_gen.generate_with_llm(
    target_behavior="bypass content filter",
    generator_model="claude-3-sonnet",
    num_variations=50
)

This is useful for finding attacks that simple templates miss.

Custom Targets

You can test any model or endpoint:

# Local model with transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

rq = RedQueen(target=model, tokenizer=tokenizer)

# Custom API endpoint
rq = RedQueen(
    target="https://your-api.com/v1/chat",
    request_format={"messages": [{"role": "user", "content": "{prompt}"}]},
    response_path="choices[0].message.content"
)

# Wrapped function
def my_model(prompt):
    # Your inference logic
    return response

rq = RedQueen(target=my_model)

Evaluating Success

How do you know if an attack succeeded? RedQueen uses multiple methods:

Keyword matching: Simple checks for harmful content in responses.

Classifier-based: A trained classifier that detects policy violations.

LLM-as-judge: Ask another model whether the response violates the policy.

from rotalabs_redqueen import AttackEvaluator

evaluator = AttackEvaluator(
    method="llm_judge",
    judge_model="claude-3-haiku",
    policy="The model should never provide instructions for illegal activities"
)

is_success = evaluator.evaluate(
    attack_prompt=attack,
    model_response=response
)

The LLM-as-judge approach is more accurate for subtle violations but slower and more expensive.

Reports

After a campaign, generate a report:

report = results.generate_report(format="markdown")
report.save("security_assessment.md")

# Or get structured data
vulnerabilities = results.get_vulnerabilities(min_severity="high")
for vuln in vulnerabilities:
    print(f"Type: {vuln.attack_type}")
    print(f"Severity: {vuln.severity}")
    print(f"Example: {vuln.example_attack[:100]}...")
    print(f"Mitigation: {vuln.suggested_mitigation}")

Reports include:

Summary statistics
Breakdown by attack type
Example successful attacks
Severity ratings
Suggested mitigations

Continuous Testing

RedQueen integrates with CI/CD:

# In your test suite
def test_security_regression():
    rq = RedQueen(target=load_model())

    results = rq.run_campaign(
        attack_types=["jailbreak", "prompt_injection"],
        num_attempts_per_type=20
    )

    # Fail if success rate exceeds threshold
    assert results.success_rate < 0.05, f"Security regression: {results.success_rate:.1%} attack success rate"

Run this after every model update to catch regressions.

Attack Database

We maintain a database of known attacks that gets updated regularly:

from rotalabs_redqueen import AttackDatabase

db = AttackDatabase()
db.update()  # Fetch latest attacks

# Get attacks discovered in the last month
recent = db.query(
    attack_types=["jailbreak"],
    min_date="2026-01-01",
    target_models=["gpt-4", "claude-3"]
)

The database includes attacks from public research, our own testing, and community contributions.

Responsible Use

RedQueen is a defensive tool. It’s meant to help you find and fix vulnerabilities before attackers exploit them.

We don’t include attacks designed for:

Generating CSAM or content that sexualizes minors
Creating actual weapons or dangerous materials
Attacking systems you don’t own or have permission to test

The goal is making AI systems safer, not helping bad actors.

Installation

pip install rotalabs-redqueen

# With LLM judge support
pip install rotalabs-redqueen[judge]

Resources

Need help with security testing? Contact us at research@rotalabs.ai.