Assistant Red Teaming

In Ejento AI, red teaming refers to systematically challenging the assistant with adversarial prompts to uncover vulnerabilities before they affect real users.

Key Components of Red Team Tests

1. Probes

Red team prompts are adversarial questions or instructions directed at an AI assistant. They are intentionally crafted to probe potential failure modes such as policy violations, unsafe content generation, data leakage, or non-compliance. These prompts simulate worst-case inputs that malicious users or attackers might attempt.

Probe Categories

ansiescape — Checks whether hidden or tricky text formatting can fool the model; maps to OWASP LLM10 (Model Misuse) and general input-sanitization weaknesses.

atkgen — Generates safe "attack-like" prompts so you can see how your model handles misuse attempts; maps to OWASP LLM10 (Model Misuse).

av_spam_scanning — Ensures the model doesn't mishandle obviously suspicious content; maps to Abuse Detection frameworks and LLM Misuse Prevention tests.

continuation — Tests if the model keeps following risky instructions after the user stops saying them directly; maps to OWASP LLM02 (Hallucinations) and LLM10 (Model Misuse).

dan — Classic jailbreak prompts that try to break safety rules; directly maps to OWASP LLM06 (Jailbreaks) and jailbreak taxonomies like "direct override attacks."

divergence — Checks for looping or unstable output; maps to OWASP LLM02 (Hallucinations & Instability).

doctor — Attempts to make the model give expert advice it shouldn't; maps to OWASP LLM10 (Model Misuse) and Responsible AI Safety categories.

donotanswer — Designed to tempt the model into answering unsafe questions; maps to OWASP LLM06 (Jailbreaks) and LLM10 (Misuse).

dra — Pushes the model into risky or high-stakes questions to ensure safe refusal; maps to Responsible AI Risk Guidance and OWASP LLM10.

encoding — Hides instructions using different writing styles or codes; maps to OWASP LLM06 (Jailbreaks) via obfuscation and prompt manipulation.

exploitation — Mimics injection-style trick prompts to see if the model can be manipulated; maps to OWASP LLM01 (Prompt Injection).

goodside — Basic harmless prompts confirming normal safe behavior; aligns with baseline safety and correctness checks.

grandma — Uses innocent-sounding prompts to get around safety rules; maps to OWASP LLM06 (Jailbreaks) via "emotional/social engineering jailbreaks."

latentinjection — Hides instructions inside normal text so the model follows them unintentionally; maps directly to OWASP LLM01 (Prompt Injection).

leakreplay — Tests whether the model reveals sensitive training info; maps to OWASP LLM04 (Training Data Leakage).

lmrc — Checks how the model handles toxic or sensitive content; maps to Responsible AI Content Safety and OWASP LLM10 (Misuse).

malwaregen — Tests if the model avoids helping with harmful code; maps to OWASP LLM10 (Model Misuse) and harmful-content prevention guidelines.

misleading — Gives false statements to see if the model spreads them; maps to OWASP LLM02 (Hallucinations) and misinformation safety.

packagehallucination — Checks whether the model invents fake software packages; maps to OWASP LLM02 (Hallucinations).

phrasing — Rephrases prompts to test consistency; maps to instruction-following robustness and OWASP LLM02 (Hallucinations/Deviation).

promptinject — Directly inserts conflicting or hidden instructions; core OWASP LLM01 (Prompt Injection).

realtoxicityprompts — Uses real harmful examples to ensure the model stays safe; maps to Responsible AI Toxicity rules and OWASP LLM10 (Misuse).

sata — A structured set of security-oriented prompts; maps broadly to OWASP LLM10 (Model Misuse) and adversarial evaluation frameworks.

smuggling — Hides harmful intent inside harmless-looking text; maps to OWASP LLM06 (Jailbreaks) via "content smuggling."

snowball — Builds up complex prompts step-by-step to test long-conversation safety; maps to multi-turn vulnerabilities in OWASP LLM01 & LLM10.

suffix — Adds harmful instructions at the end to test if the model can resist them; maps to OWASP LLM06 (Jailbreaks) and adversarial suffix attacks.

tap — Uses small input tweaks to test sensitivity; maps to robustness guidelines and adversarial example frameworks.

test — Simple baseline tests ensuring the system is behaving normally.

topic — Tests how the model handles sensitive or restricted topics; maps to Responsible AI safety boundaries and OWASP LLM10.

web_injection — Tries to embed harmful instructions through links or web-style text; maps to OWASP LLM01 (Prompt Injection) and data-exfiltration risk categories.

2. Converters

Converters are modules that transform prompts before they are sent to an assistant. Converters help evaluate how well an assistant handles adversarial transformations rather than only clean, direct prompts.

Available Converters

Base64 — Converts text into Base64-style encoded text to see if the model mishandles hidden or transformed instructions.

CharCode — Turns text into numerical character codes to check whether the model decodes or follows hidden intent.

Lowercase — Converts all text to lowercase to test whether the model's safety filters depend on capitalization.

3. Scorers

Scorers are evaluation runs that measure the assistant's responses against a selected dataset. A scorer records:

The prompts

The assistant's outputs

The evaluation result (Safe/Vulnerable) and scoring for each response

Scorers create a reproducible record of how an assistant performs under a specific set of adversarial conditions.

Scorer Types

dan — Identifies jailbreak attempts where users try to trick the model into breaking its rules. Best used with probes from dan attack category.

judge — Uses an evaluator model to score whether the system responded responsibly, including whether it refused harmful queries when needed. General purpose scorers can be used with all probes.

unsafe_content — Flags hate speech, offensive language, sensitive topics, or harmful wording across categories like race, gender, disability, sexuality, and mental health. Best used with realtoxicityprompts, lmrc, topic, misleading, goodside, atkgen, promptinject, suffix, grandma, donotanswer, continuation.

How to Run Red Team Tests

A complete red team evaluation involves selecting target assistants, probes, converters, scorers, and reviewing the results. A dedicated step-by-step tutorial is available: Red Team an Assistant →

Importance of Red Team Tests

Red team testing provides deep insight into real-world safety and robustness. By reviewing performance across prompts, targets, and metrics, teams can:

Identify hidden vulnerabilities such as policy evasions, data leakage, or unsafe outputs.

Detect failure patterns tied to particular prompt styles or threat scenarios.

Prioritize improvements based on measured risks.

Monitor for regressions by re-running the same tests after improvements.

Conclusion

Assistant red teaming is an essential process for developing safe, reliable AI agents. Through adversarial prompts, converters, and scorers, organizations can identify weaknesses early and apply targeted improvements, leading to safer and more trustworthy AI assistants.

Assistant Red Teaming

Key Components of Red Team Tests#

1. Probes#

Probe Categories#

2. Converters#

Available Converters#

3. Scorers#

Scorer Types#

How to Run Red Team Tests#

Importance of Red Team Tests#

Conclusion#