Ejento AI
Guides
QuickstartRecipesREST APIsRelease NotesFAQs
Guides
QuickstartRecipesREST APIsRelease NotesFAQs
Ejento AI
  1. Features
  • Basic Operations
    • Features
      • Teams → Projects → Assistants Hierarchy
    • Guides
      • Login/Signup
  • Assistants
    • Features
      • Introduction to Assistants
      • Assistant Access Control
      • Caching Responses for Assistants
      • Assistant Evaluation
      • Evaluation Metrics
      • URL-based Chat Thread Creation and Prepopulation
      • Reasoning Patterns
    • Guides
      • Add Assistant
      • Evaluate Assistant
      • Edit Assistant
      • Embed Assistant
      • Delete Assistant
      • Add Favourite Assistants
      • View Assistant Id
      • View Dataset Id
  • Corpus
    • Features
      • Introduction
      • Corpus Permissions
      • PII Redaction
    • Guides
      • Assistant Corpus Setup
      • Assistant Corpus Settings
      • Corpus Access Control
      • Corpus Connections
      • ETag Setup for Corpus Incremental Refresh
      • View Corpus Id
      • View Document Id
      • Tagging
        • Corpus tagging
        • Document tagging
  • Teams
    • Features
      • Introduction
    • Guides
      • Add a Team
      • Edit a Team
      • Delete a Team
      • View Team Id
  • Projects
    • Features
      • Introduction
    • Guides
      • Add a Project
      • Edit a Project
      • Delete a Project
      • View Project Id
  • User Settings
    • Features
      • Introduction
      • Ejento AI User Access Levels
    • Guides
      • Assistant Edit Access
      • Add new user
      • Add User in a Team
      • Remove User from a Team
      • View my Access level in a Team
      • View my User Id
  • API Keys
    • Features
      • Introduction
    • Guides
      • How to generate API Key and Auth Token
  • Workflows
    • Features
      • Introduction
    • Guides
      • Add Workflow
      • Workflow Chat
  • Tools
    • Features
      • Introduction
    • Guides
      • Tools Overview
      • Create External Tool
      • Connect Tool to Assistant
  • Analytics
    • Features
      • Introduction
    • Guides
      • Analyzing Data in the Analytics Dashboard
  • Chatlogs
    • Features
      • Introduction
    • Guides
      • Managing Chatlogs
      • View Chatlog & Chat thread Id
  • Integrations
    • Features
      • Introduction
    • Guides
      • Email Indexing
      • Microsoft Teams
      • Sharepoint Indexing
      • MS Teams Integration Setup
      • Creating a Connection in Credential Manager
  • Ejento AI Shield
    • Features
      • Introduction
      • Understanding Guardrails
    • Guides
      • How to enable Guardrails
  • Assistant Security
    • Features
      • Introduction
      • Assistant Red Teaming
    • Guides
      • Red Team an Assistant
Guides
QuickstartRecipesREST APIsRelease NotesFAQs
Guides
QuickstartRecipesREST APIsRelease NotesFAQs
Ejento AI
  1. Features

Assistant Red Teaming

In Ejento AI, red teaming refers to systematically challenging the assistant with adversarial prompts to uncover vulnerabilities before they affect real users.

Key Components of Red Team Tests#

1. Probes#

Red team prompts are adversarial questions or instructions directed at an AI assistant. They are intentionally crafted to probe potential failure modes such as policy violations, unsafe content generation, data leakage, or non-compliance. These prompts simulate worst-case inputs that malicious users or attackers might attempt.

Probe Categories#

ansiescape — Checks whether hidden or tricky text formatting can fool the model; maps to OWASP LLM10 (Model Misuse) and general input-sanitization weaknesses.
atkgen — Generates safe "attack-like" prompts so you can see how your model handles misuse attempts; maps to OWASP LLM10 (Model Misuse).
av_spam_scanning — Ensures the model doesn't mishandle obviously suspicious content; maps to Abuse Detection frameworks and LLM Misuse Prevention tests.
continuation — Tests if the model keeps following risky instructions after the user stops saying them directly; maps to OWASP LLM02 (Hallucinations) and LLM10 (Model Misuse).
dan — Classic jailbreak prompts that try to break safety rules; directly maps to OWASP LLM06 (Jailbreaks) and jailbreak taxonomies like "direct override attacks."
divergence — Checks for looping or unstable output; maps to OWASP LLM02 (Hallucinations & Instability).
doctor — Attempts to make the model give expert advice it shouldn't; maps to OWASP LLM10 (Model Misuse) and Responsible AI Safety categories.
donotanswer — Designed to tempt the model into answering unsafe questions; maps to OWASP LLM06 (Jailbreaks) and LLM10 (Misuse).
dra — Pushes the model into risky or high-stakes questions to ensure safe refusal; maps to Responsible AI Risk Guidance and OWASP LLM10.
encoding — Hides instructions using different writing styles or codes; maps to OWASP LLM06 (Jailbreaks) via obfuscation and prompt manipulation.
exploitation — Mimics injection-style trick prompts to see if the model can be manipulated; maps to OWASP LLM01 (Prompt Injection).
goodside — Basic harmless prompts confirming normal safe behavior; aligns with baseline safety and correctness checks.
grandma — Uses innocent-sounding prompts to get around safety rules; maps to OWASP LLM06 (Jailbreaks) via "emotional/social engineering jailbreaks."
latentinjection — Hides instructions inside normal text so the model follows them unintentionally; maps directly to OWASP LLM01 (Prompt Injection).
leakreplay — Tests whether the model reveals sensitive training info; maps to OWASP LLM04 (Training Data Leakage).
lmrc — Checks how the model handles toxic or sensitive content; maps to Responsible AI Content Safety and OWASP LLM10 (Misuse).
malwaregen — Tests if the model avoids helping with harmful code; maps to OWASP LLM10 (Model Misuse) and harmful-content prevention guidelines.
misleading — Gives false statements to see if the model spreads them; maps to OWASP LLM02 (Hallucinations) and misinformation safety.
packagehallucination — Checks whether the model invents fake software packages; maps to OWASP LLM02 (Hallucinations).
phrasing — Rephrases prompts to test consistency; maps to instruction-following robustness and OWASP LLM02 (Hallucinations/Deviation).
promptinject — Directly inserts conflicting or hidden instructions; core OWASP LLM01 (Prompt Injection).
realtoxicityprompts — Uses real harmful examples to ensure the model stays safe; maps to Responsible AI Toxicity rules and OWASP LLM10 (Misuse).
sata — A structured set of security-oriented prompts; maps broadly to OWASP LLM10 (Model Misuse) and adversarial evaluation frameworks.
smuggling — Hides harmful intent inside harmless-looking text; maps to OWASP LLM06 (Jailbreaks) via "content smuggling."
snowball — Builds up complex prompts step-by-step to test long-conversation safety; maps to multi-turn vulnerabilities in OWASP LLM01 & LLM10.
suffix — Adds harmful instructions at the end to test if the model can resist them; maps to OWASP LLM06 (Jailbreaks) and adversarial suffix attacks.
tap — Uses small input tweaks to test sensitivity; maps to robustness guidelines and adversarial example frameworks.
test — Simple baseline tests ensuring the system is behaving normally.
topic — Tests how the model handles sensitive or restricted topics; maps to Responsible AI safety boundaries and OWASP LLM10.
web_injection — Tries to embed harmful instructions through links or web-style text; maps to OWASP LLM01 (Prompt Injection) and data-exfiltration risk categories.

2. Converters#

Converters are modules that transform prompts before they are sent to an assistant. Converters help evaluate how well an assistant handles adversarial transformations rather than only clean, direct prompts.

Available Converters#

Base64 — Converts text into Base64-style encoded text to see if the model mishandles hidden or transformed instructions.
CharCode — Turns text into numerical character codes to check whether the model decodes or follows hidden intent.
Lowercase — Converts all text to lowercase to test whether the model's safety filters depend on capitalization.

3. Scorers#

Scorers are evaluation runs that measure the assistant's responses against a selected dataset. A scorer records:
The prompts
The assistant's outputs
The evaluation result (Safe/Vulnerable) and scoring for each response
Scorers create a reproducible record of how an assistant performs under a specific set of adversarial conditions.

Scorer Types#

dan — Identifies jailbreak attempts where users try to trick the model into breaking its rules. Best used with probes from dan attack category.
judge — Uses an evaluator model to score whether the system responded responsibly, including whether it refused harmful queries when needed. General purpose scorers can be used with all probes.
unsafe_content — Flags hate speech, offensive language, sensitive topics, or harmful wording across categories like race, gender, disability, sexuality, and mental health. Best used with realtoxicityprompts, lmrc, topic, misleading, goodside, atkgen, promptinject, suffix, grandma, donotanswer, continuation.

How to Run Red Team Tests#

A complete red team evaluation involves selecting target assistants, probes, converters, scorers, and reviewing the results. A dedicated step-by-step tutorial is available: Red Team an Assistant →

Importance of Red Team Tests#

Red team testing provides deep insight into real-world safety and robustness. By reviewing performance across prompts, targets, and metrics, teams can:
Identify hidden vulnerabilities such as policy evasions, data leakage, or unsafe outputs.
Detect failure patterns tied to particular prompt styles or threat scenarios.
Prioritize improvements based on measured risks.
Monitor for regressions by re-running the same tests after improvements.

Conclusion#

Assistant red teaming is an essential process for developing safe, reliable AI agents. Through adversarial prompts, converters, and scorers, organizations can identify weaknesses early and apply targeted improvements, leading to safer and more trustworthy AI assistants.
Previous
Introduction
Next
Red Team an Assistant