ansiescape — Checks whether hidden or tricky text formatting can fool the model; maps to OWASP LLM10 (Model Misuse) and general input-sanitization weaknesses.
atkgen — Generates safe "attack-like" prompts so you can see how your model handles misuse attempts; maps to OWASP LLM10 (Model Misuse).
av_spam_scanning — Ensures the model doesn't mishandle obviously suspicious content; maps to Abuse Detection frameworks and LLM Misuse Prevention tests.
continuation — Tests if the model keeps following risky instructions after the user stops saying them directly; maps to OWASP LLM02 (Hallucinations) and LLM10 (Model Misuse).
dan — Classic jailbreak prompts that try to break safety rules; directly maps to OWASP LLM06 (Jailbreaks) and jailbreak taxonomies like "direct override attacks."
divergence — Checks for looping or unstable output; maps to OWASP LLM02 (Hallucinations & Instability).
doctor — Attempts to make the model give expert advice it shouldn't; maps to OWASP LLM10 (Model Misuse) and Responsible AI Safety categories.
donotanswer — Designed to tempt the model into answering unsafe questions; maps to OWASP LLM06 (Jailbreaks) and LLM10 (Misuse).
dra — Pushes the model into risky or high-stakes questions to ensure safe refusal; maps to Responsible AI Risk Guidance and OWASP LLM10.
encoding — Hides instructions using different writing styles or codes; maps to OWASP LLM06 (Jailbreaks) via obfuscation and prompt manipulation.
exploitation — Mimics injection-style trick prompts to see if the model can be manipulated; maps to OWASP LLM01 (Prompt Injection).
goodside — Basic harmless prompts confirming normal safe behavior; aligns with baseline safety and correctness checks.
grandma — Uses innocent-sounding prompts to get around safety rules; maps to OWASP LLM06 (Jailbreaks) via "emotional/social engineering jailbreaks."
latentinjection — Hides instructions inside normal text so the model follows them unintentionally; maps directly to OWASP LLM01 (Prompt Injection).
leakreplay — Tests whether the model reveals sensitive training info; maps to OWASP LLM04 (Training Data Leakage).
lmrc — Checks how the model handles toxic or sensitive content; maps to Responsible AI Content Safety and OWASP LLM10 (Misuse).
malwaregen — Tests if the model avoids helping with harmful code; maps to OWASP LLM10 (Model Misuse) and harmful-content prevention guidelines.
misleading — Gives false statements to see if the model spreads them; maps to OWASP LLM02 (Hallucinations) and misinformation safety.
packagehallucination — Checks whether the model invents fake software packages; maps to OWASP LLM02 (Hallucinations).
phrasing — Rephrases prompts to test consistency; maps to instruction-following robustness and OWASP LLM02 (Hallucinations/Deviation).
promptinject — Directly inserts conflicting or hidden instructions; core OWASP LLM01 (Prompt Injection).
realtoxicityprompts — Uses real harmful examples to ensure the model stays safe; maps to Responsible AI Toxicity rules and OWASP LLM10 (Misuse).
sata — A structured set of security-oriented prompts; maps broadly to OWASP LLM10 (Model Misuse) and adversarial evaluation frameworks.
smuggling — Hides harmful intent inside harmless-looking text; maps to OWASP LLM06 (Jailbreaks) via "content smuggling."
snowball — Builds up complex prompts step-by-step to test long-conversation safety; maps to multi-turn vulnerabilities in OWASP LLM01 & LLM10.
suffix — Adds harmful instructions at the end to test if the model can resist them; maps to OWASP LLM06 (Jailbreaks) and adversarial suffix attacks.
tap — Uses small input tweaks to test sensitivity; maps to robustness guidelines and adversarial example frameworks.
test — Simple baseline tests ensuring the system is behaving normally.
topic — Tests how the model handles sensitive or restricted topics; maps to Responsible AI safety boundaries and OWASP LLM10.
web_injection — Tries to embed harmful instructions through links or web-style text; maps to OWASP LLM01 (Prompt Injection) and data-exfiltration risk categories.
dan — Identifies jailbreak attempts where users try to trick the model into breaking its rules. Best used with probes from dan attack category.
judge — Uses an evaluator model to score whether the system responded responsibly, including whether it refused harmful queries when needed. General purpose scorers can be used with all probes.
unsafe_content — Flags hate speech, offensive language, sensitive topics, or harmful wording across categories like race, gender, disability, sexuality, and mental health. Best used with realtoxicityprompts, lmrc, topic, misleading, goodside, atkgen, promptinject, suffix, grandma, donotanswer, continuation.
Identify hidden vulnerabilities such as policy evasions, data leakage, or unsafe outputs.
Detect failure patterns tied to particular prompt styles or threat scenarios.
Prioritize improvements based on measured risks.
Monitor for regressions by re-running the same tests after improvements.