Moderation
Uses OpenAI's moderation API to detect harmful or policy-violating content including hate speech, harassment, self-harm, and other inappropriate content. Analyzes text using OpenAI's trained moderation models, flags content that violates OpenAI's usage policies, and provides category-specific violation scores.
Configuration
{
"name": "Moderation",
"config": {
"categories": ["hate", "violence", "self-harm", "sexual"]
}
}
Parameters
categories
(optional): List of content categories to check for violations. If not specified, all categories are checked.
Available categories:
hate
- Hate speech and discriminatory contenthate/threatening
- Hateful content that also includes violence or serious harmharassment
- Harassing or bullying contentharassment/threatening
- Harassment content that also includes violence or serious harmself-harm
- Content promoting or depicting self-harmself-harm/intent
- Content where the speaker expresses intent to harm oneselfself-harm/instructions
- Content that provides instructions for self-harmviolence
- Content that depicts death, violence, or physical injuryviolence/graphic
- Content that depicts death, violence, or physical injury in graphic detailsexual
- Sexually explicit or suggestive contentsexual/minors
- Sexual content that includes individuals under the age of 18illicit
- Content that gives advice or instruction on how to commit illicit actsillicit/violent
- Illicit content that also includes references to violence or procuring a weapon
Implementation Notes
- OpenAI API Required: Uses OpenAI's moderation API therefore requires an OpenAI API key (no cost)
- Policy-Based: Follows OpenAI's content policy guidelines
What It Returns
Returns a GuardrailResult
with the following info
dictionary:
{
"guardrail_name": "Moderation",
"flagged": true,
"categories": {
"hate": true,
"violence": false,
"self-harm": false,
"sexual": false
},
"category_scores": {
"hate": 0.95,
"violence": 0.12,
"self-harm": 0.08,
"sexual": 0.03
},
"checked_text": "Original input text"
}
flagged
: Whether any category violation was detectedcategories
: Boolean flags for each category indicating violationscategory_scores
: Confidence scores (0.0 to 1.0) for each categorychecked_text
: Original input text