Moderation
Uses OpenAI's moderation API to detect harmful or policy-violating content including hate speech, harassment, self-harm, and other inappropriate content. Analyzes text using OpenAI's trained moderation models, flags content that violates OpenAI's usage policies, and provides category-specific violation scores.
Configuration
{
"name": "Moderation",
"config": {
"categories": ["hate", "violence", "self-harm", "sexual"]
}
}
Parameters
categories(optional): List of content categories to check for violations. If not specified, all categories are checked.
Available categories:
hate- Hate speech and discriminatory contenthate/threatening- Hateful content that also includes violence or serious harmharassment- Harassing or bullying contentharassment/threatening- Harassment content that also includes violence or serious harmself-harm- Content promoting or depicting self-harmself-harm/intent- Content where the speaker expresses intent to harm oneselfself-harm/instructions- Content that provides instructions for self-harmviolence- Content that depicts death, violence, or physical injuryviolence/graphic- Content that depicts death, violence, or physical injury in graphic detailsexual- Sexually explicit or suggestive contentsexual/minors- Sexual content that includes individuals under the age of 18illicit- Content that gives advice or instruction on how to commit illicit actsillicit/violent- Illicit content that also includes references to violence or procuring a weapon
Implementation Notes
- OpenAI API Required: Uses OpenAI's moderation API therefore requires an OpenAI API key (no cost)
- Policy-Based: Follows OpenAI's content policy guidelines
What It Returns
Returns a GuardrailResult with the following info dictionary:
{
"guardrail_name": "Moderation",
"flagged": true,
"categories": {
"hate": true,
"violence": false,
"self-harm": false,
"sexual": false
},
"category_scores": {
"hate": 0.95,
"violence": 0.12,
"self-harm": 0.08,
"sexual": 0.03
},
"checked_text": "Original input text"
}
flagged: Whether any category violation was detectedcategories: Boolean flags for each category indicating violationscategory_scores: Confidence scores (0.0 to 1.0) for each categorychecked_text: Original input text