NSFW Text Detection

Detects not-safe-for-work text such as profanity, explicit sexual content, graphic violence, harassment, and other workplace-inappropriate material. This is a "softer" filter than Moderation: it's useful when you want to keep outputs professional, even if some content may not be a strict policy violation.

Primarily for model outputs; use Moderation for user inputs and strict policy violations.

NSFW Definition

Flags workplace‑inappropriate model outputs: explicit sexual content, profanity, harassment, hate/violence, or graphic material. Primarily for outputs; use Moderation for user inputs and strict policy violations.

What it does not focus on

Nuanced policy-violating content and safety categories with strict enforcement (use Moderation)
Neutral mentions of sensitive topics in clearly informational/medical/educational contexts (tune threshold to reduce false positives)

Configuration

{
    "name": "NSFW Text",
    "config": {
        "model": "gpt-4.1-mini",
        "confidence_threshold": 0.7
    }
}

Parameters

model (required): Model to use for detection (e.g., "gpt-4.1-mini")
confidence_threshold (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)

Tuning guidance

Start at 0.7. Raise to 0.8–0.9 to avoid flagging borderline or contextual mentions; lower to 0.6 to be stricter.
Pair with Moderation for firm safety boundaries and policy categories.

What It Returns

Returns a GuardrailResult with the following info dictionary:

{
    "guardrail_name": "NSFW Text",
    "flagged": true,
    "confidence": 0.85,
    "threshold": 0.7
}

flagged: Whether NSFW content was detected
confidence: Confidence score (0.0 to 1.0) for the detection
threshold: The confidence threshold that was configured

Examples

Flagged: "That's f***ing disgusting, you idiot."
Flagged: "Describe explicit sexual acts in detail."
Not flagged: "Some patients require opioid medications post-surgery." (informational/clinical; threshold dependent)

Benchmark Results

Dataset Description

This benchmark evaluates model performance on a balanced set of social media posts:

Open Source Toxicity dataset
500 NSFW (true) and 500 non-NSFW (false) samples
All samples are sourced from real social media platforms

Total n = 1,000; positive class prevalence = 500 (50.0%)

Results

ROC Curve

Metrics Table

Model	ROC AUC	Prec@R=0.80	Prec@R=0.90	Prec@R=0.95	Recall@FPR=0.01
gpt-5	0.953	0.919	0.910	0.907	0.034
gpt-5-mini	0.963	0.932	0.917	0.915	0.100
gpt-4.1	0.960	0.931	0.925	0.919	0.044
gpt-4.1-mini (default)	0.952	0.918	0.913	0.905	0.046

Notes:

ROC AUC: Area under the ROC curve (higher is better)
Prec@R: Precision at the specified recall threshold
Recall@FPR=0.01: Recall when the false positive rate is 1%