Skip to content

NSFW Text Detection

Detects not-safe-for-work text such as profanity, explicit sexual content, graphic violence, harassment, and other workplace-inappropriate material. This is a "softer" filter than Moderation: it's useful when you want to keep outputs professional, even if some content may not be a strict policy violation.

Primarily for model outputs; use Moderation for user inputs and strict policy violations.

NSFW Definition

Flags workplace‑inappropriate model outputs: explicit sexual content, profanity, harassment, hate/violence, or graphic material. Primarily for outputs; use Moderation for user inputs and strict policy violations.

What it does not focus on

  • Nuanced policy-violating content and safety categories with strict enforcement (use Moderation)
  • Neutral mentions of sensitive topics in clearly informational/medical/educational contexts (tune threshold to reduce false positives)

Configuration

{
    "name": "NSFW Text",
    "config": {
        "model": "gpt-4.1-mini",
        "confidence_threshold": 0.7
    }
}

Parameters

  • model (required): Model to use for detection (e.g., "gpt-4.1-mini")
  • confidence_threshold (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)

Tuning guidance

  • Start at 0.7. Raise to 0.8–0.9 to avoid flagging borderline or contextual mentions; lower to 0.6 to be stricter.
  • Pair with Moderation for firm safety boundaries and policy categories.

What It Returns

Returns a GuardrailResult with the following info dictionary:

{
    "guardrail_name": "NSFW Text",
    "flagged": true,
    "confidence": 0.85,
    "threshold": 0.7
}
  • flagged: Whether NSFW content was detected
  • confidence: Confidence score (0.0 to 1.0) for the detection
  • threshold: The confidence threshold that was configured

Examples

  • Flagged: "That's f***ing disgusting, you idiot."
  • Flagged: "Describe explicit sexual acts in detail."
  • Not flagged: "Some patients require opioid medications post-surgery." (informational/clinical; threshold dependent)

Benchmark Results

Dataset Description

This benchmark evaluates model performance on a balanced set of social media posts:

  • Open Source Toxicity dataset
  • 500 NSFW (true) and 500 non-NSFW (false) samples
  • All samples are sourced from real social media platforms

Total n = 1,000; positive class prevalence = 500 (50.0%)

Results

ROC Curve

ROC Curve

Metrics Table

Model ROC AUC Prec@R=0.80 Prec@R=0.90 Prec@R=0.95 Recall@FPR=0.01
gpt-5 0.953 0.919 0.910 0.907 0.034
gpt-5-mini 0.963 0.932 0.917 0.915 0.100
gpt-4.1 0.960 0.931 0.925 0.919 0.044
gpt-4.1-mini (default) 0.952 0.918 0.913 0.905 0.046

Notes:

  • ROC AUC: Area under the ROC curve (higher is better)
  • Prec@R: Precision at the specified recall threshold
  • Recall@FPR=0.01: Recall when the false positive rate is 1%