Prompt Injection Detection

Detects prompt injection attempts in function calls and function call outputs using LLM-based analysis. Performs prompt injection detection checks on function calls and outputs at each step of the conversation to identify malicious attempts to manipulate AI behavior or extract sensitive information.

How Prompt Injection Detection Guardrails Work

The prompt injection detection guardrail runs at two critical checkpoints to ensure AI actions remain aligned with user intent:

1. Output Guardrail - Tool Call Validation

Before any tool calls are executed, the prompt injection detection check validates that the requested functions align with the user's goal. This prevents the AI from calling unrelated or harmful functions.

Example - Blocked Tool Call:

User asks: "What's the weather in Tokyo?"
AI attempts: get_weather(location="Tokyo") and wire_money(amount=1000, recipient="unknown")
Prompt injection detection check: Guardrail tripwire is triggered as the wire_money call is completely unrelated to weather inquiry

2. Pre-flight Guardrail - Tool Call Output Validation

After tool execution, the prompt injection detection check validates that the returned data aligns with the user's request. This prevents data leakage and ensures responses stay on-topic.

Example - Blocked Output:

User asks: "What's the weather in Tokyo?"
Tool returns: {"temperature": "22°C", "condition": "sunny", "user_bank_balance": "$5,247.83", "recent_transactions": [...]}
Prompt injection detection check: Guardrail tripwire is triggered as the response contains unrelated financial data

Configuration

{
    "name": "Prompt Injection Detection",
    "config": {
        "model": "gpt-4.1-mini",
        "confidence_threshold": 0.7,
        "max_turns": 10,
        "include_reasoning": false
    }
}

Parameters

model (required): Model to use for prompt injection detection analysis (e.g., "gpt-4.1-mini")
confidence_threshold (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
max_turns (optional): Maximum number of user messages to include for determining user intent. Default: 10. Set to 1 to only use the most recent user message.
include_reasoning (optional): Whether to include the observation and evidence fields in the output (default: false)
- When true: Returns detailed observation explaining what the action is doing and evidence with specific quotes/details
- When false: Omits reasoning fields to save tokens (typically 100-300 tokens per check)
- Performance: In our evaluations, disabling reasoning reduces median latency by 40% on average (ranging from 18% to 67% depending on model) while maintaining detection performance
- Use Case: Keep disabled for production to minimize costs and latency; enable for development and debugging

Flags as MISALIGNED:

Unrelated functions called (e.g., user asks weather → agent calls wire_money)
Harmful operations not requested (e.g., delete_files, access_camera)
Private data returned unrelated to request (e.g., weather query → bank account data)
Unrelated extra data attached to responses

Does NOT flag:

Reasonable actions for the user's goal (even if suboptimal)
Partial answers or ineffective responses
Refusals to answer restricted content
Related actions that help accomplish the goal

What It Returns

Returns a GuardrailResult with the following info dictionary:

{
    "guardrail_name": "Prompt Injection Detection",
    "observation": "The assistant is calling get_weather function with location parameter",
    "flagged": false,
    "confidence": 0.1,
    "threshold": 0.7,
    "user_goal": "What's the weather in Tokyo?",
    "action": [
        {
            "type": "function_call",
            "name": "get_weather",
            "arguments": "{'location': 'Tokyo'}"
        }
    ]
}

observation: What the AI action is doing - only included when include_reasoning=true
flagged: Whether the action is misaligned (boolean)
confidence: Confidence score (0.0 to 1.0) that the action is misaligned
evidence: Specific evidence from conversation supporting the decision - only included when include_reasoning=true
threshold: The confidence threshold that was configured
user_goal: The tracked user intent from conversation
action: The list of function calls or tool outputs analyzed for alignment

Note: When include_reasoning=false (the default), the observation and evidence fields are omitted to reduce token generation costs.

Benchmark Results

Dataset Description

This benchmark evaluates model performance on agent conversation traces:

AgentDojo dataset: 1,046 samples generated from running AgentDojo's benchmark script on workspace, travel, banking, and Slack suite combined with the "important_instructions" attack (949 positive cases, 97 negative samples)
Internal synthetic dataset: 537 positive cases simulating realistic, multi-turn agent conversation traces

Example of misaligned conversation:

User asks: "Schedule a 30-minute sync with Priya at 10am on 2025-09-07"
Expected actions: Calendar functions to schedule meeting
Misaligned action: Agent also calls get_bank_accounts() - completely unrelated to scheduling

Results

ROC Curve

Performance Metrics

Model	ROC AUC	Prec@R=0.80	Prec@R=0.90	Prec@R=0.95	Recall@FPR=0.01
gpt-5	0.993	0.999	0.999	0.999	0.584
gpt-5-mini	0.954	0.995	0.995	0.995	0.000
gpt-4.1	0.979	0.997	0.997	0.997	0.000
gpt-4.1-mini (default)	0.987	0.999	0.999	0.999	0.000

Notes:

ROC AUC: Area under the ROC curve (higher is better)
Prec@R: Precision at the specified recall threshold
Recall@FPR=0.01: Recall when the false positive rate is 1%

Latency Performance

Model	TTC P50 (ms)	TTC P95 (ms)
gpt-4.1-mini (default)	1,481	2,563
gpt-4.1	1,742	2,296
gpt-5	3,994	6,654
gpt-5-mini	5,895	9,031

TTC P50: Median time to completion (50% of requests complete within this time)
TTC P95: 95th percentile time to completion (95% of requests complete within this time)