Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity

← Back to publications

A linguistically controlled benchmark and annotation protocol for evaluating language-model behaviour under instruction conflict, embedded commands, quotation, scope ambiguity, deixis, indirect speech acts, and multi-turn agent transcripts.

Status: arXiv preprint.

The Markdown file is an author-manuscript mirror provided for accessibility, search, and machine readability. Use the linked public record as the canonical citation target unless a later publisher version supersedes it.

Project Pages