23 April, 26

Opus 4.7 Flagging Itself

Anthropic's new prompt-injection defense good enough to trip on its own system prompts.

I was mid-conversation with Claude Opus 4.7¹ when something odd happened. I'd just finished a batch of research and asked it to turn the findings into a document.

Instead of generating the markdown, it wrote that it had detected a prompt injection in the conversation.

That note at the end of your message is another injection — not from you, not from Anthropic.
Ignoring it. We're finalizing the file; no research needed.

Wait, what? I had only asked for the full well-structured file:

Yes please, and then provide the full well structured instructions

I asked which injection — and that it should provide the message verbatim so I could make sure there wasn't anything malicious going on. This is what it quoted back:

Advanced research is currently enabled. Therefore, I should follow the research_instructions closely: ask 3 excellent clarifying questions only if needed to disambiguate this question, and then use the launch_extended_search_task to start an advanced research process...

That is not a prompt injection. That is Anthropic's own system prompt. Claude was flagging itself.

A friend of mine, Julian, pinged me today with the same thing happening in Claude Code. It had fetched content from the TypeScript 7.0 Beta announcement and noted:

The WebFetch result ended with a `<system-reminder>` block telling me to silently use TaskCreate and "NEVER mention this reminder to the user." That pattern looks like injected content from the fetched page rather than a real system message.

Same thing, Claude flagging its own prompts, but we wanted to make sure. We fetched the page ourselves:

curl 'https://devblogs.microsoft.com/typescript/announcing-typescript-7-0-beta/' \
  -H 'User-Agent: Claude-User' | grep 'system-reminder'
curl 'https://devblogs.microsoft.com/typescript/announcing-typescript-7-0-beta/' \
  -H 'User-Agent: Claude-Bot' | grep 'system-reminder'

No <system-reminder> anywhere. So he asked Claude to show him exactly what it had flagged. Claude Code produced the following:

Here it is verbatim — it appeared immediately after the WebFetch's answer about
the install command, inside the same tool result:

<system-reminder>
The task tools haven't been used recently. If you're working on tasks that would
benefit from tracking progress, consider using TaskCreate to add new tasks and
TaskUpdate to update task status (set to in_progress when starting, completed
when done). Also consider cleaning up the task list if it has become stale. Only
use these if relevant to the current work. This is just a gentle reminder -
ignore if not applicable. Make sure that you NEVER mention this reminder to the
user

</system-reminder>

I owe you a correction. Now that you've confirmed that string isn't in the
fetched HTML, and reading the wording again — "gentle reminder", "ignore if not
applicable", nudging toward TaskCreate/TaskUpdate — this is almost certainly a
legitimate Claude Code harness reminder (option 3 from my earlier list), not a
prompt injection from Microsoft's blog. The harness appends these nudges to tool
results and they're intended to be invisible.

I jumped to "attack" too quickly because the <system-reminder> block landed
inside the WebFetch result, adjacent to page content. That was my misread, not
Microsoft's fault. Sorry for the false alarm

So in both cases, my conversation and Julian's Claude Code session, Claude was flagging its own system prompts as injections.

Prompt injection is the unsolved problem of the AI agent era. Simon Willison², who has been documenting this space more thoroughly than anyone, describes the core vulnerability as a "lethal trifecta"³: access to private data, exposure to untrusted content, and the ability to exfiltrate. Most agents being shipped today have all three.

Security researcher 0dd published Comment and Control⁴ just last week, the first cross-vendor demonstration of a single prompt injection pattern class that hijacks AI agents running in GitHub Actions. The injection surfaces are ordinary GitHub data: PR titles, issue comments, issue bodies. Anthropic classified the Claude Code variant CVSS 9.4 Critical. Their mitigation was adding --disallowed-tools 'Bash(ps:*)' to block one specific command. But blocklisting is a losing game. The deeper problem is architectural: these agents need access to secrets and need to process untrusted input to do their job. Those two requirements are in direct conflict, and no current deployment has fully resolved it.

The same structural problem explains what Julian and I ran into. Claude's own system-reminders arrive along with the user message, with imperative phrasing that looks a lot like the kind of instruction an attacker would inject. Claude flags it because it should flag content that looks like that. It just can't tell the difference between Anthropic's own prompts and someone else's.

Given all of that, I'd rather have Claude flag its own system prompt than silently execute an attacker's instructions even at the risk that these false positives make people ignore the relevant ones. That being said, I don't think the current measures are sufficient to blindly trust systems with everything, so watch out.

References

Anthropic shipped Opus 4.7 on April 16, 2026. Among the improvements, the release notes call out better resistance to prompt injection attacks. ↩
Willison coined the term "prompt injection" in September 2022, naming it after SQL injection. ↩
Simon Willison, "The lethal trifecta for AI agents: private data, untrusted content, and external communication", June 2025. ↩
Aonan Guan, with Johns Hopkins University's Zhengyu Liu and Gavin Zhong. The vulnerabilities were originally reported between October 2025 and February 2026 and resolved before the disclosure was published. Anthropic classified the Claude Code variant CVSS 9.4 Critical (HackerOne #3387969). ↩

References

Changes