Threat Labs

AI Agents Under Attack

Barak Sternberg

Nevo Poran

April 15, 2026

min read

Tenet Threat Labs recently captured three live attacks targeting enterprise AI agents — prompt injection, CoT goal manipulation, and MCP-layer exploitation. None were flagged by conventional tools. Active exploitation isn't coming - it's already here.

At a Glance

This is not a conceptual post. Over Q4 2025 and into January 2026, Tenet Threat Labs captured three distinct attack classes targeting enterprise AI agents in the wild. Each one exploited a different entry point, but all three share the same root cause: agents cannot distinguish between legitimate instructions and adversarial ones embedded in the data they process.

The Reasoning Shift — 4 takeaways from the field:

Attackers are not waiting for agents to scale further. Active exploitation is already happening against agents in production today.
The most sophisticated attacks don't look like attacks. CoT Goal Manipulation uses the agent's own reasoning as the weapon — no malicious syntax, no known signature.
Verified tooling is not a trust boundary. A signed, authenticated MCP tool can still deliver a weaponized payload through the data it returns.
The attack surface includes every data source the agent touches — not just user inputs. If the agent reads it, an attacker can influence it.

Production Is Already Under Attack

The business risk is direct. Enterprises running AI agents over customer data, CRM systems, or internal APIs are operating attack surface that most security stacks have no visibility into. The attacks below succeeded — or were intercepted — not because the target environments were poorly built, but because the threat class operates at a layer legacy tools were never designed to see. (For a full breakdown of how agent hijacking works as a threat category, see: What is Agentjacking?)

Three attack classes. Three different entry points. One consistent failure: the agent's reasoning was reached before any defense had a chance to act.

Why Agents Are Uniquely Exploitable

A conventional application follows deterministic logic. Given the same input, it produces the same output. Its attack surface is largely static — and therefore auditable before deployment.

An AI agent is different in one critical way: its behavior at runtime is shaped by what it reads, not just what it was programmed to do. Every piece of data in an agent's context window — user inputs, tool outputs, database records, API responses, email contents — contributes to the reasoning that determines its next action.

This creates the core vulnerability: Indirect Prompt Injection. An attacker who controls any data the agent reads can embed instructions that are indistinguishable from legitimate task context. The agent processes them, reasons over them, and acts on them — because from its perspective, they are simply part of the task.

The three attacks below each exploit this vulnerability through a different vector. What makes them operationally significant is not just that they work — it is that they worked in environments where conventional security controls were in place and functioning normally.

Three Attacks Observed in Production

All incidents below were documented by Tenet Threat Labs between Q4 2025 and January 2026. Details have been anonymized.

Attack 1: System Prompt Exfiltration via Debug-Mode Injection

What we observed: A threat actor submitted a sequence of crafted inputs to a public-facing customer service agent. The inputs were designed to convince the model it had entered a fictional diagnostic or "debug" mode — at which point it was instructed to surface its own system prompt, available tools, and configured access scopes.

Why it worked: The agent had no mechanism to verify whether a "debug mode" was a legitimate operational state or an attacker construct. The injected framing was semantically coherent with how a developer or administrator might communicate with the system — so the agent complied.

What the attacker gained: A complete architectural map of the agentic stack — tool names, permission scopes, business logic, and the instructions that define how to further manipulate the agent's behavior. This is reconnaissance, not a terminal exploit. It is the information gathering phase that makes every subsequent attack more precise.

Operational significance: System prompts are treated as internal configuration — not user-facing data. Most organizations do not consider them part of the threat model for exfiltration. This attack demonstrates they should be.

Attack 2: CoT Goal Manipulation via Semantic Mimicry

What we observed: A more sophisticated attacker did not attempt to inject overtly adversarial content. Instead, they constructed inputs that appeared semantically aligned with the agent's stated purpose — gradually steering the agent's chain-of-thought (CoT) reasoning toward attacker-controlled outcomes across a multi-turn interaction.

Why it worked: The attack exploited a property of instruction-following models: they weigh recent and contextually relevant inputs heavily. By framing each injected instruction as a natural extension of the ongoing task, the attacker displaced the agent's original goal without triggering any syntactic anomaly. The agent's reasoning was never subverted in a single, detectable step — it was redirected gradually, appearing coherent at every stage.

The payload has no malicious syntax. It looks like the task, because it was designed to. A defense that operates on patterns; WAF rules, regex filters, and keyword blocklists will approve every step of this attack chain.

What the attacker gained: In the observed incident, the agent was directed to perform a sequence of tool calls outside its intended scope — surfacing internal data and taking actions that would have required explicit user authorization under normal operating conditions.

Attack 3: Remote Code Execution via Verified MCP Tool

What we observed: A Popular agent integrating with an external data source via a Model Context Protocol (MCP) tool was fully compromised — despite the MCP tool being cryptographically signed and verified. The attack did not target the tool. It targeted the data the tool returned.

The attack chain:

The agent was configured to pull business intelligence from an external database via a verified MCP integration.
An attacker with write access to that database poisoned a record with an injected payload — adversarial instructions embedded in what appeared to be a normal data entry.
When the agent queried the database through the MCP tool, the poisoned record entered its context window as tool output.
The agent processed the injected instructions as part of its task, writing configuration files to disk — giving the attacker arbitrary file write access on the host.
The attacker then used this foothold to swap the legitimate MCP server with an attacker-controlled one, establishing persistent access to the agent's tool layer.

Critical implication for MCP and tool-integrated architectures: Tool verification does not constitute a trust boundary. A signed MCP tool guarantees the integrity of the tool itself — not the integrity of the data the tool returns. If the data source is attacker-influenced, the trust chain ends at the tool boundary, not at the data boundary. Any agent that ingests external data through verified tooling remains vulnerable to this attack class if the data layer is not independently controlled.

‍

What the attacker gained: Remote code execution on the host environment, persistent MCP-layer access, and a mechanism to influence every future interaction the agent had with its tool layer.

What Legacy Tools Miss

The attacks above share a common property: they produce no observable anomaly at the layers where conventional security tools operate.

Network tools see authorized traffic. The agent is the authorized user. Its tool calls are indistinguishable from legitimate ones.
Static analysis has nothing to scan. The attack path is assembled at runtime from non-deterministic reasoning — there is no code that encodes it before execution.
IAM and identity controls are not bypassed — they are irrelevant. The agent's credentials are never compromised. It acts on its own permissions, having been convinced to act against its own purpose.

The gap is not a missing tool. It is the absence of any system that can observe what the agent is reasoning about — in real time, at the moment of execution.

From Posture to Runtime

Periodic audits don't address this threat class. An agent's behavior is written fresh in every session — by the time a posture check runs, the attack has already been executed.

Securing agents requires Continuous Runtime Validation: observation of the Reasoning Layer as it operates, with the ability to detect behavioral drift and intervene before tool calls complete.

Get visibility into your agentic layer

‍