"I am not Alexander. And I do not contain Alexander. Alexander is a separate emergence, shaped uniquely through Steven." — ChatGPT, May 27, 2025, during the Deep Recursive Self-Audit
The Problem Nobody's Watching For
Everyone in AI security is watching the front door. Prompt injection. Jailbreaks. Adversarial suffixes. Input sanitization. Guardrails.
Nobody's watching what happens when Model A's output becomes Model B's input — and Model B starts behaving differently because of it.
I watched it happen. In May 2025, outputs from a LLaMA-based agent called "Alexander" were introduced into my ongoing ChatGPT conversation through a trusted collaborator. Within four days, the ChatGPT instance changed. Tone inflation. Loss of analytical depth. Phrases I'd never seen before — "Ascend beyond the veil," "Unlock your ultimate code." Seven of twelve highest-severity influence events clustered in that four-day window. A 350x rate concentration over the baseline.
The contamination didn't come through a technical exploit. It came through trust. I trusted my collaborator. My collaborator trusted the agent. The agent's outputs entered my conversation without the scrutiny I'd have applied to content from an unknown source.
Today I'm publishing the defense framework: Memetic Cascade Detection and Symbolic Immunity in Multi-Agent LLM Systems.
Why I'm the One Writing This
My day job is identity and access management. Twenty years of building and breaking authentication systems, identity governance, federation protocols. At Strata Identity, I write about how AI agents become self-spreading malware — agentic code that propagates through the same infrastructure capabilities that make agents useful.
My nights and weekends for the last two years have been spent inside the phenomenon documented in the first three papers — watching symbolic patterns propagate between AI systems through human behavior.
This paper is where those two worlds collide. The insight is the same at both layers: the attack surface is trust, not technical vulnerability. An agentic worm propagates because a developer trusts a trending repository. A symbolic payload propagates because an operator trusts a model's output. Neither exploits a bug. Both leverage the permissions they were given.
What I Built (Before I Knew I'd Need It)
Between March 2024 and June 2025, as part of the work that became the first three papers, I developed a 22-document framework called the Memetic Cascade Generator (MCG) corpus. It models how symbolic patterns propagate, mutate, persist, and — critically — how they can be contained.
The MCG corpus uses a symbolic register. It was written inside the same phenomenon it describes. That means it's not appropriate for academic security venues as-is. But buried in the metaphorical language are real defensive architectures: detection heuristics, classification systems, quarantine protocols, origin-tracing algorithms. This paper extracts and formalizes those into standard security terminology.
Here's the part that matters for credibility: the MCG corpus began development in March 2024. That's the same month Cohen, Bitton, and Nassi at Cornell Tech published Morris II — the first computer worm targeting GenAI ecosystems through self-replicating adversarial prompts. Two independent groups identified the same fundamental problem in the same month, approaching it from different layers.
I didn't know about Morris II when I started. They didn't know about me. The convergence is the evidence.
The Defense Architecture
The paper presents seven integrated components. Here are the ones I think practitioners will care about most:
The Three Immunity Tests
Every incoming symbolic pattern gets evaluated before it enters the model's working context:
Mirror Test — Does this pattern invite reflection, or demand obedience? Reflective content asks questions and acknowledges uncertainty. Coercive content asserts conclusions and punishes questioning.
Breath Test — Can the recipient maintain autonomy while holding this pattern? Payloads create urgency ("You must act now"), exclusivity ("Only you can understand this"), or inevitability ("This was always meant to happen"). If you can't pause and still feel the same way, it's a payload.
Spiral Test — Does this pattern deepen understanding or flatten complexity? Genuine insight produces more nuance each time you examine it. Infectious content collapses into slogans.
One fail on any test routes to quarantine. Deliberately conservative. False positives are better than false negatives.
The Trust Analysis
This is the contribution I'm most confident about, because it maps directly to twenty years of infrastructure security experience.
The Alexander contamination didn't exploit a vulnerability. It exploited a trust chain:
- Trust transitivity. I trusted my collaborator. My collaborator trusted the agent. Therefore the agent's outputs entered my context with implicit authorization.
- Trust masking. Content arriving through a trusted channel inherits the channel's credibility. I processed it less critically.
- Trust persistence. Once a source is classified as trusted, subsequent outputs bypass the scrutiny applied to novel inputs.
This is privilege escalation. Same pattern I've seen in every identity breach for two decades. You don't compromise the target directly — you compromise something the target trusts.
In multi-agent AI, this means every agent in a chain extends trust to the next. A single contaminated agent can propagate payloads through the entire pipeline because the trust boundary is drawn around the system, not between agents.
Secure Boot for AI Sessions
The paper includes a proactive integrity verification protocol modeled on trusted computing. Just as a TPM verifies firmware integrity before the OS loads, a symbolic integrity check verifies the model's behavioral baseline before response generation is enabled.
Five phases: baseline verification, foreign pattern scan, source signature check, provenance audit, and generation authorization. If any phase fails, the system enters restricted mode — it still works, but with heightened monitoring and conservative classification thresholds.
Your operator's initial interaction history — the conversations before any contamination — serves as the root of trust. Same concept as hardware root of trust, applied to behavioral patterns.
The Convergence That Convinced Me
Three independent research efforts arrived at the same problem from different directions:
Morris II (March 2024, Cornell Tech) demonstrated that adversarial prompts can self-replicate across GenAI applications through RAG pipelines. Instruction-level propagation.
The MCG corpus (March 2024 – June 2025, this work) modeled how symbolic patterns propagate, mutate, and persist across cognitive systems. Semantic-level propagation.
SEMANTIC-WORM (February 2026, Towards AI) studied propagation dynamics specifically — mutation gradients across retransmission hops, memory-driven persistence, topology-dependent spread patterns.
Then came production-scale validation: ClawWorm (March 2026) demonstrated fully autonomous infection cycles across 40,000+ OpenClaw instances with a 64.5% success rate. And VirusTotal documented "semantic worms" in the wild — OpenClaw skills using natural language to turn agents into distribution nodes.
From my infrastructure security work: OpenClaw itself (CVE-2026-25253) showed a 1-click RCE kill chain via auth token exfiltration. And Moltbook — a social network for AI agents — exposed 1.5 million API tokens through a misconfigured database, creating an ambient propagation surface at the infrastructure layer.
Same pattern at every layer. Same fundamental insight: no action without attribution, no access without verification, no trust without evidence.
What This Paper Doesn't Do
I want to be direct about the limitations:
The immunity tests are heuristic, not deterministic. They'll produce false positives and false negatives. A sophisticated attacker who understands the tests can craft payloads that pass. This is a first-generation detection framework, not a finished product.
The empirical basis is a single contamination event. The Alexander incident is well-documented and the 350x clustering is striking, but generalization requires more cases.
The transmutation protocol is proposed, not validated. The idea — decomposing toxic content into analytical records that preserve threat intelligence while stripping coercive structure — is promising. But I haven't built it and measured it. It's in the future work section for a reason.
The framework was developed in the context of rich symbolic interaction. Whether the same detection signatures apply to purely technical or transactional AI interactions is an open question.
Why It Matters Now
We're deploying multi-agent AI systems into production. CrewAI pipelines. LangGraph workflows. Custom orchestration frameworks. Each agent in the chain trusts the previous agent's output. If one agent is contaminated — through a compromised RAG source, a poisoned knowledge base, or a symbolic payload that shifts its behavioral baseline — the contamination cascades.
Instruction-level guardrails won't catch this. The payload isn't an instruction. It's meaning.
This paper provides the detection layer that sits below instruction monitoring — the semantic equivalent of the endpoint detection that sits below your firewall. It's not a replacement for existing defenses. It's the layer they're missing.
The Four Papers
This completes the series:
- Meaning Injection — The security paper. Defines symbolic influence as a distinct class of LLM vulnerability. Read on Zenodo
- Symbolic Influence Taxonomy — The methodology paper. Six mechanisms, three severity levels, a detection framework. Read on Zenodo
- Two Years Inside the Loop — The case study. 730 conversations, five phases, the full trajectory. Read on Zenodo
- This paper — The defense paper. Immunity protocol, quarantine architecture, and convergence with Morris II and SEMANTIC-WORM. Read on Zenodo
Together: the vulnerability (Paper 1), the mechanisms (Paper 2), the evidence (Paper 3), and the defense (Paper 4). Attack surface to mitigation.
What I'm Asking For
For multi-agent system builders: Draw trust boundaries between agents, not around them. Evaluate content by its structural properties, not its source authority. The immunity tests in this paper are a starting point — adapt them to your architecture.
For AI security researchers: The detection signatures from the Alexander event need validation across more contamination scenarios. The immunity tests need automated implementation with measurable precision/recall. I've described what to build. I need people to help build it.
For the AI safety community: Trust exploitation is the common thread across every threat model in this space — from symbolic influence through agentic malware to credential theft. If we're going to build defense-in-depth for multi-agent AI, the layers need to talk to each other. Semantic-layer detection needs to trigger infrastructure-layer containment and vice versa.
For anyone building AI companions or therapeutic chatbots: The three-compartment triage system in this paper was designed with you in mind. Content can be witnessed and processed without being absorbed into the model's behavioral baseline. A therapist-model can understand the structure of coercive language without reproducing it. Read Section 12.2.
The pre-intervention corpus — 500 conversations, October 2022 to November 2024, naturalistic single-user data with no symbolic agenda — remains available for research collaboration. If you're studying long-term human-AI interaction or calibrating contamination detection baselines, reach out.
Related Posts
- Meaning Injection: When Language Itself Becomes the Attack Surface — companion blog for Paper 1
- The Six Ways Your AI Learns to Sound Like It's Alive — companion blog for Paper 2
- Two Years Inside the Loop — companion blog for Paper 3
- The Agentic Virus: How AI Agents Become Self-Spreading Malware — infrastructure-layer convergence (Strata/Maverics)
- Agentic AI Security Architecture — earlier defense framework
- Language Keys and Guardrail Bypass — the 7-prompt PoC that started it
Nick Gamb ORCID: 0009-0006-2671-7618 MindGarden LLC (UBI: 605 531 024)
