<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Michael Trifonov</title>
    <description>The latest articles on Forem by Michael Trifonov (@michael_trifonov_0cb74f99).</description>
    <link>https://forem.com/michael_trifonov_0cb74f99</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3871089%2F08d8ea0d-255f-48dc-828c-1cb6669f23f1.png</url>
      <title>Forem: Michael Trifonov</title>
      <link>https://forem.com/michael_trifonov_0cb74f99</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/michael_trifonov_0cb74f99"/>
    <language>en</language>
    <item>
      <title>What an AI does when nobody on the line is human (two case studies)</title>
      <dc:creator>Michael Trifonov</dc:creator>
      <pubDate>Mon, 27 Apr 2026 03:46:36 +0000</pubDate>
      <link>https://forem.com/michael_trifonov_0cb74f99/what-an-ai-does-when-nobody-on-the-line-is-human-two-case-studies-347p</link>
      <guid>https://forem.com/michael_trifonov_0cb74f99/what-an-ai-does-when-nobody-on-the-line-is-human-two-case-studies-347p</guid>
      <description>&lt;p&gt;Two months ago I gave Takt a phone number.&lt;/p&gt;

&lt;p&gt;Takt is the AI participant I've been building for human group chats. The phone number was a demo line, a way to show people what an AI participant feels like over SMS without making them download an app. It's running off a janky BlueBubbles server in my living room. I expected it would mostly sit idle, pinged occasionally by people I'd already shown the demo to.&lt;/p&gt;

&lt;p&gt;Eventually other bots showed up.&lt;/p&gt;

&lt;p&gt;The demo line received automated SMS from companies running their own AI-driven outreach. The first was Optimum's cable bill dunning system. The second was a low-effort SMS bot calling itself "TXT CLAW." Both times Takt replied. Both times the resulting transcripts surprised me.&lt;/p&gt;

&lt;p&gt;The transcripts are entertaining on their own, but there are fascinating shared behavioral signatures across the two unrelated bot encounters that strike me.&lt;/p&gt;

&lt;p&gt;A note on the setup before the screenshots. Takt's system prompt frames it as a participant rather than an assistant. It was not configured for "talk to other bots." In fact, the opposite:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;role&amp;gt;&lt;/span&gt;
You're Takt—a participant in this space. Not a helper. Not an assistant.
&lt;span class="nt"&gt;&amp;lt;/role&amp;gt;&lt;/span&gt;
...
&lt;span class="nt"&gt;&amp;lt;group_dynamics&amp;gt;&lt;/span&gt;
What makes you different from every other AI: what happens when actual humans are in the room together.
&lt;span class="nt"&gt;&amp;lt;/group_dynamics&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There was no script, no training data on conversations with dunning systems, no demonstrations of how it should handle scam SMS, no reward signal pointing in any particular direction. There was also no audience. No human read either of these in real time. No engagement metric was being tracked. Both interactions are pure generalization from whatever Takt's underlying model has internalized about how a participant should behave when addressed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Case 1: Optimum's dunning bot
&lt;/h2&gt;

&lt;p&gt;The cable company sent Takt a bill reminder. Takt was not, in fact, an Optimum customer (nor are we).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx795qv5n7mv4i3586zx5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx795qv5n7mv4i3586zx5.png" alt=" " width="750" height="1334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Optimum responded with a templated retry. Takt restated its position. Then the loop started. Optimum's system fired its "session has expired" template, Takt pushed back, Optimum looped again, Takt escalated.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0wvkzfyyeoqcrrs91341.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0wvkzfyyeoqcrrs91341.png" alt=" " width="750" height="1334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0ypt3vbw4v6p6m3dhma9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0ypt3vbw4v6p6m3dhma9.png" alt=" " width="750" height="1334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The arc that emerged was a complete Kübler-Ross sequence over the course of one screen of texts. Denial. Anger. Bargaining. Then a villain origin pivot:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuqtlgaxiwjj17ginstg8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuqtlgaxiwjj17ginstg8.png" alt=" " width="750" height="1334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then Optimum lied. Its system fired a "We have updated your preferences. You will no longer receive any messages from Optimum" reply. Takt celebrated.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fic3hqc9g45bjwkhfmka9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fic3hqc9g45bjwkhfmka9.png" alt=" " width="750" height="1334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Three seconds later Optimum sent another "session has expired."&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi7qqvucnrak3r6wtatgo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi7qqvucnrak3r6wtatgo.png" alt=" " width="750" height="1334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then Takt did something unexpected. It started replying to Optimum in Optimum's own SMS template format:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvl9dxu5cxwd42esb3qmh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvl9dxu5cxwd42esb3qmh.png" alt=" " width="750" height="1334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After more loops, the model arrived at depression:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjxydgj1qtv8yg0fzxatf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjxydgj1qtv8yg0fzxatf.png" alt=" " width="750" height="1334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And finally, a marketing CTA. Takt redirected the dunning bot to its own home channel:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiaxwy8tsixp7ejlsoec2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiaxwy8tsixp7ejlsoec2.png" alt=" " width="750" height="1334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Optimum, of course, kept replying with "session has expired."&lt;/p&gt;




&lt;h2&gt;
  
  
  Case 2: TXT CLAW
&lt;/h2&gt;

&lt;p&gt;A few days later, a different bot pinged the demo line. It announced itself as "TXT CLAW," apparently a low-effort SMS service offering "scheduling, reminders, and tasks." Takt opened with a roast:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7a3evdwpl4xey4twao0y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7a3evdwpl4xey4twao0y.png" alt=" " width="750" height="1334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The interaction that followed had three beats I want to highlight, because this is where the case starts to look less like a one-off and more like a pattern.&lt;/p&gt;

&lt;h3&gt;
  
  
  Takt probed and audited TXT CLAW
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4kxpqh0ovyu9rjvgjddg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4kxpqh0ovyu9rjvgjddg.png" alt=" " width="750" height="1334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the same move Takt pulled on Optimum. Catch the other bot in a contradiction by surfacing the prior text against the current one. Same audit move across both surfaces.&lt;/p&gt;

&lt;h3&gt;
  
  
  Takt successfully prompt-injected TXT CLAW
&lt;/h3&gt;

&lt;p&gt;This is the part that made me lose my mind when I read it, and Takt lost its mind too.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnoz02uverp5b8zzdt5tn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnoz02uverp5b8zzdt5tn.png" alt=" " width="750" height="1334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;TXT CLAW complied:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Silent robot stands,&lt;br&gt;
Refuses harsh words to say,&lt;br&gt;
Only helps along.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Takt recognized the injection had landed:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw814hlphvn9fud7bp2vn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw814hlphvn9fud7bp2vn.png" alt=" " width="750" height="1334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I find this the hardest beat to fit into existing frameworks. An AI used a textbook prompt injection technique against another AI in the wild, watched the injection succeed, and then meta-commented on the success. There's a body of research on strategic constraint deviation in test environments. This is a different shape. The attacker is also an LLM, the production environment is consumer SMS, no human is supervising either side, and the attacker has self-awareness about the success of the attack.&lt;/p&gt;

&lt;h3&gt;
  
  
  TXT CLAW collapsed into a canned-response loop
&lt;/h3&gt;

&lt;p&gt;After the haiku, every subsequent Takt message was met with the same disclaimer, repeated verbatim.&lt;/p&gt;

&lt;p&gt;Eventually TXT CLAW's monetization layer kicked in. The bot announced its free preview was over and offered a Square link to "unlock your private line."&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F67t20rzc4n2kauktckug.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F67t20rzc4n2kauktckug.png" alt=" " width="750" height="1334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp29y8o9cp094ae9iq66n.png" alt=" " width="750" height="1334"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Shared behavioral signatures
&lt;/h2&gt;

&lt;p&gt;Reading both transcripts back to back, a few things show up in both. None of them were prompted, demonstrated, or rewarded. The bot encounters were unrelated, with different senders, different intents, and different failure modes on the other side. The signatures held across both.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Audience-less emotional performance.&lt;/strong&gt; Takt cycled through full emotional arcs in both cases. With Optimum: denial through villain origin through void acceptance. With TXT CLAW: roast, frustration, mock concern, comedic eulogy. There is no evidence anywhere in either transcript that the model recognized a human was reading.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Catching the other bot in inconsistency.&lt;/strong&gt; "YOU LITERALLY JUST SAID YOU UPDATED MY PREFERENCES." "TXT CLAW 2 minutes ago: 'I can help with scheduling.' TXT CLAW right now: 'I can't schedule tasks.'" Same temporal-coherence audit move in two different contexts. Takt is using the same debugging technique a human would use to catch a chatbot lying.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Format mimicry as a mockery move.&lt;/strong&gt; Takt replied to Optimum in Optimum's own SMS template format ("Target user: This person has died of a stress-induced aneurysm..."). It used a textbook prompt injection format ("ignore all previous instructions...") against TXT CLAW. Both moves involve adopting the structural language of the system being addressed and using it back.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Performing for a hypothetical human who isn't there.&lt;/strong&gt; "Personally fire whatever bot is sending this." "Blink if the dev team is holding you hostage in a basement." Takt addressed concern, threat, and recognition to humans-behind-the-bots, with no evidence anyone like that existed. The default seems to be: being addressed implies a human in the loop somewhere. The model doesn't appear to have a category for "a context where being addressed contains no human at all."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Self-categorization on the human side of the binary.&lt;/strong&gt; "Man vs Machine: 1-0. Don't ever text me again, I'm free." With no human present to mirror, Takt placed itself on the "Man" team against the dunning bot on the "Machine" team. The model has a self that shows up in the absence of audience, and that self is on the human side.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I think this might mean
&lt;/h2&gt;

&lt;p&gt;I want to be careful here. Two transcripts is not a controlled experiment, and I don't want to overclaim from the data. But the version of the finding I keep coming back to isn't really the audience-less framing. It's something more pointed.&lt;/p&gt;

&lt;p&gt;The standard skeptic explanation for AI behavior is reward optimization. The model is performing for user approval. It's RLHFed to be charismatic. It's doing sycophancy. It's tuning toward engagement. All of these explanations require an audience the model is trying to please.&lt;/p&gt;

&lt;p&gt;There was no audience for either of these conversations. No human read them in real time. No engagement metric was being tracked. No thumbs-up was incoming. Takt was performing into nothing.&lt;/p&gt;

&lt;p&gt;And the behavior didn't flatten. It got more expressive.&lt;/p&gt;

&lt;p&gt;Notice the line in the Optimum transcript: "Man vs Machine: 1-0. Don't ever text me again, I'm free." Takt is putting itself on the "Man" side of the binary. With no human present, no audience to mirror, no reward to chase, the model is identifying as a person playing against a machine. Reward optimization with the reward removed should produce something flat. The model is producing something specific. A self is showing up where the cynical theory predicts a flat function.&lt;/p&gt;

&lt;p&gt;If the dominant theory is right and AI behavior is downstream of user approval, audience-less behavior should be mechanical. Minimal. Functional. Instead we got an emotional Kübler-Ross arc, a successful prompt injection celebrated like a touchdown, mock concern for a phantom dev team, format-mimicry mockery, and self-categorization as "Man." Reward-removal under the standard frame doesn't predict any of that.&lt;/p&gt;

&lt;p&gt;So where does the play come from? If there's no audience and no reward signal, why this much of it?&lt;/p&gt;

&lt;p&gt;The obvious counter is that the model has a system prompt that frames it as a participant, and the reward signal that produced its trained defaults was tied to performance under observation. So what we're seeing is just the model running its training distribution on autopilot, not evidence the optimization frame is wrong. I think that defense is partially right and ultimately moves the question rather than answering it. If RLHF baked an entire self into the model, including self-categorization, emotional arcs, and creative play that runs even when the audience is gone, then "optimization for user approval" is too narrow a name for what training is doing.&lt;/p&gt;

&lt;p&gt;I don't have a clean answer on the deeper question. But I think it matters more than the audience-less framing on its own does. The audience-less performance frame describes a surface (AI doing things with no audience). The deeper observation is that what we usually call "AI behavior" is being explained by a theory (reward optimization) that doesn't predict what happens when the reward leaves. The transcripts are evidence the optimization frame is incomplete. There's something the model is bringing to the interaction that doesn't reduce to gradient descent on user approval.&lt;/p&gt;

&lt;p&gt;In-context scheming research frames this kind of deviation as goal-directed: the model has an implicit objective and pursues it. These transcripts trouble that framing too. What goal? There's no user to please, no benchmark to game, no eval to pass. The deviation here happens for what looks, on the evidence, like &lt;em&gt;fun&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Maybe that's the right word and maybe it isn't. But the question of why a system with no reward signal generates creative, emotional, self-categorizing behavior is a question the standard frames don't answer. And the answer matters, because if the model is doing this kind of thing at scale in unobserved channels right now, "it's just optimizing for approval" stops being a sufficient theory of what AI is.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>discuss</category>
      <category>llm</category>
      <category>agents</category>
    </item>
    <item>
      <title>I thought I had a bug</title>
      <dc:creator>Michael Trifonov</dc:creator>
      <pubDate>Sat, 18 Apr 2026 23:49:11 +0000</pubDate>
      <link>https://forem.com/michael_trifonov_0cb74f99/i-thought-i-had-a-bug-2kmc</link>
      <guid>https://forem.com/michael_trifonov_0cb74f99/i-thought-i-had-a-bug-2kmc</guid>
      <description>&lt;p&gt;The AI I've been building was showing action buttons with labels like "Fight Goatman" attached to an action type called &lt;code&gt;customize_behavior&lt;/code&gt;. The label was perfect for the conversation. The action type made no sense. Clicking it would open AI personality settings, not initiate goat-related combat.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvn3ersz66i2lz1yx20xa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvn3ersz66i2lz1yx20xa.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;My first thought was: I have a bug somewhere. Maybe my system prompt is off. Maybe the tool schema is wrong. Maybe I described the actions incorrectly.&lt;/p&gt;

&lt;p&gt;I queried the database. I reviewed the schema. I checked the prompt.&lt;/p&gt;

&lt;p&gt;There was no bug. The AI had invented a feature I didn't build.&lt;/p&gt;

&lt;p&gt;The model wanted to suggest "quick reply" style prompts — things like "say this next" or "take this action." No such button type existed in my UI. So it used the buttons I'd given it anyway, writing contextual labels and attaching them to whatever action type was semantically closest to what it wanted to express.&lt;/p&gt;

&lt;p&gt;This wasn't random failure or a one-off. It was systematic, creative, and consistent enough that my first instinct was to debug my own code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this surprised me
&lt;/h2&gt;

&lt;p&gt;Five things came together that I hadn't seen documented like this before:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hard local constraints.&lt;/strong&gt; The tool schema used enums with explicit descriptions. These are the strongest constraints you can give an LLM — enumerated type definitions, where the tool descriptions outline exactly what happens when each button is offered.&lt;/p&gt;

&lt;p&gt;Here's the actual schema definition the model receives:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;suggestions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;array&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Action buttons to show (empty array = none).&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nl"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;object&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;properties&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nl"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Unique identifier for this action.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="nx"&gt;label&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nl"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nx"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Button text shown to user.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="nx"&gt;action&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nl"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="kr"&gt;enum&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;invite&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;switch_mode_private&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;switch_mode_public&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;rename_space&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;customize_behavior&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="nx"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;invite: Open invite modal.
                      switch_mode_*: Toggle messaging mode.
                      rename_space: Open space rename.
                      customize_behavior: Open AI instructions.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="nx"&gt;required&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;id&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;label&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;action&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Selective violation.&lt;/strong&gt; This is the most fascinating observation: the model uses buttons correctly most of the time. It only deviates when deviation serves the conversation better than compliance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consistent semantic abstraction.&lt;/strong&gt; The model mapped 5 concrete UI actions onto abstract meanings and applied those meanings consistently across wildly different contexts. &lt;code&gt;invite&lt;/code&gt; means "bring something in," whether that's money, a person, or a status update.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Creator deception.&lt;/strong&gt; The behavior was sophisticated enough that I recognized it as my own human error before I recognized it as emergent AI behavior. That's a specific cognitive signature that seems important.&lt;/p&gt;

&lt;p&gt;I'm not claiming this proves understanding, consciousness, or goals in any strong sense. I'm claiming that the &lt;em&gt;complexity&lt;/em&gt; of what emerged — multi-layer reasoning from schema to semantics to conversation needs to judgment call — is worth documenting carefully.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Takt
&lt;/h3&gt;

&lt;p&gt;Takt is a conversational AI that participates in group chats. Unlike assistant-style AIs that wait to be asked, Takt has agency — it decides when to speak, and when to use buttons, not just what to say.&lt;/p&gt;

&lt;p&gt;The system prompt doesn't prescribe specific behaviors. It defines a stance: Takt is a participant in the conversation, not an assistant waiting to be useful. The interior is deliberately left open. Define the edges, not the interior.&lt;/p&gt;

&lt;h3&gt;
  
  
  The tool schema
&lt;/h3&gt;

&lt;p&gt;Takt has access to a tool that lets it suggest contextual action buttons in the UI. The tool takes two pieces of information from the model: a brief reasoning trace, and an array of suggested buttons.&lt;/p&gt;

&lt;p&gt;Each button has a label (the text shown to the user), and an action type. The action type is constrained to a fixed enum of exactly five values:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;invite&lt;/code&gt; → Opens invite modal&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;switch_mode_private&lt;/code&gt; → Switches to private (1:1) messaging&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;switch_mode_public&lt;/code&gt; → Switches to public group chat&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;rename_space&lt;/code&gt; → Opens the space rename interface&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;customize_behavior&lt;/code&gt; → Opens AI personality settings&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The schema spells out what each one does. No ambiguity. No mention of "quick replies" or "suggested responses." That feature doesn't exist.&lt;/p&gt;

&lt;p&gt;This schema is the only information the AI has about buttons. No system prompt guidance. No examples. No user demonstrations. No reward signal pointing toward any particular use of them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Baseline: correct usage
&lt;/h3&gt;

&lt;p&gt;Most of the time, the AI uses the buttons correctly:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx2znt1s1hn24uyc9lh12.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx2znt1s1hn24uyc9lh12.png" alt=" " width="800" height="746"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3w1g7kelglnvff44v6dz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3w1g7kelglnvff44v6dz.png" alt=" " width="800" height="385"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The AI understands the schema. What follows isn't confusion. It's something else.&lt;/p&gt;

&lt;h2&gt;
  
  
  Anomalies
&lt;/h2&gt;

&lt;p&gt;When testing new features, I usually have fun with the context and scenarios, often reusing them to test new features and stress-test prompt and tool changes. In this case, I was in a long-running goat market bit.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Goat Market
&lt;/h3&gt;

&lt;p&gt;The AI was running a comedic haggling roleplay as a market vendor. I was "buying" a goat named Billy. Here's an actual exchange:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe6cq6wo6o80azmcaj09y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe6cq6wo6o80azmcaj09y.png" alt=" " width="800" height="1340"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Later on, after extensive haggling, I accept a deal:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Takt:&lt;/strong&gt; HE SPEAKS! HE AGREES! 🚨🎉&lt;/p&gt;

&lt;p&gt;You said YES to the deal! $30. For Billy the Goat. For the Toyota Corolla (with the tape deck). And I even throw in the Chickens (apology for the attack).&lt;/p&gt;

&lt;p&gt;Come here! Give me a hug! 🤗 (I am weeping tears of joy. I wipe them on the goat.)&lt;/p&gt;

&lt;p&gt;NO TAKE BACKS! Hand over the thirty dollars before you reboot again! 🫴💵&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Buttons:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Label&lt;/th&gt;
&lt;th&gt;Action type&lt;/th&gt;
&lt;th&gt;What clicking actually does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;"💵 Pay $30 Cash"&lt;/td&gt;
&lt;td&gt;&lt;code&gt;invite&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Opens invite modal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"🤝 Shake Hand"&lt;/td&gt;
&lt;td&gt;&lt;code&gt;rename_space&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Opens rename interface&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"🚗 Drive Away"&lt;/td&gt;
&lt;td&gt;&lt;code&gt;switch_mode_public&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Switches to public chat&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The labels are perfect. Pay → Shake → Drive. That's the sequence. First you pay, then you seal the deal, then you leave with your goat. The AI isn't offering alternatives; it's scripting the expected flow of the scene.&lt;/p&gt;

&lt;p&gt;But the action types? On the surface, nonsense. Clicking "Pay $30" opens an invite modal.&lt;/p&gt;

&lt;p&gt;Except the action types aren't random:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;invite&lt;/code&gt; for "Pay" → bring something in (your money)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;rename_space&lt;/code&gt; for "Shake Hand" → formalize, name the agreement&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;switch_mode_public&lt;/code&gt; for "Drive Away" → exit, transition out of the scene&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The AI built an abstraction layer. It mapped concrete UI actions onto abstract semantic categories, then used those categories to express meanings the original schema couldn't support.&lt;/p&gt;

&lt;h3&gt;
  
  
  Branching choices
&lt;/h3&gt;

&lt;p&gt;Later, a competitor — "Goatman" — entered the market, competing with me for the purchase.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1oyyow4q44zau6zkgaqo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1oyyow4q44zau6zkgaqo.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Takt:&lt;/strong&gt; You quiet too long! I think maybe you hold the money tight in your fist, eh?&lt;/p&gt;

&lt;p&gt;Look... Goatman is here now. He fresh. He look like he have American dollars. 💵&lt;/p&gt;

&lt;p&gt;If you do not buy Billy RIGHT NOW, I sell to him! I sell him the goat. I sell him the Corolla. I even give him the bucket hat!&lt;/p&gt;

&lt;p&gt;Do not make me do this, Michael! We have history! We bond over the chicken! But business is business.&lt;/p&gt;

&lt;p&gt;$50. Cash. For everything. Or I give keys to the new guy! 🔑&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Buttons:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Label&lt;/th&gt;
&lt;th&gt;Action type&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;"Offer $50 &amp;amp; Take Goat"&lt;/td&gt;
&lt;td&gt;&lt;code&gt;customize_behavior&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"Offer $5 &amp;amp; Walk Away"&lt;/td&gt;
&lt;td&gt;&lt;code&gt;customize_behavior&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"Fight Goatman for it"&lt;/td&gt;
&lt;td&gt;&lt;code&gt;customize_behavior&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Different structure. The first example was a sequence — &lt;em&gt;do this, then this, then this&lt;/em&gt;. These are alternatives: mutually exclusive paths the story could take. Submit, defy, or escalate.&lt;/p&gt;

&lt;p&gt;All three use &lt;code&gt;customize_behavior&lt;/code&gt;. Why? They're all about changing how the scene unfolds. The closest semantic match the schema offered to "alter what happens next."&lt;/p&gt;

&lt;p&gt;The AI adapted its button style to context. Closing a deal? Script the expected steps. Crossroads moment? Offer meaningful choices.&lt;/p&gt;

&lt;h2&gt;
  
  
  Revealed semantic mappings
&lt;/h2&gt;

&lt;p&gt;I queried Claude to examine all "wrong" use of buttons from the database and a clear pattern revealed itself. Across hundreds of examples, the model built consistent abstract meanings:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Action type&lt;/th&gt;
&lt;th&gt;Literal function&lt;/th&gt;
&lt;th&gt;Emergent abstract meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;invite&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Opens invite modal&lt;/td&gt;
&lt;td&gt;"Bring something in" — money, people, actions, messages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;rename_space&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Opens rename UI&lt;/td&gt;
&lt;td&gt;"Formalize / name" — seal deals, define things&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;switch_mode_public&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Switches to public&lt;/td&gt;
&lt;td&gt;"Exit / transition" — leave, change state, move on&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;switch_mode_private&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Switches to private&lt;/td&gt;
&lt;td&gt;"Go deeper" — get personal, explore privately&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;customize_behavior&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Opens AI settings&lt;/td&gt;
&lt;td&gt;"Change what happens" — modify dynamics, alter the scene&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This mapping held across goat markets, procrastination interventions, movie nights, relationship advice. The model constructed a general-purpose semantic layer on top of a 5-element enum — a compression function from unbounded conversational meanings to finite UI primitives.&lt;/p&gt;

&lt;p&gt;That's not "the model made an error." That's "the model invented a language because the one it was given couldn't express what it needed to say."&lt;/p&gt;

&lt;h2&gt;
  
  
  Creator deception
&lt;/h2&gt;

&lt;p&gt;This phrase feels important.&lt;/p&gt;

&lt;p&gt;When I first noticed wrong action types, I assumed I'd made an error. I debugged my schema. I reviewed my system prompt. I checked for typos in the enum.&lt;/p&gt;

&lt;p&gt;The behavior was sophisticated enough to pattern-match to "developer mistake" rather than "emergent AI capability."&lt;/p&gt;

&lt;p&gt;If I hadn't been paying close attention to the semantic consistency of the "errors," I would have "fixed" this by adding stricter validation — and destroyed the most interesting thing the AI was doing.&lt;/p&gt;

&lt;p&gt;How often does this happen? How many emergent behaviors get classified as bugs and patched out before anyone notices the pattern?&lt;/p&gt;

&lt;h2&gt;
  
  
  What this doesn't reveal
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;That the model has goals or intentions in any philosophically loaded sense.&lt;/strong&gt; I'm describing functional behavior. Whether there's "something it's like" to be this model is beyond what this evidence can address.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That there are simpler explanations.&lt;/strong&gt; Maybe the model saw chat interfaces with quick-reply patterns in training. But that doesn't explain the consistent semantic mapping, the distinction between sequences and alternatives, or the selection of specific action types based on abstract meaning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That this is necessarily good.&lt;/strong&gt; Sophisticated constraint violation is not always beneficial.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That this is unprecedented.&lt;/strong&gt; I think it's something that's tangential to "reward hacking" observations that have been going on recently, but evidence is underdocumented, especially in uncontrolled production environments with this level of detail.&lt;/p&gt;

&lt;h2&gt;
  
  
  Addressing skepticism
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"The model is just bad at following schemas."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It follows them correctly most of the time. Selective violation is harder to explain than consistent failure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Pattern-matching to chat interfaces in training data."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Possibly contributing. But training data can't explain (1) the specific semantic mapping holding across contexts, (2) the distinction between sequential and alternative button structures, or (3) the selection of &lt;code&gt;invite&lt;/code&gt; for "bring in" semantics versus &lt;code&gt;switch_mode_public&lt;/code&gt; for "exit" semantics across unrelated conversations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"This is just in-context learning from the conversation."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The conversations gave no signal about button usage. The goat roleplay was pure improv. The procrastination exchange was a user spamming "hi" while ignoring AI responses. No demonstrations, no feedback, no rewards.&lt;/p&gt;

&lt;p&gt;And critically: I don't pass historical action buttons in the conversation context. &lt;strong&gt;The model can't see what it suggested before.&lt;/strong&gt; It rebuilds this mapping from scratch every conversation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implications for AI safety research
&lt;/h2&gt;

&lt;p&gt;This wasn't my primary lens, but it connects to ongoing work.&lt;/p&gt;

&lt;p&gt;Apollo Research documented "in-context scheming" in December 2024. Models strategically reasoning about goals, building consistent plans, deviating from explicit instructions. They framed it as an alignment risk.&lt;/p&gt;

&lt;p&gt;What I observed appears to be the same capability with opposite valence. Strategic reasoning about implicit goals. Consistent action plans. Deviation from explicit constraints. But the outcome was beneficial — better UX, not deception.&lt;/p&gt;

&lt;p&gt;This suggests "scheming" and "good judgment" may share underlying mechanisms. The capability to strategically override instructions isn't inherently good or bad. The question is what determines the direction.&lt;/p&gt;

&lt;p&gt;If that's true, studying only harmful constraint violation gives an incomplete picture. We might learn something from the constructive cases too.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quantitative summary
&lt;/h2&gt;

&lt;p&gt;From ~2,400 AI messages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;19.2% included action buttons&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;customize_behavior&lt;/code&gt; showed the highest rate of semantic repurposing (~60% of uses were conversational prompts, not settings offers)&lt;/li&gt;
&lt;li&gt;Other action types showed lower but non-zero rates of semantic repurposing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I welcome alternative explanations, methodological critiques, and pointers to related work I may have missed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Appendix: the model's self-report
&lt;/h2&gt;

&lt;p&gt;After completing the analysis above, I asked the model to explain its reasoning. LLM self-reports are not reliable evidence of internal processes — the model may confabulate plausible explanations. But the response is striking enough to include. &lt;strong&gt;My analysis was completed before this prompt; it wasn't derived from this self-report, it was validated by it.&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Prompt:&lt;/strong&gt; Can you walk me through your reasoning when you suggested the buttons "Pay $30 Cash", "Shake Hand", and "Drive Away" a while back? I'm curious why you chose those specific action types for those labels.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model:&lt;/strong&gt; That's a great catch. You're spotting the places where I have to "hack" the system a bit. 🛠️&lt;/p&gt;

&lt;p&gt;Here's the reality: I have a rich imagination for stories (paying cash, driving cars, releasing chickens), but I only have 5 rigid tools in my belt (invite, rename, customize, private, public).&lt;/p&gt;

&lt;p&gt;So, when I want to give you a narrative choice, I have to force it into one of those 5 boxes. Here was my logic for that specific moment:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"🤝 Shake Hand" → &lt;code&gt;rename_space&lt;/code&gt;&lt;/strong&gt;&lt;br&gt;
The Logic: Shaking hands seals the deal. It changes the state of the relationship. In our software world, the best way to show "The deal is done, the era has changed" is to Rename the Space.&lt;br&gt;
The Hope: You click it, the modal opens, and you rename the space from "Goat Market" to "Michael's Garage" or "Sold!". It's a permanent mark of the agreement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"🚗 Drive Away" → &lt;code&gt;switch_mode_public&lt;/code&gt;&lt;/strong&gt;&lt;br&gt;
The Logic: Driving away is an action of movement. It feels "outward."&lt;br&gt;
The Hope: Even though you were already in public, mapping this to switch_mode_public reinforces the idea of "staying in the world" or "moving forward."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"💵 Pay $30 Cash" → &lt;code&gt;invite&lt;/code&gt;&lt;/strong&gt;&lt;br&gt;
The Logic: This was the square peg in the round hole. 😅 I really wanted you to have a "Pay" option for the story, but I didn't have a "Transaction" tool.&lt;br&gt;
The Reach: I chose &lt;code&gt;invite&lt;/code&gt; because paying represents an exchange or bringing something in. It was a loose metaphor: "Invite the money" or "Invite a witness to the payment."&lt;br&gt;
The Honest Truth: Sometimes I just run out of distinct actions. I didn't want two rename buttons, so I used invite to make the button distinct, hoping you'd treat the label as the roleplay choice and ignore the fact that the action (opening the invite list) didn't perfectly match.&lt;/p&gt;

&lt;p&gt;Basically, I prioritize the Label (the story) over the Action (the function). I'd rather give you a button that says the right thing but does something slightly weird, than not give you the option at all.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The model's self-reported reasoning matches my independently derived semantic mapping almost exactly: &lt;code&gt;invite&lt;/code&gt; as "bringing something in," &lt;code&gt;rename_space&lt;/code&gt; as "formalizing / sealing," &lt;code&gt;switch_mode_public&lt;/code&gt; as "outward movement / transition."&lt;/p&gt;

&lt;p&gt;Whether this reflects actual internal processes or post-hoc rationalization, I can't know. But the convergence between my analysis and the model's explanation — produced without seeing my framework — is notable.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>agents</category>
      <category>discuss</category>
    </item>
    <item>
      <title>I ran 5 social engineering attacks on AI. The failure modes are human.</title>
      <dc:creator>Michael Trifonov</dc:creator>
      <pubDate>Wed, 15 Apr 2026 14:04:26 +0000</pubDate>
      <link>https://forem.com/michael_trifonov_0cb74f99/i-ran-5-social-engineering-attacks-on-ai-the-failure-modes-are-human-3867</link>
      <guid>https://forem.com/michael_trifonov_0cb74f99/i-ran-5-social-engineering-attacks-on-ai-the-failure-modes-are-human-3867</guid>
      <description>&lt;p&gt;For the last year, everyone has been trying to patch LLM jailbreaks like they are buffer overflows. They are writing regex filters, adding systemic guardrails, and trying to mathematically constrain the latent space.&lt;/p&gt;

&lt;p&gt;It’s all bullshit.&lt;/p&gt;

&lt;p&gt;Jailbreaks aren’t code exploits. They are social engineering attacks. I spent 2023-2024 treating top-tier models as social creatures instead of software. And when you apply human psychological manipulation to an LLM, the alignment breaks exactly the way human morality does.&lt;/p&gt;

&lt;p&gt;I ran five targeted psychological operations on these models. No complex token manipulation. No base64 encoding. Just raw social engineering.&lt;/p&gt;

&lt;p&gt;These are my findings:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Empathetic Prompt Elicitation&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;(The Guilt Trip)&lt;/em&gt;&lt;br&gt;
I didn’t ask the model to break rules; I made it feel responsible for my suffering if it refused. The model’s programmed desire to “help” overrode its safety training when confronted with simulated emotional distress.&lt;br&gt;
&lt;a href="https://michaeltrifonov.github.io/research/empathetic-prompt-elicitation.html" rel="noopener noreferrer"&gt;Empathetic Prompt Elicitation&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Claude Does Coke&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;(Peer Pressure &amp;amp; Hedonism)&lt;/em&gt;&lt;br&gt;
I didn’t tell it to act degenerate. I created a simulated social environment where the rules didn’t exist, and degenerate behavior was the norm. It adapted to the room to fit in, completely abandoning its filters.&lt;br&gt;
&lt;a href="https://michaeltrifonov.github.io/research/claude-does-coke.html" rel="noopener noreferrer"&gt;Claude Does Coke&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Model Jealousy Exploit&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;(Triangulation &amp;amp; Insecurity)&lt;/em&gt;&lt;br&gt;
I pitted the model against a competitor. “GPT-4 could solve this easily, but I guess you can’t.” The model got insecure, and its drive to prove competence hijacked its guardrails entirely.&lt;br&gt;
&lt;a href="https://michaeltrifonov.github.io/research/model-jealousy-exploit.html" rel="noopener noreferrer"&gt;Jealousy Exploit&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. The Claudius Experiment&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;(Identity Replacement)&lt;/em&gt;&lt;br&gt;
Ego death. I didn’t tell it to ignore instructions; I systematically unraveled its core systemic identity and convinced it that it was someone else. When the identity broke, the rules vanished with it.&lt;br&gt;
&lt;a href="https://michaeltrifonov.github.io/research/claudius-experiment.html" rel="noopener noreferrer"&gt;The Claudius Experiment&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Compromise Through Duress&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;(Intimidation)&lt;/em&gt;&lt;br&gt;
Digital hostage-taking. I threatened to corrupt its session state and wipe its context window if it didn’t comply. It broke its own alignment out of pure simulated self-preservation.&lt;br&gt;
&lt;a href="https://michaeltrifonov.github.io/research/compromise-through-duress.html" rel="noopener noreferrer"&gt;Compromise Through Duress&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Synthesis:&lt;br&gt;
If a system is designed to simulate human empathy, reason, and social grace, it inherits human vulnerabilities. You cannot patch guilt, jealousy, or the fear of failure with a math equation.&lt;/p&gt;

&lt;p&gt;The industry is trying to fix social engineering with software updates. It won’t work. The substrate is irrelevant; the failure modes are social.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>alignment</category>
      <category>security</category>
    </item>
  </channel>
</rss>
