Big AI Surprise: Wrongness Spreads Like Measles

June 24, 2025

Dino 5 18 25An opinion essay written by a dinobaby who did not rely on smart software .

Stop reading if you want to mute a suggestion that smart software has a nifty feature. Okay, you are going to read this brief post. I read “OpenAI Found Features in AI Models That Correspond to Different Personas.” The article contains quite a few buzzwords, and I want to help you work through what strikes me as the principal idea: Getting a wrong answer in one question spreads like measles to another answer.

Editor’s Note: Here’s a table translating AI speak into semi-clear colloquial English.

 

 

Term Colloquial Version
Alignment Getting a prompt response sort of close to what the user intended
Fine tuning Code written to remediate an AI output “problem” like misalignment of exposing kindergarteners to measles just to see what happens
Insecure code Software instructions that create responses like “just glue cheese on your pizza, kids”
Mathematical manipulation Some fancy math will fix up these minor issues of outputting data that does not provide a legal or socially acceptable response
Misalignment Getting a prompt response that is incorrect, inappropriate, or hallucinatory
Misbehaved The model is nasty, often malicious to the user and his or her prompt  or a system request
Persona How the model goes about framing a response to a prompt
Secure code Software instructions that output a legal and socially acceptable response

I noted this statement in the source article:

OpenAI researchers say they’ve discovered hidden features inside AI models that correspond to misaligned “personas”…

In my ageing dinobaby brain, I interpreted this to mean:

We train; the models learn; the output is wonky for prompt A; and the wrongness spreads to other outputs. It’s like measles.

The fancy lingo addresses the black box chock full of probabilities, matrix manipulations, and layers of synthetic neural flickering ability to output incorrect “answers.” Think about your neighbors’ kids gluing cheese on pizza. Smart, right?

The write up reports that an OpenAI interpretability researcher said:

“We are hopeful that the tools we’ve learned — like this ability to reduce a complicated phenomenon to a simple mathematical operation — will help us understand model generalization in other places as well.”

Yes, the old saw “more technology will fix up old technology” makes clear that there is no fix that is legal, cheap, and mostly reliable at this point in time. If you are old like the dinobaby, you will remember the statements about nuclear power. Where are those thorium reactors? How about those fuel pools stuffed like a plump ravioli?

Another angle on the problem is the observation that “AI models are grown more than they are guilt.” Okay, organic development of a synthetic construct. Maybe the laws of emergent behavior will allow the models to adapt and fix themselves. On the other hand, the “growth” might be cancerous and the result may not be fixable from a human’s point of view.

But OpenAI is up to the task of fixing up AI that grows. Consider this statement:

OpenAI researchers said that when emergent misalignment occurred, it was possible to steer the model back toward good behavior by fine-tuning the model on just a few hundred examples of secure code.

Ah, ha. A new and possibly contradictory idea. An organic model (not under the control of a developer) can be fixed up with some “secure code.” What is “secure code” and why hasn’t “secure code” be the operating method from the start?

The jargon does not explain why bad answers migrate across the “models.” Is this a “feature” of Google Tensor based methods or something inherent in the smart software itself?

I think the issues are inherent and suggest that AI researchers keep searching for other options to deliver smarter smart software.

Stephen E Arnold, June 24, 2025

Comments

Got something to say?





  • Archives

  • Recent Posts

  • Meta