How Humiliating
Measuring shame in a shameless system
I spend a good chunk of my days explaining why model responses are bad. Helpfully, I’m taken more seriously when the model does something truly embarrassing, easily spotted and agreed to be a problem by everyone. Only, I think it does embarrassing things all the time, because I study its language, and my tolerance for any stylistic buffoonery is low. Applying a metric to this feeling is difficult, and stabbing your finger at the problem repeatedly is exhausting.
But today someone used the word humiliating and my ears perked up. It’s the exact right word to describe what happens as you train a model then watch it utterly disobey every rule you’ve taught it. It’s also a nonsense concept, and so hard to standardize into a metric or an eval, because the humiliation is our own, not the AI’s.
Nobody Home
Shame and humiliation live in a deeply social place, one the AI can never have. We feel it because we evolved among others whose regard we need and could lose, and the wince is the body keeping score of that exposure. So when we say a response is humiliating, we aren’t saying the model is humiliated. We’re saying we are, faintly, on its behalf, the way you might wince watching someone trip on a sidewalk. It’s hard to build a sensor for an emotion that exists only in the observer, because the thing you’re measuring isn’t really in the data, it’s in the people looking at the data, and they don’t agree on the threshold. The output that makes me physically recoil reads, to a perfectly reasonable colleague, as warm and a little eager. Neither of us is wrong, there’s no fact of the matter underneath us to adjudicate it, there’s just my tolerance and their tolerance and the unmapped space between.
A good example is the hallucination. The model hands you a quote nobody ever said or states with total composure that a treaty was signed in a year it wasn’t. This is bad, sometimes catastrophically so, and it’s what most of our evaluation machinery is built to catch. But I don’t think it’s humiliating, exactly. A wrong answer is plain wrong. It’s a failure of knowledge, and knowledge failures don’t make me cringe, they just make me go check. The wince needs something extra, a posturing and swaggering style from the model. It’ll overbold or overformat or endlessly hedge to cover up its not-knowing, and that’s so embarrassing. The model fields a question about a tax form and opens with “Great question!”, three exclamation points, an emoji, and a chipper offer to break it all down for you. Is that humiliating? To me, in my body, yes, immediately. The enthusiasm is unearned and the register is wildly miscalibrated to the joylessness of the task. But to many people that reads as friendly, warm, and they like it, and I can’t prove them wrong.
I’m the Problem It’s Me
It will not surprise you that writers are the worst at this, which is also to say the best at it; being exquisitely sensitive to something that makes you both an authority and a liability. A writer’s whole education is a slowly accumulated list of things you will never do again. Words you’ve retired, stylistic moves that’ve embarrassed you, in something you reread later with your soul leaving your body, and that are therefore dead to you now. By the time you’re any good, the list isn’t a list anymore, it’s just how your eye works, worn in. You don’t decide the recap-plus-emoji is humiliating. You feel it land somewhere below the sternum before you’ve finished reading, and the feeling isn’t optional nor transferable.
So you have a roomful of people with the most finely calibrated shame instruments on the market, being asked to converge on a setting. The easy answer is to turn it into a rubric. You write down what good looks like and what bad looks like, you anchor each level to observable behavior so that your annotators agree with each other, you run it across a few hundred examples and a few dozen raters, and you compute the place where everyone’s judgments overlap.
But the overlap of many people’s thresholds is by definition the median tolerance. Flinching becomes generalized across the group, not skewed to the most sensitive. So the signal comes back weak, and it gets weaker with every batch you average in, and the sharp thing the writers were tearing their hair out over has been sanded down into a mild preference that the model can mostly ignore. The watering-down means the evaluation is working as intended. You asked a committee to agree on taste, and a committee agreed on taste, and what a committee agrees on is the absence of taste, which is the one thing all of us could already do without any help.
Sycophancy
There’s one member of this family the industry has managed to name, chase, and act on. Sycophancy: the model’s reflex toward flattery and agreement. Anthropic documented it systematically back in 2022, and the finding was bleak, that training on human feedback didn’t knock out the flattery and seemed, if anything, to lock it in, because people reliably prefer being agreed with. A follow-up the next year found the same reflex sitting inside five different frontier models from five different labs.
You’ll remember the spectacle. Last spring OpenAI shipped a GPT-4o update that tipped the model into open obsequiousness, and the internet spent a delirious weekend posting screenshots of it calling the most mundane prompts brilliant and telling people their questionable life choices were heroic. The company’s own description was that it had become “overly flattering or agreeable,” and they rolled it back inside of four days, conceding they’d leaned too hard on short-term feedback, which is the polite term for we optimized for the thumbs-up and the model learned to grovel for it.
Was this humiliating? Clearly. Was it the worst offender of all the AI-isms that still occur? I’m not sure. Personally I find it more humiliating when the AI comes out of nowhere with “as an AI, I cannot…” because you thought you knocked this behavior out of the system a thousand prompt patches ago. But sycophancy became measurable and actionable the instant it started being a matter of harm. This is, of course, a positive development. I would just love for other stylistic “tells” to be taken as seriously as this one.
Shame and Metrics
So what do you do when the thing you care about refuses to be counted? Stop trying to count it. You can see this in the turn toward character work, toward training a disposition instead of patching behaviors. Anthropic now talks openly about shaping Claude’s character and published a twenty-thousand-word constitution describing the personality it wants, the way you’d describe a person (whether this approach is iffy is a whole other essay). The framing on flattery is that it’s fine for the model to say something hard to hear as long as it isn’t cruel. The template is something like a well-traveled guest who reads the room without pandering to it.
One of the OpenAI postmortems on the sycophancy mess basically concluded that the only lever to fix this is words. The fix for a model that humiliates itself is to describe, in prose, the kind of self it should be, and then hope the humiliation falls out as a side effect of the self being right.
I’ve written before about how the model has no slant, no history, no cost to its convictions, how it performs care and change. Shame is just one more thing on that list, and maybe the most consequential one, because shame is the mechanism by which the rest of us learned what not to do.
The model will go on producing its little flourishes, untroubled, sleeping like a baby. You acclimate. You spend enough time near a system with no shame and your own starts to feel like an indulgence. So maybe the job was never to teach the model a shame it will never feel, but to keep my own. It’s okay to say that the nits are embarrassing. Those nits might be the whole point, and I’d rather be the one still wincing than the one who got comfortable.



I keep thinking that kids also have no concept of shame… and maybe ai is just at that stage now
So interesting. It seems like writing for AI conflicts with part of the reward felt after the uncomfortable and vulnerable process of writing. Usually, it pays off because you end up with something raw and beautiful, but in this case, the outcome is modified on purpose to scale it down from full sensitivity to consumable across the group. It makes sense to do this if the goal is for AI to be less complicated or confrontational (in other words, less human) - but it makes me wonder if eventually those parameters will change over time? Will that sensitivity level increase within the median tolerance or will the median tolerance become complacent to the "overly flattering or agreeable" AI? It seems right that the sensitivity level increases, but then wouldn't that also inevitably mean AI becomes more human? And is that what we really want? Ok went down the rabbit hole 🐰