Fact-Finding Mission
On correctness in AI
Before we dig in, a fun surprise! I started a club on Posthouse with an Inaugural Tier for just twelve subscribers to get a handwritten letter from me every month ($10/mo).
If you don’t get a spot, join the waitlist so I can gauge interest. I’ll reassess after I’ve done the first batch 💌 On to this week’s Soft Coded:
Depends Who’s Asking
I spent part of this week trying to define what it means for a language model to answer a question correctly. This was a practical matter that needed rules and criteria, not just “it’s a fact, so, you know, supply it,” which frankly was my first instinct. As advanced as these models are, you still have to write out instructions as if they’re last rites, something for the model to cling to and believe in, so its spirit doesn’t end up in purgatory or haunting you or whatever. And not only the model, but human evaluators need this guidance too, so they’re not relying on what feels right (although that’s what I’d prefer evals looked like), but quantifying correctness for the future. Who won the 2012 U.S. election? How does nuclear fission work? When did the Berlin Wall come down?
The task, as usual, ended up being a whole thing, with a lot to think about in terms of language, style, and the meaning of facts in a post-facts world. I was convinced it couldn’t be done and then I did it, which happens more often in this work than I’d like to admit.
You can answer a “factual question” correctly and still answer it badly. A response that’s technically accurate but three paragraphs long when one sentence would do isn’t necessarily serving the person who asked (or maybe it is, depending on their mood and the model’s mood and the position of Saturn in that moment, who knows). A response that hedges and qualifies and adds context nobody needed is both responsible and useless. And a response that’s breezy and confident and exactly the right length for one kind of person is glib and insufficient for another. Correct isn’t the same as good, and good isn’t a fixed point.
My instinct (sitting there hoping the Word doc writes itself) is honed through years of UX work, watching everyday people try and fail to use things. I want to get things right, and I’m often seeing it not from where I am in the moment training the model, but way down the line at user-facing content. I get ahead of myself and trip over my feet and everyone else’s, trying to right-end something that feels upside-down. Sometimes it’d be nice if this instinct didn’t exist, if I were the type of person who could just quantify something and move on. I’m not usually looking for the right answer, I’m asking, hello who’s there and what do you need!? The person who types “who split the atom” into a chat interface at midnight is not the person who types it into a research workflow at two in the afternoon. They may want the same fact, but they’re asking completely different questions. But there’s no way to know who’s asking, and that’s why you need the rules (search me why I had to relearn this basic fact this week).
Meanwhile the model can’t really see or sense any of this. It gets the query, a context window, and some learned sense of register from prior turns, and it makes a guess. Sometimes the guess is good, and usually it’s calibrated toward an imaginary average user who doesn’t exist, someone thoughtful but not too specialized, curious but not in a hurry, who appreciates a well-organized summary and a gentle caveat at the end. Unfortunately real people keep failing to be a predictable archetype (annoying), so it all falls apart once you send these things into the real world, and you have to collect, analyze, tweak, and train all over again.
So you build the framework backwards, starting from the question itself, trying to remain very, very objective. Categorizing queries by type, by complexity, by whether the answer is stable or contested, by whether someone is likely to have follow-up questions or is just trying to settle something quickly. Here’s when three sentences is enough, here’s when you need five. Here’s when a caveat serves the user and here’s when it’s the model hedging to avoid being wrong. I got there, but for every rule I was secretly thinking about all the people on the other end of the line, who they are, what they want, how they feel, where they grew up, if they prefer Sprite or Fresca, while they’re asking the AI when Alaska got statehood or who invented MSG. Separating context from correctness feels unnatural.
Benchmark
AI companies publish factuality scores the way restaurants post health grades: prominently displayed, technically meaningful, and only telling you so much about whether you'll enjoy the meal. Helpfully, factuality is one of the cleaner metrics. You can in principle check whether the model said something true, so the industry leans on it. Benchmarks are controlled environments though, and most of the time the conditions of factuality have little to do with how people actually use these systems. It measures one version of correct that holds up when someone who knows the answer is checking.
A reasonable starting point, but in deployment, the checks slow down. The user asks the question, takes the answer, moves on, and whether that answer was calibrated to their needs isn’t captured or scored in the same way. The structural problem is that factuality as a benchmark was designed to measure the model’s relationship to truth, not its relationship to the person asking. Reading the user is, notoriously, not what these models do well. They’re wonderful at surface matching, but they’re not really understanding you and what you’re about. The model can recognize that your question is technically sophisticated but it can’t tell whether you’re a domain expert or someone who read one good article and is now in over their head.
This matters for factuality because appropriate detail, confidence, and caveating are dependent on who you’re talking to. The same information, delivered identically, is helpful to one person and alienating another. Getting the facts right is the floor, and the model mostly meets the floor. Everything above it is where the trouble lives. The UX training in me finds this maddening, because the whole discipline is built on the premise that you can’t understand what you’re building without understanding who you’re building it for. This premise is at times absent from how these models get evaluated at scale. Did the model say something true? That’s the question. Did the person get what they needed? Much harder to ask, because the person in some ways need not exist.
Of course there’s work happening on user-centered evaluation all the time. I’m not suggesting otherwise, just that problem solving in AI often starts with the numbers, because it’s easier to analyze and scale on the numbers than the ~vibes.~ Factuality might seem like a numbers game, but then you go and try to write a spec about it and it turns into the Gettysburg Address. I think this is okay. I think there are clean things and messy things here and they need to meet in the middle.
The Facts
A model that reliably tells you who won an election or how fission works is useful, and also, that’s table stakes, and has been for awhile. Getting the fact right isn’t the hard part anymore. Understanding why someone is asking, what they’re going to do with the answer, how much uncertainty they can hold, whether a confident response will serve them or just give them the sensation of being served while they walk away with an incomplete picture — that’s the hard part, and it’s still mostly unsolved.
We’re deploying these systems into health contexts, legal contexts, and educational contexts where the difference between a technically correct answer and an appropriately calibrated one can matter a lot. And even if models are getting more reliable at producing accurate information, who’s it for? Does it matter (and are we asking) who’s on the receiving end of that answer? You can build a precise answer to a question and completely fail the person who asked it, and that’s a problem.
I guess I’m saying that I care whether you prefer Sprite or Fresca. I’m thinking about you here on the messy side of things.



Are some people more likely than others to keep asking until they get the level of expertise they were aiming for? I’m guessing ppl who don’t keep asking are the harder ones to read?