According to Humans

Why AI companies spend billions on RLHF

Jul 22, 2025

Meta recently invested roughly $14 billion in Scale AI, acquiring a major non-voting stake and bringing Scale CEO Alexandr Wang into the fold at Meta's new 'superalignment' lab. The deal centers on human evaluation work: rating outputs, ranking responses, and training AI systems to behave in ways we find helpful, harmless, and honest. If that sounds unglamorous, that's because it is. But it also might be the most important (and underexplained) part of modern AI.

The method at the heart of this investment is called RLHF: reinforcement learning from human feedback. And for something that sounds like it was named by a compliance team who hates you, it’s oddly central to how today’s most advanced models are shaped. RLHF is how models learn what we want. It's how they learn when to shut up, when to say more, when to apologize. It's how they learn not to guess wildly when asked a sensitive question. Basically, it’s how they learn to sound like they know what they’re doing.

It’s also one of the strangest feedback loops ever scaled. The idea is to teach a predictive system how to be something more than predictive. And to do that, we throw armies of humans at it. People in every time zone, judging AI responses and making thousands of minor decisions, over and over again, shaping what gets rewarded.

In a sense, the training never ends. RLHF is a layer that folds into the model’s identity, a mirror it keeps checking every time it generates something new.

The Gist

Start with a base model. It’s been trained on a giant pile of internet text. It knows how to autocomplete a sentence, but not how to behave. It has no native sense of whether its outputs are helpful or appropriate or safe. Ask it a question, and it will do its best to mimic a plausible answer based on what it’s seen before. Enter: RLHF, to add a layer of social judgment.

Here’s the basic loop:

You give the model (or competing models) a prompt.
It generates a few different responses.
A human evaluator ranks those responses from best to worst.
The system trains a secondary model to predict those rankings.
The main model is then fine-tuned to produce outputs that maximize its predicted reward.

That’s it. The AI learns to speak in a way that earns high marks from people. Sometimes those people are following detailed instructions. Sometimes they’re giving scores in a spreadsheet. Sometimes they’re you, clicking a thumbs up. All of it is data, and all of it goes back into the loop.

Versus Machine Evals

RLHF gives models a surprisingly effective way to internalize soft goals: sound coherent, be polite, answer the question, don’t make it weird. These are not traits a model picks up from raw pretraining. And until someone finds a better method, this is the most direct way to impose a subjective value system on a giant stochastic machine. That’s why every frontier model has some form of RLHF or its cousins (like DPO or RLAIF) baked in.

This is worth slowing down for. Most people assume that AI models are judged by some clean battery of tests: benchmarks, quizzes, logic puzzles. And yes, we do machine evals all the time. But those tests measure narrow skills. RLHF shapes the entire personality. A benchmark might tell you whether a model can solve a math word problem. RLHF tells it whether to say "I'm not sure, but here's my best guess" or "The answer is 17." You can't test your way into that behavior. You have to train for it using human preference data that isn't consistent, clean, or easily measured.

Secret! No one in the industry really trusts their own benchmarks to capture model quality at scale. They're too brittle, too narrow, too gameable. If your model scores well on MMLU (Massive Multitask Language Understanding 🤯) but flounders in a real conversation, what does that actually mean? What counts as a win?

So we gather more data, and we crowdsource feedback. It's a little like pouring all your hopes into a constellation of five-point scales and hoping they point to something true.

The Labor Behind the Loop

This is where the Meta deal matters. Billions is a lot of money to spend on reinforcement. It’s also a hint at how much labor is embedded in the process.

At scale, RLHF is a full-on industry. The big labs hire contractors through companies like Scale AI and Appen to rank outputs, annotate conversations, flag safety issues, and provide the judgment that models lack. The work is time-consuming, often low-paid, and surprisingly hard. It involves reading endless variants of the same response and trying to decide which one is slightly more helpful, slightly more clear, slightly less creepy. In this way, RLHF is a kind of mass persuasion.

It’s important to explain how RLHF works because it’s invisible labor. The final AI output is meant to be lightweight and seemingly magical — if you saw the sweat equity behind it, you might not believe the hype. If you’ve seen a CoT (chain of thought) interface on your favorite AI, that’s an attempt at transparency, but it’s incomplete, and still shines a spotlight on the model itself, not any human judgments at hand.

This method persists because every alternative is worse. We could stick with raw pretrained models, but no one wants to talk to a machine that reads like a chaotic message board. We could build rule-based safety layers, but those miss the nuance. We could try unsupervised refinement, but it doesn’t (yet) deliver the polish users expect.

RLHF is human-ish, and it works for now. If there’s any bone to pick (and there is), it’s that this layer is too unknown outside of insider tech circles. Why not expose it more? Why not publish more about the social process? Answer: because it’s not very clean. A super-crisp benchmark measurement is easier to digest than a messy loop of compromise and human taste. But frankly, I find the latter more interesting.

So the next time an AI answers your question with eerie poise, remember: it wasn't born that way. A painstaking judgment was made.

aishu

Jul 22

Have you looked at all into the company Mercor? It is super hot startup right now that is working in the human data labeling space, except is under the guise of "job search." Super interesting company that I think is worth looking into. Also this was a great read!

Expand full comment

RyanV

Jul 23

Really nice writing, Danielle!

According to Humans

Why AI companies spend billions on RLHF

The Gist

Versus Machine Evals

The Labor Behind the Loop

Discussion about this post