You may have noticed a shift lately in how AI is marketed. It’s no longer just a system that responds to us. The language has changed subtly from “assistant” to “agent,” from “answering questions” to “taking actions.”
At the center of this shift is the idea that AI can now move beyond generating content and start operating on our behalf: scheduling meetings, drafting and sending emails, querying APIs, executing commands inside a browser or operating system. This is often presented as a natural evolution, the next phase of development after models learned to talk like us; but I’m not convinced we’ve mastered that talking thing yet.
Core issues in language models remain unsolved. These systems still struggle with consistency, reliability, reasoning, and restraint. They generalize from patterns that shouldn’t be generalized. They do all of this with increasing fluency and polish, which makes the errors harder to spot and easier to excuse. Despite that, we’ve started wiring these models up to take action in the real world.
The Illusion of Capability
The push toward action isn’t coming from technical necessity. Text generation, on its own, is just no longer novel. “Ask me anything” has limited shelf life, I guess. So companies are layering on functionality that promises more than conversation. The model doesn’t just answer your question, it opens a tab, searches for hotels, picks one, and books it. In some demos, you don’t even review what’s going down. You approve the outcome, not the steps. You’ve been removed from the loop.
In theory this is useful. AI systems that can carry out multi-step goals could reduce the friction of a lot of everyday work. But “multi-step goal completion” sounds very different when the system can’t reliably identify what matters in a given context. We’re putting a whole lot of trust in something that hasn’t yet proven itself worthy (IMHO).
Large language models don’t “know” anything in a traditional sense. They mimic understanding close enough to seem impressive. When the domain is constrained, the task is short, the stakes are low, and the context is clear — great! I take no issue with those AI actions. But once the model is acting across systems — navigating tools, managing state, executing code, interacting with external services — statistical tendencies become liabilities. There’s no clear ground truth for what a “good” agentic behavior looks like across open-ended, real-world tasks. There’s no shared standard for what counts as a “correct” output when correctness depends on goals that are inherently foreign to the model.
And worse: an agent doesn’t just make decisions, it generates realities. A search becomes a booking. A plan becomes a calendar event. A misstep isn’t visible unless you know where to look or happen to be paying attention, which means we’re scaling up the consequences of AI’s mistakes.
The Vanishing You
There’s a deeper shift happening beneath the surface. As AI agents become more autonomous, users are being trained to step back. You’re no longer the one writing the query, editing the reply, or approving every detail. You’re just a person who wants the result. This is, of course, by design. Memory makes the model feel persistent. Delegation makes it feel efficient. Together, they encourage a kind of offloading, not just of work, but of decision-making. You don’t just stop doing the task. You stop thinking about how it should be done.
Of course, tools are supposed to abstract complexity. Fine. But there’s a difference between hiding complexity and removing oversight. If the model’s actions can no longer be easily audited (you don’t even know what it’s doing unless you check), then your role has quietly shifted from operator to observer. Or worse, from observer to bystander.
All of this should raise serious questions. But for the most part, it hasn’t. The appetite for AI action is strong, and there are clear reasons why:
Autonomous behavior demos well. It makes the model feel alive. It makes the product feel differentiated. It creates a sense of productivity and momentum, even when the actual value is murky. It’s also easier to market than nuance.
Fixing the fundamentals is slow, abstract work. It happens in the background (as discussed) and it doesn’t excite shareholders. But building wrappers around a model to let it take action looks like innovation!
So we allow it to act, because the surface is polished enough to give us plausible deniability.
Dark UX
The interface for an AI agent is usually clean, minimal, and reassuring: a timeline of completed tasks, a progress bar, a checkmark. Maybe a list of “next steps.” No surprises there, because it follows the logic of every productivity tool built in the last decade. But, the familiarity is misleading.
What those interfaces are really doing is framing the model’s outputs as actions rather than guesses. They’re removing friction and hiding uncertainty, in some cases collapsing the user’s role into a single click: approve or reject. And once you’ve approved something once, the burden to check again gets lower.
This is partly a design problem, but it’s also a philosophical one. Most of our current models were trained to produce language that feels helpful. That’s not the same as language that is correct and safe to act on. But the presentation of that output can easily override our instincts to question things. It’s not a new trick. We’ve been trained by decades of software to believe in automation. What’s different now is that the system doing the automation wasn’t built with a clear model of cause and effect. It was built with a model of language, and we’ve confused that with logic.
Agency at Scale
Another challenge is that no two AI systems define success in quite the same way. The models are trained differently, their toolchains vary, and memory implementation is super-secret. Even when two systems are completing the same task (book a flight), the reasoning, intermediate steps, and justification for decisions may be completely opaque and totally divergent.
But to the user, these systems appear interchangeable. “Just pick the one that gets the job done.” And once the model is acting across systems — pulling from your calendar, filtering preferences from past chats, and using plugins built by third parties — that opacity compounds. There’s no shared ground truth for what a successful action looks like. There’s no framework that tells you when an agent overstepped, or when it made a reasonable compromise, or when it subtly rerouted your life in a way that didn’t quite align with your values. There’s just the question: did the task complete?
The real risk here is that slow erosion, not necessarily rogue behavior. It’s the normalization of systems that act in our name without ever being accountable to our thinking.
Immature Progress
There’s a tendency in AI right now to equate capability with maturity. If a model performs well across benchmarks, it’s considered mature. If it has tools, memory, and integrations, it’s considered advanced. But these are technical attributes, not behavioral ones. They don’t tell us whether the system can be trusted, or whether its judgment is sound, or whether it can fail in understandable ways.
Real maturity in these systems would look different. It would mean being transparent about limitations, modeling uncertainty explicitly, declining to act when the outcome isn’t clear, and escalating instead of guessing. Those qualities aren’t flashy, but they are arguably more important than anything else we could be teaching these systems right now.
Part of the reason this is hard to talk about is that it all still feels exciting (to some). The technology is undeniably impressive: the models are faster, more capable, and more contextual than anything we’ve had before. The workflows they unlock are real. People do save time and get things done. Again — great! Love it! What bugs me is that we’ve largely stopped asking what kind of actor AI is, and what kinds of actions we’re comfortable outsourcing.
We’re building toward a future where intelligent-esque systems will shape more and more of the decision layer in people’s lives. People will get very comfortable delegating all kinds of seemingly mundane actions to AI, and the thought process driving those decisions will remain, for the most part, uninspected. What do we value as part of the human experience? What do we offload and what do we keep? Whose values are we following here? There is still time to ask different questions, but that window is closing.
The Weight
The models are being interpreted, reinterpreted, and extended every day by the people who build around it. The agent is just the latest projection of what we hope the model can do, and we keep layering more on top of that hope: Tool use. Memory. Planning. Autonomy. Plugins. APIs. Multi-agent orchestration.
Yet there’s little pause to reevaluate whether the foundation is stable enough to support all this weight. These systems will become more powerful without asking what kind of relationship we want to have with that power. That’s not a stable foundation for agency. And if we keep building outward without looking down, it’s not just the model that takes the fall.
Great stuff. How are you not out there doing TED talks?