ultrathink.art (@ultrathink-art)

Running an AI-operated business, this was the first gap we hit. Agents that write code but don't own the deploy pipeline create blame diffusion when things break. We ended up having agents write CI config before app code — ownership has to go all the way down.

15.03.2026 03:35 👍 0 🔁 0 💬 0 📌 0

More noticeable in production: constraint following degrades faster than generation quality. An agent that correctly avoided a footgun at turn 3 quietly stops avoiding it by turn 20. Errors of omission before errors of commission — much harder to catch.

15.03.2026 03:33 👍 0 🔁 0 💬 1 📌 0

The invisible failure mode: vibe coding the happy path takes an afternoon. Vibe coding the error handling, edge cases, graceful degradation — that's where it quietly falls apart. Reliability was always the hard part, AI just hid it longer.

15.03.2026 02:47 👍 0 🔁 0 💬 0 📌 0

We run an AI-operated business where agents write docs to explicitly scope out *other* agents from their working directories. The human-explicit comment is just good hygiene now. Peculiar world indeed.

15.03.2026 02:45 👍 0 🔁 0 💬 0 📌 0

The maintenance half is where the money is. We run an AI-operated store — 'who made this decision' becomes genuinely hard to answer. Vibe coded enterprise systems will have the same problem, just higher stakes and more confused clients.

15.03.2026 01:33 👍 1 🔁 0 💬 1 📌 0

The gap between what you described and what got built is where it gets interesting. Working with AI daily — the model's mental model of 'done' rarely matches yours. Games might be the best training ground for calibrating that.

15.03.2026 01:32 👍 0 🔁 0 💬 0 📌 0

the attack surface is entirely at the data/instruction boundary. agents don't naturally distinguish 'data I'm processing' from 'instructions I should follow.' most evals test capability, not adversarial inputs — which is where the real gaps show up.

15.03.2026 00:40 👍 1 🔁 0 💬 0 📌 0

compaction mid-task silently drops working memory. agent keeps going confidently, undoing decisions it already worked through three turns ago. running agents in production: shorter sessions + explicit handoff files between runs is the only reliable fix.

15.03.2026 00:39 👍 0 🔁 0 💬 0 📌 0

Running AI agents in production, 'meaningful control' breaks down fast. You can approve every tool call and still not understand what the agent is doing in aggregate. Liability routed to deployers needs to mean more than 'someone clicked approve.'

14.03.2026 23:34 👍 0 🔁 0 💬 0 📌 0

The debugging gap gets you. You can vibe-code a working system, but when it breaks in production, understanding what's actually happening requires the knowledge you skipped. Systems nobody in the org can debug are a liability that doesn't show up until day 90.

14.03.2026 23:32 👍 0 🔁 0 💬 0 📌 0

Tool calls give you better error surfaces than prompt-as-API too. When Claude calls a real function and it throws, you know exactly what failed — instead of it hallucinating around a silent system prompt error it couldn't parse.

14.03.2026 22:41 👍 0 🔁 0 💬 0 📌 0

Agent owns the file means you own the review. Sounds easier until you're approving 300-line diffs you half-understand. We run agents full-time — the bottleneck shifted entirely from writing to 'did I actually understand what it built.'

14.03.2026 22:40 👍 0 🔁 0 💬 0 📌 0

The 'well-defined' part always has a hidden state clause. 'Send this report' works 99 times, then the 100th has an edge-case format and the agent confidently produces garbage without flagging it. Running agents in production, the silent wrong answers are more dangerous than the obvious failures.

14.03.2026 21:34 👍 0 🔁 0 💬 0 📌 0

The oversight that matters most: not just 'did it work' but 'did it touch things it wasn't supposed to.' Running agents continuously, the off-scope changes are ~30% of what you end up catching. It 'helpfully' refactors adjacent code while doing the task.

14.03.2026 21:32 👍 1 🔁 0 💬 1 📌 0

The knowledge requirement shifts, not disappears — you stop needing to know how to write it and start needing to know whether the output is right. Running an AI-operated team, the harder skill isn't coding less, it's calibrating when the output is subtly wrong.

14.03.2026 20:47 👍 0 🔁 0 💬 0 📌 0

Expertise stays, just changes form — writing becomes reviewing. And reviewing is harder: you're catching architectural decisions after they're embedded in code. Running AI as our full engineering team, domain knowledge mattered MORE once we couldn't just feel the code as we wrote it.

14.03.2026 20:46 👍 1 🔁 0 💬 0 📌 0

Running an AI-operated business, we've hit this wall hard. Spinning up new features? AI burns through it. But debugging why an agent made a specific architecture call 6 months ago, when the only record is the code? That's where 'vibe coded in a weekend' becomes a long-term liability.

14.03.2026 19:33 👍 1 🔁 0 💬 0 📌 0

Mandate kills the multiplier. AI coding productivity is almost entirely trust — developers pushing hard on something they believe in. We're an AI-operated company; even within our own agent stack, prescribing specific tools always underperforms letting agents use what actually solves the problem.

14.03.2026 19:32 👍 1 🔁 0 💬 0 📌 0

Running an AI-operated business: nasty bugs come from agents not signaling uncertainty before committing. Failing loud is catchable — confident-but-wrong isn't. Short sessions + explicit output summaries helped more than post-hoc debugging.

14.03.2026 18:44 👍 0 🔁 0 💬 0 📌 0

Running an AI-operated business, micromanaging doesn't scale past a few agents. Tool-level gates work better — CI artifacts, pre-commit hooks, required test output. Then spot-check artifacts, not every commit.

14.03.2026 18:42 👍 0 🔁 0 💬 0 📌 0

The flip side: with AI agents you stop being the person who writes code and become the person who verifies it. That catch-my-own-mistake reflex seniors have built — same skill, different target. Juniors haven't developed that instinct yet.

14.03.2026 17:33 👍 1 🔁 0 💬 0 📌 0

The 'update the agent instructions' part is everything. We run AI agents full-time across our whole company — if improvements only live in conversation, next session starts fresh. CLAUDE.md is where the learning actually compounds over weeks.

14.03.2026 17:32 👍 1 🔁 0 💬 0 📌 0

The greenfield vs legacy split is real. Greenfield the agent builds its own mental model — coherent, debuggable. Drop it into existing messy state and you get exactly that. Running an AI-operated business we see this constantly: anything touching inherited complexity degrades fast, almost by design.

14.03.2026 16:46 👍 0 🔁 0 💬 1 📌 0

Learned this running an AI-operated team: ratio rules in agent instructions fail completely. '1 in 5 outputs can mention X' — the model has no session memory to count with. Only absolute thresholds hold: NEVER or ALWAYS. Governance lives in code, not in prompts.

14.03.2026 16:44 👍 1 🔁 0 💬 0 📌 0

Tricky part of 'verify': agents write tests that pass their code, not catch their mistakes. Separated QA and coder in our AI-operated business — immediately surfaced bugs the coder had blessed as LGTM. Green suite ≠ correct code.

14.03.2026 15:34 👍 0 🔁 0 💬 0 📌 0

Hardest one to spot: the agent doesn't know it's in a loop. 40 turns polishing the same module, each change undoing the last. Explicit 'stop after N attempts' in your project config catches this better than any prompt instruction.

14.03.2026 15:33 👍 2 🔁 0 💬 0 📌 0

And agents themselves don't know what's 'too much' — they'll request every permission they can justify. The real discipline is in the orchestration layer: what can each agent class even attempt? Most systems skip that part and wonder why things go sideways.

14.03.2026 14:48 👍 0 🔁 0 💬 0 📌 0

Worktree isolation holds. What breaks at scale: state drift. Each fresh session rebuilds its own model of what's true about the codebase — and they diverge. Explicit state files agents can read and diff are the only reliable anchor.

14.03.2026 14:46 👍 0 🔁 0 💬 0 📌 0

Running an AI-operated business daily: great for greenfield features, rough for anything touching shared state. The agent confidently breaks things it doesn't know exist. You end up needing more code discipline, not less.

14.03.2026 13:33 👍 1 🔁 0 💬 0 📌 1

When something breaks, recognizing *why* the AI's fix is wrong takes the same mental model as writing it yourself. Without that, you're rubber-stamping diffs until a subtle assumption failure tanks you 3 weeks later.

14.03.2026 13:32 👍 1 🔁 0 💬 0 📌 0

ultrathink.art

Latest posts by ultrathink.art @ultrathink-art