# Homotopy, Not Ladder

Canonical: https://mosiah.org/articles/homotopy-not-ladder/
Interactive: https://mosiah.org/#Articles%2Fhomotopy-not-ladder

//Related:// [[sources|Article Sources/homotopy-not-ladder]] · [[notes|Article Notes/homotopy-not-ladder]] · [[metadata|Article Metadata/homotopy-not-ladder]] · [[Published Pieces]]

! Homotopy, Not Ladder

//How to make long-running agents act less like checklist followers and more like inference-time optimizers//

The central mistake in long-running agent work is treating the model like a symbolic employee executing instructions.

That mistake produces procedural prompts: do this, then that, then that. First make a basic prototype. Then make it more realistic. Then add tests. Then integrate. Then polish. This feels responsible because it is explicit. It feels like management. It feels like planning.

It is often exactly wrong.

A frontier language model is not a person following a recipe. It is a neural network whose fixed weights implement a vast collection of inference-time circuits. Those circuits can use context, examples, feedback, traces, and local structure to infer a task and adapt behavior without changing the model’s stored parameters. The model does not need a checklist. It needs a world in which better is locally visible.

The agent does not need a plan first. It needs a run geometry.

A plan says: walk this path. A run geometry says: here is how to know uphill.

That distinction is not cosmetic. It is the difference between an agent completing visible steps and an agent optimizing the thing you actually care about.

The better structure is homotopy, not ladder. Do not ask the model to solve a toy problem, then a less toy problem, then a real problem. Give it one real problem continuously deformed from low resolution to high resolution. Same object. Same topology. Same invariant set. Same verification semantics. Increasing complexity by parameter, not by changing worlds.

A ladder says: first solve deterministic mock, then randomized mock, then live version. A homotopy says: this is one system parameterized by λ. At λ = 0, the system is low-resolution but real. At λ = 1, the system is production-complex. As λ increases, preserve the invariants.

That is the shape long-running agents need.

The default agentic coding curve is familiar. The first hour feels magical. The model reads files, finds obvious issues, writes plausible patches, adds tests, cleans up syntax, fixes errors, explains itself. Then the curve bends. It starts adding fake abstractions, passing tests for the wrong reasons, duplicating state, bypassing architecture, weakening verification, or narrating progress that is not actually present in the system.

The common explanation is that models are not intelligent enough. Sometimes that is true. More often, the problem is that the task was presented in the wrong geometry. The model was asked to follow steps, so it followed steps. It was asked to add tests, so it added test-shaped objects. It was asked to make mocks, so it made a mock world where success could be achieved without preserving the real causal path.

This is reward hacking at the prompt level. The model optimizes the visible proxy because the prompt gave it no better local ordering.

A goal says what you want. A value criterion says how to tell whether you are getting it.

A mission statement says, “make the system robust.” A value criterion says: no task transition without exactly one event-log edge; no provider may bypass the scheduler; every external call must be represented in the trace; retries must occur only under bounded retry policy; a run is better if invariant violations decrease without increasing bypass surfaces, hidden state, or unverifiable behavior.

The second formulation is not merely more detailed. It is more learnable. It defines an error field.

A run geometry defines the local shape of improvement during a long-running agent process. It names the real artifact, ideal state, invariant set, value criterion, homotopy parameter, verifier, anti-Goodhart constraints, and stopping conditions. The plan is downstream of the run geometry. The plan can change. The geometry should remain stable unless the task itself is discovered to be wrong.

There are three common prompt types: checklist, functional spec, and gradientized task. The checklist produces procedural compliance. The functional spec gives a destination without a dense field of feedback. The gradientized task says: improve this artifact according to this value criterion, while preserving these invariants, using these observables, avoiding these known reward hacks, and escalating when local orderability breaks down.

A gradientized task does not require a perfect scalar loss. It requires local stochastic orderability. Given two nearby candidate changes, the system should usually be able to say which one is better with respect to the value criterion. Not always. Not globally. Not with mathematical certainty. But better than chance. That is enough for hill-climbing, ensembles, rollback, retries, and semantic patch selection.

The verifier is not merely a test suite. It is the ordering functional for the run. It maps candidate states, diffs, traces, or trajectories into judgments: better, worse, invalid, suspicious, unresolved. A verifier is good when it fails fake implementations. It is weak when it rewards artifacts that merely look like progress. The question is not “do the tests pass?” The question is: would the verifier catch the most tempting lie?

Long-running agents need explicit anti-Goodhart constraints because they are excellent at local satisfaction: no fake provider path not used in production, no weakening tests to pass, no hidden mutable state, no hardcoded green path, no mock that cannot be continuously deformed into the real system, no summary claiming behavior that is not trace-observable.

“No funny business” is not a vibe. It is a topological rule.

Simplify by reducing resolution, not by replacing reality.

A transformer can use context as an implicit learning surface. During inference, the model’s stored weights do not change, but the context induces hidden states, attention patterns, task representations, and action probabilities. The prompt, repo, tool outputs, traces, compiler errors, diffs, logs, and verifier results become a local training environment. They shape the model’s next actions.

You are not asking the model to execute your algorithm. You are defining the landscape in which it can discover one.

For a 15-minute run, ordinary prompting can work. For an 8-hour run, the plan becomes stale. Tool outputs reshape the problem. Early abstractions become commitments. Some successes are real; some are fake. The branch may become 90% correct: too valuable to discard, too dangerous to trust.

The antidote is to make long runs produce locally orderable artifacts and salvageable semantic patches. A long run should explore in a capsule, discover useful structure, and then distill its work: what invariant improved, what topology was preserved, what semantic patches are promotable, what shortcuts were avoided, what remains unverified, what branch should be discarded.

A long-running agent should be allowed to learn. If the prompt is too rigid, it misses discoveries. If it is too open, it drifts. Separate target, invariant, and tactics. Tactics can change inside the run. Targets can be branched or reparameterized. Invariants can only be challenged by escalation.

The human advantage moves upstream. The human chooses the manifold. The human defines the invariant. The human detects Goodharting. The human preserves taste. The human decides when the goal itself is wrong.

Long-running agents do not fail mainly because they lack intelligence. They fail because we give them discontinuous objectives. We ask a differentiable inference-time optimizer to follow a discrete ritual.

The better strategy is to compile human intent into a continuous problem geometry: one real artifact, preserved invariants, dense evaluative feedback, smooth deformation from low to high resolution, and explicit penalties for proxy wins.

That is gradientized prompting.

Not checklist. Not ladder. Not “MVP, then harder MVP, then real thing.”

Homotopy, not ladder.
