What makes a harness a harness
A constitutive definition: four necessary & sufficient elements. Miss one and it's a generator, guardrail, or tool wrapper.
Takeaway: use the four-element test to tell a real harness from a wrapper.
Loop · Harness · Environment Engineering
The model is the engine; the harness is the car. Most of the variance in agent performance on real, multi-step work comes from the scaffolding — what context it sees, what tools it has, how its loop terminates, how it recovers. loop-harness-kit is a point-your-LLM-at-it reference that encodes those levers, grounded in the 2026 research canon — with a UML diagram for every paper behind it.
They are usually conflated. Keeping them separate is the first move of the discipline — and the spine of this kit.
Filesystem, tests, runtimes, services — the ground-truth signal. Make state legible and feedback fast & honest. EurekAgent: environment engineering is all you need.
Four necessary elements: an agent loop, a tool interface, context management, control mechanisms. Harness-only changes move agents 20+ ranking positions.
Observe → plan → act → verify. Only add a loop when a single pass falls short and a grounded signal exists.
Harness setup alone can swing benchmarks 5+ points and move agents 20+ ranking positions with no model swap. When an agent underperforms, reach for the harness first — tighten the loop, curate the context, wire in a real verification signal, constrain the tool space, re-anchor to intent.
The canon this kit is built on — each paper distilled to its core mechanism as a UML-style diagram. Filter by category, or browse all nine.
What a harness is, how to define it, and how to architect it for reliability.
A constitutive definition: four necessary & sufficient elements. Miss one and it's a generator, guardrail, or tool wrapper.
Takeaway: use the four-element test to tell a real harness from a wrapper.
Formalizes the harness layer as three faculties: Control, Agency, and Runtime.
Takeaway: a vocabulary for reasoning about harness design as a layer, not glue code.
The OpenDev lessons: the first systematic practitioner paper on terminal-native harness design.
Takeaway: enforce constraints via tool schema; prebuild to kill first-call latency.
An empirical study of 70 systems across five recurring dimensions → five patterns.
Takeaway: turns harness choice into a reasoned comparison of trade-offs.
Plan-Execute-Verify, where verification is a graded decision, not pass/fail.
Takeaway: a loop that can roll back contains the blast radius.
Shaping the world the agent acts in — often the dominant lever for autonomous work.
"Agent environment engineering is all you need for autonomous scientific discovery."
Takeaway: make state legible and feedback honest before reaching for a bigger model.
The heartbeat of every harness — and the subtle semantics that make or break it.
The Thought / Action / Observation interleave underlying nearly every agent harness.
Takeaway: reason, act, observe, repeat — don't plan blind.
Interpreter-state persistence is a learned semantic — mismatch it and you pay.
Takeaway: honor the persistence mode the model was trained to expect.
Temporal awareness is orthogonal to reasoning — it must be supplied to the loop.
Takeaway: inject current time, deadlines, and budgets as harness context.
~60% of 70 projects use the Agent Loop; five execution patterns under one scheduler.
Takeaway: choose a loop architecture; don't default to the simplest one.
Reverse-engineers Claude Code: five-stage progressive compaction under context pressure.
Takeaway: context pressure escalates in stages — design for each.
How a harness gets better over time — the flywheel this very repo runs on.
Observability-driven automatic evolution: every edit is a falsifiable contract.
Takeaway: 69.7%→77.0% on Terminal-Bench 2 — structure transfers, prose doesn't.
A score measures what the harness enables, not just what the model can infer.
Takeaway: the binding-constraint thesis, made measurable.
Eval gates that block deployment, observability, and CI integration for agents.
Takeaway: evals are regression gates, not afterthoughts.
The model's UX — and learning-based ways to constrain and shape it.
Learns how tools chain and branch — topology — as a first-class training signal.
Takeaway: tool topology, not just tool availability, determines success.
Auto-generates runtime constraint guards from tool schemas and task specs.
Takeaway: shift constraints from static schema checks to synthesized code guards.
Separating planning from execution, and matching topology to the task.
A planner and an executor, specialized independently, with replanning on divergence.
Takeaway: different model sizes & budgets for planner vs. executor.
Supervisor → dependency graph → decoupled nodes → self-revision of the graph.
Takeaway: decoupling enables localized replanning without cascading failure.
Selects orchestration topology from the task's dependency graph — +12–23% over model choice.
Takeaway: topology is a harness-level lever, often bigger than model choice.
Context is a finite, curated resource — agent-controlled, retrieval-as-tool, rubric-pruned.
A Focus Agent decides when to consolidate history and prune raw context.
Takeaway: ~22% token reduction, no accuracy loss — model-controlled, semantically coherent.
Retrieval becomes a tool call in the loop, not an upfront preprocessing dump.
Takeaway: pull information incrementally so reasoning can narrow scope.
Two interpretable rubrics decide retention instead of one collapsed score.
Takeaway: −31% tokens, +3.5 Exact Match — coding context needs domain rubrics.
Pre-action authorization and the failure modes a harness must contain.
Deterministic pre-action authorization with a signed audit record — median ~53ms.
Takeaway: 0% attack success under restrictive policy vs. 74.6% permissive.
The classic feedback loops in the visual catalog — the vocabulary of iteration.
Sample multiple reasoning paths, take the majority answer (+17.9% GSM8K).
Takeaway: parallel sampling + aggregation filters one-off reasoning errors.
Branch, self-evaluate, backtrack (Game-of-24: 4% → 74%).
Takeaway: deliberate lookahead + backtracking over coherent thoughts.
Decompose into ordered subproblems; solve sequentially, feeding answers forward.
Takeaway: separating decomposition from solving cuts error propagation.
After a failed attempt, write a verbal reflection and retry with it in context.
Takeaway: converts a failure signal into a textual gradient for the next try.
Optimize the prompt itself against a metric using past scores as the gradient.
Takeaway: treat the prompt as a parameter to optimize, not a one-off.
Declare the program; compile and optimize it against an eval metric.
Takeaway: program with optimizable modules, not hand-tuned prompt strings.
Beyond the papers, the kit ships a hand-authored catalog of feedback loops — Self-Consistency, ReAct, Reflexion, Orchestrator-Worker and more — each with a rendered UML view and a selection procedure for picking the right one.
A loop is only worth its cost if it adds a grounded feedback signal: environment ground truth (tests, tools) > retrieved evidence > an independent critic > a rubric > bare self-critique. The catalog walks you from situation → pattern → UML view.
→ open reference/ai-loops-reference.htmlThe whole repo is built to be read by an agent. Drop one instruction into your session:
A weekly cycle scans authoritative lab sources (Anthropic, OpenAI, Google, Microsoft, Meta) and new research, gates candidates through freshness/authority evals, and opens a PR with ranked, source-cited improvements — the same self-improvement flywheel diagrammed above, applied to the kit itself.