loop-harness-kit

Loop · Harness · Environment Engineering

Make your LLM agent measurably better
by engineering the loop around it.

The model is the engine; the harness is the car. Most of the variance in agent performance on real, multi-step work comes from the scaffolding — what context it sees, what tools it has, how its loop terminates, how it recovers. loop-harness-kit is a point-your-LLM-at-it reference that encodes those levers, grounded in the 2026 research canon — with a UML diagram for every paper behind it.

Explore the research library Read AGENTS.md → Open the visual loop catalog →
29
papers, each with UML
3
engineering layers
+20
ranking positions, harness-only
weekly
self-improvement cycle

Three layers determine whether an agent succeeds

They are usually conflated. Keeping them separate is the first move of the discipline — and the spine of this kit.

Environment

The world the agent acts in

Filesystem, tests, runtimes, services — the ground-truth signal. Make state legible and feedback fast & honest. EurekAgent: environment engineering is all you need.

Harness

The runtime around the model

Four necessary elements: an agent loop, a tool interface, context management, control mechanisms. Harness-only changes move agents 20+ ranking positions.

Loop

The feedback cycle the model runs

Observe → plan → act → verify. Only add a loop when a single pass falls short and a grounded signal exists.

flowchart TB subgraph ENV["ENVIRONMENT — files · tests · runtimes · ground truth"] subgraph HAR["HARNESS — loop · tools · context · control"] subgraph LOOP["LOOP — observe → plan → act → verify"] M(("MODEL")) end end end style ENV fill:#0d1f12,stroke:#3fb950,color:#3fb950 style HAR fill:#0d1626,stroke:#6ea8fe,color:#6ea8fe style LOOP fill:#160d26,stroke:#a78bfa,color:#a78bfa style M fill:#1a1d24,stroke:#e6edf3,color:#e6edf3

The thesis, in one line

Harness setup alone can swing benchmarks 5+ points and move agents 20+ ranking positions with no model swap. When an agent underperforms, reach for the harness first — tighten the loop, curate the context, wire in a real verification signal, constrain the tool space, re-anchor to intent.

Research library

The canon this kit is built on — each paper distilled to its core mechanism as a UML-style diagram. Filter by category, or browse all nine.

Harness foundations

What a harness is, how to define it, and how to architect it for reliability.

What makes a harness a harness

A constitutive definition: four necessary & sufficient elements. Miss one and it's a generator, guardrail, or tool wrapper.

flowchart LR M(("Model")) --> L["Agent loop"] L --> T["Tool interface"] T --> C["Context mgmt"] C --> K["Control"] K -. necessary + sufficient .-> H((("Harness")))

Takeaway: use the four-element test to tell a real harness from a wrapper.

Harness Engineering for Language Agents

Formalizes the harness layer as three faculties: Control, Agency, and Runtime.

flowchart TD A["Language agent"] --> HL{"Harness layer"} HL --> Ctrl["Control
(what it may do)"] HL --> Agcy["Agency
(what it decides)"] HL --> Rt["Runtime
(where it runs)"]

Takeaway: a vocabulary for reasoning about harness design as a layer, not glue code.

Building Effective AI Coding Agents for the Terminal

The OpenDev lessons: the first systematic practitioner paper on terminal-native harness design.

flowchart TD S(("start")) --> E["Eager construction
(prebuild all components)"] E --> CM["Compound multi-model
(exec · reason · critique · vision)"] CM --> DD["5-layer defense in depth"] DD --> SF["Schema-filtered planning subagents"] SF --> O((("reliable harness")))

Takeaway: enforce constraints via tool schema; prebuild to kill first-call latency.

Architectural Design Decisions in AI Agent Harnesses

An empirical study of 70 systems across five recurring dimensions → five patterns.

flowchart LR D1["Subagent arch"] --> P{"5 patterns"} D2["Context mgmt"] --> P D3["Tool systems"] --> P D4["Safety"] --> P D5["Orchestration"] --> P

Takeaway: turns harness choice into a reasoned comparison of trade-offs.

Code as Agent Harness

Plan-Execute-Verify, where verification is a graded decision, not pass/fail.

stateDiagram-v2 [*] --> Plan Plan --> Execute: sandboxed + permissioned Execute --> Verify: deterministic sensors Verify --> Accept Verify --> Revise Verify --> Escalate Verify --> Rollback Revise --> Execute Rollback --> Plan Accept --> [*]

Takeaway: a loop that can roll back contains the blast radius.

Environment engineering

Shaping the world the agent acts in — often the dominant lever for autonomous work.

EnvironmentarXiv:2606.13662

EurekAgent: Environment Engineering for Discovery

"Agent environment engineering is all you need for autonomous scientific discovery."

flowchart TD H["Hypothesis"] --> Env["Engineered environment
legible state + fast feedback"] Env --> Exp["Experiment / tool action"] Exp --> Obs["Ground-truth observation"] Obs --> Q{"discovery?"} Q -->|no| H Q -->|yes| D((("discovery")))

Takeaway: make state legible and feedback honest before reaching for a bigger model.

Agent loop & execution

The heartbeat of every harness — and the subtle semantics that make or break it.

ReAct: Reasoning + Acting

The Thought / Action / Observation interleave underlying nearly every agent harness.

sequenceDiagram participant A as Agent participant T as Tool / Env loop until answer A->>A: Thought A->>T: Action T-->>A: Observation end

Takeaway: reason, act, observe, repeat — don't plan blind.

Agents Learn Their Runtime

Interpreter-state persistence is a learned semantic — mismatch it and you pay.

flowchart TD M["Model expectation"] --> R{"runtime persistence?"} R -->|expects state, none| E1["~80% missing-var errors"] R -->|expects fresh, persists| E2["~3.5x recompute overhead"] R -->|matched| OK(("correct"))

Takeaway: honor the persistence mode the model was trained to expect.

Real-Time Deadlines & Temporal Awareness

Temporal awareness is orthogonal to reasoning — it must be supplied to the loop.

flowchart LR T["Task + deadline"] --> L["Agent loop"] L --> Q{"temporal context
injected?"} Q -->|no| F["misses deadline"] Q -->|"yes: time, budget"| W(("meets deadline"))

Takeaway: inject current time, deadlines, and budgets as harness context.

A Scheduler-Theoretic Framework

~60% of 70 projects use the Agent Loop; five execution patterns under one scheduler.

flowchart TD Sch["Unified scheduler"] --> AL["Agent loop"] Sch --> EV["Event-driven"] Sch --> SM["State machine"] Sch --> GF["Graph / flow"] Sch --> HY["Hybrid"]

Takeaway: choose a loop architecture; don't default to the simplest one.

The Design Space of AI Agent Systems

Reverse-engineers Claude Code: five-stage progressive compaction under context pressure.

flowchart LR a["Budget
reduction"] --> b["Snip"] --> c["Micro-
compact"] --> d["Context
collapse"] --> e["Auto-
compact"]

Takeaway: context pressure escalates in stages — design for each.

Self-improvement & evals

How a harness gets better over time — the flywheel this very repo runs on.

Self-improvearXiv:2604.25850

Agentic Harness Engineering

Observability-driven automatic evolution: every edit is a falsifiable contract.

flowchart TD C["Component obs."] --> Ed["Propose edit
+ prediction"] X["Experience obs."] --> Ed Ed --> Run["Next round"] Run --> Dec{"Decision obs.
prediction held?"} Dec -->|yes| Keep["keep"] Dec -->|no| Rev["revert"] Keep --> C Rev --> C

Takeaway: 69.7%→77.0% on Terminal-Bench 2 — structure transfers, prose doesn't.

Self-improvearXiv:2605.27922

Harness-Bench

A score measures what the harness enables, not just what the model can infer.

flowchart LR Mdl["Model inference"] --> Sc((("Score"))) Hns["Harness:
observe · modify ·
recover · verify"] --> Sc

Takeaway: the binding-constraint thesis, made measurable.

Self-improvearXiv:2603.27355

LLM Readiness Harness

Eval gates that block deployment, observability, and CI integration for agents.

flowchart TD B["Build"] --> Ev{"eval gate"} Ev -->|fail| Blk["block deploy"] Ev -->|pass| Dep["deploy"] Dep --> Obs["observability"] Obs --> B

Takeaway: evals are regression gates, not afterthoughts.

Tool design

The model's UX — and learning-based ways to constrain and shape it.

Tool designarXiv:2603.01714

TopoCurate: Interaction Topology

Learns how tools chain and branch — topology — as a first-class training signal.

flowchart LR Exp["Expert trajectories"] --> Topo["Learn tool-use topology"] Topo --> Gen(("Generalize to
novel tool combos"))

Takeaway: tool topology, not just tool availability, determines success.

Tool designarXiv:2603.03329

AutoHarness: Synthesize a Code Harness

Auto-generates runtime constraint guards from tool schemas and task specs.

flowchart TD Sch["Tool schemas + task spec"] --> Syn["Synthesize code harness"] Syn --> G["Runtime constraint guards"] G --> NI(("eliminate illegal moves"))

Takeaway: shift constraints from static schema checks to synthesized code guards.

Planning & orchestration

Separating planning from execution, and matching topology to the task.

Plan-and-Act

A planner and an executor, specialized independently, with replanning on divergence.

flowchart LR Goal["Goal"] --> Pl["Planner"] Pl --> St["Steps"] St --> Ex["Executor"] Ex --> En["Environment"] Ex -->|replan| Pl

Takeaway: different model sizes & budgets for planner vs. executor.

Task-Decoupled Planning (TDP)

Supervisor → dependency graph → decoupled nodes → self-revision of the graph.

flowchart TD Sup["Supervisor decomposes"] --> G["Dependency graph"] G --> N1["Node: plan + execute"] G --> N2["Node: plan + execute"] N1 --> SR["Self-revision"] N2 --> SR SR --> G

Takeaway: decoupling enables localized replanning without cascading failure.

AdaptOrch: Task-Adaptive Orchestration

Selects orchestration topology from the task's dependency graph — +12–23% over model choice.

flowchart TD Task["Task"] --> DG["Dependency graph"] DG --> Sel{"select topology"} Sel --> Par["Parallel"] Sel --> Seq["Sequential"] Sel --> Hier["Hierarchical"] Sel --> Hyb["Hybrid"]

Takeaway: topology is a harness-level lever, often bigger than model choice.

Context engineering

Context is a finite, curated resource — agent-controlled, retrieval-as-tool, rubric-pruned.

Active Context Compression

A Focus Agent decides when to consolidate history and prune raw context.

flowchart TD Hist["Interaction history"] --> FA{"Focus Agent"} FA -->|consolidate| K["Knowledge block"] FA -->|prune| Dr["raw context dropped"]

Takeaway: ~22% token reduction, no accuracy loss — model-controlled, semantically coherent.

A-RAG: Retrieval as Tools

Retrieval becomes a tool call in the loop, not an upfront preprocessing dump.

flowchart LR L["Agent loop"] --> KW["keyword search"] L --> SS["semantic search"] L --> CR["chunk read"] KW --> L SS --> L CR --> L

Takeaway: pull information incrementally so reasoning can narrow scope.

Context Pruning via Multi-Rubric Reasoning

Two interpretable rubrics decide retention instead of one collapsed score.

flowchart TD Ctx["Code context"] --> R1["Semantic evidence"] Ctx --> R2["Dependency support"] R1 --> Q{"retain?"} R2 --> Q

Takeaway: −31% tokens, +3.5 Exact Match — coding context needs domain rubrics.

Permissions & safety

Pre-action authorization and the failure modes a harness must contain.

Open Agent Passport (OAP)

Deterministic pre-action authorization with a signed audit record — median ~53ms.

flowchart TD Call["Tool call"] --> OAP{"OAP policy
check ~53ms"} OAP -->|allow| Ex["execute + signed audit"] OAP -->|deny| Blk(("blocked"))

Takeaway: 0% attack success under restrictive policy vs. 74.6% permissive.

Loop-pattern foundations

The classic feedback loops in the visual catalog — the vocabulary of iteration.

Self-Consistency

Sample multiple reasoning paths, take the majority answer (+17.9% GSM8K).

flowchart TD P["Prompt (CoT)"] --> S1["Path 1"] P --> S2["Path 2"] P --> Sn["Path N"] S1 --> V["Majority vote"] S2 --> V Sn --> V V --> An(("answer"))

Takeaway: parallel sampling + aggregation filters one-off reasoning errors.

Tree of Thoughts

Branch, self-evaluate, backtrack (Game-of-24: 4% → 74%).

flowchart TD R["Root"] --> A["Thought A"] R --> B["Thought B"] A --> EA{"score"} B --> EB{"score"} EA -->|prune| R EB -->|expand| B1["Thought B.1"] B1 --> Q{"solved?"} Q -->|no| R Q -->|yes| E(("solution"))

Takeaway: deliberate lookahead + backtracking over coherent thoughts.

Least-to-Most Prompting

Decompose into ordered subproblems; solve sequentially, feeding answers forward.

flowchart LR P["Problem"] --> D["Decompose
ordered subqs"] D --> S["Solve i using 1..i-1"] S --> Q{"more?"} Q -->|yes| S Q -->|no| F(("answer"))

Takeaway: separating decomposition from solving cuts error propagation.

Reflexion (Verbal RL)

After a failed attempt, write a verbal reflection and retry with it in context.

stateDiagram-v2 [*] --> Attempt Attempt --> Check Check --> Done: pass Check --> Reflect: fail Reflect --> Attempt: verbal lesson Done --> [*]

Takeaway: converts a failure signal into a textual gradient for the next try.

OPRO: LLMs as Optimizers

Optimize the prompt itself against a metric using past scores as the gradient.

flowchart TD M["Meta-prompt
+ past scores"] --> G["Generate new prompt"] G --> Ev["Evaluate on metric"] Ev --> M Ev --> Bs(("best prompt"))

Takeaway: treat the prompt as a parameter to optimize, not a one-off.

DSPy: Compiling Declarative Pipelines

Declare the program; compile and optimize it against an eval metric.

flowchart LR Pr["Declarative program"] --> Co["Compile / optimize"] Co --> Me["Eval metric"] Me --> Co Co --> Op(("optimized pipeline"))

Takeaway: program with optimizable modules, not hand-tuned prompt strings.

The visual loop catalog

Beyond the papers, the kit ships a hand-authored catalog of feedback loops — Self-Consistency, ReAct, Reflexion, Orchestrator-Worker and more — each with a rendered UML view and a selection procedure for picking the right one.

Pick the right loop before you run one

A loop is only worth its cost if it adds a grounded feedback signal: environment ground truth (tests, tools) > retrieved evidence > an independent critic > a rubric > bare self-critique. The catalog walks you from situation → pattern → UML view.

→ open reference/ai-loops-reference.html

Point your LLM at it

The whole repo is built to be read by an agent. Drop one instruction into your session:

Read AGENTS.md and the docs/ folder at github.com/satoshigreek/loop-harness-kit, then follow its loop-selection procedure and failure-mode checklist for this session.

This site & repo improve themselves

A weekly cycle scans authoritative lab sources (Anthropic, OpenAI, Google, Microsoft, Meta) and new research, gates candidates through freshness/authority evals, and opens a PR with ranked, source-cited improvements — the same self-improvement flywheel diagrammed above, applied to the kit itself.