Initiated by Dr. Xin Wei, University of Michigan
Ongoing development by the community
Rolling AI Engineering Notes

From Prompting to Agent Skills, Harnesses, and Loops: A Practical Path to Reliable AI Agents in 2026

June 16, 2026 Rolling update AI Insights

Over the last two years, agent development has shifted from "make the demo work once" to "make the system work repeatedly under messy real-world conditions." Early workflows depended heavily on prompt engineering. That remains useful, but it is no longer sufficient for production-grade agents. Reliable systems now need explicit context management, modular expert capabilities, runtime guardrails, structured validation, observability, and recoverable execution loops.

This matters directly for long-horizon knowledge-work and operations workflows. These systems often need to extract heterogeneous evidence from documents, tables, archives, databases, and tool outputs; reconcile conflicting records; track changes over time; and preserve enough provenance for review. A single prompt cannot carry that burden. A system can.

Problem Agents fail when tasks outlive a single context window.
Pattern Skills, harnesses, and loops become infrastructure.
Use Case Long-running evidence synthesis.
Outcome Traceable, recoverable, reviewable workflows.
Conceptual illustration of AI agents evolving from prompts to skills, harnesses, and loops
Reliable agents need more than better prompts: they need reusable skills, runtime controls, and recoverable execution loops.

1. The Stack: From Instructions to an Agent Operating System

A useful way to think about modern AI agents is as five nested layers. Each layer solves a different failure mode, and each becomes more important as tasks become longer, more stateful, and less forgiving.

The useful distinction is not "prompt versus agent." It is which part of the system owns intent, evidence, reusable procedure, runtime control, and execution state.

Prompt Engineering The instruction layer: role, task framing, examples, output schema, and failure-aware wording.
Context Engineering The information layer: retrieval, memory, tool outputs, prior artifacts, and structured state.
Agent Skills The modular expertise layer: reusable folders of domain procedures, scripts, templates, and references.
Harness Engineering The runtime control layer: tests, tools, guardrails, state, recovery, logs, and review sensors.
Loop Engineering The execution layer: explicit reason-act-observe cycles, checkpoints, routing, and human review nodes.

Prompt Engineering

Prompting is still the foundation. It defines the task, the expected output, the constraints, and the level of reasoning required. The more mature pattern, however, is no longer one-shot prompt craft. It is evaluation-driven iteration: write the prompt, run representative cases, inspect failures, revise the prompt or surrounding system, and test again. Self-refinement loops can help, but only when grounded in concrete evaluation signals rather than generic self-critique.

Context Engineering

Context engineering manages what the model sees before it acts. That includes RAG, memory, tool outputs, user preferences, source documents, database rows, prior decisions, and intermediate artifacts. Its central question is not "How much can I fit into the context window?" but "What is the minimum reliable evidence the agent needs at this point in the workflow?"

Agent Skills

Skills turn repeated expertise into reusable modules. Anthropic introduced Agent Skills in October 2025 as folders of instructions, scripts, and resources that agents can discover and load dynamically, and later published the format as an open standard for cross-platform portability. In the open Agent Skills specification, a skill is a folder with a required SKILL.md file containing metadata and task instructions, plus optional scripts, references, assets, and templates.

Open Agent Skills specification

The format reference for portable skills, including the folder structure and the role of SKILL.md.

https://agentskills.io/

Harness Engineering

A harness is everything around the model that makes the agent dependable: tools, tests, state, checkpoints, coding conventions, validation scripts, review agents, logs, and recovery paths. Martin Fowler's framing is useful: a good harness provides both feedforward controls that steer the agent before it acts and feedback controls that let it self-correct after it acts.

Loop Engineering

The execution loop is where the agent actually lives. ReAct-style reasoning, tool use, observation, and revision remain the core pattern, but production systems should not rely on an unbounded while true. Frameworks such as LangGraph model workflows as explicit state graphs with durable execution, persistence, human-in-the-loop control, streaming, and observability.

2. Agent Skills: Modularizing Expert Knowledge

The most important shift in the skills pattern is separation of concerns. Instead of embedding every domain rule into a single prompt, we package expert knowledge into portable folders. The agent sees only skill metadata at startup. If the task matches the skill, it loads the full instructions and any additional files it needs. This is progressive disclosure: small context footprint by default, deeper procedural knowledge on demand.

Why this matters: In a data workflow, the same agent may need one procedure for parsing source documents, another for normalizing tabular evidence, another for identifying duplicate records, and another for writing audit-ready provenance notes. Skills make those procedures versionable, testable, and reusable.

The open ecosystem is moving quickly. Useful starting points include:

  • VoltAgent/awesome-agent-skills, a curated collection of official and community skills with more than 1000 entries and compatibility notes for tools such as Claude Code, Codex, Gemini CLI, Cursor, and others.
  • heilcheng/awesome-agent-skills, a multilingual community index with English, Simplified Chinese, Traditional Chinese, Japanese, Korean, and Spanish documentation.
  • addyosmani/agent-skills, a production-oriented engineering workflow repository organized around specification, planning, building, testing, review, and shipping.
  • mattpocock/skills, a compact set of real engineering skills designed for everyday agent-assisted software work.
  • skills.sh, a live skills directory and leaderboard with installation-oriented discovery.
  • agentskill.sh, a large searchable skills marketplace and directory.

A typical install command follows the pattern:

npx skills@latest add VoltAgent/awesome-agent-skills

3. Harness Engineering: The Agent's Runtime Control System

Harness engineering is the layer that converts an impressive model into a useful system. Anthropic's article on long-running agents is especially practical: it describes a two-part pattern with an initializer agent that creates durable project artifacts and a coding agent that makes incremental progress while leaving structured state for future sessions.

The important idea is not limited to coding. For evidence-heavy extraction work, the same pattern becomes:

  1. Create an initializer stage that defines the schema, evidence rules, validation criteria, source hierarchy, and known ambiguity classes.
  2. Run extraction agents in small increments, each writing structured outputs and provenance records.
  3. Validate every batch with deterministic checks first: schema validity, units, currency years, duplicate identifiers, impossible dates, and missing evidence links.
  4. Use LLM-based review only where semantic judgment is actually needed: conflict resolution, event matching, uncertainty notes, and edge cases.
  5. Persist progress so the next run can resume without reconstructing the entire reasoning chain.

In this framing, the harness is less like a wrapper and more like an operating discipline. It is where we place the parts that must be stable even when individual model calls are stochastic.

4. Loop Engineering: Making Execution Observable and Recoverable

A reliable agent loop needs visible state, explicit transitions, and stopping conditions. The basic loop is still reason, act, observe, and revise. The production version adds routing, budget controls, retries with failure classes, checkpoints, human review gates, and termination rules.

LangGraph is one useful expression of this idea. Its documentation emphasizes long-running, stateful agents with durable execution, persistence, human-in-the-loop oversight, and observability. That matters because complex workflows are rarely linear. An evidence-synthesis pipeline may need to branch when a source is incomplete, pause for human review when records disagree, and resume after new evidence appears.

5. Implications for Data and Workflow Systems

For data-intensive work, the practical lesson is clear: do not ask a single prompt to behave like a full operating workflow. Build the workflow around the model.

  • Skills layer: package source-document parsing, unit normalization, taxonomy mapping, duplicate-record detection, temporal change detection, and provenance writing as separate reusable skills.
  • Harness layer: maintain cross-session state, enforce schema and provenance rules, run deterministic validation, and escalate only ambiguous cases to LLM review or human review.
  • Loop layer: use graph-based execution for extraction, validation, reconciliation, and publication, with checkpoints and review nodes.
  • Evaluation layer: combine deterministic tests with targeted LLM-as-judge rubrics and maintain gold cases for known hard examples.

This architecture is more work than prompt chaining, but it is also much closer to how durable workflow infrastructure should behave. It supports reproducibility, auditability, and long-term maintainability. It also makes failure informative: when something breaks, we can usually tell whether the problem is in the prompt, the context, the skill, the harness, or the loop.

Closing Thought

Skills, harnesses, and loops are not needed for every agent. They become necessary when the task has memory, repeated procedures, external evidence, partial failures, and outputs that must be checked later. In that setting, the prompt is only the entry point. The durable work is done by the surrounding system: what it loads, what it records, what it validates, and how it recovers when a step fails.

Back to News & Blogs