Xin · AI Insights

AI agent engineering has moved beyond clever prompts. The practical question in 2026 is how to make agents keep their bearings across multi-step work: prompts set intent, context supplies working memory, skills package reusable know-how, harnesses run checks, and loops decide what happens next.

On This Page

Rolling AI Engineering Notes

AI Agent Field Notes 2026: Prompts, Skills, Harnesses, Loops, and a Few Bedtime Reads

June 16, 2026 Rolling update (July 4, 2026 updated) AI Insights

A demo proves that a model can do something once. An agent setup has to make the model do useful work again after tool calls, file edits, interruptions, and mistaken turns. Prompting still matters, but the prompt is only one part of the machine.

A practical split is: prompts say what to do, context keeps the relevant working memory, skills hold repeated procedures, harnesses run checks and keep state, and loops decide the next action. That vocabulary is useful because it makes failures easier to locate.

Where Agents Stumble They can lose track after several steps or tool calls.

What Helps Move repeated instructions into skills, checks, and loops.

Best Fit Tasks that need files, tools, memory, or retries.

What Improves The run is easier to pause, resume, inspect, and fix.

Conceptual illustration of AI agents evolving from prompts to skills, harnesses, and loops — Reliable agents need more than better prompts: they need reusable skills, checks around the model, and explicit execution loops.

1. The Stack: From Instructions to a Working Agent Setup

A useful way to understand modern AI agents is as five jobs that have to be handled somewhere. The longer the task, the more dangerous it is to leave all five jobs inside one prompt.

The useful distinction is not "prompt versus agent." It is where the setup keeps intent, working context, reusable steps, checks, and progress.

Prompt Engineering The instruction layer: role, task framing, examples, constraints, and failure-aware wording.

Context Engineering The information layer: search results, memory, tool outputs, previous work, and saved state.

Agent Skills The reusable know-how layer: folders with instructions, scripts, templates, and references.

Harness Engineering The control layer around the model: tests, tools, saved state, recovery, logs, and status checks.

Loop Engineering The execution layer: think, use a tool, read the result, then decide what comes next.

Prompt Engineering

Prompting is still the starting point. It states the task, the expected answer, the constraints, and a few examples. Strong prompts are tested on realistic cases, revised where they fail, and paired with checks. Asking the model to critique itself can help, but only when there is something concrete to check.

Context Engineering

Context engineering manages what the model sees before it acts. That includes retrieved notes, memory, tool outputs, user preferences, files, prior decisions, and work-in-progress files. Its central question is not "How much can fit into the context window?" but "What is the smallest useful context for the next move?"

Agent Skills

Skills turn repeated expertise into reusable folders. Anthropic introduced Agent Skills in October 2025 as folders of instructions, scripts, and resources that agents can load when a task calls for them. In the open Agent Skills specification, a skill has a required SKILL.md file with a short description and task instructions, plus optional scripts, references, assets, and templates.

Anthropic engineering note on Agent Skills

The primary source for understanding why skills are useful for giving agents reusable, task-specific capabilities outside a single prompt.

https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills

Open Agent Skills specification

The official format reference for skills, including the folder structure and the role of SKILL.md.

https://agentskills.io/

Harness Engineering

A harness is everything around the model that makes the agent less brittle: tools, tests, saved state, save points, coding conventions, logs, and recovery paths. Martin Fowler's framing is useful: good harnesses guide the agent before it acts and give it feedback after it acts.

Loop Engineering

The execution loop is where the agent actually lives. The basic pattern is simple: think, use a tool, read the result, revise the plan. Systems should not rely on an endless while true. Frameworks such as LangGraph make the steps explicit, save progress, allow approval points, and make the run easier to inspect.

2. Agent Skills: Modularizing Expert Knowledge

The most important shift in the skills pattern is separation of concerns. Instead of stuffing every rule into one prompt, put a repeated procedure in its own folder. At startup, the agent sees only the skill name and short description. If the task matches the skill, it opens the full instructions and any extra files it needs.

Why this matters: If the same instruction keeps reappearing across sessions, it probably belongs in a skill. One task, one maintained procedure, loaded only when it is actually relevant.

The open ecosystem is moving quickly. Useful starting points include:

VoltAgent/awesome-agent-skills, a curated collection of official and community skills, with notes on which tools can use them.
heilcheng/awesome-agent-skills, a multilingual community index with English, Simplified Chinese, Traditional Chinese, Japanese, Korean, and Spanish documentation.
addyosmani/agent-skills, an engineering-focused set of skills for planning, building, testing, review, and shipping.
mattpocock/skills, a compact set of real engineering skills designed for everyday agent-assisted software work.
skills.sh, a live skills directory and leaderboard for finding installable skills.
agentskill.sh, a large searchable skills marketplace and directory.

A typical install command follows the pattern:

npx skills@latest add VoltAgent/awesome-agent-skills

3. Harness Engineering: The Checks Around the Agent

Harness engineering is the layer that turns a strong model into a usable system. Anthropic's article on long-running agents is especially practical: one setup step writes the plan and progress files; the working agent then makes small changes and updates those files as it goes.

Anthropic engineering note on long-running harnesses

A practical discussion of how to keep agents productive across longer tasks by using setup agents, progress files, and recovery paths.

https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents

Martin Fowler on harness engineering

A concise engineering framing for agent harnesses: how to guide the agent before an action and correct it after an action.

https://martinfowler.com/articles/harness-engineering.html

The important idea is not limited to coding. The same pattern is useful whenever an agent has to work through several steps:

Start with a short setup step that writes goals, constraints, open questions, and a clear done checklist.
Make small changes, then record what changed before moving on.
Run cheap checks first: tests, type checks, lint, link checks, file-existence checks, and format checks.
Use model critique only where judgment is actually needed: tradeoffs, ambiguity, unclear instructions, and edge cases.
Save progress somewhere stable so the next run does not have to rediscover the whole path.

In this framing, the harness is less like a wrapper and more like a set of working rules. It holds the parts that should stay reliable even when individual model calls vary.

4. Loop Engineering: Making Execution Explicit

A reliable agent loop needs visible state, clear next steps, and stopping conditions. The basic loop is still think, act, observe, and revise. The stronger version adds path choices, time or token limits, retry rules, save points, approval gates, and clear stop rules.

LangGraph is one useful expression of this idea. Its documentation focuses on agents that keep state, save progress, pause for human input, show intermediate updates, and expose what happened during a run. That matters because real work is rarely linear: a run may need to branch when a tool fails, pause for approval, or resume later without starting from scratch.

LangGraph overview

A useful starting point for graph-based agent workflows where the steps, saved progress, and run history need to be visible.

https://docs.langchain.com/oss/python/langgraph/overview

LangGraph: saving progress

The relevant documentation for saving progress and resuming a run after interruption.

https://docs.langchain.com/oss/python/langgraph/persistence

ReAct paper

The original think-act-observe pattern: write a thought, take an action, read the result, then update the plan.

https://arxiv.org/abs/2210.03629

5. Practical Takeaways

The practical lesson is clear: do not ask a single prompt to manage the whole workflow. Build the workflow around the model.

Skills layer: put repeated procedures in small, named folders instead of rewriting them inside every prompt.
Harness layer: keep tests, logs, save points, permissions, cleanup, and recovery outside the model call.
Loop layer: make the next action explicit: think, call a tool, observe the result, then decide whether to continue, retry, ask, or stop.
Evaluation layer: keep representative tasks and failure notes; improve the system by rerunning the same cases.

This is more setup than simply connecting prompts together, but the payoff is practical: when something breaks, we can usually tell whether the problem is in the prompt, the context, the skill, the harness, or the loop.

6. Closing Thought

Skills, harnesses, and loops are overhead for tiny tasks. They pay off when an agent has to remember state, repeat a procedure, call tools, and recover from a bad step. The prompt starts the work; the surrounding system keeps it from drifting.

7. Tea-Break Papers: Agents in the Wild

Not a formal bibliography; more like a small reading shelf for tracking where agent evaluation is getting interesting. The common thread is simple: benchmarks are moving away from toy prompts and toward messy workspaces, real tool use, and multi-step jobs.

7.1 LLM/Agent-as-a-Judge

Judge models are useful when outputs are open-ended and exact-match metrics are too brittle. They are not magic graders. A good judge setup needs a rubric, calibration examples, bias checks, and repeatable validation against human preferences or human disagreement.

The agent version raises the bar: instead of scoring only the final answer, an agentic judge can inspect evidence, run tools, compare intermediate steps, and check whether a claimed result is supported. That makes judging closer to a workflow than a single model call.

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

The classic starting point: it studies GPT-4-style judges, known biases, MT-Bench, Chatbot Arena, and agreement with human preferences.

https://arxiv.org/abs/2306.05685

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

A broad map of why to use LLM judges, how to build them, where they are used, how to evaluate them, and where they fail.

https://arxiv.org/abs/2412.05579

A Survey on LLM-as-a-Judge

Useful for the reliability angle: consistency, bias mitigation, scenario adaptation, and how to validate judge systems rather than assuming they work.

https://www.sciencedirect.com/science/article/pii/S2666675825004564

Agent-as-a-Judge

The next step after single-pass LLM judging: planning, tool-supported verification, multi-agent collaboration, and memory for more inspectable evaluation. Companion resource list: ModalityDance/Awesome-Agent-as-a-Judge.

https://arxiv.org/abs/2601.05111

Validating LLM-as-a-Judge Systems under Rating Indeterminacy

A useful caution: when several ratings can be reasonable, forced-choice validation can make a judge look better or worse for the wrong reason.

https://neurips.cc/virtual/2025/loc/san-diego/poster/117308

7.2 Workspace and Runtime Benchmarks

These papers are useful because they move agent evaluation closer to ordinary work: many files, uncertain paths, command-line tools, and mistakes that appear before the final answer.

Workspace benchmark

Workspace-Bench 1.0

Evaluates agents on realistic workspace tasks with many files and file types. Useful because the agent has to find the right files before it can answer.

Runtime benchmark

WildClawBench

A benchmark that runs agents in real command-line environments. Useful because it tests the messy parts: tools, files, errors, and intermediate steps.

7.3 Harness and Long-Horizon Evaluation

These readings are about the setup around the model and the failure modes that only appear after several steps.

Harness effects

Harness-Bench

Asks a sharp question: how much of agent performance comes from the model, and how much comes from the setup around it?

Failure diagnosis

The Long-Horizon Task Mirage?

Looks at where multi-step agents break down and why step-by-step traces can reveal failures that final scores hide.

8. Blog/Gist Shelf: Posts Worth Keeping Open

These are less like citations and more like tabs worth keeping open while designing an agent setup. They are practical, opinionated, and good for calibrating taste.

How I am AI-Proofing my Career - Johnathan Bi

YouTube screenshot for the video How to Become 1% in the Age of AI

Knowledge base

AI Agent Field Notes 2026: Prompts, Skills, Harnesses, Loops, and a Few Bedtime Reads

1. The Stack: From Instructions to a Working Agent Setup

Prompt Engineering

Context Engineering

Agent Skills

Anthropic engineering note on Agent Skills

Open Agent Skills specification

Harness Engineering

Loop Engineering

2. Agent Skills: Modularizing Expert Knowledge

3. Harness Engineering: The Checks Around the Agent

Anthropic engineering note on long-running harnesses

Martin Fowler on harness engineering

4. Loop Engineering: Making Execution Explicit

LangGraph overview

LangGraph: saving progress

ReAct paper

5. Practical Takeaways

6. Closing Thought

7. Tea-Break Papers: Agents in the Wild

7.1 LLM/Agent-as-a-Judge

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

A Survey on LLM-as-a-Judge

Agent-as-a-Judge

Validating LLM-as-a-Judge Systems under Rating Indeterminacy

7.2 Workspace and Runtime Benchmarks

Workspace-Bench 1.0

WildClawBench

7.3 Harness and Long-Horizon Evaluation

Harness-Bench

The Long-Horizon Task Mirage?

8. Blog/Gist Shelf: Posts Worth Keeping Open

How I am AI-Proofing my Career - Johnathan Bi

Karpathy's LLM Wiki

Effective Harnesses for Long-Running Agents

Build Agents That Run for Hours

Martin Fowler on Harness Engineering

Anthropic on Agent Skills