Skills Without Evals Are Just Optimism

Every organization has people who know how things actually get done. Not the official process. The real one.

They know which edge case the template doesn’t handle. Which stakeholder needs to be looped in first. Which step everyone quietly skips, and why. This is tribal knowledge, and it lives in people’s heads until they leave.

In theory, a SKILL.md that captures the right sequence, the right tools, the right judgment calls gives you repeatable expert behavior at scale. In practice, most teams ship the skill and call it done.

That’s the mistake.

Write the eval first

The conventional workflow: write the skill, then write evals to check it worked. This is backwards.

Writing evaluations first inverts the usual dynamic. You’re forced to be specific about what success looks like before you’ve written a single instruction. And specificity early is the difference between a skill that does what you intended and one that does what you wrote.

Perplexity treats evaluation as Step 0 in skill development, not a final quality check. ^[3] The evaluation cases should be sourced from three places:

Production queries: real user requests, as written, not cleaned up
Known failure cases: the reasons the skill was needed in the first place
Boundary confusion scenarios: cases where a similar-but-different skill might load instead

That third category is easy to skip and among the most valuable. If you have both a “vendor response” skill and a “procurement inquiry” skill, you need eval cases that confirm each loads for the right queries and not for the other’s. Writing these boundary cases before building the skill forces you to make the distinction explicit in the skill description, in the routing logic, everywhere.

Negative examples matter as much as positive ones. An eval that only tests “the skill works when it should load” leaves half the failure space untested. A skill that fires when it shouldn’t is burning context and potentially doing the wrong thing with confidence.

A five-minute skill PR almost certainly produces a subpar skill. ^[3] The eval cases are where the time goes, and that time spent upfront is cheaper than discovering the failure mode in production.

Run the ablation: does the skill earn its place?

Before writing a single instruction, run your eval cases against the base model with no skill loaded.

This is the ablation baseline. If the model already handles the task at 80% accuracy and your skill gets to 83%, you have a skill that costs tokens on every session, requires ongoing maintenance, adds routing complexity, and delivers marginal lift. That skill probably shouldn’t exist.

Ablation runs reveal two things. First, whether the skill is needed at all. Models are trained on a wide range of tasks, and many things teams reach for skills to encode are things frontier models already do well without special context. Sequential git operations, standard document formatting, common API patterns. These often don’t need a skill. ^[3] The cost of a skill that wraps knowledge the model already has isn’t zero: it’s the index tokens paid on every session, the loaded context paid when it fires, and the maintenance burden paid indefinitely.

Second, ablation runs identify which part of the skill is doing the actual work. Run the eval with the full skill, then strip it down section by section. If removing the examples drops performance by 15 points but removing the step-by-step instructions drops it by 2, you know where to invest. A lot of skill content is padding around a small number of high-signal lines. Ablation makes this visible instead of leaving it guesswork.

from statistics import mean

def run_ablation(eval_cases, skill_name):
    baseline, with_skill = [], []
    for case in eval_cases:
        baseline.append(grade(agent.run(case.query, skills=[]),           case.expected))
        with_skill.append(grade(agent.run(case.query, skills=[skill_name]), case.expected))

    lift = mean(with_skill) - mean(baseline)
    print(f"Baseline:   {mean(baseline):.0%}")
    print(f"With skill: {mean(with_skill):.0%}")
    print(f"Lift:       {lift:+.0%}")
    return lift  # if small, question whether the skill should exist

The honest conclusion of a well-run ablation is sometimes that the skill should not be built. The tribal knowledge you thought needed encoding was already in the model’s weights. The gap wasn’t real, or wasn’t large enough to justify the infrastructure. That’s a good outcome: a skill that doesn’t get written is a skill that doesn’t rot.

How harnesses load skills, and what breaks with too many

To write good routing evals, you need to understand what you’re testing against. The loading mechanics are similar across the major harnesses, and the failure mode in each is identical.

All of them follow progressive disclosure. A compact representation of every installed skill (name and description) lives permanently in context. The full SKILL.md content loads only when a skill is actually invoked. This keeps the per-session baseline cost manageable, but it doesn’t make it zero.

Claude Code caps the combined description + when_to_use text at 1,536 characters per skill, and the total budget for all skill descriptions is roughly 1% of the context window, around 8,000 characters on most models. ^[4] When the budget fills, individual descriptions get shortened first, the routing keywords disappear, and the skill becomes effectively invisible to the routing decision. Push further and some skills are omitted from the listing entirely, with an explicit warning. Once a session compacts to free context, invoked skills are re-attached, but each is capped at 5,000 tokens, all invoked skills share a combined 25,000-token budget, and older skills can be dropped silently as newer ones consume the allocation. ^[4]

Codex scans .agents/skills from the current working directory up to the repository root, then $HOME/.agents/skills and /etc/codex/skills. Its initial skills roster is capped at the same 8,000 characters. When many skills compete for that space, descriptions are shortened first; beyond that, skills are omitted with a warning. ^[5]

Hermes Agent from Nous Research maintains four registries (built-in, optional, community, and platform-sourced) and lists 684 skills in its hub. ^[6] The same tradeoff applies: every skill in the registry costs index space whether it’s ever used or not.

Claw Code, the open-source Rust reimplementation of Claude Code’s architecture, mirrors these patterns, as expected from a clean-room rewrite of the same system.

The shared failure mode is silent truncation. You add a new skill, descriptions get shortened across the board, a routing trigger phrase disappears mid-sentence, and a skill that worked fine yesterday starts failing its routing evals today, not because you changed it but because the budget is now tighter. This is the mechanism behind “action at a distance,” and it’s why routing regression tests need to run after every addition to the skill library, not just when you modify a skill directly.

The practical ceiling depends on description length, but at the 50-word descriptions Perplexity recommends, the 8,000-character budget accommodates roughly 20-25 skills before truncation begins. The ablation discipline is directly relevant here: every skill that doesn’t earn its place consumes index budget that a needed skill could use.

Test loading, not just execution

Most teams evaluate the wrong thing.

They check whether the skill completes the task correctly once loaded. That’s necessary but not sufficient. Skills have a failure mode that regular software doesn’t: the agent has to decide whether to load the skill at all.

This is a routing problem, and it has two distinct failure modes:

False positive: the skill loads when it shouldn’t. The agent is doing something unrelated, but the skill’s description matched closely enough that it loaded anyway. The result: wasted tokens, polluted context, and an agent now primed with instructions for the wrong task.

False negative: the skill doesn’t load when it should. The capability exists but the agent doesn’t reach for it. The expert’s knowledge sits encoded in a file nobody reads.

Both failures are invisible without explicit routing evals: test cases that assert whether the skill loaded, not only how it performed after loading. ^[3]

There’s a subtler problem here: action at a distance. Adding a new skill to the registry can degrade an existing skill’s loading behavior without touching that skill at all. Two descriptions that were individually clear can become ambiguous in combination. A skill that routed correctly in isolation starts losing traffic to the new arrival.

This means your eval suite needs to test all skills together, not each in isolation. The unit of evaluation is the skill library, not the individual skill. Every new addition requires a regression sweep of the existing routing behavior.

import pytest

@pytest.mark.parametrize("query, skill, should_load", [
    # positive cases: the right skill loads
    ("draft a response to this vendor email",  "vendor-response",    True),
    ("reply to the supplier about the delay",  "vendor-response",    True),
    ("we received a new procurement request",  "procurement-inquiry", True),
    # boundary cases: the wrong skill must NOT load
    ("draft a response to this vendor email",  "procurement-inquiry", False),
    ("we received a new procurement request",  "vendor-response",    False),
])
def test_routing(query, skill, should_load):
    session = run_session(query)
    loaded = get_loaded_skills(session)
    if should_load:
        assert skill in loaded, f"{skill!r} should load for: {query!r}"
    else:
        assert skill not in loaded, f"{skill!r} must not load for: {query!r}"

Evaluate the path, not just the answer

Assuming the skill loaded correctly, the next failure mode is a correct-looking output from a broken process.

An agent that produces a correctly formatted expense report may have reached that result by skipping the receipt validation step. An agent that returns a polished vendor response may have queried the wrong contract database. A correct-looking output can hide a broken process.

This matters specifically because tribal knowledge is usually procedural: how to do something, not just what the end result looks like. Encoding the what and evaluating only the what misses the point entirely.

Trajectory evaluation assesses the sequence of tool calls the agent made, not just the final output. ^[2] For a skill that’s supposed to:

Look up the client’s account tier
Check the relevant SLA policy
Draft a response using the approved template

You can verify each step fired, in the right order, with the right parameters. A model-based grader (an autorater) can inspect the transcript and check whether the tool call sequence matches the expected pattern. The trajectory check catches failures that output-only evaluation never would.

RUBRIC = """\
Evaluate whether the agent followed the required procedure.

Required steps, in order:
  1. get_account_tier   — must be called first
  2. check_sla_policy   — must reference the result of step 1
  3. draft_response     — must be the final action

Tool calls observed (chronological):
{tool_calls}

Answer yes or no:
- Was get_account_tier called before any other tool?
- Was check_sla_policy called after get_account_tier?
- Was draft_response the last call?
- Did the agent follow the correct procedure overall?
"""

def evaluate_trajectory(transcript):
    calls = extract_tool_calls(transcript)
    formatted = "\n".join(f"  {i+1}. {c['name']}({c['args']})" for i, c in enumerate(calls))
    response = judge.complete(RUBRIC.format(tool_calls=formatted))
    return parse_yes_no(response, key="overall")

Building an autorater for tool calls doesn’t require a sophisticated framework. A rubric like “did the agent call get_account_tier before drafting the response?” evaluates cleanly with a capable model. Pair it with exact-match checks for critical parameters and you have a grader that can run at scale without human review on every case.

The important caveat: grade paths when the path matters, not reflexively. A skill with a single valid execution path needs trajectory evaluation. A skill that can reasonably achieve the same outcome through multiple routes should be graded on outcomes, not penalized for creative routing. ^[1]

One metric worth understanding early: pass^k, the probability that all k trials succeed, not just one. ^[1] A skill that works 70% of the time looks fine in a single test run. For skills encoding mission-critical tribal knowledge, you need to know whether the behavior holds consistently across runs, not just occasionally.

Outcome verification is the north star

Trajectory evaluation tells you the path was right. Outcome verification tells you the world changed correctly.

A flight-booking agent can produce a perfect transcript (correct tool calls, right parameters, professional response) and still fail to create a reservation. The outcome is what happened in the environment, not what appeared in the conversation. ^[1]

For skills encoding tribal knowledge, the outcome check is often the most objective signal available. It doesn’t require a rubric. It doesn’t require calibrating a judge model. It requires looking at the environment state before and after the skill ran and asking whether the right change happened.

This is harder to set up than output grading. You need a real or realistic environment, ground truth state, and automated state verification. That infrastructure investment is what separates teams that know their skills work from teams that assume they do.

Start with the highest-stakes skills first. The ones where a failure is expensive, visible, or both. Get outcome verification working for those before expanding to everything else.

Mock the boundary, not the decision

Outcome checks require real environments, which makes automated offline testing slow and expensive. The natural solution is mocking: simulating tool responses so evals can run without hitting real APIs.

The risk is that aggressive mocking hollows out the eval. If you bypass the LLM call entirely, you learn nothing about whether the skill works. If you mock tool results but leave the agent’s tool selection real, you get something genuinely useful: fast, repeatable evals that still test whether the agent calls the right tool with the right arguments.

The principle: mock at the environment boundary, not inside the agent. Intercept the tool response, not the tool call. This way the eval still exercises the agent’s judgment about what to do next, given this result, without requiring live systems.

# Wrong: bypass the LLM entirely — tests nothing about the skill's actual behavior
def test_wrong(monkeypatch):
    monkeypatch.setattr("skill.run", lambda _: "Your issue has been escalated.")

# Right: mock tool results, keep the LLM call real
def test_correct(monkeypatch):
    monkeypatch.setattr("tools.get_account_tier", lambda _: "Starter")
    monkeypatch.setattr("tools.check_sla_policy",  lambda t: SLA[t])
    # The model still runs with the real skill instructions and real LLM.
    # Only what comes back from the environment is controlled.
    result = agent.run("draft a response to this vendor email about their delay")
    assert "5 business days" in result  # Starter tier SLA, not the 48-hour Enterprise figure

Brittle evals come from under-constrained mocks. If your mock for get_account_tier always returns “Enterprise,” your skill never gets tested on the “Starter” edge case responsible for half the real failures. Stratified mock datasets, covering the case distribution that actually appears in production, are what make offline evals worth running.

The case distribution to start with: real user failures, not hypothetical examples. ^[1] Twenty to fifty genuine failure cases surface more signal than two hundred synthetic ones. And don’t forget the boundary confusion scenarios from your routing evals; mock the adjacent cases, not just the clean ones.

Skills rot. The gotchas flywheel is the fix.

A skill encoding tribal knowledge at a point in time is a snapshot. The organization keeps changing. Policies update. Tool APIs evolve. Models improve in ways that shift behavior subtly and without announcement.

Most skill maintenance guidance focuses on rewriting: update the instructions, restructure the flow, improve the examples. But Perplexity’s production experience points to a different pattern that works better: append-mostly. ^[3]

The mechanism is a gotchas section: a running record of real failure observations, translated into warnings for the model. Every time the skill fails in a way that wasn’t anticipated, the failure becomes a gotcha. The gotcha becomes an eval case. The eval case prevents the regression.

This flywheel is how tribal knowledge actually accumulates in a skill over time. The initial SKILL.md captures what the expert knew when the skill was written. The gotchas section captures what the skill revealed it didn’t know after it started running. Both layers matter. The gotchas section is often the more valuable one, containing the hard-won lessons that nobody thought to write down the first time.

Gotchas provide “extremely high-signal content” for the model, more so than expanded positive instructions. ^[3] A warning that says “do not skip receipt validation even if the submitter says it’s already approved” does more work than three paragraphs of positive guidance on how to process expense reports correctly.

## Gotchas

- Do not skip receipt validation even if the submitter says "already approved."
  Approvals older than 30 days must be re-confirmed in the system before filing.
- The 48-hour SLA applies only to Enterprise accounts with the priority add-on.
  Starter tier is 5 business days. Always call get_account_tier before citing any figure.
- If get_account_tier returns null, the account is in trial mode.
  Use the trial SLA policy — it differs from the free-tier policy.
- Do not draft the response before both get_account_tier and check_sla_policy have
  returned. A tier value from an earlier turn may be stale; always call fresh.

The gotchas flywheel: how failures become permanent improvements

Ownership makes this flywheel run. Someone has to be responsible for observing failures, extracting the pattern, writing the gotcha, and adding the eval case. Without that person, the flywheel stops. Real failures happen but don’t get encoded. The skill stays at whatever quality it shipped at, degrading relative to the world changing around it.

Ownership also means accountability for the routing layer. When a new skill gets added to the registry, the owner of every adjacent skill needs to check that the routing behavior of their skill hasn’t shifted. This is not the kind of work that happens without explicit responsibility.

The re-evaluation cadence should be explicit. High-stakes skills: after every significant model update and every change to the tools they call. Stable, low-stakes skills: on a longer cycle, but with a defined one. “Whenever someone gets around to it” is not a cadence.

Two loops, not one

Offline evaluation and online monitoring are complementary, not substitutes.

Offline evals run in a controlled environment against a curated dataset. They’re fast, repeatable, and enable rapid iteration. They tell you whether a new skill version is better than the old one before you deploy it. The dataset needs to be maintained; stale test cases are as misleading as no test cases.

Online monitoring runs against real traffic. It surfaces failure modes that never appeared in your eval dataset because real users do things eval designers don’t anticipate. Production logs, user feedback signals, and outcome audits all feed this loop.

The teams that get evaluation right run both. Offline evals to iterate quickly and catch regressions. Online monitoring as ground truth that continuously updates their picture of what the skill actually does in the world. When online signals diverge from offline metrics, that’s the signal to update the eval dataset, not to assume the online signal is noise.

One failure mode to watch for in the offline loop: eval saturation. When a skill’s performance on the eval set stops moving (seemingly near-perfect), the problem is often that the eval set no longer covers the hard cases. Saturated evals stop being useful as iteration signals. ^[1] The gotchas flywheel is what keeps the eval set from saturating: each new failure in production becomes a new eval case.

It’s also worth testing across models explicitly. A skill that works correctly on Claude Sonnet may behave differently on GPT-4 or Claude Opus. ^[3] If your organization’s agents run on multiple model backends, your eval suite needs to cover them, not as an afterthought but as part of the standard run.

The skill that works vs. the skill you think works

Encoding tribal knowledge into an AI skill is a meaningful act. The knowledge survives staff turnover, scales beyond what one expert can support, and runs consistently at 2am when the person who knows how to do this is asleep.

But the encoding only holds if you’re willing to verify it continuously. A skill that hasn’t been evaluated recently is a skill you’re hoping still works. As the organization changes, as models update, as tools evolve, that hope has a half-life.

The teams building durable organizational capability on AI skills are the ones who treat evaluation as infrastructure, not a one-time quality check but the ongoing mechanism that keeps encoded knowledge actual rather than nominal. Write the eval before the skill. Test loading before you test execution. Check outcomes, not just outputs. Let failures become gotchas. Own the cadence.

Tribal knowledge rots when it stops getting updated. Skills rot when they stop getting evaluated. The fix is the same in both cases: someone owns it, someone checks it, and there’s a clear signal when it’s failing.

References

[1] Anthropic Engineering, "Demystifying evals for AI agents," 2026. anthropic.com

[2] LangChain, "LLM Evaluation Framework: Trajectories vs. Outputs," 2025. langchain.com

[3] Perplexity Research, "Designing, Refining, and Maintaining Agent Skills at Perplexity," 2026. research.perplexity.ai

[4] Anthropic, "Extend Claude with skills," Claude Code documentation, 2026. code.claude.com

[5] OpenAI, "Agent Skills – Codex," Codex developer documentation, 2026. developers.openai.com

[6] Nous Research, "Skills Hub," Hermes Agent documentation, 2026. hermes-agent.nousresearch.com