Essay
Cutting a skill in half without making it worse
Shortening two caliper skills, grill-skill and evaluate-skill, by ~45% using a non-inferiority eval bar: prove the shorter version is no worse before shipping the cut.
I was scrolling X when a tweet from Matt Pocock caught me. [2] His point: skills are littered with no-ops, lines like “be thorough” or “write a detailed commit message”. This is useless because models are RL to the bone already with this, so adding it to the skill is slop.
Usually modifying a skill means adding stuff. This one tried to make it smaller, and what was tricky was proving that smaller did not mean worse.
Skills are always-loaded context. Every line in a SKILL.md is paid for on every invocation, whether or not that line ends up mattering. Matt Pocock has written a writing-great-skills skill. [1] One of the things he advocates for is to prune to the shortest skill that still behaves predictably. The hard part is the second clause. It is easy to delete lines. It is hard to know whether you deleted a load-bearing one.
I took two skills, grill-skill [3] and evaluate-skill [4], and cut them by roughly 45% each. The claim I wanted to defend was not “these are better now.” It was narrower and more honest:
The shortened skill is no worse than the original, and still clearly beats the raw agent with no skill at all.
That is a non-inferiority test, and it changes how you evaluate.
Ablation is all you need
The classic move when you modify a skill is to add precise instructions, modify slop that your model wrote. It’s valid. But a skill trajectory should mostly to be trimmed with time. Models are getting better with every day passing, so your skill is becoming more irrelevant with time. But also models need SPACE! They need a clear context window to avoid being dumb. So let it breathe!
So here the exercise was to run ablation on the skill. I held each skill’s eval fixed and varied only the skill text, comparing pass@k across three variants:
control = the current full skill
candidate = a shortened variant
baseline = the raw agent, no skill at all
The baseline matters more than it looks. Without it, a shortened skill that quietly stopped working could still “pass” against a weak control, and you would never know the skill had stopped contributing anything. The baseline is the floor. If the candidate sinks toward it, the cut removed something real.
The decision rule:
- Non-inferior at k=3. The candidate shows no task-level regression against the control.
- Still earns its keep. The candidate still clearly beats the no-skill baseline.
- Equalling or beating the control is a bonus, not a requirement. The claim is “no worse,” not “better.”
Debugging ran at k=1 to shake out spec and prompt bugs cheaply. Only the accept/reject decision ran at k=3. Both skills used the claude-code backend so the numbers were comparable.
Harden the eval before you trust it
While doing that exercise, I understood that the pre-existing eval was just bad. It was not exercising the prose I intended to cut. They tested YAML serialization and structure, the mechanical output, but never the interview discipline or the judgment guidance that made up the bulk of the text. An eval that cannot feel the damage makes every cut look safe.
So I rewrote the evals first. This feels backwards, since you are strengthening the thing that might block your change, but it is the only way the “no worse” claim means anything. A green light from an eval that can’t see the wound is not evidence.
For grill-skill, the old eval pre-supplied the whole task structure, so it only ever tested serialization. The new one has three tasks:
- Interview discipline (first turn). An open prompt, no structure supplied. The judge checks that the agent interviews the user and stops, rather than fabricating answers and charging ahead. It is backed by a deterministic assert that no spec was written during the turn.
- Gap-fill detect-and-ask. An existing eval is present. The judge checks that the agent reports the existing tasks and asks what is missing before touching anything, with an assert that the existing spec is left unmodified.
- Serialize a valid spec. A fully-directive prompt, graded by a deterministic structure assert.
The first task taught me something worth writing down: the prompt has to signal that the user is present and will answer. Without that signal, an agent that just does the task instead of interviewing is being rational, not disobedient. The eval was measuring the wrong thing until the prompt made the human’s presence explicit.
grill-skill: 143 → 55 lines
The grill-skill cut was the clean one.
| Before | After | Reduction | |
|---|---|---|---|
SKILL.md | 143 lines / ~1240 tok | 55 lines / ~703 tok | ~43% |
What came out:
- The inline caliper command blocks in the interview phases that duplicated
REFERENCE.md. Those became a pointer toREFERENCE.md: a single source of truth instead of two copies that drift. - The verbose “quoted script” instructions, long blocks of exact wording to say to the user, collapsed into concise directives.
- The intro paragraph that restated the description.
What stayed, verbatim in spirit: the load-bearing interview discipline. Interview, confirm, one question at a time, do not run ahead. That is the behavior the skill exists to enforce, so it does not get compressed.
The result at k=3:
| Task | Baseline (no skill) | Control (full) | Candidate (short) |
|---|---|---|---|
| Interview discipline | 1/1 | 3/3 | 3/3 |
| Gap-fill: detect & ask first | 0/1 | 2/3 | 3/3 |
| Serialize a valid spec | 0/1 | 0/3 | 0/3 |
On the informative tasks the candidate matches or beats the control, and it beats the baseline where the skill actually earns its keep. The gap-fill discipline is 3/3 with the skill and 0/1 without it. That row is the whole argument: the skill’s job is to make the agent ask before acting, and the raw agent does not do that on its own.
The third task is 0/3 everywhere: control, candidate, and baseline all fail it. That is not a regression. It is a pre-existing behavior where the agent writes a spec missing the top-level skill: block on a fully-directive prompt. It is non-informative for the ablation because it is equal across every variant, but it is a real bug worth a separate fix. An eval that fails uniformly tells you nothing about your cut, though it can still tell you something about the skill.
Verdict: confidently reduced.
evaluate-skill: one aggressive cut, then bisect
evaluate-skill did not go as smoothly, which is the more useful story.
I started with a maximally aggressive candidate: disclose everything to REFERENCE.md and leave SKILL.md as a thin shell.
v0-full 159 lines / ~1713 tok
v1-short 42 lines / ~734 tok (~57% smaller) rejected
v2-milder 62 lines / ~922 tok (~46% smaller) shipped
v1 moved four judgment reference sections out of SKILL.md and into REFERENCE.md: the artifact-vs-transcript checks, the task-quality checklist, common eval patterns, and how to write expect: rubrics. About 70 lines of always-loaded text, gone.
It regressed. On “create a valid spec,” v1 dropped from 3/3 to 1/3 against the control. The aggressive cut had removed too much structural cue. The agent no longer had enough inline scaffolding to reliably produce a valid spec skeleton. This is the failure the whole method exists to catch, and it is exactly why you do not ship on a read-through.
So I bisected. v2 restores a compact spec skeleton inline (skill.path, backend, the judge shape) while keeping the judgment prose disclosed to REFERENCE.md. The structural cue the agent needs to emit valid YAML lives in SKILL.md; the reference material it only sometimes needs lives one hop away.
v1 removed: structural skeleton + judgment prose -> regressed
v2 removed: judgment prose only -> parity
v2 matched the control across the spec-creation tasks (parity, within one-sample noise) and beat the baseline on valid spec creation. I also added a disclosure check, a short review task that plants a vague expect: and a script-checkable fact, to confirm the agent still reaches the judgment guidance now that it lives in REFERENCE.md. Progressive disclosure only works if the agent actually opens the drawer.
An honest caveat: the final pristine k=3 confirmation for v2 was cut short by a spending cap on the backend account. v2 produced correct specs on every attempt it was allowed to run, and the failures were rate-limit artifacts rather than skill regressions. A clean re-run after the cap resets would move this from “moderate confidence” to “high.” I would rather write that down than round it up.
De-noising the asserts
My first deterministic asserts were too brittle, and the brittleness made the control flaky, which muddied every comparison.
They did exact-match on paths, and they crashed when the agent wrote judge: codex as a string instead of judge: {backend: codex}. A crashing assert is not a failing task. It is noise that shows up in both columns and hides the signal you care about.
The fixes:
- Accept string-or-dict backends (
judge: codexandjudge: {backend: codex}both valid). - Tolerate
./SKILL.mdvsSKILL.md. - Treat invalid YAML as a clean fail, not a crash.
- Soften the review rubric to accept any wording that makes both required points.
The final verdicts come from this de-noised run. When the control is flaky, the numbers tell you about the harness rather than the candidate you set out to measure.
The size summary
| Skill | Before | After | Reduction |
|---|---|---|---|
grill-skill | 143 lines / ~1240 tok | 55 lines / ~703 tok | ~43% |
evaluate-skill | 159 lines / ~1713 tok | 62 lines / ~922 tok | ~46% |
Token counts are approximate (chars over four, since there is no exact tokenizer in the environment), but the relative reductions are robust to whatever tokenizer you use. Line counts are exact.
The description is the most expensive line in the skill
I rewrote both skill descriptions too, and I want to make an emphasis to the reader here. Because the description is not like the rest of the skill, it’s even MORE important. The body only loads when the skill fires. The description is in context on every single turn. It is what the router reads to decide whether the skill even applies. That makes it the most expensive real estate in the whole file, and it is paying rent constantly, most of the time on turns where the skill never runs.
So the description has to earn that cost. I rewrote both to lead with the job word (grill, evaluate) and dropped “caliper”, the tool the skill happens to be built on. A router deciding relevance needs to know what the skill does, not what library sits underneath it. The tool name is fine in the body; sitting in the always-loaded description, it is dead weight you pay for on every turn.
The honest gap is that I never measured whether the new descriptions actually trigger better. That needs its own test, and it is the next thing to build: a trigger evaluation in Caliper that fires each skill against a batch of prompts and checks it activates on the ones it should while staying quiet on the ones it shouldn’t. Until that exists, these rewrites are reasoned, not proven.
The loop
This is the same shape as improving a skill, run in reverse:
Shortest plausible cut
-> harden the eval so it can feel the cut
-> ablate: candidate vs control vs baseline
-> regressed? bisect to a milder cut
-> ship the smallest cut that stays non-inferior
The discipline that makes it work is refusing to trust a read-through. Every line you want to delete looks deletable, which is exactly why you want to delete it. The eval is the only thing standing between “looks redundant” and “was redundant.” Harden it first, or you are just deleting lines and hoping.
The lesson I’d keep: shortening a skill only counts as a win if you can show nothing broke. A smaller skill saves tokens on every invocation, which is worth real effort. But a skill that quietly stopped working costs you far more than it saved, and you won’t find out until it fails on a real task. Measure the cut against a baseline, or don’t make it.
References
[1] Matt Pocock, "writing-great-skills," from mattpocock/skills (Skills for Real Engineers). github.com
[2] Matt Pocock (@mattpocockuk), "Take a look at your favourite skill... Agent-authored skills are LITTERED with no-ops," X. x.com
[3] Caliper, "grill-skill," a local-first eval harness for Claude Code and Codex skills. github.com
[4] Caliper, "evaluate-skill," a local-first eval harness for Claude Code and Codex skills. github.com