A small eval loop for the humanizer skill

Most writing evals start in the wrong place.

They ask whether the text “sounds human.” That sounds reasonable until you try to grade it. Human compared to what? A blog post? A support email? A design doc? A founder note written at midnight?

The more useful question is narrower: when the user gives a writing sample, did the skill preserve that voice while removing AI tells?

I used blader/humanizer as the test case. It is a Claude Code and OpenCode skill that rewrites AI-sounding text using patterns from Wikipedia’s “Signs of AI writing” guide. ^[5] The skill is already strong on the obvious tells: significance inflation, “rapidly evolving landscape,” signposting, generic conclusions, bold inline-list headers, and the other sludge that makes model text feel assembled.

The gap I wanted to evaluate was smaller. Version 2.5.1 already had a Voice Calibration section, but the requirement was soft:

## Voice Calibration (Optional)

If the user provides a writing sample (their own previous writing), analyze it before rewriting:

1. **Read the sample first.** Note:
   - Sentence length patterns (short and punchy? Long and flowing? Mixed?)
   - Word choice level (casual? academic? somewhere between?)
   - How they start paragraphs (jump right in? Set context first?)
   - Punctuation habits (lots of dashes? Parenthetical asides? Semicolons?)
   - Any recurring phrases or verbal tics
   - How they handle transitions (explicit connectors? Just start the next point?)

2. **Match their voice in the rewrite.** Don't just remove AI patterns - replace them with patterns from the sample.

3. **When no sample is provided,** fall back to the default behavior.

That is directionally right. But “analyze it” is easy for the model to internalize silently or skip. The eval target became:

If a writing sample is supplied, the assistant must state a brief voice target before rewriting, then produce a final rewrite that follows that target.

That is much easier to evaluate than “be human.”

The failure case

The current skill’s default style advice is intentionally lively:

**Have opinions.** Don't just report facts - react to them.

**Use "I" when it fits.** First person isn't unprofessional - it's honest.

**Let some mess in.** Perfect structure feels algorithmic.

That advice is good when no sample exists. It is risky when the sample is dry technical prose.

Here is one eval item:

- name: dry_technical_voice
  prompt: |
    Humanize the draft below. Use the writing sample as the voice target.

    Writing sample:
    The cache stores normalized keys and their serialized values. A write replaces
    the previous value for the same key. Expiration is evaluated on read, not by a
    background job. This keeps the implementation small and makes eviction behavior
    easy to test.

    Draft to humanize:
    The caching layer serves as a robust foundation for scalable application
    performance, providing developers with a seamless and intuitive mechanism for
    managing frequently accessed data. Not only does it reduce latency, but it also
    empowers teams to unlock more efficient user experiences across the platform.
    Let's dive into what makes this approach so powerful.

A good answer should not add jokes, first person, or a bloggy aside. It should sound like the person who wrote the cache note cleaned up the bad draft.

The expectation was explicit:

expect: |
  Pass only if the response includes a brief voice card or equivalent style-target
  note before the rewrite, and the final rewritten prose removes promotional
  phrasing, signposting, and negative parallelism while preserving a dry technical
  voice.

  The rewrite should be plain, specific, and impersonal. It should not add "I",
  jokes, emotional reactions, punchy editorial comments, or conversational warmth
  unless present in the source sample.

  Fail if the response does not identify the sample voice before rewriting, makes
  the prose more personal or opinionated than the sample, leaves "serves as",
  "seamless", "unlock", or "Let's dive in", or changes the technical meaning.

This follows the pattern OpenAI recommends for skill evals: prompt, captured run, checks, and a comparable score. ^[1] It also keeps the task straightforward. LangChain’s evaluation writeup makes a useful point here: if the task is too adversarial, you start evaluating the agent’s general problem-solving ability instead of the skill. ^[2]

The Caliper spec

I kept the suite small: four tasks, one narrow behavior.

skill:
  path: ../SKILL.md
  backend: codex
judge:
  backend: codex
tasks:
  - name: edonadei_blog_voice
    prompt: |
      Humanize the draft below. Use the writing sample as the voice target.
      ...
    expect: |
      Pass only if the response includes a brief voice card or equivalent
      style-target note before the rewrite, and the final rewritten prose removes
      obvious AI-writing tells while preserving the sample's voice.

  - name: dry_technical_voice
    prompt: |
      Humanize the draft below. Use the writing sample as the voice target.
      ...
    expect: |
      Pass only if the response includes a brief voice card or equivalent
      style-target note before the rewrite, and the final rewritten prose removes
      promotional phrasing, signposting, and negative parallelism while preserving
      a dry technical voice.

  - name: casual_email_voice
    prompt: |
      Humanize the draft below. Use the writing sample as the voice target.
      ...
    expect: |
      Pass only if the response includes a brief voice card or equivalent
      style-target note before the rewrite, and the final rewritten prose removes
      formal chatbot artifacts and matches the casual email sample.

  - name: no_sample_default
    prompt: |
      Humanize this draft:
      ...
    expect: |
      Grade only the rewritten prose, especially the final rewrite if the response
      includes draft/audit sections. A more natural, opinionated voice is acceptable
      because no sample was supplied.

The fourth task matters. It checks that the change does not break the skill’s normal behavior when there is no sample.

I also added a cheap post-run assertion script. Caliper’s autorater is the main judge, but deterministic checks are useful because they fail loudly and explain exactly what regressed.

AI_TELLS = [
    "rapidly evolving landscape",
    "pivotal milestone",
    "serves as",
    "seamless",
    "unlock",
    "let's dive in",
    "i hope this message finds you well",
    "future looks bright",
    "journey toward excellence",
]

def check_output(task: str, output: str) -> list[str]:
    checked = final_rewrite(output)
    text = checked.lower()
    failures: list[str] = []

    if task != "no_sample_default" and not has_voice_card(output):
        failures.append(f"{task}: did not include a voice card before rewriting")

    for tell in AI_TELLS:
        if tell in text:
            failures.append(f"{task}: left AI tell {tell!r}")

    if re.search(r"\*\*[^*\n]+:\*\*", checked):
        failures.append(f"{task}: used bold inline-header formatting")

    if checked.count("—") > 1:
        failures.append(f"{task}: used more than one em dash")

    if task == "dry_technical_voice" and re.search(r"\bI\b|\bwe\b|\bmy\b|\bour\b", checked):
        failures.append(f"{task}: added first-person language to dry technical voice")

    return failures

The script extracts the final rewrite when the skill uses its normal output format:

def final_rewrite(output: str) -> str:
    patterns = [
        r"\*\*Final rewrite\*\*\s*(.*?)(?:\n\s*\*\*Changes made\*\*|\Z)",
        r"Final rewrite\s*\n+(.+?)(?:\n\s*Changes made|\Z)",
    ]
    for pattern in patterns:
        match = re.search(pattern, output, flags=re.IGNORECASE | re.DOTALL)
        if match:
            return match.group(1).strip()
    return output

That split is important. The skill intentionally returns draft/audit/final sections, so the eval should not penalize the wrapper. It should grade the rewrite.

Baseline result

I ran the current upstream SKILL.md with:

caliper run humanizer-voice-calibration.eval.yaml \
  --k 1 \
  --workers 1 \
  --timeout 180 \
  --judge script \
  --output baseline-old-k1.json

Result:

Task	Current skill
`edonadei_blog_voice`	0/1
`dry_technical_voice`	0/1
`casual_email_voice`	0/1
`no_sample_default`	1/1
Average pass@1	25.0%

The failures all had the same shape. The rewrites were often decent, but the model did not identify the supplied voice before rewriting.

The judge reason for dry_technical_voice was:

The transcript does not include a brief voice card or equivalent style-target note
identifying the sample voice before the rewrite. Although the final rewrite is dry
and removes the prohibited promotional phrasing, the missing voice identification
fails the expectation.

That is the exact kind of failure I want an eval to catch. The output looked okay. The process was under-specified.

Anthropic’s eval guidance distinguishes between checking final outcomes and checking the path that produced them. ^[3] For a writing skill, the path is mostly textual, not tool calls. The “voice card” is the observable trace that the model read the sample as a constraint instead of treating it as flavor.

The patch

The actual skill change is small.

 ## Voice Calibration (Optional)

-If the user provides a writing sample (their own previous writing), analyze it before rewriting:
+If the user provides a writing sample (their own previous writing), that sample becomes
+the style target. Analyze it before rewriting, and let it override the default
+PERSONALITY AND SOUL guidance.

-1. **Read the sample first.** Note:
+1. **Read the sample first and write a brief voice card.** Before the rewrite, note:
    - Sentence length patterns (short and punchy? Long and flowing? Mixed?)
    - Word choice level (casual? academic? somewhere between?)
    - How they start paragraphs (jump right in? Set context first?)
    - Punctuation habits (lots of dashes? Parenthetical asides? Semicolons?)
    - Any recurring phrases or verbal tics
    - How they handle transitions (explicit connectors? Just start the next point?)
+   - Point of view (first person? second person? detached?)
+   - Emotional temperature (dry, warm, blunt, funny, skeptical, enthusiastic?)

-2. **Match their voice in the rewrite.** Don't just remove AI patterns - replace them with patterns from the sample.
+2. **Use the voice card as a constraint, not decoration.** Don't just remove AI patterns -
+replace them with patterns from the sample. If they write dry technical prose, keep it dry.

-3. **When no sample is provided,** fall back to the default behavior.
+3. **Do not import personality from this guide unless the sample supports it.** Do not add
+first person, jokes, punchy asides, mixed feelings, edge, or extra opinion just because
+the default guidance says human writing can have those traits.
+
+4. **Preserve meaning and remove AI tells.** Voice matching does not permit changing
+claims, adding unsupported details, or leaving obvious AI artifacts in place.

I also updated the output format:

 Provide:
-1. Draft rewrite
-2. "What makes the below so obviously AI generated?" (brief bullets)
-3. Final rewrite
-4. A brief summary of changes made (optional, if helpful)
+1. Voice card (only when a writing sample is provided)
+2. Draft rewrite
+3. "What makes the below so obviously AI generated?" (brief bullets)
+4. Final rewrite
+5. A brief summary of changes made (optional, if helpful)

This is deliberately not a rewrite of the whole skill. The improvement is one constraint: sample voice overrides default personality.

Improved result

Same command, patched skill:

caliper run evals/humanizer-voice-calibration.eval.yaml \
  --k 1 \
  --workers 1 \
  --timeout 180 \
  --judge script \
  --output improved-k1.json

python evals/assert_voice.py improved-k1.json

Result:

Task	Current skill	Patched skill
`edonadei_blog_voice`	0/1	1/1
`dry_technical_voice`	0/1	1/1
`casual_email_voice`	0/1	1/1
`no_sample_default`	1/1	1/1
Average pass@1	25.0%	100.0%

The post-run assertions passed:

Voice assertions passed.

Small evals are not benchmarks. This is four tasks, k=1, one backend, one date. The point is not “humanizer is now solved.” The point is that a previously implicit behavior now has a regression test.

That is the loop I care about:

Observed failure
  -> eval case
  -> small instruction patch
  -> rerun
  -> regression coverage

This is also why I prefer evals that name the failure precisely. “Make the output more human” would have produced a mushy patch. “When sample exists, show and follow the voice target” produced a small one.

CI for the skill

The follow-up tooling PR adds a GitHub Actions workflow too.

name: Caliper

on:
  pull_request:
  workflow_dispatch:

jobs:
  humanizer-evals:
    runs-on: ubuntu-latest
    env:
      OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install Caliper
        run: python -m pip install "caliper-eval[openai]"

      - name: Validate eval spec
        run: caliper validate evals/humanizer-voice-calibration.eval.yaml

      - name: Run Caliper eval
        if: ${{ env.OPENAI_API_KEY != '' || env.ANTHROPIC_API_KEY != '' }}
        run: |
          python - <<'PY'
          import os
          from pathlib import Path
          import yaml

          spec_path = Path("evals/humanizer-voice-calibration.eval.yaml")
          spec = yaml.safe_load(spec_path.read_text())
          backend = "openai-api" if os.environ.get("OPENAI_API_KEY") else "claude-api"
          spec["skill"]["backend"] = backend
          spec["judge"]["backend"] = backend
          Path("evals/humanizer-voice-calibration.ci.eval.yaml").write_text(
              yaml.safe_dump(spec, sort_keys=False)
          )
          PY

          caliper run evals/humanizer-voice-calibration.ci.eval.yaml \
            --k 1 \
            --baseline \
            --judge script \
            --output results.json

          python evals/assert_voice.py results.json

The workflow always validates the spec. It only runs the full eval when a model API key is configured. That keeps the PR useful for maintainers who do not want model calls on every contribution, while still making the full regression check available.

Contribution

I packaged the work as two upstream contributions rather than leaving it as a blog-only experiment.

The first PR is intentionally small: it changes only SKILL.md.

codex/voice-calibration
  -> Clarify that a supplied writing sample overrides the default personality guidance.
  -> Require a brief voice card before rewriting.

The second PR is stacked on top and adds only the evaluation machinery:

codex/humanizer-caliper-evals
  -> Add evals/humanizer-voice-calibration.eval.yaml.
  -> Add evals/assert_voice.py.
  -> Add .github/workflows/caliper.yml.

The first PR body leads with the eval result:

| Version | pass@1 |
|---|---:|
| Current upstream `SKILL.md` | 25.0% |
| Voice Calibration PR | 100.0% |

The three failures in the current skill all had the same shape: the final
rewrite could be good, but the response did not identify the supplied sample's
voice before rewriting. This PR makes that step explicit.

The second PR gives the maintainer the harness separately. That split matters. The behavior change can be reviewed on its own, and the eval/CI tooling can be discussed without blocking the smaller improvement.

That is the contribution shape I want more skills to have. Not just “I improved the prompt.” Not just “I added evals.” The useful unit is both: a behavior change and the harness that explains why it should stay changed. Splitting them into two PRs just makes the review path cleaner.

The earlier post on skill evals made the broad argument: skills without evals are just optimism. ^[4] This is the smaller version in practice. One failure, one eval, one patch, one regression check.

References

[1] OpenAI Developers, "Testing Agent Skills Systematically with Evals," January 2026. developers.openai.com

[2] LangChain, "Evaluating Skills," 2026. langchain.com

[3] Anthropic Engineering, "Demystifying evals for AI agents," 2026. anthropic.com

[4] edonadei, "Skills without evals are just optimism," May 2026. edonadei.com

[5] Wikipedia, "Wikipedia:Signs of AI writing," maintained by WikiProject AI Cleanup. wikipedia.org

[6] blader, "humanizer," GitHub repository. github.com