A small eval loop for the humanizer skill
A case study in using Caliper to evaluate blader/humanizer, tighten voice calibration, and turn the improvement into an upstream contribution with regression coverage.
agents evaluation skills writing
Tagged
A case study in using Caliper to evaluate blader/humanizer, tighten voice calibration, and turn the improvement into an upstream contribution with regression coverage.
Tribal knowledge encoded as an AI skill is still just text until you evaluate it. Ablation baselines, routing regression tests, trajectory autoraters, and the gotchas flywheel keep encoded knowledge from rotting.
Everyone is worried about AI reading things it shouldn't. That's the wrong threat model. The problem starts after the agent reads.