harness:certify
About
This skill verifies an evolved agent's score stability by running its evaluation three times and reporting the mean and standard deviation. It's used when developers need to confirm that performance metrics are reliable and not due to random variance. The skill automatically reads the project configuration and executes evaluations on the current best code.
Quick Install
Claude Code
Recommendednpx skills add raphaelchristi/harness-evolver -a claude-code/plugin add https://github.com/raphaelchristi/harness-evolvergit clone https://github.com/raphaelchristi/harness-evolver.git ~/.claude/skills/harness:certifyCopy and paste this command in Claude Code to install this skill
Documentation
/harness:certify
Verify score stability by running evaluation multiple times and reporting statistical confidence.
Resolve Tool Path
TOOLS="${EVOLVER_TOOLS:-$([ -d ".evolver/tools" ] && echo ".evolver/tools" || echo "$HOME/.evolver/tools")}"
EVOLVER_PY="${EVOLVER_PY:-$([ -f "$HOME/.evolver/venv/bin/python" ] && echo "$HOME/.evolver/venv/bin/python" || echo "python3")}"
What To Do
Read .evolver.json to get the best experiment and dataset.
Run evaluation 3 times on the current code (not a worktree — the best code is already merged):
for i in 1 2 3; do
$EVOLVER_PY $TOOLS/run_eval.py \
--config .evolver.json \
--worktree-path "." \
--experiment-prefix "certify-run-$i"
done
After all 3 runs complete, read results and compute statistics:
$EVOLVER_PY $TOOLS/read_results.py --experiments "certify-run-1-{suffix},certify-run-2-{suffix},certify-run-3-{suffix}" --config .evolver.json --format summary
Calculate mean and standard deviation from the 3 combined_scores.
Report
CERTIFICATION REPORT
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Runs: 3
Mean: {mean:.3f}
Std: {std:.3f}
Range: {min:.3f} — {max:.3f}
Verdict: {STABLE|UNSTABLE}
STABLE (std < 0.05): Score is reliable. The agent performs consistently.
MARGINAL (0.05 <= std < 0.10): Score varies moderately. Consider adding rubrics to reduce judge variance.
UNSTABLE (std >= 0.10): Score is unreliable. The LLM judge interprets criteria differently across runs. Add few-shot examples or tighter rubrics.
After Certification
If STABLE: suggest /harness:deploy to finalize.
If UNSTABLE: suggest adding rubrics to dataset examples, or running /harness:evolve with heavy mode for more thorough evaluation.
GitHub Repository
Related Skills
railway-docs
DocumentationThis skill fetches current Railway documentation to answer questions about features, functionality, or specific docs URLs. It ensures developers receive accurate, up-to-date information directly from Railway's official sources. Use it when users ask how Railway works or reference Railway documentation.
n8n-code-python
DocumentationThis Claude Skill provides expert guidance for writing Python code in n8n's Code nodes, specifically for using Python's standard library and working with n8n's special syntax like `_input`, `_json`, and `_node`. It helps developers understand Python's limitations within n8n and recommends using JavaScript for most workflows while offering Python solutions for specific data transformation needs.
archon
DocumentationThe Archon skill provides RAG-powered semantic search and project management through a REST API. Use it for querying documentation, managing hierarchical projects/tasks, and performing knowledge retrieval with document upload capabilities. Always prioritize Archon first when searching external documentation before using other sources.
n8n-code-javascript
DocumentationThis Claude Skill provides expert guidance for writing JavaScript code in n8n's Code nodes. It covers essential n8n-specific syntax like `$input`/`$json` variables, HTTP helpers, and DateTime handling, while troubleshooting common errors. Use it when developing n8n workflows that require custom JavaScript processing in Code nodes.
