Back to Skills

Evals

majiayu000
Updated Today
2 views
58
9
58
View on GitHub
Testingtesting

About

Evals is an agent evaluation framework for testing and benchmarking Claude Code agents using Anthropic's best practices. It provides three grader types (code-based, model-based, human), transcript capture, and pass@k metrics for regression and capability testing. Use this skill when you need to evaluate, verify, or benchmark agent behavior.

Quick Install

Claude Code

Recommended
Plugin CommandRecommended
/plugin add https://github.com/majiayu000/claude-skill-registry
Git CloneAlternative
git clone https://github.com/majiayu000/claude-skill-registry.git ~/.claude/skills/Evals

Copy and paste this command in Claude Code to install this skill

Documentation

Customization

Before executing, check for user customizations at: ~/.claude/skills/CORE/USER/SKILLCUSTOMIZATIONS/Evals/

If this directory exists, load and apply any PREFERENCES.md, configurations, or resources found there. These override default behavior. If the directory does not exist, proceed with skill defaults.

🚨 MANDATORY: Voice Notification (REQUIRED BEFORE ANY ACTION)

You MUST send this notification BEFORE doing anything else when this skill is invoked.

  1. Send voice notification:

    curl -s -X POST http://localhost:8888/notify \
      -H "Content-Type: application/json" \
      -d '{"message": "Running the WORKFLOWNAME workflow in the Evals skill to ACTION"}' \
      > /dev/null 2>&1 &
    
  2. Output text notification:

    Running the **WorkflowName** workflow in the **Evals** skill to ACTION...
    

This is not optional. Execute this curl command immediately upon skill invocation.

Evals - AI Agent Evaluation Framework

Comprehensive agent evaluation system based on Anthropic's "Demystifying Evals for AI Agents" (Jan 2026).

Key differentiator: Evaluates agent workflows (transcripts, tool calls, multi-turn conversations), not just single outputs.


When to Activate

  • "run evals", "test this agent", "evaluate", "check quality", "benchmark"
  • "regression test", "capability test"
  • Compare agent behaviors across changes
  • Validate agent workflows before deployment
  • Verify ALGORITHM ISC rows
  • Create new evaluation tasks from failures

Core Concepts

Three Grader Types

TypeStrengthsWeaknessesUse For
Code-basedFast, cheap, deterministic, reproducibleBrittle, lacks nuanceTests, state checks, tool verification
Model-basedFlexible, captures nuance, scalableNon-deterministic, expensiveQuality rubrics, assertions, comparisons
HumanGold standard, handles subjectivityExpensive, slowCalibration, spot checks, A/B testing

Evaluation Types

TypePass TargetPurpose
Capability~70%Stretch goals, measuring improvement potential
Regression~99%Quality gates, detecting backsliding

Key Metrics

  • pass@k: Probability of at least 1 success in k trials (measures capability)
  • pass^k: Probability all k trials succeed (measures consistency/reliability)

Workflow Routing

TriggerWorkflow
"run evals", "evaluate suite"Run suite via Tools/AlgorithmBridge.ts
"log failure"Log failure via Tools/FailureToTask.ts log
"convert failures"Convert to tasks via Tools/FailureToTask.ts convert-all
"create suite"Create suite via Tools/SuiteManager.ts create
"check saturation"Check via Tools/SuiteManager.ts check-saturation

Quick Reference

CLI Commands

# Run an eval suite
bun run ~/.claude/skills/Evals/Tools/AlgorithmBridge.ts -s <suite>

# Log a failure for later conversion
bun run ~/.claude/skills/Evals/Tools/FailureToTask.ts log "description" -c category -s severity

# Convert failures to test tasks
bun run ~/.claude/skills/Evals/Tools/FailureToTask.ts convert-all

# Manage suites
bun run ~/.claude/skills/Evals/Tools/SuiteManager.ts create <name> -t capability -d "description"
bun run ~/.claude/skills/Evals/Tools/SuiteManager.ts list
bun run ~/.claude/skills/Evals/Tools/SuiteManager.ts check-saturation <name>
bun run ~/.claude/skills/Evals/Tools/SuiteManager.ts graduate <name>

ALGORITHM Integration

Evals is a verification method for THE ALGORITHM ISC rows:

# Run eval and update ISC row
bun run ~/.claude/skills/Evals/Tools/AlgorithmBridge.ts -s regression-core -r 3 -u

ISC rows can specify eval verification:

| # | What Ideal Looks Like | Verify |
|---|----------------------|--------|
| 1 | Auth bypass fixed | eval:auth-security |
| 2 | Tests all pass | eval:regression |

Available Graders

Code-Based (Fast, Deterministic)

GraderUse Case
string_matchExact substring matching
regex_matchPattern matching
binary_testsRun test files
static_analysisLint, type-check, security scan
state_checkVerify system state after execution
tool_callsVerify specific tools were called

Model-Based (Nuanced)

GraderUse Case
llm_rubricScore against detailed rubric
natural_language_assertCheck assertions are true
pairwise_comparisonCompare to reference with position swap

Domain Patterns

Pre-configured grader stacks for common agent types:

DomainPrimary Graders
codingbinary_tests + static_analysis + tool_calls + llm_rubric
conversationalllm_rubric + natural_language_assert + state_check
researchllm_rubric + natural_language_assert + tool_calls
computer_usestate_check + tool_calls + llm_rubric

See Data/DomainPatterns.yaml for full configurations.


Task Schema (YAML)

task:
  id: "fix-auth-bypass_1"
  description: "Fix authentication bypass when password is empty"
  type: regression  # or capability
  domain: coding

  graders:
    - type: binary_tests
      required: [test_empty_pw.py]
      weight: 0.30

    - type: tool_calls
      weight: 0.20
      params:
        sequence: [read_file, edit_file, run_tests]

    - type: llm_rubric
      weight: 0.50
      params:
        rubric: prompts/security_review.md

  trials: 3
  pass_threshold: 0.75

Resource Index

ResourcePurpose
Types/index.tsCore type definitions
Graders/CodeBased/Deterministic graders
Graders/ModelBased/LLM-powered graders
Tools/TranscriptCapture.tsCapture agent trajectories
Tools/TrialRunner.tsMulti-trial execution with pass@k
Tools/SuiteManager.tsSuite management and saturation
Tools/FailureToTask.tsConvert failures to test tasks
Tools/AlgorithmBridge.tsALGORITHM integration
Data/DomainPatterns.yamlDomain-specific grader configs

Key Principles (from Anthropic)

  1. Start with 20-50 real failures - Don't overthink, capture what actually broke
  2. Unambiguous tasks - Two experts should reach identical verdicts
  3. Balanced problem sets - Test both "should do" AND "should NOT do"
  4. Grade outputs, not paths - Don't penalize valid creative solutions
  5. Calibrate LLM judges - Against human expert judgment
  6. Check transcripts regularly - Verify graders work correctly
  7. Monitor saturation - Graduate to regression when hitting 95%+
  8. Build infrastructure early - Evals shape how quickly you can adopt new models

Related

  • ALGORITHM: Evals is a verification method
  • Science: Evals implements scientific method
  • Browser: For visual verification graders

GitHub Repository

majiayu000/claude-skill-registry
Path: skills/data/Evals

Related Skills

content-collections

Meta

This skill provides a production-tested setup for Content Collections, a TypeScript-first tool that transforms Markdown/MDX files into type-safe data collections with Zod validation. Use it when building blogs, documentation sites, or content-heavy Vite + React applications to ensure type safety and automatic content validation. It covers everything from Vite plugin configuration and MDX compilation to deployment optimization and schema validation.

View skill

evaluating-llms-harness

Testing

This Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.

View skill

cloudflare-turnstile

Meta

This skill provides comprehensive guidance for implementing Cloudflare Turnstile as a CAPTCHA-alternative bot protection system. It covers integration for forms, login pages, API endpoints, and frameworks like React/Next.js/Hono, while handling invisible challenges that maintain user experience. Use it when migrating from reCAPTCHA, debugging error codes, or implementing token validation and E2E tests.

View skill

webapp-testing

Testing

This Claude Skill provides a Playwright-based toolkit for testing local web applications through Python scripts. It enables frontend verification, UI debugging, screenshot capture, and log viewing while managing server lifecycles. Use it for browser automation tasks but run scripts directly rather than reading their source code to avoid context pollution.

View skill