evaluating-code-models

davila7

Updated 10 days ago

365 views

18,478

1,685

18,478

View on GitHub

MetaEvaluationCode GenerationHumanEvalMBPPMultiPL-EPass@kBigCodeBenchmarkingCode Models

About

This skill benchmarks code generation models using industry-standard evaluations like HumanEval and MBPP across multiple programming languages. It calculates pass@k metrics for comparing model performance, testing multi-language support, and measuring code quality. Developers should use it when rigorously evaluating or comparing coding models, as it's the same tool powering HuggingFace's code leaderboards.

Quick Install

Claude Code

Recommended

Primary

npx skills add davila7/claude-code-templates -a claude-code

Plugin CommandAlternative

/plugin add https://github.com/davila7/claude-code-templates

Git CloneAlternative

git clone https://github.com/davila7/claude-code-templates.git ~/.claude/skills/evaluating-code-models

Copy and paste this command in Claude Code to install this skill

GitHub Repository

davila7/claude-code-templates

Path: cli-tool/components/skills/ai-research/evaluation-bigcode-evaluation-harness

anthropicanthropic-claudeclaudeclaude-code

Related Skills

langsmith-observability

phoenix-observability

Testing

Phoenix is an open-source AI observability platform for tracing, evaluating, and monitoring LLM applications. It provides detailed traces for debugging, runs evaluations on datasets, and offers real-time monitoring for production systems. Key capabilities include experiment pipelines and self-hosted observability without vendor lock-in.

View skill

evaluating-llms-harness

Testing

This skill runs standardized LLM evaluations across 60+ academic benchmarks like MMLU and GSM8K using the industry-standard lm-evaluation-harness. Use it for benchmarking model quality, comparing different models, or tracking training progress with support for HuggingFace, vLLM, and API-based models. It provides a consistent, widely-adopted method for reporting academic results.

View skill