A comprehensive, production-ready framework for evaluating code generation models on programming benchmarks. Designed for researchers and practitioners working with LLMs for code generation tasks.
Inspired by CURE
- Features
- Architecture
- Installation
- Quick Start
- Datasets
- Configuration
- Usage Examples
- Output Format
- Performance
- Troubleshooting
- Contributing
- Citation
- ๐ง Multiple Inference Backends: Seamlessly switch between vLLM (local) and API-based models (OpenAI, Anthropic, etc.)
- โก High-Performance Execution: Distributed GPU inference with optimized batching and memory management
- ๐ฏ Comprehensive Metrics: Pass@k, execution success rates, Best-of-N sampling, and custom metrics
- ๐ Safe Code Execution: Sandboxed execution with timeout protection and resource limits
- ๐ Rich Analytics: Detailed performance analysis with multiple evaluation modes
- Multi-GPU Support: Efficient parallel inference across multiple GPUs with configurable worker groups
- Adaptive Batching: Dynamic batch sizing for optimal throughput (up to 256 concurrent sequences)
- Memory Optimization: KV-cache management, prefix caching, and chunked prefill for large models
- Flexible Prompting: Customizable prompt templates with Jinja2 templating
- Robust Error Handling: Graceful failure recovery with detailed error logging
- MBPP (Mostly Basic Python Problems)
- LiveCodeBench
- CodeContests
- CodeForces
- LiveBench
- Custom datasets (with proper formatting)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Configuration Layer โ
โ (Hydra + OmegaConf) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Evaluation Pipeline โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ Dataset โโ โ Generation โโ โ Execution โ โ
โ โ Loader โ โ Engine โ โ Sandbox โ โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโ โ
โ โ Metrics โ โ
โ โ Calculator โ โ
โ โโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Inference Backends โ
โ โโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ vLLM โ โ API Clients โ โ
โ โ (Local Models) โ โ (OpenAI, Anthropic) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- Python 3.8 or higher
- CUDA 11.8+ (for GPU acceleration)
- At least 16GB RAM (32GB+ recommended for large models)
- NVIDIA GPU with 24GB+ VRAM (for local model inference)
git clone https://github.com/TimeLovercc/code-evaluator.git
cd code-evaluator
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv pip install -r requirements.txt
cd data
# Download evaluation datasets
python download_data.py --dataset MBPP
python download_data.py --dataset LiveCodeBench
python download_data.py --dataset CodeContests
python download_data.py --dataset CodeForces
python download_data.py --dataset LiveBench
# Optional: Download training data
python download_data.py --dataset CodeContests_train
cd ..
# Run evaluation with default settings (MBPP dataset)
bash scripts/eval.sh
# Run comprehensive evaluation across all datasets
bash scripts/all_eval.sh
# Using vLLM with a specific model
python src/evaluate/evaluation_exp.py \
inference.vllm.pretrained_model="codellama/CodeLlama-7b-Python-hf" \
dataset.name="MBPP"
# Using API-based model
python src/evaluate/evaluation_exp.py \
inference.use_api=true \
inference.api.model_name="gpt-4" \
inference.api.key="YOUR_API_KEY" \
dataset.name="LiveCodeBench"
The framework uses a standardized JSON format with Stdio input/output:
{
"task_id": 0,
"question": "Problem description here",
"test_input": ["5\n1 2 3 4 5\n"],
"test_output": ["15\n"],
"example_input": ["3\n1 2 3\n"],
"example_output": ["6\n"],
"test_time_limit": 1
}
Dataset | # Problems | Difficulty | Format |
---|---|---|---|
MBPP | 974 | Basic | Stdio |
LiveCodeBench | 400+ | Mixed | Stdio/Functional* |
CodeContests | 13,000+ | Hard | Stdio |
CodeForces | 10,000+ | Mixed | Stdio |
LiveBench | 200+ | Mixed | Stdio/Functional* |
*Automatically converted to Stdio format using data/transformation.ipynb
To add your own dataset:
- Format your data according to the schema above
- Place the JSON file in
data/eval_data/
- Update
config.yaml
with your dataset name - Run evaluation as usual
For datasets with functional format (e.g., assert-based tests), use the provided conversion tool:
# Open data/transformation.ipynb
# Follow the notebook to convert functional โ Stdio format
The framework uses Hydra for configuration management. Main configuration file: src/evaluate/config.yaml
inference:
use_api: false # true for API models, false for vLLM
# vLLM settings
vllm:
pretrained_model: "Qwen/Qwen3-4B"
max_model_len: 16384
max_generation_token: 4096
temp: 0.8
gpu_groups: [[0], [1], [2], [3]] # GPU allocation
max_batch_size: 256
# API settings
api:
model_name: "gpt-4o-mini"
key: "YOUR_API_KEY"
temperature: 0.8
max_workers: 20
rpm_limit: 100
generation:
k_code: 16 # Number of code samples per task
k_case: 16 # Number of test case samples per task
no_example: true # Whether to include examples in prompts
evaluation:
single_eval: true # One-shot coding accuracy only
scale_tuple_list: [[4, 4], [16, 16]] # Best-of-N configurations
export PYTHONPATH=.
export NCCL_P2P_DISABLE=1 # For multi-GPU without NVLink
export VLLM_USE_V1=0
export OMP_NUM_THREADS=8
# benchmark_models.py
import subprocess
import json
models = [
"codellama/CodeLlama-7b-Python-hf",
"Qwen/Qwen2.5-Coder-7B",
"deepseek-ai/deepseek-coder-6.7b-base"
]
results = {}
for model in models:
cmd = f"python src/evaluate/evaluation_exp.py inference.vllm.pretrained_model='{model}'"
subprocess.run(cmd, shell=True)
# Parse results from output files
with open(f"outputs/eval/results-eval-{model.replace('/', '.')}-final_eval.txt") as f:
results[model] = f.read()
print(json.dumps(results, indent=2))
# custom_prompts.yaml
prompts:
system_prompts: |
You are an expert Python programmer.
Task: {{problem}}
Requirements: {{special_requirements}}
Generate clean, efficient Python code.
# For high-volume API usage
python src/evaluate/evaluation_exp.py \
inference.use_api=true \
inference.api.rpm_limit=500 \
inference.api.max_workers=50 \
execution.num_chunks=1024
# Enable debug mode for quick testing
python src/evaluate/evaluation_exp.py debug=true
# This sets: k_code=2, k_case=2, num_chunks=4
outputs/eval/
โโโ MBPP/
โ โโโ generations-eval-model-MBPP.json # Raw generations
โ โโโ outputs-eval-model-MBPP.json # Full results
โ โโโ results-eval-model-final_eval.txt # Summary metrics
โโโ LiveCodeBench/
โ โโโ ...
โโโ ...
- Code Accuracy: Proportion of tasks where generated code passes all tests
- Code Accumulate Accuracy: Proportion of individual test cases passed
- Case Accuracy: Quality of generated test cases (if applicable)
- P_01/P_00: Probability metrics for test case discrimination
- Best-of-N: Performance when selecting best solution from N attempts
{
"task_id": 0,
"question": "Write a function to sum numbers",
"generated_code": ["def solution(nums):..."],
"test_bool_table": [[true, true, false], ...],
"case_bool_table": [[true, false], ...],
"test_exe_results": [["15", "10", "error"], ...],
"case_exe_results": [["5", "error"], ...]
}
code acc (average proportion of tasks the generated code can pass): 0.425
code accumulate acc (average proportion of unit tests the generated code can pass): 0.612
estimated unit test acc: 0.387
estimated p_01: 0.823
estimated p_00: 0.156
BoN setting [4, 4]: acc: 0.512, accumulate acc: 0.687
code average response length: 287.3
vllm:
gpu_memory_utilization: 0.90 # Maximize VRAM usage
enable_prefix_caching: true # Cache common prefixes
enable_chunked_prefill: true # Better memory handling
- A100 80GB:
max_batch_size: 256
- A100 40GB:
max_batch_size: 128
- RTX 4090:
max_batch_size: 64
- RTX 3090:
max_batch_size: 32
# Single GPU per worker (recommended)
gpu_groups: [[0], [1], [2], [3]]
# Tensor parallelism (for very large models)
gpu_groups: [[0,1], [2,3]]
execution:
num_chunks: 512 # Increase for better parallelization
exe_verbose: true # Monitor execution progress
Model | MBPP Pass@1 | LiveCodeBench Pass@1 | Throughput (samples/min) |
---|---|---|---|
CodeLlama-7B | 42.3% | 38.7% | 120 |
Qwen2.5-Coder-7B | 51.2% | 45.3% | 115 |
DeepSeek-Coder-6.7B | 48.5% | 43.2% | 125 |
GPT-4 (API) | 67.8% | 62.1% | 60 |
# Reduce batch size
python src/evaluate/evaluation_exp.py \
inference.vllm.max_batch_size=32 \
inference.vllm.gpu_memory_utilization=0.8
# Increase execution timeout
dataset:
max_test: 16 # Increase test limit
execution:
num_chunks: 1024 # More parallel chunks
inference:
api:
rpm_limit: 50 # Reduce requests per minute
max_workers: 10 # Fewer concurrent workers
# For models requiring trust_remote_code
python src/evaluate/evaluation_exp.py \
inference.vllm.trust_remote_code=true
# If processes don't terminate cleanly
pkill -f "evaluation_exp.py"
nvidia-smi # Check GPU usage
#!/bin/bash
#SBATCH --job-name=code-eval
#SBATCH --gres=gpu:4
#SBATCH --time=48:00:00
module load cuda/11.8
source .venv/bin/activate
python src/evaluate/evaluation_exp.py dataset.name="CodeContests"
from src.evaluate.evaluator import CodeEvaluator
from src.evaluate.inference_engines import VLLMInferenceEngine
# Create custom evaluator
engine = VLLMInferenceEngine(cfg)
evaluator = CodeEvaluator(cfg, dataset, engine, ...)
evaluator.evaluate()
We welcome contributions! Areas of interest:
- Support for more programming languages
- Additional evaluation metrics
- New dataset integrations
- Performance optimizations
- Documentation improvements
# Install development dependencies
pip install -e .
pip install pytest black isort flake8
# Run tests
pytest tests/
# Format code
black src/
isort src/
This project is licensed under the MIT License - see the LICENSE file for details.
- Inspired by the CURE framework
- Built on top of vLLM for efficient inference
- Uses Hydra for configuration management
- Dataset sources from HuggingFace and various coding competition platforms
For questions and support, please open an issue on GitHub or contact [gzjz07@outlook.com]
Star โญ this repository if you find it helpful!