Skip to content

TimeLovercc/code-evaluator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

3 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿš€ Code Evaluator

Python 3.8+ License: MIT

A comprehensive, production-ready framework for evaluating code generation models on programming benchmarks. Designed for researchers and practitioners working with LLMs for code generation tasks.

Inspired by CURE

๐Ÿ“‹ Table of Contents

โœจ Features

Core Capabilities

  • ๐Ÿ”ง Multiple Inference Backends: Seamlessly switch between vLLM (local) and API-based models (OpenAI, Anthropic, etc.)
  • โšก High-Performance Execution: Distributed GPU inference with optimized batching and memory management
  • ๐ŸŽฏ Comprehensive Metrics: Pass@k, execution success rates, Best-of-N sampling, and custom metrics
  • ๐Ÿ”’ Safe Code Execution: Sandboxed execution with timeout protection and resource limits
  • ๐Ÿ“Š Rich Analytics: Detailed performance analysis with multiple evaluation modes

Technical Features

  • Multi-GPU Support: Efficient parallel inference across multiple GPUs with configurable worker groups
  • Adaptive Batching: Dynamic batch sizing for optimal throughput (up to 256 concurrent sequences)
  • Memory Optimization: KV-cache management, prefix caching, and chunked prefill for large models
  • Flexible Prompting: Customizable prompt templates with Jinja2 templating
  • Robust Error Handling: Graceful failure recovery with detailed error logging

Supported Datasets

  • MBPP (Mostly Basic Python Problems)
  • LiveCodeBench
  • CodeContests
  • CodeForces
  • LiveBench
  • Custom datasets (with proper formatting)

๐Ÿ— Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    Configuration Layer                   โ”‚
โ”‚                   (Hydra + OmegaConf)                   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    Evaluation Pipeline                   โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚   Dataset   โ”‚โ†’ โ”‚  Generation  โ”‚โ†’ โ”‚   Execution  โ”‚  โ”‚
โ”‚  โ”‚   Loader    โ”‚  โ”‚    Engine    โ”‚  โ”‚   Sandbox    โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚                            โ†“                            โ”‚
โ”‚                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                     โ”‚
โ”‚                    โ”‚   Metrics    โ”‚                     โ”‚
โ”‚                    โ”‚  Calculator  โ”‚                     โ”‚
โ”‚                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                     Inference Backends                   โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚       vLLM          โ”‚  โ”‚      API Clients        โ”‚  โ”‚
โ”‚  โ”‚  (Local Models)     โ”‚  โ”‚  (OpenAI, Anthropic)    โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ“ฆ Installation

Prerequisites

  • Python 3.8 or higher
  • CUDA 11.8+ (for GPU acceleration)
  • At least 16GB RAM (32GB+ recommended for large models)
  • NVIDIA GPU with 24GB+ VRAM (for local model inference)

Step 1: Clone the Repository

git clone https://github.com/TimeLovercc/code-evaluator.git
cd code-evaluator

Step 2: Create Virtual Environment

uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Step 3: Install Dependencies

uv pip install -r requirements.txt

Step 4: Download Datasets

cd data
# Download evaluation datasets
python download_data.py --dataset MBPP
python download_data.py --dataset LiveCodeBench
python download_data.py --dataset CodeContests
python download_data.py --dataset CodeForces
python download_data.py --dataset LiveBench

# Optional: Download training data
python download_data.py --dataset CodeContests_train
cd ..

๐Ÿš€ Quick Start

Basic Evaluation

# Run evaluation with default settings (MBPP dataset)
bash scripts/eval.sh

Evaluate All Datasets

# Run comprehensive evaluation across all datasets
bash scripts/all_eval.sh

Custom Model Evaluation

# Using vLLM with a specific model
python src/evaluate/evaluation_exp.py \
  inference.vllm.pretrained_model="codellama/CodeLlama-7b-Python-hf" \
  dataset.name="MBPP"

# Using API-based model
python src/evaluate/evaluation_exp.py \
  inference.use_api=true \
  inference.api.model_name="gpt-4" \
  inference.api.key="YOUR_API_KEY" \
  dataset.name="LiveCodeBench"

๐Ÿ“Š Datasets

Supported Formats

The framework uses a standardized JSON format with Stdio input/output:

{
  "task_id": 0,
  "question": "Problem description here",
  "test_input": ["5\n1 2 3 4 5\n"],
  "test_output": ["15\n"],
  "example_input": ["3\n1 2 3\n"],
  "example_output": ["6\n"],
  "test_time_limit": 1
}

Dataset Statistics

Dataset # Problems Difficulty Format
MBPP 974 Basic Stdio
LiveCodeBench 400+ Mixed Stdio/Functional*
CodeContests 13,000+ Hard Stdio
CodeForces 10,000+ Mixed Stdio
LiveBench 200+ Mixed Stdio/Functional*

*Automatically converted to Stdio format using data/transformation.ipynb

Custom Dataset Integration

To add your own dataset:

  1. Format your data according to the schema above
  2. Place the JSON file in data/eval_data/
  3. Update config.yaml with your dataset name
  4. Run evaluation as usual

Format Conversion

For datasets with functional format (e.g., assert-based tests), use the provided conversion tool:

# Open data/transformation.ipynb
# Follow the notebook to convert functional โ†’ Stdio format

โš™๏ธ Configuration

The framework uses Hydra for configuration management. Main configuration file: src/evaluate/config.yaml

Key Configuration Options

Model Settings

inference:
  use_api: false  # true for API models, false for vLLM
  
  # vLLM settings
  vllm:
    pretrained_model: "Qwen/Qwen3-4B"
    max_model_len: 16384
    max_generation_token: 4096
    temp: 0.8
    gpu_groups: [[0], [1], [2], [3]]  # GPU allocation
    max_batch_size: 256
  
  # API settings
  api:
    model_name: "gpt-4o-mini"
    key: "YOUR_API_KEY"
    temperature: 0.8
    max_workers: 20
    rpm_limit: 100

Generation Parameters

generation:
  k_code: 16    # Number of code samples per task
  k_case: 16    # Number of test case samples per task
  no_example: true  # Whether to include examples in prompts

Evaluation Modes

evaluation:
  single_eval: true  # One-shot coding accuracy only
  scale_tuple_list: [[4, 4], [16, 16]]  # Best-of-N configurations

Environment Variables

export PYTHONPATH=.
export NCCL_P2P_DISABLE=1  # For multi-GPU without NVLink
export VLLM_USE_V1=0
export OMP_NUM_THREADS=8

๐Ÿ’ป Usage Examples

Example 1: Benchmarking Multiple Models

# benchmark_models.py
import subprocess
import json

models = [
    "codellama/CodeLlama-7b-Python-hf",
    "Qwen/Qwen2.5-Coder-7B",
    "deepseek-ai/deepseek-coder-6.7b-base"
]

results = {}
for model in models:
    cmd = f"python src/evaluate/evaluation_exp.py inference.vllm.pretrained_model='{model}'"
    subprocess.run(cmd, shell=True)
    # Parse results from output files
    with open(f"outputs/eval/results-eval-{model.replace('/', '.')}-final_eval.txt") as f:
        results[model] = f.read()

print(json.dumps(results, indent=2))

Example 2: Custom Prompt Templates

# custom_prompts.yaml
prompts:
  system_prompts: |
    You are an expert Python programmer. 
    Task: {{problem}}
    Requirements: {{special_requirements}}
    Generate clean, efficient Python code.

Example 3: API Rate Limiting

# For high-volume API usage
python src/evaluate/evaluation_exp.py \
  inference.use_api=true \
  inference.api.rpm_limit=500 \
  inference.api.max_workers=50 \
  execution.num_chunks=1024

Example 4: Debug Mode

# Enable debug mode for quick testing
python src/evaluate/evaluation_exp.py debug=true
# This sets: k_code=2, k_case=2, num_chunks=4

๐Ÿ“ Output Format

Directory Structure

outputs/eval/
โ”œโ”€โ”€ MBPP/
โ”‚   โ”œโ”€โ”€ generations-eval-model-MBPP.json    # Raw generations
โ”‚   โ”œโ”€โ”€ outputs-eval-model-MBPP.json        # Full results
โ”‚   โ””โ”€โ”€ results-eval-model-final_eval.txt   # Summary metrics
โ”œโ”€โ”€ LiveCodeBench/
โ”‚   โ””โ”€โ”€ ...
โ””โ”€โ”€ ...

Metrics Explained

  • Code Accuracy: Proportion of tasks where generated code passes all tests
  • Code Accumulate Accuracy: Proportion of individual test cases passed
  • Case Accuracy: Quality of generated test cases (if applicable)
  • P_01/P_00: Probability metrics for test case discrimination
  • Best-of-N: Performance when selecting best solution from N attempts

Sample Output

{
  "task_id": 0,
  "question": "Write a function to sum numbers",
  "generated_code": ["def solution(nums):..."],
  "test_bool_table": [[true, true, false], ...],
  "case_bool_table": [[true, false], ...],
  "test_exe_results": [["15", "10", "error"], ...],
  "case_exe_results": [["5", "error"], ...]
}

Summary Statistics

code acc (average proportion of tasks the generated code can pass): 0.425
code accumulate acc (average proportion of unit tests the generated code can pass): 0.612
estimated unit test acc: 0.387
estimated p_01: 0.823
estimated p_00: 0.156
BoN setting [4, 4]: acc: 0.512, accumulate acc: 0.687
code average response length: 287.3

๐Ÿ“ˆ Performance

Optimization Tips

1. GPU Memory Management

vllm:
  gpu_memory_utilization: 0.90  # Maximize VRAM usage
  enable_prefix_caching: true    # Cache common prefixes
  enable_chunked_prefill: true   # Better memory handling

2. Batch Size Tuning

  • A100 80GB: max_batch_size: 256
  • A100 40GB: max_batch_size: 128
  • RTX 4090: max_batch_size: 64
  • RTX 3090: max_batch_size: 32

3. Multi-GPU Scaling

# Single GPU per worker (recommended)
gpu_groups: [[0], [1], [2], [3]]

# Tensor parallelism (for very large models)
gpu_groups: [[0,1], [2,3]]

4. Execution Optimization

execution:
  num_chunks: 512  # Increase for better parallelization
  exe_verbose: true  # Monitor execution progress

Benchmark Results

Model MBPP Pass@1 LiveCodeBench Pass@1 Throughput (samples/min)
CodeLlama-7B 42.3% 38.7% 120
Qwen2.5-Coder-7B 51.2% 45.3% 115
DeepSeek-Coder-6.7B 48.5% 43.2% 125
GPT-4 (API) 67.8% 62.1% 60

๐Ÿ”ง Troubleshooting

Common Issues

CUDA Out of Memory

# Reduce batch size
python src/evaluate/evaluation_exp.py \
  inference.vllm.max_batch_size=32 \
  inference.vllm.gpu_memory_utilization=0.8

Timeout Errors

# Increase execution timeout
dataset:
  max_test: 16  # Increase test limit
execution:
  num_chunks: 1024  # More parallel chunks

API Rate Limits

inference:
  api:
    rpm_limit: 50  # Reduce requests per minute
    max_workers: 10  # Fewer concurrent workers

Model Loading Issues

# For models requiring trust_remote_code
python src/evaluate/evaluation_exp.py \
  inference.vllm.trust_remote_code=true

Process Cleanup Issues

# If processes don't terminate cleanly
pkill -f "evaluation_exp.py"
nvidia-smi  # Check GPU usage

๐Ÿƒโ€โ™‚๏ธ Advanced Usage

Running with SLURM

#!/bin/bash
#SBATCH --job-name=code-eval
#SBATCH --gres=gpu:4
#SBATCH --time=48:00:00

module load cuda/11.8
source .venv/bin/activate
python src/evaluate/evaluation_exp.py dataset.name="CodeContests"

Custom Evaluation Pipeline

from src.evaluate.evaluator import CodeEvaluator
from src.evaluate.inference_engines import VLLMInferenceEngine

# Create custom evaluator
engine = VLLMInferenceEngine(cfg)
evaluator = CodeEvaluator(cfg, dataset, engine, ...)
evaluator.evaluate()

๐Ÿค Contributing

We welcome contributions! Areas of interest:

  • Support for more programming languages
  • Additional evaluation metrics
  • New dataset integrations
  • Performance optimizations
  • Documentation improvements

Development Setup

# Install development dependencies
pip install -e .
pip install pytest black isort flake8

# Run tests
pytest tests/

# Format code
black src/
isort src/

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Inspired by the CURE framework
  • Built on top of vLLM for efficient inference
  • Uses Hydra for configuration management
  • Dataset sources from HuggingFace and various coding competition platforms

๐Ÿ“ฎ Contact

For questions and support, please open an issue on GitHub or contact [gzjz07@outlook.com]


Star โญ this repository if you find it helpful!

About

A codebase for code evaluation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published