Skip to content

PriorLabs/dataset-requirements-guide

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Dataset Integration Guide

Welcome! This guide explains the best way to share your dataset with us. Our goal is to quickly understand your data and establish a strong performance baseline. This ensures a smooth and effective integration with our machine learning models, like TabPFN.

To do this, we ask that you provide a simple notebook that prepares your data and runs a baseline model, such as XGBoost. This helps us align on metrics and validate the data format from the start.

To ensure your datasets work flawlessly with our pipeline, please follow these requirements.

Note on Requirements: Throughout this guide, certain items are marked as (Essential) or (Optional).

  • (Essential) — These requirements are critical for us to process, evaluate, and integrate your dataset effectively.
  • (Optional) — These items are not strictly required, but they are highly encouraged. Providing them helps us better understand your data, perform more targeted benchmarking, and produce richer insights.

Option 1: The Baseline Notebook

This is the standard approach for most projects. Providing a single Jupyter or Colab notebook ensures a reproducible, low-friction starting point that we can immediately build upon.

Notebook Requirements (Essential)

Your notebook should cover the following:

  1. Data Preparation and Splitting

    • Define target_column and feature_columns at the start.
    • Load data, preprocess, and split into training/testing sets.
    • For time-series data, ensure chronological splitting to avoid data leakage.
  2. Baseline Model and Evaluation

    • Train a standard XGBoost model (or another agreed-upon baseline).
    • Clearly specify the performance metrics that matter most for your use case. The following are common examples — please confirm or adjust based on your project’s priorities:
      • Classification: AUROC, F1-Score, LogLoss, Confusion Matrix
      • Regression: RMSE, MAE, R²
    • Report results on the test set using the agreed metrics.
  3. Key Visualizations (Optional)

    • Feature importance chart
    • ROC curve (classification)
    • Residuals plot (regression)

Data Privacy (Essential)

Before sharing data, remove or obfuscate all PII and sensitive information.

  • Remove Direct Identifiers: Names, emails, addresses, phone numbers, etc.
  • Anonymize Quasi-Identifiers: Features like ZIP code, age, job title that could be combined to re-identify individuals.
    • Techniques: Group rare categories, add statistical noise to sensitive numerical values.

Submission

  • Package the final notebook and associated data files into a .zip archive.
  • Provide a secure download link.

Option 2: Multi Dataset Pipeline Integration (Advanced)

For specialized use cases — such as large-scale benchmarking across multiple datasets or formal fine-tuning programs — we support direct integration into our multi-dataset testing and fine-tuning pipeline.

This approach leverages the Multi_Dataset_Integration module included in the repository, which contains the core structure and example implementations to help you get started quickly.


Contributor Responsibilities

  • Implement Data Loader: Create a DataModule class that loads, processes, and serves datasets according to our pipeline specs.
  • Testing: Validate implementation with provided scripts (minimal_example.py).
  • Automated Checks: Ensure all tests pass using pytest.

Detailed Requirements for Pipeline Integration

Data Structure (Essential)

  • Format: Pairs of NumPy arrays:
    • X → Features (n_samples, n_features)
    • y → Target (n_samples,)
  • Data Types: Numeric (float, int)
  • Organization: Separate (X, y) pairs for training, testing, and optionally validation.

Data Quality (Essential)

  • Fully preprocessed (cleaned, imputed, encoded) — ready for training.
  • No leakage between training, validation, and test sets.
  • Shape constraints:
    • Test/validation ≤ 10,000 samples × 500 features
    • Training ≤ 5,000 samples × 500 features
    • Notify us if exceeding these limits.

Evaluation (Essential)

  • Primary Metric: Define (e.g., AUROC, F1-Score).
  • Aggregation Strategy: Specify (e.g., mean AUROC, weighted average).
  • Performance Target: State improvement goal (e.g., "+5% AUROC over baseline").

Metadata Submission

Include a YAML metadata file with your dataset. This consolidates critical information about the dataset and evaluation protocol.

Example:

# General Information
dataset_name: "Credit Risk Analysis"
description: "Predicting loan default risk based on historical borrower information."
time_series: false

# Evaluation Protocol
primary_metric: "AUROC"
aggregation_strategy: "mean" # Aggregate test set performance by taking the mean AUROC.
target_performance: "+3 pp AUROC over baseline"
key_datasets: # Optional: list high-priority datasets for evaluation.
  - "credit_risk_v1.npy"
  - "mortgage_loans_q3.npy"

# Data Specification
data_format: "numpy"
preprocessing_steps: "StandardScaler for numerical features, OneHotEncoder for categorical."
feature_names: # Optional: only if features are consistent across datasets.
  - "age"
  - "annual_income"
  - "credit_score"
  - "employment_length_years"

About

PriorLabs Dataset Requirement Guide

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages