Dataset Integration Guide

Welcome! This guide explains the best way to share your dataset with us. Our goal is to quickly understand your data and establish a strong performance baseline. This ensures a smooth and effective integration with our machine learning models, like TabPFN.

To do this, we ask that you provide a simple notebook that prepares your data and runs a baseline model, such as XGBoost. This helps us align on metrics and validate the data format from the start.

To ensure your datasets work flawlessly with our pipeline, please follow these requirements.

Note on Requirements: Throughout this guide, certain items are marked as (Essential) or (Optional).

(Essential) — These requirements are critical for us to process, evaluate, and integrate your dataset effectively.
(Optional) — These items are not strictly required, but they are highly encouraged. Providing them helps us better understand your data, perform more targeted benchmarking, and produce richer insights.

Option 1: The Baseline Notebook

This is the standard approach for most projects. Providing a single Jupyter or Colab notebook ensures a reproducible, low-friction starting point that we can immediately build upon.

Notebook Requirements (Essential)

Your notebook should cover the following:

Data Preparation and Splitting
- Define target_column and feature_columns at the start.
- Load data, preprocess, and split into training/testing sets.
- For time-series data, ensure chronological splitting to avoid data leakage.
Baseline Model and Evaluation
- Train a standard XGBoost model (or another agreed-upon baseline).
- Clearly specify the performance metrics that matter most for your use case. The following are common examples — please confirm or adjust based on your project’s priorities:
  - Classification: AUROC, F1-Score, LogLoss, Confusion Matrix
  - Regression: RMSE, MAE, R²
- Report results on the test set using the agreed metrics.
Key Visualizations (Optional)
- Feature importance chart
- ROC curve (classification)
- Residuals plot (regression)

Data Privacy (Essential)

Before sharing data, remove or obfuscate all PII and sensitive information.

Remove Direct Identifiers: Names, emails, addresses, phone numbers, etc.
Anonymize Quasi-Identifiers: Features like ZIP code, age, job title that could be combined to re-identify individuals.
- Techniques: Group rare categories, add statistical noise to sensitive numerical values.

Submission

Package the final notebook and associated data files into a .zip archive.
Provide a secure download link.

Option 2: Multi Dataset Pipeline Integration (Advanced)

For specialized use cases — such as large-scale benchmarking across multiple datasets or formal fine-tuning programs — we support direct integration into our multi-dataset testing and fine-tuning pipeline.

This approach leverages the Multi_Dataset_Integration module included in the repository, which contains the core structure and example implementations to help you get started quickly.

Contributor Responsibilities

Implement Data Loader: Create a DataModule class that loads, processes, and serves datasets according to our pipeline specs.
Testing: Validate implementation with provided scripts (minimal_example.py).
Automated Checks: Ensure all tests pass using pytest.

Detailed Requirements for Pipeline Integration

Data Structure (Essential)

Format: Pairs of NumPy arrays:
- X → Features (n_samples, n_features)
- y → Target (n_samples,)
Data Types: Numeric (float, int)
Organization: Separate (X, y) pairs for training, testing, and optionally validation.

Data Quality (Essential)

Fully preprocessed (cleaned, imputed, encoded) — ready for training.
No leakage between training, validation, and test sets.
Shape constraints:
- Test/validation ≤ 10,000 samples × 500 features
- Training ≤ 5,000 samples × 500 features
- Notify us if exceeding these limits.

Evaluation (Essential)

Primary Metric: Define (e.g., AUROC, F1-Score).
Aggregation Strategy: Specify (e.g., mean AUROC, weighted average).
Performance Target: State improvement goal (e.g., "+5% AUROC over baseline").

Metadata Submission

Include a YAML metadata file with your dataset. This consolidates critical information about the dataset and evaluation protocol.

Example:

# General Information
dataset_name: "Credit Risk Analysis"
description: "Predicting loan default risk based on historical borrower information."
time_series: false

# Evaluation Protocol
primary_metric: "AUROC"
aggregation_strategy: "mean" # Aggregate test set performance by taking the mean AUROC.
target_performance: "+3 pp AUROC over baseline"
key_datasets: # Optional: list high-priority datasets for evaluation.
  - "credit_risk_v1.npy"
  - "mortgage_loans_q3.npy"

# Data Specification
data_format: "numpy"
preprocessing_steps: "StandardScaler for numerical features, OneHotEncoder for categorical."
feature_names: # Optional: only if features are consistent across datasets.
  - "age"
  - "annual_income"
  - "credit_score"
  - "employment_length_years"

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
Multi_Dataset_Integration_Pipeline		Multi_Dataset_Integration_Pipeline
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Dataset Integration Guide

Option 1: The Baseline Notebook

Notebook Requirements (Essential)

Data Privacy (Essential)

Submission

Option 2: Multi Dataset Pipeline Integration (Advanced)

Contributor Responsibilities

Detailed Requirements for Pipeline Integration

Data Structure (Essential)

Data Quality (Essential)

Evaluation (Essential)

Metadata Submission

About

Uh oh!

Releases

Packages

Uh oh!

Languages

PriorLabs/dataset-requirements-guide

Folders and files

Latest commit

History

Repository files navigation

Dataset Integration Guide

Option 1: The Baseline Notebook

Notebook Requirements (Essential)

Data Privacy (Essential)

Submission

Option 2: Multi Dataset Pipeline Integration (Advanced)

Contributor Responsibilities

Detailed Requirements for Pipeline Integration

Data Structure (Essential)

Data Quality (Essential)

Evaluation (Essential)

Metadata Submission

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages