Welcome! This guide explains the best way to share your dataset with us. Our goal is to quickly understand your data and establish a strong performance baseline. This ensures a smooth and effective integration with our machine learning models, like TabPFN.
To do this, we ask that you provide a simple notebook that prepares your data and runs a baseline model, such as XGBoost. This helps us align on metrics and validate the data format from the start.
To ensure your datasets work flawlessly with our pipeline, please follow these requirements.
Note on Requirements: Throughout this guide, certain items are marked as (Essential) or (Optional).
- (Essential) — These requirements are critical for us to process, evaluate, and integrate your dataset effectively.
- (Optional) — These items are not strictly required, but they are highly encouraged. Providing them helps us better understand your data, perform more targeted benchmarking, and produce richer insights.
This is the standard approach for most projects. Providing a single Jupyter or Colab notebook ensures a reproducible, low-friction starting point that we can immediately build upon.
Your notebook should cover the following:
-
Data Preparation and Splitting
- Define
target_column
andfeature_columns
at the start. - Load data, preprocess, and split into training/testing sets.
- For time-series data, ensure chronological splitting to avoid data leakage.
- Define
-
Baseline Model and Evaluation
- Train a standard XGBoost model (or another agreed-upon baseline).
- Clearly specify the performance metrics that matter most for your use case.
The following are common examples — please confirm or adjust based on your project’s priorities:
- Classification: AUROC, F1-Score, LogLoss, Confusion Matrix
- Regression: RMSE, MAE, R²
- Report results on the test set using the agreed metrics.
-
Key Visualizations (Optional)
- Feature importance chart
- ROC curve (classification)
- Residuals plot (regression)
Before sharing data, remove or obfuscate all PII and sensitive information.
- Remove Direct Identifiers: Names, emails, addresses, phone numbers, etc.
- Anonymize Quasi-Identifiers: Features like ZIP code, age, job title that could be combined to re-identify individuals.
- Techniques: Group rare categories, add statistical noise to sensitive numerical values.
- Package the final notebook and associated data files into a
.zip
archive. - Provide a secure download link.
For specialized use cases — such as large-scale benchmarking across multiple datasets or formal fine-tuning programs — we support direct integration into our multi-dataset testing and fine-tuning pipeline.
This approach leverages the Multi_Dataset_Integration
module included in the repository, which contains the core structure and example implementations to help you get started quickly.
- Implement Data Loader: Create a
DataModule
class that loads, processes, and serves datasets according to our pipeline specs. - Testing: Validate implementation with provided scripts (
minimal_example.py
). - Automated Checks: Ensure all tests pass using
pytest
.
- Format: Pairs of NumPy arrays:
X
→ Features (n_samples
,n_features
)y
→ Target (n_samples
,)
- Data Types: Numeric (
float
,int
) - Organization: Separate
(X, y)
pairs for training, testing, and optionally validation.
- Fully preprocessed (cleaned, imputed, encoded) — ready for training.
- No leakage between training, validation, and test sets.
- Shape constraints:
- Test/validation ≤ 10,000 samples × 500 features
- Training ≤ 5,000 samples × 500 features
- Notify us if exceeding these limits.
- Primary Metric: Define (e.g., AUROC, F1-Score).
- Aggregation Strategy: Specify (e.g., mean AUROC, weighted average).
- Performance Target: State improvement goal (e.g., "+5% AUROC over baseline").
Include a YAML metadata file with your dataset. This consolidates critical information about the dataset and evaluation protocol.
Example:
# General Information
dataset_name: "Credit Risk Analysis"
description: "Predicting loan default risk based on historical borrower information."
time_series: false
# Evaluation Protocol
primary_metric: "AUROC"
aggregation_strategy: "mean" # Aggregate test set performance by taking the mean AUROC.
target_performance: "+3 pp AUROC over baseline"
key_datasets: # Optional: list high-priority datasets for evaluation.
- "credit_risk_v1.npy"
- "mortgage_loans_q3.npy"
# Data Specification
data_format: "numpy"
preprocessing_steps: "StandardScaler for numerical features, OneHotEncoder for categorical."
feature_names: # Optional: only if features are consistent across datasets.
- "age"
- "annual_income"
- "credit_score"
- "employment_length_years"