Skip to content

PriorLabs/nanoTabPFN

Repository files navigation

nanoTabPFN

The purpose of this repository is to provide a fully open source playground for tabular foundation models. It contains a much smaller and simpler implementation of the TabPFNv2 architecture as well as a training loop and code for loading data that was pre-generated by a prior. We are planning to rapidly extend the repository with more features (e.g. regression, missing values, categorical features), prior interfaces and architectures. It is supposed to be a good starting point for students and researchers that are interested in learning about how TabPFN works under the hood.

Clone the repository, afterwards install dependencies via:

pip install -e .

We offer the same interface as TabPFN:

from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import train_test_split

from nanotabpfn import NanoTabPFNClassifier

# Load data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

# Initialize a classifier
clf = NanoTabPFNClassifier()
clf.fit(X_train, y_train)

# Predict probabilities
prediction_probabilities = clf.predict_proba(X_test)
print("ROC AUC:", roc_auc_score(y_test, prediction_probabilities[:, 1]))

# Predict labels
predictions = clf.predict(X_test)
print("Accuracy", accuracy_score(y_test, predictions))

Our Code

nanotabpfn/model.py contains the implementation of the architecture in less than 250 lines of code. nanotabpfn/train.py implements a simple training loop in under 100 lines and nanotabpfn/priors.py implements a dataloader that allows you to load a dump pre-generated from a prior. We will release multiple dumps of different scales soon. We also offer an interface where you can provide your own get_batch function.

Pretrain your own small nanoTabPFN

First we download 100k pre-generated datasets with 50 datapoints, 3 features and up to 3 classes each from here.

Then you can run:

python pretrain_classification.py -epochs 80 -steps 25 -batchsize 50 -priordump 50x3_3_100k_classification.h5

This should take less than 5 min on a modern NVIDIA GPU (around 10 minutes on Macbook M4 Pro GPU and around 40 min on M4 Pro CPU).

We also offer a pre-generated dataset containing 1.28M tables with 50 datapoints and 3 features each for regression here.

You can pretrain on it using python pretrain_regressor.py.

Step by Step Explanation (Classifier)

First we import our Architecture, Prior interface and training loop, etc.

from nanotabpfn.model import NanoTabPFNModel
from nanotabpfn.priors import PriorDumpDataLoader
from nanotabpfn.train import train
from nanotabpfn.utils import get_default_device
from nanotabpfn.interface import NanoTabPFNClassifier
from torch.nn import CrossEntropyLoss

then we instantiate our model and loss criterion:

model = NanoTabPFNModel(
    num_attention_heads=6,
    embedding_size=192,
    mlp_hidden_size=768,
    num_layers=6,
    num_outputs=10,
)
criterion = CrossEntropyLoss()

then we instantiate our prior:

device = get_default_device()
prior = PriorDumpDataLoader(filename='50x3_3_100k_classification.h5', num_steps=25, batch_size=50, device=device)

and finally train our model:

def epoch_callback(epoch, epoch_time, mean_loss, model):
    classifier = NanoTabPFNClassifier(model, device)
    # you can add your own eval code here that runs after every epoch
    print(f'epoch {epoch:5d} | time {epoch_time:5.2f}s | mean loss {mean_loss:5.2f}', flush=True)

trained_model, loss = train(
    model=model,
    prior=prior,
    criterion=criterion,
    epochs=80,
    device=device,
    epoch_callback=epoch_callback
)

About

nanoTabPFN: A Playground for Tabular Foundation Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages