The purpose of this repository is to provide a fully open source playground for tabular foundation models. It contains a much smaller and simpler implementation of the TabPFNv2 architecture as well as a training loop and code for loading data that was pre-generated by a prior. We are planning to rapidly extend the repository with more features (e.g. regression, missing values, categorical features), prior interfaces and architectures. It is supposed to be a good starting point for students and researchers that are interested in learning about how TabPFN works under the hood.
Clone the repository, afterwards install dependencies via:
pip install -e .
We offer the same interface as TabPFN:
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import train_test_split
from nanotabpfn import NanoTabPFNClassifier
# Load data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
# Initialize a classifier
clf = NanoTabPFNClassifier()
clf.fit(X_train, y_train)
# Predict probabilities
prediction_probabilities = clf.predict_proba(X_test)
print("ROC AUC:", roc_auc_score(y_test, prediction_probabilities[:, 1]))
# Predict labels
predictions = clf.predict(X_test)
print("Accuracy", accuracy_score(y_test, predictions))
nanotabpfn/model.py
contains the implementation of the architecture in less than 250 lines of code. nanotabpfn/train.py
implements a simple training loop in under 100 lines and nanotabpfn/priors.py
implements a dataloader that allows you to load a dump pre-generated from a prior.
We will release multiple dumps of different scales soon. We also offer an interface where you can provide your own get_batch function.
First we download 100k pre-generated datasets with 50 datapoints, 3 features and up to 3 classes each from here.
Then you can run:
python pretrain_classification.py -epochs 80 -steps 25 -batchsize 50 -priordump 50x3_3_100k_classification.h5
This should take less than 5 min on a modern NVIDIA GPU (around 10 minutes on Macbook M4 Pro GPU and around 40 min on M4 Pro CPU).
We also offer a pre-generated dataset containing 1.28M tables with 50 datapoints and 3 features each for regression here.
You can pretrain on it using python pretrain_regressor.py
.
First we import our Architecture, Prior interface and training loop, etc.
from nanotabpfn.model import NanoTabPFNModel
from nanotabpfn.priors import PriorDumpDataLoader
from nanotabpfn.train import train
from nanotabpfn.utils import get_default_device
from nanotabpfn.interface import NanoTabPFNClassifier
from torch.nn import CrossEntropyLoss
then we instantiate our model and loss criterion:
model = NanoTabPFNModel(
num_attention_heads=6,
embedding_size=192,
mlp_hidden_size=768,
num_layers=6,
num_outputs=10,
)
criterion = CrossEntropyLoss()
then we instantiate our prior:
device = get_default_device()
prior = PriorDumpDataLoader(filename='50x3_3_100k_classification.h5', num_steps=25, batch_size=50, device=device)
and finally train our model:
def epoch_callback(epoch, epoch_time, mean_loss, model):
classifier = NanoTabPFNClassifier(model, device)
# you can add your own eval code here that runs after every epoch
print(f'epoch {epoch:5d} | time {epoch_time:5.2f}s | mean loss {mean_loss:5.2f}', flush=True)
trained_model, loss = train(
model=model,
prior=prior,
criterion=criterion,
epochs=80,
device=device,
epoch_callback=epoch_callback
)