Skip to content

eXascaleInfolab/ImputeGAP

Repository files navigation



Welcome to ImputeGAP

ImputeGAP is a comprehensive Python library for imputation of missing values in time series data. It implements user-friendly APIs to easily visualize, analyze, and repair time series datasets. The library supports a diverse range of imputation algorithms and modular missing data simulation catering to datasets with varying characteristics. ImputeGAP includes extensive customization options, such as automated hyperparameter tuning, benchmarking, explainability, and downstream evaluation.

In detail, the package provides:

  • Access to commonly used datasets in the time series imputation field (Datasets).
  • Configurable contamination that simulates real-world missingness patterns (Patterns).
  • Parameterizable state-of-the-art time series imputation algorithms (Algorithms).
  • Extensive benchmarking to compare the performance of imputation algorithms (Benchmark).
  • Modular tools to assess the impact of imputation on key downstream tasks (Downstream).
  • Fine-grained analysis of the impact of time series features on imputation results (Explainer).
  • Seamless integration of new algorithms in Python, C++, Matlab, Java, and R (Integration).

Python Release License Coverage PyPI Language Platform Docs

If you like our library, please add a ⭐ in our GitHub repository.


Tools URL
📚 Documentation https://imputegap.readthedocs.io/
📦 PyPI https://pypi.org/project/imputegap/
📁 Datasets Description

Available Imputation Algorithms

Family Algorithm Venue -- Year
Deep Learning BitGraph [32] ICLR -- 2024
Deep Learning BayOTIDE [30] PMLR -- 2024
Deep Learning MissNet [27] KDD -- 2024
Deep Learning MPIN [25] PVLDB -- 2024
Deep Learning PRISTI [26] ICDE -- 2023
Deep Learning GRIN [29] ICLR -- 2022
Deep Learning HKMF_T [31] TKDE -- 2021
Deep Learning DeepMVI [24] PVLDB -- 2021
Deep Learning MRNN [22] IEEE Trans on BE -- 2019
Deep Learning BRITS [23] NeurIPS -- 2018
Deep Learning GAIN [28] ICML -- 2018
Matrix Completion CDRec [1] KAIS -- 2020
Matrix Completion TRMF [8] NeurIPS -- 2016
Matrix Completion GROUSE [3] PMLR -- 2016
Matrix Completion ROSL [4] CVPR -- 2014
Matrix Completion SoftImpute [6] JMLR -- 2010
Matrix Completion SVT [7] SIAM J. OPTIM -- 2010
Matrix Completion SPIRIT [5] VLDB -- 2005
Matrix Completion IterativeSVD [2] BIOINFORMATICS -- 2001
Pattern Search TKCM [11] EDBT -- 2017
Pattern Search STMVL [9] IJCAI -- 2016
Pattern Search DynaMMo [10] KDD -- 2009
Machine Learning IIM [12] ICDE -- 2019
Machine Learning XGBOOST [13] KDD -- 2016
Machine Learning MICE [14] Statistical Software -- 2011
Machine Learning MissForest [15] BioInformatics -- 2011
Statistics KNNImpute -
Statistics Interpolation -
Statistics MinImpute -
Statistics ZeroImpute -
Statistics MeanImpute -
Statistics MeanImputeBySeries -

Quick Navigation


System Requirements

ImputeGAP is compatible with Python>=3.10 (except 3.13) and Unix-compatible environment.

To create and set up an environment with Python 3.12, please refer to the installation guide.


Installation

To install the latest version of ImputeGAP from PyPI, run the following command:

pip install imputegap

Alternatively, you can install the library from source:

git init
git clone https://github.com/eXascaleInfolab/ImputeGAP
cd ./ImputeGAP
pip install -e .

Loading

ImputeGAP comes with several time series datasets. The list of datasets is described here.

As an example, we use the eeg-alcohol dataset, composed of individuals with a genetic predisposition to alcoholism. The dataset contains measurements from 64 electrodes placed on subject’s scalps, sampled at 256 Hz. The dimensions of the dataset are 64 series, each containing 256 values.

Example Loading

You can find this example of normalization in the file runner_loading.py.

To load and plot the eeg-alcohol dataset from the library:

from imputegap.recovery.manager import TimeSeries
from imputegap.tools import utils

# initialize the time series object
ts = TimeSeries()
print(f"ImputeGAP datasets : {ts.datasets}")

# load and normalize the dataset from file or from the code
ts.load_series(utils.search_path("eeg-alcohol"))
ts.normalize(normalizer="z_score")

# plot and print a subset of time series
ts.plot(input_data=ts.data, nbr_series=9, nbr_val=100, save_path="./imputegap_assets")
ts.print(nbr_series=9, nbr_val=20)

The module ts.datasets contains all the publicly available datasets provided by the library, which can be listed as follows:

from imputegap.recovery.manager import TimeSeries
ts = TimeSeries()
print(f"ImputeGAP datasets : {ts.datasets}")

Contamination

We now describe how to simulate missing values in the loaded dataset. ImputeGAP implements eight different missingness patterns. For more details about the patterns, please refer to the documentation on this page.

Example Contamination

You can find this example in the file runner_contamination.py.

As example, we show how to contaminate the eeg-alcohol dataset with the MCAR pattern:

from imputegap.recovery.manager import TimeSeries
from imputegap.tools import utils

# initialize the time series object
ts = TimeSeries()

# load and normalize the dataset
ts.load_series(utils.search_path("eeg-alcohol"))
ts.normalize(normalizer="z_score")

# contaminate the time series with MCAR pattern
ts_m = ts.Contamination.mcar(ts.data, rate_dataset=0.2, rate_series=0.4, block_size=10, seed=True)

# [OPTIONAL] plot the contaminated time series
ts.plot(ts.data, ts_m, nbr_series=9, subplot=True, save_path="./imputegap_assets/contamination")

All missingness patterns developed in ImputeGAP are available in the ts.patterns module. They can be listed as follows:

from imputegap.recovery.manager import TimeSeries
ts = TimeSeries()
print(f"Missingness patterns : {ts.patterns}")

Imputation

In this section, we will illustrate how to impute the contaminated time series. Our library implements five families of imputation algorithms: Statistical, Machine Learning, Matrix Completion, Deep Learning, and Pattern Search. The list of algorithms is described here.

Example Imputation

You can find this example in the file runner_imputation.py.

Let's illustrate the imputation using the CDRec algorithm from the Matrix Completion family.

from imputegap.recovery.imputation import Imputation
from imputegap.recovery.manager import TimeSeries
from imputegap.tools import utils

# initialize the time series object
ts = TimeSeries()

# load and normalize the dataset
ts.load_series(utils.search_path("eeg-alcohol"))
ts.normalize(normalizer="z_score")

# contaminate the time series
ts_m = ts.Contamination.mcar(ts.data)

# impute the contaminated series
imputer = Imputation.MatrixCompletion.CDRec(ts_m)
imputer.impute()

# compute and print the imputation metrics
imputer.score(ts.data, imputer.recov_data)
ts.print_results(imputer.metrics)

# plot the recovered time series
ts.plot(input_data=ts.data, incomp_data=ts_m, recov_data=imputer.recov_data, nbr_series=9, subplot=True, algorithm=imputer.algorithm, save_path="./imputegap_assets/imputation")

Imputation can be performed using either default values or user-defined values. To specify the parameters, please use a dictionary in the following format:

config = {"rank": 5, "epsilon": 0.01, "iterations": 100}
imputer.impute(params=config)

All algorithms developed in ImputeGAP are available in the ts.algorithms module, which can be listed as follows:

from imputegap.recovery.manager import TimeSeries
ts = TimeSeries()
print(f"Imputation algorithms : {ts.algorithms}")

Parameter Tuning

The Optimizer component manages algorithm configuration and hyperparameter tuning. The parameters are defined by providing a dictionary containing the ground truth, the chosen optimizer, and the optimizer's options. Several search algorithms are available, including those provided by Ray Tune.

Example Auto-ML

You can find this example in the file runner_optimization.py.

Let's illustrate the imputation using the CDRec algorithm and Ray-Tune AutoML:

from imputegap.recovery.imputation import Imputation
from imputegap.recovery.manager import TimeSeries
from imputegap.tools import utils

# initialize the time series object
ts = TimeSeries()

# load and normalize the dataset
ts.load_series(utils.search_path("eeg-alcohol"))
ts.normalize(normalizer="z_score")

# contaminate and impute the time series
ts_m = ts.Contamination.mcar(ts.data)
imputer = Imputation.MatrixCompletion.CDRec(ts_m)

# use Ray Tune to fine tune the imputation algorithm
imputer.impute(user_def=False, params={"input_data": ts.data, "optimizer": "ray_tune"})

# compute the imputation metrics with optimized parameter values
imputer.score(ts.data, imputer.recov_data)

# compute the imputation metrics with default parameter values
imputer_def = Imputation.MatrixCompletion.CDRec(ts_m).impute()
imputer_def.score(ts.data, imputer_def.recov_data)

# print the imputation metrics with default and optimized parameter values
ts.print_results(imputer_def.metrics, text="Default values")
ts.print_results(imputer.metrics, text="Optimized values")

# plot the recovered time series
ts.plot(input_data=ts.data, incomp_data=ts_m, recov_data=imputer.recov_data, nbr_series=9, subplot=True, algorithm=imputer.algorithm, save_path="./imputegap_assets/imputation")

# save hyperparameters
utils.save_optimization(optimal_params=imputer.parameters, algorithm=imputer.algorithm, dataset="eeg-alcohol", optimizer="ray_tune")

All optimizers developed in ImputeGAP are available in the ts.optimizers module, which can be listed as follows:

from imputegap.recovery.manager import TimeSeries
ts = TimeSeries()
print(f"AutoML Optimizers : {ts.optimizers}")

Benchmark

ImputeGAP can serve as a common test-bed for comparing the effectiveness and efficiency of time series imputation algorithms[33] . Users have full control over the benchmark by customizing various parameters, including the list of the algorithms to compare, the optimizer, the datasets to evaluate, the missingness patterns, the range of missing values, and the performance metrics.

Example Benchmark

You can find this example in the file runner_benchmark.py.

The benchmarking module can be utilized as follows:

from imputegap.recovery.benchmark import Benchmark

my_algorithms = ["SoftImpute", "KNNImpute"]

my_opt = ["default_params"]

my_datasets = ["eeg-alcohol"]

my_patterns = ["mcar"]

range = [0.05, 0.1, 0.2, 0.4, 0.6, 0.8]

my_metrics = ["*"]

# launch the evaluation
bench = Benchmark()
bench.eval(algorithms=my_algorithms, datasets=my_datasets, patterns=my_patterns, x_axis=range, metrics=my_metrics, optimizers=my_opt)

You can enable the optimizer using the following command:

opt = {"optimizer": "ray_tune", "options": {"n_calls": 1, "max_concurrent_trials": 1}}
my_opt = [opt]

Downstream

ImputeGAP includes a dedicated module for systematically evaluating the impact of data imputation on downstream tasks. Currently, forecasting is the primary supported task, with plans to expand to additional applications in the future.

Example Downstream

You can find this example in the file runner_downstream.py.

Below is an example of how to call the downstream process for the model Prophet by defining a dictionary for the evaluator and selecting the model:

from imputegap.recovery.imputation import Imputation
from imputegap.recovery.manager import TimeSeries
from imputegap.tools import utils

# initialize the time series object
ts = TimeSeries()
print(f"ImputeGAP downstream models for forcasting : {ts.forecasting_models}")

# load and normalize the dataset
ts.load_series(utils.search_path("forecast-economy"))
ts.normalize()

# contaminate the time series
ts_m = ts.Contamination.aligned(ts.data, rate_series=0.8)

# define and impute the contaminated series
imputer = Imputation.MatrixCompletion.CDRec(ts_m)
imputer.impute()

# compute and print the downstream results
downstream_config = {"task": "forecast", "model": "hw-add", "comparator": "ZeroImpute"}
imputer.score(ts.data, imputer.recov_data, downstream=downstream_config)
ts.print_results(imputer.downstream_metrics, algorithm=imputer.algorithm)

All downstream models developed in ImputeGAP are available in the ts.forecasting_models module, which can be listed as follows:

from imputegap.recovery.manager import TimeSeries
ts = TimeSeries()
print(f"ImputeGAP downstream models for forecasting : {ts.forecasting_models}")

Explainer

The library provides insights into the algorithm’s behavior by identifying the features that impact the imputation results. It trains a regression model to predict imputation results across various methods and uses SHapley Additive exPlanations (SHAP) to reveal how different time series features influence the model’s predictions.

Example Explainer

You can find this example in the file runner_explainer.py.

Let’s illustrate the explainer using the CDRec algorithm and MCAR missingness pattern:

from imputegap.recovery.manager import TimeSeries
from imputegap.recovery.explainer import Explainer
from imputegap.tools import utils

# initialize the time series and explainer object
ts = TimeSeries()
exp = Explainer()

# load and normalize the dataset
ts.load_series(utils.search_path("eeg-alcohol"))
ts.normalize(normalizer="z_score")

# configure the explanation
exp.shap_explainer(input_data=ts.data, extractor="pycatch", pattern="mcar", file_name=ts.name, algorithm="CDRec")

# print the impact of each feature
exp.print(exp.shap_values, exp.shap_details)

All feature extractors developed in ImputeGAP are available in the ts.extractors module, which can be listed as follows:

from imputegap.recovery.manager import TimeSeries
ts = TimeSeries()
print(f"ImputeGAP features extractors : {ts.extractors}")

Integration

To add your own imputation algorithm, please refer to the detailed integration guide.


Citing

If you use ImputeGAP in your research, please cite the paper:

@article{nater2025imputegap,
  title = {ImputeGAP: A Comprehensive Library for Time Series Imputation},
  author = {Nater, Quentin and Khayati, Mourad and Pasquier, Jacques},
  year = {2025},
  eprint = {2503.15250},
  archiveprefix = {arXiv},
  primaryclass = {cs.LG},
  url = {https://arxiv.org/abs/2503.15250}
}

Core Contributors

Quentin Nater - ImputeGAP Mourad Khayati - ImputeGAP
Quentin Nater Mourad Khayati

References

[1] Mourad Khayati, Philippe Cudré-Mauroux, Michael H. Böhlen: Scalable recovery of missing blocks in time series with high and low cross-correlations. Knowl. Inf. Syst. 62(6): 2257-2280 (2020)

[2] Olga G. Troyanskaya, Michael N. Cantor, Gavin Sherlock, Patrick O. Brown, Trevor Hastie, Robert Tibshirani, David Botstein, Russ B. Altman: Missing value estimation methods for DNA microarrays. Bioinform. 17(6): 520-525 (2001)

[3] Dejiao Zhang, Laura Balzano: Global Convergence of a Grassmannian Gradient Descent Algorithm for Subspace Estimation. AISTATS 2016: 1460-1468

[4] Xianbiao Shu, Fatih Porikli, Narendra Ahuja: Robust Orthonormal Subspace Learning: Efficient Recovery of Corrupted Low-Rank Matrices. CVPR 2014: 3874-3881

[5] Spiros Papadimitriou, Jimeng Sun, Christos Faloutsos: Streaming Pattern Discovery in Multiple Time-Series. VLDB 2005: 697-708

[6] Rahul Mazumder, Trevor Hastie, Robert Tibshirani: Spectral Regularization Algorithms for Learning Large Incomplete Matrices. J. Mach. Learn. Res. 11: 2287-2322 (2010)

[7] Jian-Feng Cai, Emmanuel J. Candès, Zuowei Shen: A Singular Value Thresholding Algorithm for Matrix Completion. SIAM J. Optim. 20(4): 1956-1982 (2010)

[8] Hsiang-Fu Yu, Nikhil Rao, Inderjit S. Dhillon: Temporal Regularized Matrix Factorization for High-dimensional Time Series Prediction. NIPS 2016: 847-855

[9] Xiuwen Yi, Yu Zheng, Junbo Zhang, Tianrui Li: ST-MVL: Filling Missing Values in Geo-Sensory Time Series Data. IJCAI 2016: 2704-2710

[10] Lei Li, James McCann, Nancy S. Pollard, Christos Faloutsos: DynaMMo: mining and summarization of coevolving sequences with missing values. 507-516

[11] Kevin Wellenzohn, Michael H. Böhlen, Anton Dignös, Johann Gamper, Hannes Mitterer: Continuous Imputation of Missing Values in Streams of Pattern-Determining Time Series. EDBT 2017: 330-341

[12] Aoqian Zhang, Shaoxu Song, Yu Sun, Jianmin Wang: Learning Individual Models for Imputation (Technical Report). CoRR abs/2004.03436 (2020)

[13] Tianqi Chen, Carlos Guestrin: XGBoost: A Scalable Tree Boosting System. KDD 2016: 785-794

[14] Royston Patrick , White Ian R.: Multiple Imputation by Chained Equations (MICE): Implementation in Stata. Journal of Statistical Software 2010: 45(4), 1–20.

[15] Daniel J. Stekhoven, Peter Bühlmann: MissForest - non-parametric missing value imputation for mixed-type data. Bioinform. 28(1): 112-118 (2012)

[22] Jinsung Yoon, William R. Zame, Mihaela van der Schaar: Estimating Missing Data in Temporal Data Streams Using Multi-Directional Recurrent Neural Networks. IEEE Trans. Biomed. Eng. 66(5): 1477-1490 (2019)

[23] Wei Cao, Dong Wang, Jian Li, Hao Zhou, Lei Li, Yitan Li: BRITS: Bidirectional Recurrent Imputation for Time Series. NeurIPS 2018: 6776-6786

[24] Parikshit Bansal, Prathamesh Deshpande, Sunita Sarawagi: Missing Value Imputation on Multidimensional Time Series. Proc. VLDB Endow. 14(11): 2533-2545 (2021)

[25] Xiao Li, Huan Li, Hua Lu, Christian S. Jensen, Varun Pandey, Volker Markl: Missing Value Imputation for Multi-attribute Sensor Data Streams via Message Propagation (Extended Version). CoRR abs/2311.07344 (2023)

[26]: Mingzhe Liu, Han Huang, Hao Feng, Leilei Sun, Bowen Du, Yanjie Fu: PriSTI: A Conditional Diffusion Framework for Spatiotemporal Imputation. ICDE 2023: 1927-1939

[27] Kohei Obata, Koki Kawabata, Yasuko Matsubara, Yasushi Sakurai: Mining of Switching Sparse Networks for Missing Value Imputation in Multivariate Time Series. KDD 2024: 2296-2306

[28] Jinsung Yoon, James Jordon, Mihaela van der Schaar: GAIN: Missing Data Imputation using Generative Adversarial Nets. ICML 2018: 5675-5684

[29] Andrea Cini, Ivan Marisca, Cesare Alippi: Multivariate Time Series Imputation by Graph Neural Networks. CoRR abs/2108.00298 (2021)

[30] Shikai Fang, Qingsong Wen, Yingtao Luo, Shandian Zhe, Liang Sun: BayOTIDE: Bayesian Online Multivariate Time Series Imputation with Functional Decomposition. ICML 2024

[31] Liang Wang, Simeng Wu, Tianheng Wu, Xianping Tao, Jian Lu: HKMF-T: Recover From Blackouts in Tagged Time Series With Hankel Matrix Factorization. IEEE Trans. Knowl. Data Eng. 33(11): 3582-3593 (2021)

[32] Xiaodan Chen, Xiucheng Li, Bo Liu, Zhijun Li: Biased Temporal Convolution Graph Network for Time Series Forecasting with Missing Values. ICLR 2024

[33] Mourad Khayati, Alberto Lerner, Zakhar Tymchenko, Philippe Cudré-Mauroux: Mind the Gap: An Experimental Evaluation of Imputation of Missing Values Techniques in Time Series. Proc. VLDB Endow. 13(5): 768-782 (2020)

[34] Mourad Khayati, Quentin Nater, Jacques Pasquier: ImputeVIS: An Interactive Evaluator to Benchmark Imputation Techniques for Time Series Data. Proc. VLDB Endow. 17(12): 4329-4332 (2024)