Is there a way to validate columns regardless of column names? #1917

lucharo · 2025-02-25T11:59:10Z

lucharo
Feb 25, 2025

Hi, I've recently using pandera and I really love the ability to quickly validate data, it saves lots of tedious SQL/polars manual checks.

One issue I have encountered is that sometimes I will receive some data from a client of mine and then names might not match my pandera schema perfectly but I'd still to do some basic check validation. In my workflow I would often load a csv dump with all columns as strings (pl.read_csv("some.csv", infer_schema = False)) and then I would want to:

check all number columns use dots to represent decimal numbers rather than commas (e.g. 128.1 rather than 128,1)
make sure string columns don't contain the pipe character ("|") as we use this as our csv delimiter
make sure date columns follow a particular format (e.g. dd/mm/yyy)

this would save a ton of time renaming columns to the right values or having a back and forth with the client so that they send the files with the right column names. I understand that there needs to be something that tells pandera what columns should be what type though.

with problematic csv dumps I often find that polars will give an error when loading a csv and trying to infer the schema but it won't be as comprehensive as the data validation provided by pandera. Ideally I would like to immediately know what rows are problematic which currently i cannot achieve with either pandera or loading via polars directly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to validate columns regardless of column names? #1917

{{title}}

Replies: 0 comments

Select a reply

Is there a way to validate columns regardless of column names? #1917

lucharo Feb 25, 2025

Replies: 0 comments

lucharo
Feb 25, 2025