You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I've recently using pandera and I really love the ability to quickly validate data, it saves lots of tedious SQL/polars manual checks.
One issue I have encountered is that sometimes I will receive some data from a client of mine and then names might not match my pandera schema perfectly but I'd still to do some basic check validation. In my workflow I would often load a csv dump with all columns as strings (pl.read_csv("some.csv", infer_schema = False)) and then I would want to:
check all number columns use dots to represent decimal numbers rather than commas (e.g. 128.1 rather than 128,1)
make sure string columns don't contain the pipe character ("|") as we use this as our csv delimiter
make sure date columns follow a particular format (e.g. dd/mm/yyy)
this would save a ton of time renaming columns to the right values or having a back and forth with the client so that they send the files with the right column names. I understand that there needs to be something that tells pandera what columns should be what type though.
with problematic csv dumps I often find that polars will give an error when loading a csv and trying to infer the schema but it won't be as comprehensive as the data validation provided by pandera. Ideally I would like to immediately know what rows are problematic which currently i cannot achieve with either pandera or loading via polars directly
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi, I've recently using pandera and I really love the ability to quickly validate data, it saves lots of tedious SQL/polars manual checks.
One issue I have encountered is that sometimes I will receive some data from a client of mine and then names might not match my pandera schema perfectly but I'd still to do some basic check validation. In my workflow I would often load a csv dump with all columns as strings (
pl.read_csv("some.csv", infer_schema = False)
) and then I would want to:this would save a ton of time renaming columns to the right values or having a back and forth with the client so that they send the files with the right column names. I understand that there needs to be something that tells pandera what columns should be what type though.
with problematic csv dumps I often find that polars will give an error when loading a csv and trying to infer the schema but it won't be as comprehensive as the data validation provided by pandera. Ideally I would like to immediately know what rows are problematic which currently i cannot achieve with either pandera or loading via polars directly
Beta Was this translation helpful? Give feedback.
All reactions