-
Notifications
You must be signed in to change notification settings - Fork 630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add schema option to "read_xlsx" #11046
Comments
Thanks for the issue! Do you have a file or some example data that doesn't work as you'd like it to that you could upload here? |
I will have to make a segregate example xlsx file that surfaces the same issues I encountered with my data and I should be able to provide that in the next few days |
Below shows some ipython repl output using the test.xlsx file of these test read_geo (w/ 'FIELD_TYPES=STRING") read_xlsx (w/ 'all_varchar = True') read_xlsx (w/ 'all_varchar = False') read_xlsx (w/ my custom function and a schema) issues encountered: read_geo (w/ 'FIELD_TYPES=STRING") interprets into a string form that is more flexible for "try_cast" both read_geo and read_xlsx may have parity at guessing the types with a more strictly formatted excel file, but this should be tested (I have not done any exhaustive testing of this) castings = dict(date_type = "date",
int_type = "int",
time0 = "time",
time1 = "time",
bool0 = "bool",
bool1 = "bool")
In [87]: geo = con.read_geo("test.xlsx", open_options = ['HEADERS=AUTO','FIELD_TYPES=STRING'])
In [88]: geo
Out[88]:
┏━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ date_type ┃ int_type ┃ time0 ┃ time1 ┃ bool0 ┃ bool1 ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ string │ string │ string │ string │ string │ string │
├────────────┼──────────┼──────────┼──────────┼────────┼────────┤
│ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │
│ 2012/03/18 │ 125 │ NA │ NA │ PASS │ NA │
│ 2012/03/18 │ 203 │ NA │ NA │ PASS │ NA │
│ 2012/03/18 │ 404 │ 14:05:00 │ 14:08:47 │ Y │ Y │
│ 2012/03/18 │ 235 │ 14:05:00 │ 14:07:25 │ Y │ Y │
│ 2012/03/18 │ 913 │ 14:05:00 │ 14:07:58 │ Y │ Y │
│ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │
│ 2012/03/19 │ 203 │ 15:07:00 │ 15:09:30 │ Y │ N │
│ 2012/03/19 │ 537 │ 15:07:00 │ 15:11:03 │ Y │ Y │
│ 2012/03/19 │ 613 │ 15:07:00 │ ND │ Y │ Y │
│ … │ … │ … │ … │ … │ … │
└────────────┴──────────┴──────────┴──────────┴────────┴────────┘
In [89]: geo.try_cast(castings)
Out[89]:
┏━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ date_type ┃ int_type ┃ time0 ┃ time1 ┃ bool0 ┃ bool1 ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ date │ int64 │ time │ time │ boolean │ boolean │
├────────────┼──────────┼──────────┼──────────┼─────────┼─────────┤
│ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │
│ 2012-03-18 │ 125 │ NULL │ NULL │ NULL │ NULL │
│ 2012-03-18 │ 203 │ NULL │ NULL │ NULL │ NULL │
│ 2012-03-18 │ 404 │ 14:05:00 │ 14:08:47 │ True │ True │
│ 2012-03-18 │ 235 │ 14:05:00 │ 14:07:25 │ True │ True │
│ 2012-03-18 │ 913 │ 14:05:00 │ 14:07:58 │ True │ True │
│ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │
│ 2012-03-19 │ 203 │ 15:07:00 │ 15:09:30 │ True │ False │
│ 2012-03-19 │ 537 │ 15:07:00 │ 15:11:03 │ True │ True │
│ 2012-03-19 │ 613 │ 15:07:00 │ NULL │ True │ True │
│ … │ … │ … │ … │ … │ … │
└────────────┴──────────┴──────────┴──────────┴─────────┴─────────┘
In [90]: xlsx = con.read_xlsx("test.xlsx", header=True, ignore_errors=True, all_varchar=True, range = "A1:F11")
In [91]: xlsx
Out[91]:
┏━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ date_type ┃ int_type ┃ time0 ┃ time1 ┃ bool0 ┃ bool1 ┃
┡━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ string │ string │ string │ string │ string │ string │
├───────────┼──────────┼─────────────────────┼─────────────────────┼────────┼────────┤
│ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │
│ 40986 │ 125 │ NA │ NA │ PASS │ NA │
│ 40986 │ 203 │ NA │ NA │ PASS │ NA │
│ 40986 │ 404 │ 0.58680555555555558 │ 0.5894328703703704 │ Y │ Y │
│ 40986 │ 235 │ 0.58680555555555558 │ 0.58848379629629632 │ Y │ Y │
│ 40986 │ 913 │ 0.58680555555555558 │ 0.58886574074074072 │ Y │ Y │
│ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │
│ 40987 │ 203 │ 0.62986111111111109 │ 0.63159722222222225 │ Y │ N │
│ 40987 │ 537 │ 0.62986111111111109 │ 0.63267361111111109 │ Y │ Y │
│ 40987 │ 613 │ 0.62986111111111109 │ ND │ Y │ Y │
└───────────┴──────────┴─────────────────────┴─────────────────────┴────────┴────────┘
In [92]: xlsx.try_cast(castings)
Out[92]:
┏━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ date_type ┃ int_type ┃ time0 ┃ time1 ┃ bool0 ┃ bool1 ┃
┡━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ date │ int64 │ time │ time │ boolean │ boolean │
├───────────┼──────────┼───────┼───────┼─────────┼─────────┤
│ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │
│ NULL │ 125 │ NULL │ NULL │ NULL │ NULL │
│ NULL │ 203 │ NULL │ NULL │ NULL │ NULL │
│ NULL │ 404 │ NULL │ NULL │ True │ True │
│ NULL │ 235 │ NULL │ NULL │ True │ True │
│ NULL │ 913 │ NULL │ NULL │ True │ True │
│ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │
│ NULL │ 203 │ NULL │ NULL │ True │ False │
│ NULL │ 537 │ NULL │ NULL │ True │ True │
│ NULL │ 613 │ NULL │ NULL │ True │ True │
└───────────┴──────────┴───────┴───────┴─────────┴─────────┘
In [93]: xlsx_type_guess = con.read_xlsx("test.xlsx", header=True, ignore_errors=True, range = "A1:F11")
In [94]: xlsx_type_guess
Out[94]:
┏━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ date_type ┃ int_type ┃ time0 ┃ time1 ┃ bool0 ┃ bool1 ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ date │ float64 │ float64 │ time │ float64 │ float64 │
├────────────┼──────────┼──────────┼──────────┼─────────┼─────────┤
│ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │
│ 2012-03-18 │ 125.0 │ NULL │ NULL │ NULL │ NULL │
│ 2012-03-18 │ 203.0 │ NULL │ NULL │ NULL │ NULL │
│ 2012-03-18 │ 404.0 │ 0.586806 │ 14:08:47 │ NULL │ NULL │
│ 2012-03-18 │ 235.0 │ 0.586806 │ 14:07:25 │ NULL │ NULL │
│ 2012-03-18 │ 913.0 │ 0.586806 │ 14:07:58 │ NULL │ NULL │
│ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │
│ 2012-03-19 │ 203.0 │ 0.629861 │ 15:09:30 │ NULL │ NULL │
│ 2012-03-19 │ 537.0 │ 0.629861 │ 15:11:03 │ NULL │ NULL │
│ 2012-03-19 │ 613.0 │ 0.629861 │ NULL │ NULL │ NULL │
└────────────┴──────────┴──────────┴──────────┴─────────┴─────────┘
In [95]: xlsx_type_guess.try_cast(castings)
Out[95]:
┏━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ date_type ┃ int_type ┃ time0 ┃ time1 ┃ bool0 ┃ bool1 ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ date │ int64 │ time │ time │ boolean │ boolean │
├────────────┼──────────┼───────┼──────────┼─────────┼─────────┤
│ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │
│ 2012-03-18 │ 125 │ NULL │ NULL │ NULL │ NULL │
│ 2012-03-18 │ 203 │ NULL │ NULL │ NULL │ NULL │
│ 2012-03-18 │ 404 │ NULL │ 14:08:47 │ NULL │ NULL │
│ 2012-03-18 │ 235 │ NULL │ 14:07:25 │ NULL │ NULL │
│ 2012-03-18 │ 913 │ NULL │ 14:07:58 │ NULL │ NULL │
│ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │
│ 2012-03-19 │ 203 │ NULL │ 15:09:30 │ NULL │ NULL │
│ 2012-03-19 │ 537 │ NULL │ 15:11:03 │ NULL │ NULL │
│ 2012-03-19 │ 613 │ NULL │ NULL │ NULL │ NULL │
└────────────┴──────────┴───────┴──────────┴─────────┴─────────┘ And using my custom reader defined in my earlier comment In [96]: excel_hack = read_excel("test.xlsx", connection=con, schema = geo.try_cast(castings).schema(), sheet_range = "A1:F11")
In [97]: excel_hack
Out[97]:
┏━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ date_type ┃ int_type ┃ time0 ┃ time1 ┃ bool0 ┃ bool1 ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ date │ int64 │ time │ time │ boolean │ boolean │
├────────────┼──────────┼──────────┼──────────┼─────────┼─────────┤
│ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │
│ 2012-03-18 │ 125 │ NULL │ NULL │ NULL │ NULL │
│ 2012-03-18 │ 203 │ NULL │ NULL │ NULL │ NULL │
│ 2012-03-18 │ 404 │ 14:05:00 │ 14:08:47 │ True │ True │
│ 2012-03-18 │ 235 │ 14:05:00 │ 14:07:25 │ True │ True │
│ 2012-03-18 │ 913 │ 14:05:00 │ 14:07:58 │ True │ True │
│ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │
│ 2012-03-19 │ 203 │ 15:07:00 │ 15:09:30 │ True │ False │
│ 2012-03-19 │ 537 │ 15:07:00 │ 15:11:03 │ True │ True │
│ 2012-03-19 │ 613 │ 15:07:00 │ NULL │ True │ True │
└────────────┴──────────┴──────────┴──────────┴─────────┴─────────┘
In [98]: |
My first thought is that we should try to fix this upstream in duckdb. I would hope that in duckdb,
Are you saying you don't think this is a good idea? Can you explain this further? |
I would think that FROM read_xlsx(all_varchar=true) and FROM st_read(open_options ['HEADERS=AUTO','FIELD_TYPES=STRING']); however, I do not know if producing the same output was a goal of the DuckDB excel extension as the st_read implementation may be handling many edge cases and the perhaps read_xlsx expects the excel file to be formatted a specific way (just conjecture as I haven't communicated with DuckDB folks). The DuckDB api does provide a more structured option where the user creates the table with fields types followed by COPY... and this seems to wrangle the poorly formatted excel file into a usable form consistent with st_read (using STRINGS) followed by try_cast |
@omdaniel can you file an issue with duckdb and link to it here? Let's try to solve it there before we start to be hacky on our end. |
I opened an issue with DuckDB excel extension #40 |
Is your feature request related to a problem?
Using the "spatial" extension work around to read excel files I was able to
and then use the "try_cast" on "newtable" to apply a schema; however, the con.read_xlsx with "all_varchar=True" creates a table with data that is less malleable when using
many of the "date" and "time" values do not cast
correctly(auto-magically as desired) when using the read_xlsxadd a method to the read_xlsx that takes a schema and runs the raw sql under the hood
where the python ibis.Schema generates the raw sql for the CREATE TABLE field types.
Alternatively, create a ergonomic way to do this if the desire is to limit the api complexity of the read_xlsx method.
What is the motivation behind your request?
The workflow example can solved with con.load_extension("excel") and a con.raw_sql but the ergonomics would be cleaner if there was way to get the same result within ibis in some form
Describe the solution you'd like
Add a schema parameter to "read_xlsx" method
As an alternative this maybe a support issue for the DuckDB native excel extension but I assume their implementation choices were for robustness even if the all_varchar = True renders data that is less malleable than what is returned from read_geo with FIELD_TYPES = STRING
Bellow is not an implementation recommendation just a hack to scratch my particular itch
What version of ibis are you running?
10.3.1
What backend(s) are you using, if any?
DuckDB
Code of Conduct
The text was updated successfully, but these errors were encountered: