Add schema option to "read_xlsx" #11046

omdaniel · 2025-03-26T18:40:09Z

Is your feature request related to a problem?

Using the "spatial" extension work around to read excel files I was able to

con.read_geo("test.xlsx", table_name="newtable", layer='sheetname', open_options = ['HEADERS=AUTO','FIELD_TYPES=STRING'])
table = con.table("newtable")
table.try_cast(schema)

and then use the "try_cast" on "newtable" to apply a schema; however, the con.read_xlsx with "all_varchar=True" creates a table with data that is less malleable when using

table = con.read_xlsx("test.xlsx", header=True, sheet="sheetname", ignore_errors=True, all_varchar=True)
table.try_cast(schema)

many of the "date" and "time" values do not cast ~~correctly~~ (auto-magically as desired) when using the read_xlsx

add a method to the read_xlsx that takes a schema and runs the raw sql under the hood

CREATE TABLE test (a DOUBLE, b DOUBLE);
COPY test FROM 'test.xlsx' WITH (FORMAT xlsx, HEADER);
SELECT * FROM test;

where the python ibis.Schema generates the raw sql for the CREATE TABLE field types.

Alternatively, create a ergonomic way to do this if the desire is to limit the api complexity of the read_xlsx method.

What is the motivation behind your request?

The workflow example can solved with con.load_extension("excel") and a con.raw_sql but the ergonomics would be cleaner if there was way to get the same result within ibis in some form

Describe the solution you'd like

Add a schema parameter to "read_xlsx" method

table = con.read_xlsx(con.read_xlsx("test.xlsx", header=True, sheet="sheetname", ignore_errors=True, schema = schema)

As an alternative this maybe a support issue for the DuckDB native excel extension but I assume their implementation choices were for robustness even if the all_varchar = True renders data that is less malleable than what is returned from read_geo with FIELD_TYPES = STRING

Bellow is not an implementation recommendation just a hack to scratch my particular itch

def schema2fieldtypes(schema: ibis.Schema) -> str:
    fields = []
    for k,v in schema.as_struct().items():
        fields.append(f"{k} {v.name.upper()}")

    return ", ".join(fields).replace("FLOAT64", "FLOAT8")

def read_excel(excel_file: pathlib.Path, connection: ibis.BaseBackend , sheet_name: str = "Sheet1", schema: ibis.Schema = None, sheet_range: str = None):
    from uuid import uuid4
    table_name = f"read_excel_{uuid4()}".replace("-","")
    connection.load_extension("excel")

    if not schema:
        return connection.read_xlsx(excel_file, sheet=sheet_name, ignore_errors=True, range=sheet_range)

    fields = f"({schema2fieldtypes(schema)})"
    sheet_range = f", RANGE '{sheet_range}'" if sheet_range else ""

    raw_sql = ""
    raw_sql += f"CREATE TABLE {table_name} {fields}; "
    raw_sql += f"COPY {table_name} FROM {excel_file} WITH (FORMAT xlsx, SHEET {sheet_name}, IGNORE_ERRORS{sheet_range}); "
    raw_sql += f"SELECT * FROM {table_name};"

    connection.raw_sql(raw_sql)

    return connection.table(table_name)

What version of ibis are you running?

10.3.1

What backend(s) are you using, if any?

DuckDB

Code of Conduct

I agree to follow this project's Code of Conduct look no

cpcloud · 2025-03-27T15:29:28Z

Thanks for the issue!

Do you have a file or some example data that doesn't work as you'd like it to that you could upload here?

omdaniel · 2025-03-27T16:31:06Z

I will have to make a segregate example xlsx file that surfaces the same issues I encountered with my data and I should be able to provide that in the next few days

omdaniel · 2025-04-02T14:03:14Z

test.xlsx

Below shows some ipython repl output using the test.xlsx file of these test

read_geo (w/ 'FIELD_TYPES=STRING")
cast read_geo result

read_xlsx (w/ 'all_varchar = True')
cast read_xlsx result

read_xlsx (w/ 'all_varchar = False')
cast read_xlsx result

read_xlsx (w/ my custom function and a schema)

issues encountered:
read_geo implementation is tolerant to a blank 1st line after the header and will read past the end of the file
read_xlsx will stop reading at blank 1st line after the header and need the "range" set as a parameter

read_geo (w/ 'FIELD_TYPES=STRING") interprets into a string form that is more flexible for "try_cast"

both read_geo and read_xlsx may have parity at guessing the types with a more strictly formatted excel file, but this should be tested (I have not done any exhaustive testing of this)

castings = dict(date_type = "date",
  int_type = "int",
  time0 = "time",
  time1 = "time",
  bool0 = "bool",
  bool1 = "bool")

In [87]: geo = con.read_geo("test.xlsx", open_options = ['HEADERS=AUTO','FIELD_TYPES=STRING'])

In [88]: geo
Out[88]:
┏━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ date_type  ┃ int_type ┃ time0    ┃ time1    ┃ bool0  ┃ bool1  ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ string     │ string   │ string   │ string   │ string │ string │
├────────────┼──────────┼──────────┼──────────┼────────┼────────┤
│ NULL       │ NULL     │ NULL     │ NULL     │ NULL   │ NULL   │
│ 2012/03/18 │ 125      │ NA       │ NA       │ PASS   │ NA     │
│ 2012/03/18 │ 203      │ NA       │ NA       │ PASS   │ NA     │
│ 2012/03/18 │ 404      │ 14:05:00 │ 14:08:47 │ Y      │ Y      │
│ 2012/03/18 │ 235      │ 14:05:00 │ 14:07:25 │ Y      │ Y      │
│ 2012/03/18 │ 913      │ 14:05:00 │ 14:07:58 │ Y      │ Y      │
│ NULL       │ NULL     │ NULL     │ NULL     │ NULL   │ NULL   │
│ 2012/03/19 │ 203      │ 15:07:00 │ 15:09:30 │ Y      │ N      │
│ 2012/03/19 │ 537      │ 15:07:00 │ 15:11:03 │ Y      │ Y      │
│ 2012/03/19 │ 613      │ 15:07:00 │ ND       │ Y      │ Y      │
│ …          │ …        │ …        │ …        │ …      │ …      │
└────────────┴──────────┴──────────┴──────────┴────────┴────────┘

In [89]: geo.try_cast(castings)
Out[89]:
┏━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ date_type  ┃ int_type ┃ time0    ┃ time1    ┃ bool0   ┃ bool1   ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ date       │ int64    │ time     │ time     │ boolean │ boolean │
├────────────┼──────────┼──────────┼──────────┼─────────┼─────────┤
│ NULL       │     NULL │ NULL     │ NULL     │ NULL    │ NULL    │
│ 2012-03-18 │      125 │ NULL     │ NULL     │ NULL    │ NULL    │
│ 2012-03-18 │      203 │ NULL     │ NULL     │ NULL    │ NULL    │
│ 2012-03-18 │      404 │ 14:05:00 │ 14:08:47 │ True    │ True    │
│ 2012-03-18 │      235 │ 14:05:00 │ 14:07:25 │ True    │ True    │
│ 2012-03-18 │      913 │ 14:05:00 │ 14:07:58 │ True    │ True    │
│ NULL       │     NULL │ NULL     │ NULL     │ NULL    │ NULL    │
│ 2012-03-19 │      203 │ 15:07:00 │ 15:09:30 │ True    │ False   │
│ 2012-03-19 │      537 │ 15:07:00 │ 15:11:03 │ True    │ True    │
│ 2012-03-19 │      613 │ 15:07:00 │ NULL     │ True    │ True    │
│ …          │        … │ …        │ …        │ …       │ …       │
└────────────┴──────────┴──────────┴──────────┴─────────┴─────────┘

In [90]: xlsx = con.read_xlsx("test.xlsx", header=True, ignore_errors=True, all_varchar=True, range = "A1:F11")

In [91]: xlsx
Out[91]:
┏━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ date_type ┃ int_type ┃ time0               ┃ time1               ┃ bool0  ┃ bool1  ┃
┡━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ string    │ string   │ string              │ string              │ string │ string │
├───────────┼──────────┼─────────────────────┼─────────────────────┼────────┼────────┤
│ NULL      │ NULL     │ NULL                │ NULL                │ NULL   │ NULL   │
│ 40986     │ 125      │ NA                  │ NA                  │ PASS   │ NA     │
│ 40986     │ 203      │ NA                  │ NA                  │ PASS   │ NA     │
│ 40986     │ 404      │ 0.58680555555555558 │ 0.5894328703703704  │ Y      │ Y      │
│ 40986     │ 235      │ 0.58680555555555558 │ 0.58848379629629632 │ Y      │ Y      │
│ 40986     │ 913      │ 0.58680555555555558 │ 0.58886574074074072 │ Y      │ Y      │
│ NULL      │ NULL     │ NULL                │ NULL                │ NULL   │ NULL   │
│ 40987     │ 203      │ 0.62986111111111109 │ 0.63159722222222225 │ Y      │ N      │
│ 40987     │ 537      │ 0.62986111111111109 │ 0.63267361111111109 │ Y      │ Y      │
│ 40987     │ 613      │ 0.62986111111111109 │ ND                  │ Y      │ Y      │
└───────────┴──────────┴─────────────────────┴─────────────────────┴────────┴────────┘

In [92]: xlsx.try_cast(castings)
Out[92]:
┏━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ date_type ┃ int_type ┃ time0 ┃ time1 ┃ bool0   ┃ bool1   ┃
┡━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ date      │ int64    │ time  │ time  │ boolean │ boolean │
├───────────┼──────────┼───────┼───────┼─────────┼─────────┤
│ NULL      │     NULL │ NULL  │ NULL  │ NULL    │ NULL    │
│ NULL      │      125 │ NULL  │ NULL  │ NULL    │ NULL    │
│ NULL      │      203 │ NULL  │ NULL  │ NULL    │ NULL    │
│ NULL      │      404 │ NULL  │ NULL  │ True    │ True    │
│ NULL      │      235 │ NULL  │ NULL  │ True    │ True    │
│ NULL      │      913 │ NULL  │ NULL  │ True    │ True    │
│ NULL      │     NULL │ NULL  │ NULL  │ NULL    │ NULL    │
│ NULL      │      203 │ NULL  │ NULL  │ True    │ False   │
│ NULL      │      537 │ NULL  │ NULL  │ True    │ True    │
│ NULL      │      613 │ NULL  │ NULL  │ True    │ True    │
└───────────┴──────────┴───────┴───────┴─────────┴─────────┘

In [93]: xlsx_type_guess = con.read_xlsx("test.xlsx", header=True, ignore_errors=True, range = "A1:F11")

In [94]: xlsx_type_guess
Out[94]:
┏━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ date_type  ┃ int_type ┃ time0    ┃ time1    ┃ bool0   ┃ bool1   ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ date       │ float64  │ float64  │ time     │ float64 │ float64 │
├────────────┼──────────┼──────────┼──────────┼─────────┼─────────┤
│ NULL       │     NULL │     NULL │ NULL     │    NULL │    NULL │
│ 2012-03-18 │    125.0 │     NULL │ NULL     │    NULL │    NULL │
│ 2012-03-18 │    203.0 │     NULL │ NULL     │    NULL │    NULL │
│ 2012-03-18 │    404.0 │ 0.586806 │ 14:08:47 │    NULL │    NULL │
│ 2012-03-18 │    235.0 │ 0.586806 │ 14:07:25 │    NULL │    NULL │
│ 2012-03-18 │    913.0 │ 0.586806 │ 14:07:58 │    NULL │    NULL │
│ NULL       │     NULL │     NULL │ NULL     │    NULL │    NULL │
│ 2012-03-19 │    203.0 │ 0.629861 │ 15:09:30 │    NULL │    NULL │
│ 2012-03-19 │    537.0 │ 0.629861 │ 15:11:03 │    NULL │    NULL │
│ 2012-03-19 │    613.0 │ 0.629861 │ NULL     │    NULL │    NULL │
└────────────┴──────────┴──────────┴──────────┴─────────┴─────────┘

In [95]: xlsx_type_guess.try_cast(castings)
Out[95]:
┏━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ date_type  ┃ int_type ┃ time0 ┃ time1    ┃ bool0   ┃ bool1   ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ date       │ int64    │ time  │ time     │ boolean │ boolean │
├────────────┼──────────┼───────┼──────────┼─────────┼─────────┤
│ NULL       │     NULL │ NULL  │ NULL     │ NULL    │ NULL    │
│ 2012-03-18 │      125 │ NULL  │ NULL     │ NULL    │ NULL    │
│ 2012-03-18 │      203 │ NULL  │ NULL     │ NULL    │ NULL    │
│ 2012-03-18 │      404 │ NULL  │ 14:08:47 │ NULL    │ NULL    │
│ 2012-03-18 │      235 │ NULL  │ 14:07:25 │ NULL    │ NULL    │
│ 2012-03-18 │      913 │ NULL  │ 14:07:58 │ NULL    │ NULL    │
│ NULL       │     NULL │ NULL  │ NULL     │ NULL    │ NULL    │
│ 2012-03-19 │      203 │ NULL  │ 15:09:30 │ NULL    │ NULL    │
│ 2012-03-19 │      537 │ NULL  │ 15:11:03 │ NULL    │ NULL    │
│ 2012-03-19 │      613 │ NULL  │ NULL     │ NULL    │ NULL    │
└────────────┴──────────┴───────┴──────────┴─────────┴─────────┘

And using my custom reader defined in my earlier comment

In [96]: excel_hack = read_excel("test.xlsx", connection=con, schema = geo.try_cast(castings).schema(), sheet_range = "A1:F11")
In [97]: excel_hack
Out[97]:
┏━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ date_type  ┃ int_type ┃ time0    ┃ time1    ┃ bool0   ┃ bool1   ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ date       │ int64    │ time     │ time     │ boolean │ boolean │
├────────────┼──────────┼──────────┼──────────┼─────────┼─────────┤
│ NULL       │     NULL │ NULL     │ NULL     │ NULL    │ NULL    │
│ 2012-03-18 │      125 │ NULL     │ NULL     │ NULL    │ NULL    │
│ 2012-03-18 │      203 │ NULL     │ NULL     │ NULL    │ NULL    │
│ 2012-03-18 │      404 │ 14:05:00 │ 14:08:47 │ True    │ True    │
│ 2012-03-18 │      235 │ 14:05:00 │ 14:07:25 │ True    │ True    │
│ 2012-03-18 │      913 │ 14:05:00 │ 14:07:58 │ True    │ True    │
│ NULL       │     NULL │ NULL     │ NULL     │ NULL    │ NULL    │
│ 2012-03-19 │      203 │ 15:07:00 │ 15:09:30 │ True    │ False   │
│ 2012-03-19 │      537 │ 15:07:00 │ 15:11:03 │ True    │ True    │
│ 2012-03-19 │      613 │ 15:07:00 │ NULL     │ True    │ True    │
└────────────┴──────────┴──────────┴──────────┴─────────┴─────────┘

In [98]:

NickCrews · 2025-04-02T22:22:46Z

My first thought is that we should try to fix this upstream in duckdb. I would hope that in duckdb, FROM read_xlsx(all_varchar=true) and FROM st_read(open_options = ['HEADERS=AUTO','FIELD_TYPES=STRING']) should result in the same data.

As an alternative this maybe a support issue for the DuckDB native excel extension but I assume their implementation choices were for robustness even if the all_varchar = True renders data that is less malleable than what is returned from read_geo with FIELD_TYPES = STRING

Are you saying you don't think this is a good idea? Can you explain this further?

omdaniel · 2025-04-02T23:05:28Z

I would think that FROM read_xlsx(all_varchar=true) and FROM st_read(open_options ['HEADERS=AUTO','FIELD_TYPES=STRING']); however, I do not know if producing the same output was a goal of the DuckDB excel extension as the st_read implementation may be handling many edge cases and the perhaps read_xlsx expects the excel file to be formatted a specific way (just conjecture as I haven't communicated with DuckDB folks). The DuckDB api does provide a more structured option where the user creates the table with fields types followed by COPY... and this seems to wrangle the poorly formatted excel file into a usable form consistent with st_read (using STRINGS) followed by try_cast

NickCrews · 2025-04-03T14:10:30Z

@omdaniel can you file an issue with duckdb and link to it here? Let's try to solve it there before we start to be hacky on our end.

omdaniel · 2025-04-04T15:00:18Z

I opened an issue with DuckDB excel extension #40

omdaniel added the feature Features or general enhancements label Mar 26, 2025

github-project-automation bot added this to Ibis planning and roadmap Mar 26, 2025

github-project-automation bot moved this to backlog in Ibis planning and roadmap Mar 26, 2025

omdaniel mentioned this issue Apr 4, 2025

'all_varchar = true' output data with read_xlsx inconsistent with 'FIELD_TYPES=STRING' using st_read for excel files duckdb/duckdb-excel#40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add schema option to "read_xlsx" #11046

Add schema option to "read_xlsx" #11046

omdaniel commented Mar 26, 2025 •

edited

Loading

cpcloud commented Mar 27, 2025

omdaniel commented Mar 27, 2025 •

edited

Loading

omdaniel commented Apr 2, 2025 •

edited

Loading

NickCrews commented Apr 2, 2025

omdaniel commented Apr 2, 2025 •

edited

Loading

NickCrews commented Apr 3, 2025

omdaniel commented Apr 4, 2025

Add schema option to "read_xlsx" #11046

Add schema option to "read_xlsx" #11046

Comments

omdaniel commented Mar 26, 2025 • edited Loading

Is your feature request related to a problem?

What is the motivation behind your request?

Describe the solution you'd like

What version of ibis are you running?

What backend(s) are you using, if any?

Code of Conduct

cpcloud commented Mar 27, 2025

omdaniel commented Mar 27, 2025 • edited Loading

omdaniel commented Apr 2, 2025 • edited Loading

NickCrews commented Apr 2, 2025

omdaniel commented Apr 2, 2025 • edited Loading

NickCrews commented Apr 3, 2025

omdaniel commented Apr 4, 2025

omdaniel commented Mar 26, 2025 •

edited

Loading

omdaniel commented Mar 27, 2025 •

edited

Loading

omdaniel commented Apr 2, 2025 •

edited

Loading

omdaniel commented Apr 2, 2025 •

edited

Loading