Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add schema option to "read_xlsx" #11046

Open
1 task done
omdaniel opened this issue Mar 26, 2025 · 7 comments
Open
1 task done

Add schema option to "read_xlsx" #11046

omdaniel opened this issue Mar 26, 2025 · 7 comments
Labels
feature Features or general enhancements

Comments

@omdaniel
Copy link

omdaniel commented Mar 26, 2025

Is your feature request related to a problem?

Using the "spatial" extension work around to read excel files I was able to

con.read_geo("test.xlsx", table_name="newtable", layer='sheetname', open_options = ['HEADERS=AUTO','FIELD_TYPES=STRING'])
table = con.table("newtable")
table.try_cast(schema)

and then use the "try_cast" on "newtable" to apply a schema; however, the con.read_xlsx with "all_varchar=True" creates a table with data that is less malleable when using

table = con.read_xlsx("test.xlsx", header=True, sheet="sheetname", ignore_errors=True, all_varchar=True)
table.try_cast(schema)

many of the "date" and "time" values do not cast correctly (auto-magically as desired) when using the read_xlsx

add a method to the read_xlsx that takes a schema and runs the raw sql under the hood

CREATE TABLE test (a DOUBLE, b DOUBLE);
COPY test FROM 'test.xlsx' WITH (FORMAT xlsx, HEADER);
SELECT * FROM test;

where the python ibis.Schema generates the raw sql for the CREATE TABLE field types.

Alternatively, create a ergonomic way to do this if the desire is to limit the api complexity of the read_xlsx method.

What is the motivation behind your request?

The workflow example can solved with con.load_extension("excel") and a con.raw_sql but the ergonomics would be cleaner if there was way to get the same result within ibis in some form

Describe the solution you'd like

Add a schema parameter to "read_xlsx" method

table = con.read_xlsx(con.read_xlsx("test.xlsx", header=True, sheet="sheetname", ignore_errors=True, schema = schema)

As an alternative this maybe a support issue for the DuckDB native excel extension but I assume their implementation choices were for robustness even if the all_varchar = True renders data that is less malleable than what is returned from read_geo with FIELD_TYPES = STRING

Bellow is not an implementation recommendation just a hack to scratch my particular itch

def schema2fieldtypes(schema: ibis.Schema) -> str:
    fields = []
    for k,v in schema.as_struct().items():
        fields.append(f"{k} {v.name.upper()}")

    return ", ".join(fields).replace("FLOAT64", "FLOAT8")

def read_excel(excel_file: pathlib.Path, connection: ibis.BaseBackend , sheet_name: str = "Sheet1", schema: ibis.Schema = None, sheet_range: str = None):
    from uuid import uuid4
    table_name = f"read_excel_{uuid4()}".replace("-","")
    connection.load_extension("excel")

    if not schema:
        return connection.read_xlsx(excel_file, sheet=sheet_name, ignore_errors=True, range=sheet_range)

    fields = f"({schema2fieldtypes(schema)})"
    sheet_range = f", RANGE '{sheet_range}'" if sheet_range else ""

    raw_sql = ""
    raw_sql += f"CREATE TABLE {table_name} {fields}; "
    raw_sql += f"COPY {table_name} FROM {excel_file} WITH (FORMAT xlsx, SHEET {sheet_name}, IGNORE_ERRORS{sheet_range}); "
    raw_sql += f"SELECT * FROM {table_name};"

    connection.raw_sql(raw_sql)

    return connection.table(table_name)

What version of ibis are you running?

10.3.1

What backend(s) are you using, if any?

DuckDB

Code of Conduct

  • I agree to follow this project's Code of Conduct look no
@omdaniel omdaniel added the feature Features or general enhancements label Mar 26, 2025
@cpcloud
Copy link
Member

cpcloud commented Mar 27, 2025

Thanks for the issue!

Do you have a file or some example data that doesn't work as you'd like it to that you could upload here?

@omdaniel
Copy link
Author

omdaniel commented Mar 27, 2025

I will have to make a segregate example xlsx file that surfaces the same issues I encountered with my data and I should be able to provide that in the next few days

@omdaniel
Copy link
Author

omdaniel commented Apr 2, 2025

test.xlsx

Below shows some ipython repl output using the test.xlsx file of these test

read_geo (w/ 'FIELD_TYPES=STRING")
cast read_geo result

read_xlsx (w/ 'all_varchar = True')
cast read_xlsx result

read_xlsx (w/ 'all_varchar = False')
cast read_xlsx result

read_xlsx (w/ my custom function and a schema)

issues encountered:
read_geo implementation is tolerant to a blank 1st line after the header and will read past the end of the file
read_xlsx will stop reading at blank 1st line after the header and need the "range" set as a parameter

read_geo (w/ 'FIELD_TYPES=STRING") interprets into a string form that is more flexible for "try_cast"

both read_geo and read_xlsx may have parity at guessing the types with a more strictly formatted excel file, but this should be tested (I have not done any exhaustive testing of this)

castings = dict(date_type = "date",
  int_type = "int",
  time0 = "time",
  time1 = "time",
  bool0 = "bool",
  bool1 = "bool")

In [87]: geo = con.read_geo("test.xlsx", open_options = ['HEADERS=AUTO','FIELD_TYPES=STRING'])

In [88]: geo
Out[88]:
┏━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ date_typeint_typetime0time1bool0bool1  ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ stringstringstringstringstringstring │
├────────────┼──────────┼──────────┼──────────┼────────┼────────┤
│ NULLNULLNULLNULLNULLNULL   │
│ 2012/03/18125NANAPASSNA     │
│ 2012/03/18203NANAPASSNA     │
│ 2012/03/1840414:05:0014:08:47YY      │
│ 2012/03/1823514:05:0014:07:25YY      │
│ 2012/03/1891314:05:0014:07:58YY      │
│ NULLNULLNULLNULLNULLNULL   │
│ 2012/03/1920315:07:0015:09:30YN      │
│ 2012/03/1953715:07:0015:11:03YY      │
│ 2012/03/1961315:07:00NDYY      │
│ …          │ …        │ …        │ …        │ …      │ …      │
└────────────┴──────────┴──────────┴──────────┴────────┴────────┘

In [89]: geo.try_cast(castings)
Out[89]:
┏━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ date_typeint_typetime0time1bool0bool1   ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ dateint64timetimebooleanboolean │
├────────────┼──────────┼──────────┼──────────┼─────────┼─────────┤
│ NULLNULLNULLNULLNULLNULL    │
│ 2012-03-18125NULLNULLNULLNULL    │
│ 2012-03-18203NULLNULLNULLNULL    │
│ 2012-03-1840414:05:0014:08:47TrueTrue    │
│ 2012-03-1823514:05:0014:07:25TrueTrue    │
│ 2012-03-1891314:05:0014:07:58TrueTrue    │
│ NULLNULLNULLNULLNULLNULL    │
│ 2012-03-1920315:07:0015:09:30TrueFalse   │
│ 2012-03-1953715:07:0015:11:03TrueTrue    │
│ 2012-03-1961315:07:00NULLTrueTrue    │
│ …          │        … │ …        │ …        │ …       │ …       │
└────────────┴──────────┴──────────┴──────────┴─────────┴─────────┘

In [90]: xlsx = con.read_xlsx("test.xlsx", header=True, ignore_errors=True, all_varchar=True, range = "A1:F11")

In [91]: xlsx
Out[91]:
┏━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ date_typeint_typetime0time1bool0bool1  ┃
┡━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ stringstringstringstringstringstring │
├───────────┼──────────┼─────────────────────┼─────────────────────┼────────┼────────┤
│ NULLNULLNULLNULLNULLNULL   │
│ 40986125NANAPASSNA     │
│ 40986203NANAPASSNA     │
│ 409864040.586805555555555580.5894328703703704YY      │
│ 409862350.586805555555555580.58848379629629632YY      │
│ 409869130.586805555555555580.58886574074074072YY      │
│ NULLNULLNULLNULLNULLNULL   │
│ 409872030.629861111111111090.63159722222222225YN      │
│ 409875370.629861111111111090.63267361111111109YY      │
│ 409876130.62986111111111109NDYY      │
└───────────┴──────────┴─────────────────────┴─────────────────────┴────────┴────────┘

In [92]: xlsx.try_cast(castings)
Out[92]:
┏━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ date_typeint_typetime0time1bool0bool1   ┃
┡━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ dateint64timetimebooleanboolean │
├───────────┼──────────┼───────┼───────┼─────────┼─────────┤
│ NULLNULLNULLNULLNULLNULL    │
│ NULL125NULLNULLNULLNULL    │
│ NULL203NULLNULLNULLNULL    │
│ NULL404NULLNULLTrueTrue    │
│ NULL235NULLNULLTrueTrue    │
│ NULL913NULLNULLTrueTrue    │
│ NULLNULLNULLNULLNULLNULL    │
│ NULL203NULLNULLTrueFalse   │
│ NULL537NULLNULLTrueTrue    │
│ NULL613NULLNULLTrueTrue    │
└───────────┴──────────┴───────┴───────┴─────────┴─────────┘

In [93]: xlsx_type_guess = con.read_xlsx("test.xlsx", header=True, ignore_errors=True, range = "A1:F11")

In [94]: xlsx_type_guess
Out[94]:
┏━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ date_typeint_typetime0time1bool0bool1   ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ datefloat64float64timefloat64float64 │
├────────────┼──────────┼──────────┼──────────┼─────────┼─────────┤
│ NULLNULLNULLNULLNULLNULL │
│ 2012-03-18125.0NULLNULLNULLNULL │
│ 2012-03-18203.0NULLNULLNULLNULL │
│ 2012-03-18404.00.58680614:08:47NULLNULL │
│ 2012-03-18235.00.58680614:07:25NULLNULL │
│ 2012-03-18913.00.58680614:07:58NULLNULL │
│ NULLNULLNULLNULLNULLNULL │
│ 2012-03-19203.00.62986115:09:30NULLNULL │
│ 2012-03-19537.00.62986115:11:03NULLNULL │
│ 2012-03-19613.00.629861NULLNULLNULL │
└────────────┴──────────┴──────────┴──────────┴─────────┴─────────┘

In [95]: xlsx_type_guess.try_cast(castings)
Out[95]:
┏━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ date_typeint_typetime0time1bool0bool1   ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ dateint64timetimebooleanboolean │
├────────────┼──────────┼───────┼──────────┼─────────┼─────────┤
│ NULLNULLNULLNULLNULLNULL    │
│ 2012-03-18125NULLNULLNULLNULL    │
│ 2012-03-18203NULLNULLNULLNULL    │
│ 2012-03-18404NULL14:08:47NULLNULL    │
│ 2012-03-18235NULL14:07:25NULLNULL    │
│ 2012-03-18913NULL14:07:58NULLNULL    │
│ NULLNULLNULLNULLNULLNULL    │
│ 2012-03-19203NULL15:09:30NULLNULL    │
│ 2012-03-19537NULL15:11:03NULLNULL    │
│ 2012-03-19613NULLNULLNULLNULL    │
└────────────┴──────────┴───────┴──────────┴─────────┴─────────┘

And using my custom reader defined in my earlier comment

In [96]: excel_hack = read_excel("test.xlsx", connection=con, schema = geo.try_cast(castings).schema(), sheet_range = "A1:F11")
In [97]: excel_hack
Out[97]:
┏━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ date_typeint_typetime0time1bool0bool1   ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ dateint64timetimebooleanboolean │
├────────────┼──────────┼──────────┼──────────┼─────────┼─────────┤
│ NULLNULLNULLNULLNULLNULL    │
│ 2012-03-18125NULLNULLNULLNULL    │
│ 2012-03-18203NULLNULLNULLNULL    │
│ 2012-03-1840414:05:0014:08:47TrueTrue    │
│ 2012-03-1823514:05:0014:07:25TrueTrue    │
│ 2012-03-1891314:05:0014:07:58TrueTrue    │
│ NULLNULLNULLNULLNULLNULL    │
│ 2012-03-1920315:07:0015:09:30TrueFalse   │
│ 2012-03-1953715:07:0015:11:03TrueTrue    │
│ 2012-03-1961315:07:00NULLTrueTrue    │
└────────────┴──────────┴──────────┴──────────┴─────────┴─────────┘

In [98]:

@NickCrews
Copy link
Contributor

My first thought is that we should try to fix this upstream in duckdb. I would hope that in duckdb, FROM read_xlsx(all_varchar=true) and FROM st_read(open_options = ['HEADERS=AUTO','FIELD_TYPES=STRING']) should result in the same data.

As an alternative this maybe a support issue for the DuckDB native excel extension but I assume their implementation choices were for robustness even if the all_varchar = True renders data that is less malleable than what is returned from read_geo with FIELD_TYPES = STRING

Are you saying you don't think this is a good idea? Can you explain this further?

@omdaniel
Copy link
Author

omdaniel commented Apr 2, 2025

I would think that FROM read_xlsx(all_varchar=true) and FROM st_read(open_options ['HEADERS=AUTO','FIELD_TYPES=STRING']); however, I do not know if producing the same output was a goal of the DuckDB excel extension as the st_read implementation may be handling many edge cases and the perhaps read_xlsx expects the excel file to be formatted a specific way (just conjecture as I haven't communicated with DuckDB folks). The DuckDB api does provide a more structured option where the user creates the table with fields types followed by COPY... and this seems to wrangle the poorly formatted excel file into a usable form consistent with st_read (using STRINGS) followed by try_cast

@NickCrews
Copy link
Contributor

@omdaniel can you file an issue with duckdb and link to it here? Let's try to solve it there before we start to be hacky on our end.

@omdaniel
Copy link
Author

omdaniel commented Apr 4, 2025

I opened an issue with DuckDB excel extension #40

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements
Projects
Status: backlog
Development

No branches or pull requests

3 participants