Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: .mutate() on wide tables is slow #10956

Open
jc-5s opened this issue Mar 7, 2025 · 4 comments
Open

perf: .mutate() on wide tables is slow #10956

jc-5s opened this issue Mar 7, 2025 · 4 comments
Labels
performance Issues related to ibis's performance

Comments

@jc-5s
Copy link

jc-5s commented Mar 7, 2025

I'm trying to trace through the cause for an expression (used for an ETL process) that takes around 5 seconds to construct.

The expression involves building about 100 subtables from a main table (which has ~100 columns), each subtable contains 4-10 columns. The 100 subtables are then concatenated via a union operation

Constructing the 100 subtables takes almost all the expression building time. Each subtable involves a mutate on the main table, and then a 2nd mutate, so there are ~200 calls to Table.mutate made in building the 100 subtables. Key pain points within this process are all within ibis expression building code, in particular DerefMap.from_targets and DerefMap.dereference which are used internally by Expr.bind and Table.select, in turn used by Table.mutate

Are there any known places within these functions where optimisations might be possible (memoisation / different data structure for faster lookup/search etc)?

I will try to come up with a reproducible example and follow up separately

Results from line profiling the ibis internal functions when building the expression are provided below. Note that using lprof introduces overhead so what took 5s without it takes 15s with it. Times below are all in seconds

I'm using latest release of ibis 10.2.0

Thanks!

Table.mutate

Total time: 12.6961 s
File: [python3.12/site-packages/ibis/expr/types/relations.py]
Function: mutate at line 1866

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1866                                               def mutate(self, *exprs: Sequence[ir.Expr] | None, **mutations: ir.Value) -> Table:
...
  1954       212          0.0      0.0      0.0          node = self.op()
  1955       212          3.9      0.0     30.6          values = self.bind(*exprs, **mutations)
  1956       212          0.0      0.0      0.0          values = unwrap_aliases(values)
  1957                                                   # allow overriding of fields, hence the mutation behavior
  1958       212          0.9      0.0      7.1          values = {**node.fields, **values}
  1959       212          7.9      0.0     62.3          return self.select(**values)

Expr.bind

Total time: 11.9379 s
File: [python3.12/site-packages/ibis/expr/types/relations.py]
Function: bind at line 270

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   270                                               def bind(self, *args: Any, **kwargs: Any) -> tuple[Value, ...]:
...
   289       708          0.5      0.0      3.9          values = self._fast_bind(*args, **kwargs)
   290                                                   # dereference the values to `self`
   291       708          9.1      0.0     75.9          dm = DerefMap.from_targets(self.op())
   292       708          0.0      0.0      0.0          result = []
   293     28366          0.0      0.0      0.1          for original in values:
   294     27658          2.4      0.0     19.7              value = dm.dereference(original.op()).to_expr()
   295     27658          0.0      0.0      0.4              value = value.name(original.get_name())
   296     27658          0.0      0.0      0.1              result.append(value)
   297       708          0.0      0.0      0.0          return tuple(result)

Table.select

Total time: 9.77739 s
File: [python3.12/site-packages/ibis/expr/types/relations.py]
Function: select at line 1961

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1961                                               def select(
  1962                                                   self,
  1963                                                   *exprs: ir.Value | str | Iterable[ir.Value | str],
  1964                                                   **named_exprs: ir.Value | str,
  1965                                               ) -> Table:
...
  2138       305          0.0      0.0      0.0          from ibis.expr.rewrites import rewrite_project_input
  2139                                           
  2140       305          6.8      0.0     69.7          values = self.bind(*exprs, **named_exprs)
  2141       305          0.0      0.0      0.4          values = unwrap_aliases(values)
  2142       305          0.0      0.0      0.0          if not values:
  2143                                                       raise com.IbisTypeError(
  2144                                                           "You must select at least one column for a valid projection"
  2145                                                       )
  2146                                           
  2147                                                   # we need to detect reductions which are either turned into window functions
  2148                                                   # or scalar subqueries depending on whether they are originating from self
  2149       610          0.0      0.0      0.0          values = {
  2150     26987          2.4      0.0     25.0              k: rewrite_project_input(v, relation=self.op()) for k, v in values.items()
  2151                                                   }
  2152       305          0.5      0.0      4.8          return ops.Project(self, values).to_expr()

DerefMap.from_targets

Total time: 8.47946 s
File: [python3.12/site-packages/ibis/expr/rewrites.py]
Function: from_targets at line 63

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    63                                               @classmethod
    64                                               def from_targets(cls, rels, extra=None):
...
    81       708          0.0      0.0      0.0          rels = promote_list(rels)
    82       708          0.0      0.0      0.0          mapping = defaultdict(dict)
    83      1416          0.0      0.0      0.0          for rel in rels:
    84     81185          3.1      0.0     36.3              for field in rel.fields.values():
    85    377543          0.5      0.0      5.6                  for value, distance in cls.backtrack(field):
    86    297066          0.7      0.0      7.8                      mapping[value][field] = distance
    87                                           
    88       708          0.0      0.0      0.0          subs, ambigs = {}, {}
    89    296471          0.1      0.0      0.8          for from_, to in mapping.items():
    90    295763          0.1      0.0      1.6              mindist = min(to.values())
    91    592829          0.3      0.0      3.0              minkeys = [k for k, v in to.items() if v == mindist]
    92                                                       # if all the closest fields are from the same relation, then we
    93                                                       # can safely substitute them and we pick the first one arbitrarily
    94    295763          0.4      0.0      4.6              if all(minkeys[0].relations == k.relations for k in minkeys):
    95    295763          0.2      0.0      2.0                  subs[from_] = minkeys[0]
    96                                                       else:
    97                                                           ambigs[from_] = minkeys
    98                                           
    99       708          0.0      0.0      0.0          if extra is not None:
   100                                                       subs.update(extra)
   101                                           
   102       708          3.2      0.0     38.2          return cls(rels, subs, ambigs)

DerefMap.dereference

Total time: 2.16665 s
File: [python3.12/site-packages/ibis/expr/rewrites.py]
Function: dereference at line 135

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   135                                               def dereference(self, value):
...
   151     27658          0.7      0.0     30.5          ambigs = value.find(lambda x: x in self.ambigs, filter=ops.Value)
   152     27658          0.0      0.0      0.3          if ambigs:
   153                                                       raise IbisInputError(
   154                                                           f"Ambiguous field reference {ambigs!r} in expression {value!r}"
   155                                                       )
   156     27658          1.5      0.0     69.2          return value.replace(self.subs, filter=ops.Value)
@jc-5s
Copy link
Author

jc-5s commented Mar 7, 2025

Code to reproduce below (this example is slightly simplified vs my real-life code, but still takes 2.5s vs the 5s mine takes)

import ibis
test_table = ibis.table(name='test', schema={f'field_{i}': 'int64' for i in range(100)})

def build_expr():
    for i in range(100):
        subtable = test_table.mutate(
            json_data=ibis.struct({field: test_table[field] for field in [f'field_{i}' for i in range(4)]}).cast("json"),
            test=(test_table.field_0 == 1) & (test_table.field_1 == 2) & (test_table.field_2 == 3),
            flag=ibis.literal('foo'),
        )
        
        subtable.mutate(
            literal_1=ibis.literal(1),
            literal_2=ibis.literal('foo'),
            literal_3=ibis.literal(True),
            literal_4=ibis.literal(1.0),
        )

build_expr()  # this call takes 2.5s

To profile (in a notebook/ipython session):

%load_ext line_profiler

from ibis import Table, Expr
from ibis.expr.rewrites import DerefMap
from ibis.common.grounds import Concrete
from ibis.expr.operations.relations import Relation
%lprun -u 1 -f build_expr -f DerefMap.from_targets -f DerefMap.dereference -f Table.bind -f Table.select -f Table.mutate build_expr()

@jc-5s jc-5s changed the title building large expression is slow (~5seconds) expressions built on wide tables are slow Mar 7, 2025
@jc-5s
Copy link
Author

jc-5s commented Mar 7, 2025

after a little more experimenting its clear that the number of columns in the table on which .mutate is called is a key driver

if I make the table 1000 columns wide (instead of 100) and reduce the call to mutate to something extremely trivial, the below takes over 10s to run

import ibis
test_table = ibis.table(name='test', schema={f'field_{i}': 'int64' for i in range(1000)})

def build_expr():
    for i in range(100):
        subtable = test_table.mutate(
            flag=ibis.literal('foo'),
        )

build_expr()  # this call takes 13.6s

@jc-5s jc-5s changed the title expressions built on wide tables are slow perf: .mutate() on wide tables is slow Mar 7, 2025
@cpcloud
Copy link
Member

cpcloud commented Mar 10, 2025

@jc-5s Yep, thanks for the report. I also found something similar while poking around here.

@kszucs Any chance you might be able to poke around at this a bit?

@cpcloud cpcloud added the performance Issues related to ibis's performance label Mar 10, 2025
@kszucs
Copy link
Member

kszucs commented Mar 11, 2025

I will try to, though probably just next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Issues related to ibis's performance
Projects
Status: backlog
Development

No branches or pull requests

3 participants