Skip to content

Commit

Permalink
optimizations for GoogleEventSet, speeding up merging 20+% (#68)
Browse files Browse the repository at this point in the history
- add `add_if_not_present` method to avoid computing key twice (which is quite expensive!)

  This is intended to be used as a replacement for (e.g. in HPI)

  ```
  if event in emitted:
      continue
  emitted.add(event)
  yield event
  ```

  With this method, we could rewrite as:

  ```
  if emitted.add_if_not_present(event):
      yield event
  ```

  This could be introduced to hpi with backwards compatibility.

- use type directly as key, types are hashable (very tiny speedup, but it also feels more natural anyway
  • Loading branch information
karlicoss authored Sep 12, 2024
1 parent a3a402a commit 5779c8d
Showing 1 changed file with 18 additions and 4 deletions.
22 changes: 18 additions & 4 deletions google_takeout_parser/merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"""

from itertools import chain
from typing import Set, Tuple, List, Any, Optional
from typing import Set, Tuple, List, Any, Optional, Type


from cachew import cachew
Expand Down Expand Up @@ -76,8 +76,11 @@ def merge_events(*sources: CacheResults) -> CacheResults:
)


def _create_key(e: BaseEvent) -> Tuple[str, Any]:
return (type(e).__name__, e.key)
Key = Tuple[Type[Any], Any]


def _create_key(e: BaseEvent) -> Key:
return (type(e), e.key)


# This is so that its easier to use this logic in other
Expand All @@ -88,7 +91,7 @@ class GoogleEventSet:
"""

def __init__(self) -> None:
self.keys: Set[Tuple[str, Any]] = set()
self.keys: Set[Key] = set()

def __contains__(self, other: BaseEvent) -> bool:
return _create_key(other) in self.keys
Expand All @@ -98,3 +101,14 @@ def __len__(self) -> int:

def add(self, other: BaseEvent) -> None:
self.keys.add(_create_key(other))

def add_if_not_present(self, other: BaseEvent) -> bool:
"""
Returns False if element already existed, True if it didn't and we added it.
More efficient than checking membership and adding separately, since we only compute key once.
"""
key = _create_key(other)
if key in self.keys:
return False
self.keys.add(key)
return True

0 comments on commit 5779c8d

Please sign in to comment.