Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: History caching #2345

Merged
merged 1 commit into from
Mar 19, 2025
Merged

Conversation

dhruvan2006
Copy link
Contributor

@dhruvan2006 dhruvan2006 commented Mar 6, 2025

Changes

  • history() now only caches data if the period is max
  • Renamed _history to _max_history to ensure clarity
  • Added a test to ensure proper behavior for chained history calls

Closes #2344

@ValueRaider
Copy link
Collaborator

Not keen on default max.

Suggestions:

  • add new argument period to get_splits and get_dividends, default 2y
  • replace self._max_history = None with self._history_cache = {}, dict key is (interval, period)
  • move cache R/W to a new middle function, so flow is: get_splits() -> self._history_cache() -> self.history()

Hope that makes sense

@R5dan
Copy link
Contributor

R5dan commented Mar 6, 2025

@ValueRaider what about if you have different days? I wouldn't want the cache for data today to be returned for the data yesterday or a week ago etc

@ValueRaider
Copy link
Collaborator

@ValueRaider what about if you have different days? I wouldn't want the cache for data today to be returned for the data yesterday or a week ago etc

yfinance already does via cache_get that and no one complains. This is memory cache not persistent.

@dhruvan2006
Copy link
Contributor Author

dhruvan2006 commented Mar 6, 2025

Not keen on default max.

I was unsure on whether I should change default behavior? People using ticker.splits ticker.dividends might have code break cause they relied on the default being max

I don't think I changed default behavior in this PR

@ValueRaider
Copy link
Collaborator

@dhruvan2006 good point, I missed the default is already max. max it is

@JanMkl
Copy link
Contributor

JanMkl commented Mar 7, 2025

@dhruvan2006 if we want to refactor the caching behaviour, I believe it is not enough to create a history_max as also the metadata is different for different time periods. Exchange schedules (tradingPeriods) are only returned in metadata for intraday and that is why get_history_metadata depends on intraday data.

Perhaps a more future proof change would be to refactor the structure so that _history is actually an array containing elements named data and metadata. Array keys could be the valid time periods.

But as said this is a larger refactor. For a simple solution just addressing #2344 I created a PR #2346 but if this rework is completed both for the data and metadata then my PR can be discarded.

@ValueRaider
Copy link
Collaborator

ValueRaider commented Mar 17, 2025

@dhruvan2006 bump #2345 (comment)

Would be nice to get this or #2346 merged soon.

@dhruvan2006
Copy link
Contributor Author

dhruvan2006 commented Mar 18, 2025

@ValueRaider Implemented your data flow using _get_history_cache(). history() does no caching now.

Unsure whether prepost=True matters for fetching metadata. Can someone clarify? It doesn't appear so from my testing.

@dhruvan2006 dhruvan2006 changed the title Fix: Refactor history to only cache 'max' period Fix: History caching Mar 18, 2025
@ValueRaider
Copy link
Collaborator

Unsure whether prepost=True matters for fetching metadata

It does, it adds more columns to dat.history_metadata['tradingPeriods']:

                                          pre_start                   pre_end                     start                       end                post_start                  post_end
Date                                                                                                                                                                                 
2025-03-11 00:00:00-04:00 2025-03-11 04:00:00-04:00 2025-03-11 09:30:00-04:00 2025-03-11 09:30:00-04:00 2025-03-11 16:00:00-04:00 2025-03-11 16:00:00-04:00 2025-03-11 20:00:00-04:00
...

@dhruvan2006
Copy link
Contributor Author

It does, it adds more columns to dat.history_metadata['tradingPeriods']

Cool. I added it as a cache_key

@ValueRaider
Copy link
Collaborator

ValueRaider commented Mar 18, 2025

bug with cache_key in _get_history_cache

@dhruvan2006
Copy link
Contributor Author

bug with cache_key in _get_history_cache

what is it?

@JanMkl
Copy link
Contributor

JanMkl commented Mar 18, 2025

Is it intentional that calling history() does not store the fetched results in _history_cache?

Similarly history() doesn't try to utilise what is in the _history_cache instead goes and fetches directly from Yahoo regardless what we have in _history_cache?

@ValueRaider
Copy link
Collaborator

@JanMkl yes, keep scope focused on these get_ functions.

There can be a debate whether this is needed at all - yfinance started with self._history but now it has

yfinance/yfinance/data.py

Lines 413 to 415 in fa3094d

@lru_cache_freezeargs
@lru_cache(maxsize=cache_maxsize)
def cache_get(self, url, user_agent_headers=None, params=None, proxy=None, timeout=30):

@JanMkl
Copy link
Contributor

JanMkl commented Mar 19, 2025

lgtm

@ValueRaider
Copy link
Collaborator

ValueRaider commented Mar 19, 2025

Just test whether the cache is being hit with yf.enable_debug_mode(). It's obvious, surprised you missed it.

@dhruvan2006
Copy link
Contributor Author

It's obvious, surprised you missed it.

Fixed

@ValueRaider ValueRaider merged commit f208f09 into ranaroussi:dev Mar 19, 2025
2 checks passed
@ValueRaider ValueRaider mentioned this pull request Mar 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants