-
-
Notifications
You must be signed in to change notification settings - Fork 33.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Statistics log division by zero errors #141112
Comments
Hey there @ThomDietrich, @gjohansson-ST, mind taking a look at this issue as it has been labeled with an integration ( Code owner commandsCode owners of
(message by CodeOwnersMention) statistics documentation |
Hey @unfug-at-github that's very strange. Seems like an error I would have caught. You might be right about the link to the change of timestamps to float but that would not explain how Would you be willing to provide a PR for this? I suggest to start with a failing test case so we understand the problem |
My first thought was that the floats representing the timestamps may be less precise than the original timestamps. From what I was able to find with google, timestamp objects have a precision going to the microsecond (1/1000000s) level. A float has per definition no fixed precision. If we measure a time in seconds since 1970 (Unix time format), we get something roughly as large as this: (2024−1970)×365×24×60×60 = 1.702.944.000 (9 decimal digits) I assume that HA is based on IEEE 754 single / double - precision binary floating-point format. single-precision (23 fraction bits, ~6-7 decimal digits) So a single-precision value is not even enough to represent the seconds accurately for a current timestamp. A double-precision value is getting somewhere close to the range of the microseconds (9 decimal digits for the seconds + 6 more to make it microseconds). So it could be that we already see rounding issues with values that differ only by a few microseconds, especially after doing arithmetic with these numbers. Another question is whether the system clock is even providing a sufficiently high precision for the time. Just because a datatype can store microseconds, it doesn't mean you can get this kind of precision out of your system. Another question is what happens with the values when they are stored in the database and loaded again. The precision and the format may change there as well, so that values may differ before storing them, but become identical after reloading. Finally, I found an interesting hint in the Python documentation of the time functions stating that subsequent calls to the function could produce identical results. Another thing I am not sure about is how time travels through the template sensors. Will a value triggering the re-computation of another value pass it's change time on to the depending value, or will this re-computation use a fresh timestamp requested from the system clock. To summarize this: I don't think we can work with the general assumption that the timestamps of subsequent values will always differ. Here is some Python code that I used to proof that there can be collisions:
Here is an example for the ouput (collisions are rather frequent):
As a solution to this, we could either discard duplicates in the statistics integration, or adapt the functions computing averages etc. to deal with the duplicate timestamps. The first option is probably rather simple, the second option is tricky, because you will have to think about all the edge cases again (e.g. The average of a single value is computed differently than the average of multiple values. If you throw away one out of two values, you have to switch from the multi-value computation to the single value computation). |
I think it might be the introduction of But anyhow I think perhaps the easiest (and probably most accurate) solution is as mentioned, we simply remove the incoming value with a timestamp which is already in the queue/has been processed. |
@gjohansson-ST if we indeed see multiple values with the exact same timestamp, I would argue to update the queue, i.e. keep the last reported value. |
I don't know as I'm not exactly using it myself. In theory I guess if changed and reported comes at the same time, they would have the same value but it's a guess from my side but there should not be possible really for two changes to come with the same timestamp. Perhaps we should do an initial PR just to include some more logging and ask @unfug-at-github to test-run it and see what values we get with which timestamp coming from where when we get these errors? |
I will add some logging locally and report back what I found out. I guess that's easier than going through the whole PR process. What I have seen in the past is that values reported from my inverter triggered several updates in a very short amount of time. And that was way before the state reported events were added. The inverter reports all its values at regular intervals (Solaredge modbus integration), since they get measured in one go. Just as an example: A little later (at time t2), it reports A template computes the sum of What happens is that shortly after t2, the sum will be recalculated three times (once for every value that changed): sum_t2_first and sum_t2_second will be visible for a very short time only, because the incoming changes from the measurement done at time t_2 arrive one by one (more or less at the same time, but not exactly at the same time). So in between two measurements we see three different values, out of which one is visible for 99,x% of the time and the other two just appear for a few milliseconds. In some places I have implemented delays (wait for x secs in triggers) to prevent that this happens, because combining values from two different measurements created some weird spikes and unwanted effects. I haven't done that in places where the double / triple re-computation didn't cause issues though. It is not at all unlikely, that the re-computation of these values can happen fast enough to end up with the exact same timestamp. |
You definitely can't guarantee this. Its very easy to do if you limit precision to import time
last_time = time.time()
for itr in range(1000000):
this_time = time.time()
if last_time == this_time:
raise Exception(
f"should not be the same: this_time={this_time} last_time={last_time} after {itr} attempts"
)
last_time = this_time In most cases it doesn't happen since the execution time of Python makes it very unlikely. |
I was able to get a few more details about the issue:
As can be seen here, there is a state report event and a state change event reporting the same value with identical timestamps. This seems only to happen shortly after the startup phase (there are only two values in the queue) and I don't see similar errors at a later point where the queue would have more values. We don't write the current time into the queue, but the "last reported time" of the event, which unlike the actual time is identical (look at the timestamps provided by the log function). This means, the potential issue of having computations happening fast enough to get identical timestamps, is not the problem here. Nevertheless, that issue may still cause problems, it might just not be that likely, because there is enough time consuming code around it. In the end, we could solve both issues by just keeping the last reported value, in case a value with an identical timestamp is received. If we can agree on that, I can create a corresponding PR. |
One more update: I have added a little more logging, and I can confirm that these duplicates are appearing more or less all the time. They just don't create the null pointer exceptions (this only happens shortly after starting up). The null pointer exceptions are caused when the complete time range of all values is zero. As long as the queue contains two differing timestamps, the computation can be performed correctly, In some cases I even saw the same value reported three times. What's also interesting is that sometimes the first event to report a certain value is a state report event, and the state change event follows a little later. What I didn't encounter so far though, is that two different values are reported with the same timestamp. Here are some settings that allowed me to reproduce the effect. So far I was only able to produce the issue after restarting. My real-life templates produce the duplicated values regularly though (of course without causing the division by zero). Statistics sensor:
Templates:
|
The problem
The statistics sensor produces division by zero errors in the log. This seems to be caused by having values that have identical change timestamps.
It might be that this was caused by sensors that were updated several times in a very short time interval and the precision of the timestamps is too low to distinguish the two change timestamps (that is just a guess though). It could also be that this is something triggered by the startup phase.
I also saw that there was a recent change in the code where timestamps were replaced with floats, which might have reduced the precision of the timestamp delta calculation.
I can easily reproduce the problem, so it is not a once in a lifetime exceptional case.
What version of Home Assistant Core has the issue?
core-2025.3.4
What was the last working version of Home Assistant Core?
No response
What type of installation are you running?
Home Assistant OS
Integration causing the issue
statistics
Link to integration documentation on our website
https://www.home-assistant.io/integrations/statistics/
Diagnostics information
No response
Example YAML snippet
Anything in the logs that might be useful for us?
Additional information
No response
The text was updated successfully, but these errors were encountered: