Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statistics log division by zero errors #141112

Open
unfug-at-github opened this issue Mar 22, 2025 · 10 comments
Open

Statistics log division by zero errors #141112

unfug-at-github opened this issue Mar 22, 2025 · 10 comments

Comments

@unfug-at-github
Copy link
Contributor

The problem

The statistics sensor produces division by zero errors in the log. This seems to be caused by having values that have identical change timestamps.

It might be that this was caused by sensors that were updated several times in a very short time interval and the precision of the timestamps is too low to distinguish the two change timestamps (that is just a guess though). It could also be that this is something triggered by the startup phase.

I also saw that there was a recent change in the code where timestamps were replaced with floats, which might have reduced the precision of the timestamp delta calculation.

I can easily reproduce the problem, so it is not a once in a lifetime exceptional case.

What version of Home Assistant Core has the issue?

core-2025.3.4

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

statistics

Link to integration documentation on our website

https://www.home-assistant.io/integrations/statistics/

Diagnostics information

No response

Example YAML snippet

Anything in the logs that might be useful for us?

`2025-03-22 09:57:43.540 ERROR (MainThread) [homeassistant.helpers.event] Error while dispatching event for sensor.inverter_production to <Job track state_changed event ['sensor.inverter_production'] HassJobType.Callback <bound method StatisticsSensor._async_stats_sensor_state_change_listener of <entity sensor.inverter_production_avg_15s=0.0>>>
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/helpers/event.py", line 355, in _async_dispatch_entity_id_event
    hass.async_run_hass_job(job, event)
    ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/core.py", line 940, in async_run_hass_job
    hassjob.target(*args)
    ~~~~~~~~~~~~~~^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/statistics/sensor.py", line 748, in _async_stats_sensor_state_change_listener
    self._async_handle_new_state(event.data["new_state"])
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/statistics/sensor.py", line 734, in _async_handle_new_state
    self._async_purge_update_and_schedule()
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/usr/src/homeassistant/homeassistant/components/statistics/sensor.py", line 986, in _async_purge_update_and_schedule
    self._update_value()
    ~~~~~~~~~~~~~~~~~~^^
  File "/usr/src/homeassistant/homeassistant/components/statistics/sensor.py", line 1097, in _update_value
    value = self._state_characteristic_fn(self.states, self.ages, self._percentile)
  File "/usr/src/homeassistant/homeassistant/components/statistics/sensor.py", line 142, in _stat_average_step
    return area / age_range_seconds
           ~~~~~^~~~~~~~~~~~~~~~~~~
ZeroDivisionError: float division by zero`

Additional information

No response

@home-assistant
Copy link

Hey there @ThomDietrich, @gjohansson-ST, mind taking a look at this issue as it has been labeled with an integration (statistics) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of statistics can trigger bot actions by commenting:

  • @home-assistant close Closes the issue.
  • @home-assistant rename Awesome new title Renames the issue.
  • @home-assistant reopen Reopen the issue.
  • @home-assistant unassign statistics Removes the current integration label and assignees on the issue, add the integration domain after the command.
  • @home-assistant add-label needs-more-information Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue.
  • @home-assistant remove-label needs-more-information Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


statistics documentation
statistics source
(message by IssueLinks)

@ThomDietrich
Copy link
Contributor

Hey @unfug-at-github that's very strange. Seems like an error I would have caught. You might be right about the link to the change of timestamps to float but that would not explain how ages[-1] - ages[0] is exactly 0.

Would you be willing to provide a PR for this? I suggest to start with a failing test case so we understand the problem

@shikasta-at-git
Copy link

shikasta-at-git commented Mar 24, 2025

My first thought was that the floats representing the timestamps may be less precise than the original timestamps.

From what I was able to find with google, timestamp objects have a precision going to the microsecond (1/1000000s) level.

A float has per definition no fixed precision. If we measure a time in seconds since 1970 (Unix time format), we get something roughly as large as this:

(2024−1970)×365×24×60×60 = 1.702.944.000 (9 decimal digits)

I assume that HA is based on IEEE 754 single / double - precision binary floating-point format.

single-precision (23 fraction bits, ~6-7 decimal digits)
double-precision (52 fraction bits, ~15-16 decimal digits) => this is probably what a Python float is

So a single-precision value is not even enough to represent the seconds accurately for a current timestamp. A double-precision value is getting somewhere close to the range of the microseconds (9 decimal digits for the seconds + 6 more to make it microseconds). So it could be that we already see rounding issues with values that differ only by a few microseconds, especially after doing arithmetic with these numbers.

Another question is whether the system clock is even providing a sufficiently high precision for the time. Just because a datatype can store microseconds, it doesn't mean you can get this kind of precision out of your system.

Another question is what happens with the values when they are stored in the database and loaded again. The precision and the format may change there as well, so that values may differ before storing them, but become identical after reloading.

Finally, I found an interesting hint in the Python documentation of the time functions stating that subsequent calls to the function could produce identical results.

Another thing I am not sure about is how time travels through the template sensors. Will a value triggering the re-computation of another value pass it's change time on to the depending value, or will this re-computation use a fresh timestamp requested from the system clock.

To summarize this: I don't think we can work with the general assumption that the timestamps of subsequent values will always differ.

Here is some Python code that I used to proof that there can be collisions:

import datetime
import time

for i in range(1 , 10):
  dt1 = datetime.datetime.now()
  dt2 = datetime.datetime.now()
  dtf1 = dt1.timestamp()
  dtf2 = dt2.timestamp()
  dtd = (dtf2 - dtf1)
  if dtd == 0 :
    print(f'datetime: {dt2} - {dt1} => {dtf2} - {dtf1} => {dtd}')
  t1 = time.time()
  t2 = time.time()
  td = (t2 - t1)
  if td == 0 :
    print(f'time {t2} - {t1} => {td}')

Here is an example for the ouput (collisions are rather frequent):


datetime: 2025-03-24 20:42:25.881614 - 2025-03-24 20:42:25.881614 => 1742845345.881614 - 1742845345.881614 => 0.0
datetime: 2025-03-24 20:42:25.881678 - 2025-03-24 20:42:25.881678 => 1742845345.881678 - 1742845345.881678 => 0.0
datetime: 2025-03-24 20:42:25.881690 - 2025-03-24 20:42:25.881690 => 1742845345.88169 - 1742845345.88169 => 0.0
datetime: 2025-03-24 20:42:25.881701 - 2025-03-24 20:42:25.881701 => 1742845345.881701 - 1742845345.881701 => 0.0
time 1742845345.8817139 - 1742845345.8817139 => 0.0
datetime: 2025-03-24 20:42:25.881725 - 2025-03-24 20:42:25.881725 => 1742845345.881725 - 1742845345.881725 => 0.0

As a solution to this, we could either discard duplicates in the statistics integration, or adapt the functions computing averages etc. to deal with the duplicate timestamps. The first option is probably rather simple, the second option is tricky, because you will have to think about all the edge cases again (e.g. The average of a single value is computed differently than the average of multiple values. If you throw away one out of two values, you have to switch from the multi-value computation to the single value computation).

@gjohansson-ST
Copy link
Member

I think it might be the introduction of reported which causes this.

But anyhow I think perhaps the easiest (and probably most accurate) solution is as mentioned, we simply remove the incoming value with a timestamp which is already in the queue/has been processed.

@ThomDietrich
Copy link
Contributor

@gjohansson-ST if we indeed see multiple values with the exact same timestamp, I would argue to update the queue, i.e. keep the last reported value.

@gjohansson-ST
Copy link
Member

I don't know as I'm not exactly using it myself. In theory I guess if changed and reported comes at the same time, they would have the same value but it's a guess from my side but there should not be possible really for two changes to come with the same timestamp.

Perhaps we should do an initial PR just to include some more logging and ask @unfug-at-github to test-run it and see what values we get with which timestamp coming from where when we get these errors?
That should give us enough knowledge to provide a proper fix.

@unfug-at-github
Copy link
Contributor Author

unfug-at-github commented Mar 29, 2025

I will add some logging locally and report back what I found out. I guess that's easier than going through the whole PR process.

What I have seen in the past is that values reported from my inverter triggered several updates in a very short amount of time. And that was way before the state reported events were added.

The inverter reports all its values at regular intervals (Solaredge modbus integration), since they get measured in one go.

Just as an example:
At time t1 the inverter reports
a_t1, b_t1 and c_t1

A little later (at time t2), it reports
a_t2, b_t2 and c_t2

A template computes the sum of
sum = a+b+c.

What happens is that shortly after t2, the sum will be recalculated three times (once for every value that changed):
sum_t2_first = a_t2 + b_t1 + c_t1
sum_t2_second = a_t2 + b_t2 + c_t1
sum_t2_third = a_t2 + b_t2 + c_t2

sum_t2_first and sum_t2_second will be visible for a very short time only, because the incoming changes from the measurement done at time t_2 arrive one by one (more or less at the same time, but not exactly at the same time). So in between two measurements we see three different values, out of which one is visible for 99,x% of the time and the other two just appear for a few milliseconds.

In some places I have implemented delays (wait for x secs in triggers) to prevent that this happens, because combining values from two different measurements created some weird spikes and unwanted effects. I haven't done that in places where the double / triple re-computation didn't cause issues though.

It is not at all unlikely, that the re-computation of these values can happen fast enough to end up with the exact same timestamp.

@bdraco
Copy link
Member

bdraco commented Mar 29, 2025

To summarize this: I don't think we can work with the general assumption that the timestamps of subsequent values will always differ.

You definitely can't guarantee this. Its very easy to do if you limit precision to time.time()

import time

last_time = time.time()
for itr in range(1000000):
    this_time = time.time()
    if last_time == this_time:
        raise Exception(
            f"should not be the same: this_time={this_time} last_time={last_time} after {itr} attempts"
        )
    last_time = this_time

In most cases it doesn't happen since the execution time of Python makes it very unlikely.

@unfug-at-github
Copy link
Contributor Author

I was able to get a few more details about the issue:

[31m2025-03-29 16:33:34.617 ERROR (MainThread) [custom_components.statistics.sensor] got state form source state_report: 3089.0 2025-03-29 16:33:34.617160[0m
[31m2025-03-29 16:33:34.630 ERROR (MainThread) [custom_components.statistics.sensor] got state form source state_report: 2933.180339887499 2025-03-29 16:33:34.630237[0m
[31m2025-03-29 16:33:34.634 ERROR (MainThread) [custom_components.statistics.sensor] got state form source state_report: 2933.180339887499 2025-03-29 16:33:34.634394[0m
[32m2025-03-29 16:33:34.654 INFO (MainThread) [custom_components.hacs] Stage changed: running[0m
[31m2025-03-29 16:33:34.673 ERROR (MainThread) [custom_components.statistics.sensor] got state form source state_change: 3089.0 2025-03-29 16:33:34.617160[0m
[31m2025-03-29 16:33:34.674 ERROR (MainThread) [custom_components.statistics.sensor] age range 0 found
2025-03-29 16:33:34.617160: 3089.0
2025-03-29 16:33:34.617160: 3089.0
[0m
[31m2025-03-29 16:33:34.678 ERROR (MainThread) [homeassistant.helpers.event] Error while dispatching event for sensor.inverter_production to <Job track state_changed event ['sensor.inverter_production'] HassJobType.Callback <bound method StatisticsSensor._async_stats_sensor_state_change_listener of <entity sensor.inverter_production_avg_15s=3089.0>>>

As can be seen here, there is a state report event and a state change event reporting the same value with identical timestamps. This seems only to happen shortly after the startup phase (there are only two values in the queue) and I don't see similar errors at a later point where the queue would have more values.

We don't write the current time into the queue, but the "last reported time" of the event, which unlike the actual time is identical (look at the timestamps provided by the log function).

This means, the potential issue of having computations happening fast enough to get identical timestamps, is not the problem here. Nevertheless, that issue may still cause problems, it might just not be that likely, because there is enough time consuming code around it.

In the end, we could solve both issues by just keeping the last reported value, in case a value with an identical timestamp is received.

If we can agree on that, I can create a corresponding PR.

@unfug-at-github
Copy link
Contributor Author

unfug-at-github commented Mar 30, 2025

One more update:

I have added a little more logging, and I can confirm that these duplicates are appearing more or less all the time. They just don't create the null pointer exceptions (this only happens shortly after starting up). The null pointer exceptions are caused when the complete time range of all values is zero. As long as the queue contains two differing timestamps, the computation can be performed correctly,

In some cases I even saw the same value reported three times.

What's also interesting is that sometimes the first event to report a certain value is a state report event, and the state change event follows a little later.

What I didn't encounter so far though, is that two different values are reported with the same timestamp.

Here are some settings that allowed me to reproduce the effect. So far I was only able to produce the issue after restarting. My real-life templates produce the duplicated values regularly though (of course without causing the division by zero).

Statistics sensor:

- platform: statistics
  name: "qqq_sum_avg_15s"
  entity_id: sensor.qqq_sum
  state_characteristic: average_step
  keep_last_sample: true
  max_age:
    seconds: 15

Templates:

- trigger:
    - platform: time_pattern
      seconds: "/2"
  sensor:
    - name: "qqq_trigger"
      unit_of_measurement: 't'
      availability: "true"
      state: >   
        {{ now().microsecond }}

- sensor:
    - name: "qqq_triggered_a"
      unit_of_measurement: 't'
      availability: "true"
      state: > 
        {% set value = 100 %}
        {% set value = value + ((states('sensor.qqq_trigger') | float(0)) / 1000000) %}
        {{ value }}

- sensor:
    - name: "qqq_triggered_b"
      unit_of_measurement: 't'
      availability: "true"
      state: >   
        {% set value = 200 %}
        {% set value = value + ((states('sensor.qqq_trigger') | float(0)) / 1000000) %}
        {{ value }}

- sensor:
    - name: "qqq_triggered_c"
      unit_of_measurement: 't'
      availability: "true"
      state: >   
        {% set value = 0 %}
        {% set value = value + ((states('sensor.qqq_triggered_a') | float(0)) / 1000000) %}
        {% set value = value + ((states('sensor.qqq_triggered_b') | float(0)) / 1000000) %}
        {{ value }}

- trigger:
    - platform: state
      entity_id:
        - sensor.qqq_triggered_a
        - sensor.qqq_triggered_b
        - sensor.qqq_triggered_c
      for:
        seconds: 1
- sensor:
    - name: "qqq_sum"
      unit_of_measurement: 't'
      availability: "true"
      state: >   
        {{ (states('sensor.qqq_triggered_a') | float(0)) + (states('sensor.qqq_triggered_b') | float(0)) + (states('sensor.qqq_triggered_c') | float(0)) }}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants