Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ATTENTION: Node x (yz) did not complete setup in 30 minutes. - Nodes stuck in setup #807

Closed
agners opened this issue Jul 17, 2024 · 2 comments · Fixed by #809
Closed

ATTENTION: Node x (yz) did not complete setup in 30 minutes. - Nodes stuck in setup #807

agners opened this issue Jul 17, 2024 · 2 comments · Fixed by #809

Comments

@agners
Copy link
Collaborator

agners commented Jul 17, 2024

With #623 an error got introduced if our node setup takes longer than expected.

What seems to happen is that the SDK is seemingly stuck during the initial read/subscription setup.

In a case at hand (Meross Smart Plug, after power cycling it because it fell of the WiFi network), with SDK Progress logging enabled, the following sequence of events happened:

  1. Controller was running, the device was controllable before, but the Meross Smart Plug seemingly crashed/lost connection to the WiFi network.

  2. Restart of the Smart Plug

  3. The device got rediscovered

2024-07-17 09:12:31.773 (MainThread) INFO [matter_server.server.device_controller.mdns] <Node:176> Re-discovered on mDNS
  1. Node Setup got triggered, an existing secure session got found:
2024-07-17 09:12:31.773 (MainThread) INFO [matter_server.server.device_controller] <Node:176> Setting-up node...
2024-07-17 09:12:31.773 (MainThread) DEBUG [matter_server.server.sdk] <Node:176> Attempting to establish CASE session... (attempt 1 of 2)
2024-07-17 09:12:31.774 (Dummy-2) CHIP_PROGRESS [chip.native.DIS] Found an existing secure session to [1:00000000000000B0]!
2024-07-17 09:12:31.774 (MainThread) DEBUG [matter_server.server.sdk] <Node:176> Establishing CASE session took 0.0 seconds

  1. Event subscription gets created (note that the interview/initial read did not get triggered here, we straight moved to subscription establishment). However, this subscription establishment failed:
2024-07-17 09:12:31.774 (MainThread) INFO [matter_server.server.device_controller] <Node:176> Setting up attributes and events subscription.
2024-07-17 09:12:31.774 (Dummy-2) CHIP_PROGRESS [chip.native.DIS] Found an existing secure session to [1:00000000000000B0]!
2024-07-17 09:12:31.775 (Dummy-2) CHIP_PROGRESS [chip.native.EM] <<< [E:11207i S:40770 M:12271413] (S) Msg TX to 1:00000000000000B0 [8A89] [UDP:[fe80::4ae1:e9ff:fede:943e%eno2]:5540] --- Type 0001:03 (IM:SubscribeRequest)
2024-07-17 09:12:32.395 (Dummy-2) CHIP_PROGRESS [chip.native.EM] Retransmitting MessageCounter:12271413 on exchange 11207i with Node: <00000000000000B0, 1> Send Cnt 1
2024-07-17 09:12:33.062 (Dummy-2) CHIP_PROGRESS [chip.native.EM] Retransmitting MessageCounter:12271413 on exchange 11207i with Node: <00000000000000B0, 1> Send Cnt 2
2024-07-17 09:12:33.147 (Dummy-2) CHIP_PROGRESS [chip.native.EM] >>> [E:1737r S:40768 M:177846015] (S) Msg RX from 1:00000000000000EC [8A89] --- Type 0001:05 (IM:ReportData)
2024-07-17 09:12:33.147 (Dummy-2) CHIP_PROGRESS [chip.native.DMG] Refresh LivenessCheckTime for 64224 milliseconds with SubscriptionId = 0x2d130ba9 Peer = 01:00000000000000EC
2024-07-17 09:12:33.147 (Dummy-2) CHIP_PROGRESS [chip.native.EM] <<< [E:1737r S:40768 M:163732442 (Ack:177846015)] (S) Msg TX to 1:00000000000000EC [8A89] [UDP:[fe80::eeda:3bff:fea5:7094%eno2]:5540] --- Type 0001:01 (IM:StatusResponse)
2024-07-17 09:12:33.152 (Dummy-2) CHIP_PROGRESS [chip.native.EM] >>> [E:1737r S:40768 M:177846016 (Ack:163732442)] (S) Msg RX from 1:00000000000000EC [8A89] --- Type 0000:10 (SecureChannel:StandaloneAck)
2024-07-17 09:12:33.943 (Dummy-2) CHIP_PROGRESS [chip.native.EM] Retransmitting MessageCounter:12271413 on exchange 11207i with Node: <00000000000000B0, 1> Send Cnt 3
2024-07-17 09:12:35.427 (Dummy-2) CHIP_PROGRESS [chip.native.EM] Retransmitting MessageCounter:12271413 on exchange 11207i with Node: <00000000000000B0, 1> Send Cnt 4
2024-07-17 09:12:37.795 (Dummy-2) CHIP_ERROR [chip.native.EM] Failed to Send CHIP MessageCounter:12271413 on exchange 11207i with Node: <00000000000000B0, 1> sendCount: 4 max retries: 4

The reason for that is simply that the CASE session was no longer valid on Meross Smart Plug side. This is expected, since the Meross smart plug power cycled and does not support persistent CASE subscriptions.

  1. The CASE session is considered dead, and the Controller tries too establish a new CASE session:
2024-07-17 09:12:40.816 (Dummy-2) CHIP_PROGRESS [chip.native.SC] SecureSession[0x7f32a0064940, LSID:40770]: State change 'kActive' --> 'kDefunct'
2024-07-17 09:12:40.816 (Dummy-2) CHIP_ERROR [chip.native.DMG] Time out! failed to receive report data from Exchange: 11207i with Node: <00000000000000B0, 1>
2024-07-17 09:12:40.816 (Dummy-2) CHIP_PROGRESS [chip.native.DMG] Will try to resubscribe to 01:00000000000000B0 at retry index 0 after 0ms due to error src/app/ReadClient.cpp:682: CHIP Error 0x00000032: Timeout
2024-07-17 09:12:40.816 (Dummy-2) CHIP_PROGRESS [chip.native.DMG] OnResubscribeTimerCallback: ForceCASE = 1
2024-07-17 09:12:40.816 (Dummy-2) CHIP_PROGRESS [chip.native.DMG] Trying to establish a CASE session for subscription

Now this is expected to work. However, for whatever reason the device seem to have crashed right away. Maybe because we (and other controllers too, I have it connected to multiple Controllers) tried to communicate with an invalid CASE session, but this is speculation. The Meross devices are Matter 1.0 devices and are known to be a bit unstable. Anyhow, what the SDK now tries to do is to reestablish CASE session continuously.

  1. Exactly 45s later, the mDNS operational discovery fails, leading to this error:
2024-07-17 09:13:25.822 (Dummy-2) CHIP_PROGRESS [chip.native.DIS] Checking node lookup status after 45006 ms
2024-07-17 09:13:25.822 (Dummy-2) CHIP_ERROR [chip.native.DIS] OperationalSessionSetup[1:00000000000000B0]: operational discovery failed: src/lib/address_resolve/AddressResolve_DefaultImpl.cpp:119: CHIP Error 0x00000032: Timeout
2024-07-17 09:13:25.822 (Dummy-2) CHIP_ERROR [chip.native.DMG] Failed to establish CASE for re-subscription with error 'src/lib/address_resolve/AddressResolve_DefaultImpl.cpp:119: CHIP Error 0x00000032: Timeout'
2024-07-17 09:13:25.822 (Dummy-2) CHIP_PROGRESS [chip.native.DMG] Will try to resubscribe to 01:00000000000000B0 at retry index 1 after 8913ms due to error src/lib/address_resolve/AddressResolve_DefaultImpl.cpp:119: CHIP Error 0x00000032: Timeout

From there the cycle continuous forever, the Controller attempts to find the device for 45s (in this case after a retry delay of 8.9s):

2024-07-17 09:14:19.737 (Dummy-2) CHIP_ERROR [chip.native.DIS] OperationalSessionSetup[1:00000000000000B0]: operational discovery failed: src/lib/address_resolve/AddressResolve_DefaultImpl.cpp:119: CHIP Error 0x00000032: Timeout
2024-07-17 09:14:19.737 (Dummy-2) CHIP_ERROR [chip.native.DMG] Failed to establish CASE for re-subscription with error 'src/lib/address_resolve/AddressResolve_DefaultImpl.cpp:119: CHIP Error 0x00000032: Timeout'
2024-07-17 09:14:19.737 (Dummy-2) CHIP_PROGRESS [chip.native.DMG] Will try to resubscribe to 01:00000000000000B0 at retry index 2 after 8131ms due to error src/lib/address_resolve/AddressResolve_DefaultImpl.cpp:119: CHIP Error 0x00000032: Timeout

The logs does not show the Previous subscription failed with Error: %s, re-subscribing in %s ms... logs, this means resubscription_attempted does not get called. The reason is that SetResubscriptionAttemptedCallback() has not been called at this point yet. In other words, because the very initial read already failed, the subscription hasn't been completely setup at this point.

Eventually, after I restarted the device again, the setup completed (no Controller restart was needed):

2024-07-17 10:41:13.146 (Dummy-2) CHIP_PROGRESS [chip.native.DMG] OnResubscribeTimerCallback: ForceCASE = 1
2024-07-17 10:41:13.146 (Dummy-2) CHIP_PROGRESS [chip.native.DMG] Trying to establish a CASE session for subscription
2024-07-17 10:41:13.188 (Dummy-2) CHIP_PROGRESS [chip.native.DIS] UDP:[fe80::4ae1:e9ff:fede:943e%eno2]:5540: new best score: 7
2024-07-17 10:41:13.189 (Dummy-2) CHIP_PROGRESS [chip.native.DIS] UDP:[xxxxx:10:4ae1:e9ff:fede:943e%eno2]:5540: new best score: 6
2024-07-17 10:41:13.189 (Dummy-2) CHIP_PROGRESS [chip.native.DIS] UDP:[fd7a:8aac:95ec:10:4ae1:e9ff:fede:943e%eno2]:5540: new best score: 5
2024-07-17 10:41:13.190 (Dummy-2) CHIP_PROGRESS [chip.native.DIS] UDP:192.168.80.105%eno2:5540: new best score: 2
2024-07-17 10:41:13.190 (Dummy-2) CHIP_PROGRESS [chip.native.DIS] Checking node lookup status after 44 ms
2024-07-17 10:41:13.190 (Dummy-2) CHIP_PROGRESS [chip.native.DIS] Keeping DNSSD lookup active
2024-07-17 10:41:13.346 (Dummy-2) CHIP_PROGRESS [chip.native.DIS] Checking node lookup status after 200 ms
2024-07-17 10:41:13.347 (Dummy-2) CHIP_PROGRESS [chip.native.SC] Initiating session on local FabricIndex 1 from 0x000000000001B669 -> 0x00000000000000B0
2024-07-17 10:41:13.348 (Dummy-2) CHIP_PROGRESS [chip.native.EM] <<< [E:11520i S:0 M:15961213] (U) Msg TX to 0:0000000000000000 [0000] [UDP:[fe80::4ae1:e9ff:fede:943e%eno2]:5540] --- Type 0000:30 (SecureChannel:CASE_Sigma1)
2024-07-17 10:41:13.348 (Dummy-2) CHIP_PROGRESS [chip.native.SC] Sent Sigma1 msg
2024-07-17 10:41:13.394 (Dummy-2) CHIP_PROGRESS [chip.native.EM] >>> [E:11520i S:0 M:230182639 (Ack:15961213)] (U) Msg RX from 0:0000000000000000 [0000] --- Type 0000:10 (SecureChannel:StandaloneAck)
2024-07-17 10:41:13.558 (Dummy-2) CHIP_PROGRESS [chip.native.EM] >>> [E:11520i S:0 M:230182640 (Ack:15961213)] (U) Msg RX from 0:0000000000000000 [0000] --- Type 0000:33 (SecureChannel:CASE_Sigma2Resume)
2024-07-17 10:41:13.558 (Dummy-2) INFO [chip.storage] DeleteSdkKey: g/s/wINMyt++PwH8HyNI7F7xUg==
2024-07-17 10:41:13.558 (Dummy-2) INFO [chip.storage] Committing...
2024-07-17 10:41:13.559 (Dummy-2) INFO [chip.storage] SetSdkKey: f/1/s/00000000000000B0 = b'\x150\x03\x10\xe1*\xca\xb0{\x1f\xd0Z\xf3\xdfQD\x1e\xf7k\xd10\x04 \x01M\x17!\xe6}\x1a\xba\x0fD}k\xe1\xb9@\xe8\x8e\xe2\x96\xcb\x19\xbd\xb0\x95F\x1a\xc9\xca7,\x86\x9a0\x05\x0c\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x18'
2024-07-17 10:41:13.559 (Dummy-2) INFO [chip.storage] Committing...
2024-07-17 10:41:13.559 (Dummy-2) INFO [chip.storage] SetSdkKey: g/s/4SrKsHsf0Frz31FEHvdr0Q== = b'\x15$\x01\x01$\x02\xb0\x18'
2024-07-17 10:41:13.559 (Dummy-2) INFO [chip.storage] Committing...
2024-07-17 10:41:13.559 (Dummy-2) CHIP_PROGRESS [chip.native.EM] <<< [E:11520i S:0 M:15961214 (Ack:230182640)] (U) Msg TX to 0:0000000000000000 [0000] [UDP:[fe80::4ae1:e9ff:fede:943e%eno2]:5540] --- Type 0000:40 (SecureChannel:StatusReport)
2024-07-17 10:41:13.559 (Dummy-2) CHIP_PROGRESS [chip.native.SC] SecureSession[0x7f32a05795e0, LSID:40780]: State change 'kEstablishing' --> 'kActive'
2024-07-17 10:41:13.559 (Dummy-2) CHIP_PROGRESS [chip.native.DMG] HandleDeviceConnected
2024-07-17 10:41:13.559 (Dummy-2) CHIP_PROGRESS [chip.native.EM] <<< [E:11521i S:40780 M:210203610] (S) Msg TX to 1:00000000000000B0 [8A89] [UDP:[fe80::4ae1:e9ff:fede:943e%eno2]:5540] --- Type 0001:03 (IM:SubscribeRequest)
...
2024-07-17 10:41:19.262 (Dummy-2) CHIP_PROGRESS [chip.native.EM] Retransmitting MessageCounter:210203615 on exchange 11521i with Node: <00000000000000B0, 1> Send Cnt 2
2024-07-17 10:41:19.420 (Dummy-2) CHIP_PROGRESS [chip.native.EM] >>> [E:11521i S:40780 M:47819887 (Ack:210203615)] (S) Msg RX from 1:00000000000000B0 [8A89] --- Type 0001:05 (IM:ReportData)
2024-07-17 10:41:19.422 (Dummy-2) CHIP_PROGRESS [chip.native.EM] <<< [E:11521i S:40780 M:210203616 (Ack:47819887)] (S) Msg TX to 1:00000000000000B0 [8A89] [UDP:[fe80::4ae1:e9ff:fede:943e%eno2]:5540] --- Type 0001:01 (IM:StatusResponse)
2024-07-17 10:41:19.487 (Dummy-2) CHIP_PROGRESS [chip.native.EM] >>> [E:11521i S:40780 M:47819888 (Ack:210203615)] (S) Msg RX from 1:00000000000000B0 [8A89] --- Type 0000:10 (SecureChannel:StandaloneAck)
2024-07-17 10:41:19.552 (Dummy-2) CHIP_PROGRESS [chip.native.EM] >>> [E:11521i S:40780 M:47819889 (Ack:210203615)] (S) Msg RX from 1:00000000000000B0 [8A89] --- Type 0000:10 (SecureChannel:StandaloneAck)
2024-07-17 10:41:19.667 (Dummy-2) CHIP_PROGRESS [chip.native.EM] >>> [E:11521i S:40780 M:47819890 (Ack:210203616)] (S) Msg RX from 1:00000000000000B0 [8A89] --- Type 0001:05 (IM:ReportData)
2024-07-17 10:41:19.696 (Dummy-2) CHIP_PROGRESS [chip.native.EM] <<< [E:11521i S:40780 M:210203617 (Ack:47819890)] (S) Msg TX to 1:00000000000000B0 [8A89] [UDP:[fe80::4ae1:e9ff:fede:943e%eno2]:5540] --- Type 0001:01 (IM:StatusResponse)
2024-07-17 10:41:19.788 (Dummy-2) CHIP_PROGRESS [chip.native.EM] >>> [E:11521i S:40780 M:47819891 (Ack:210203617)] (S) Msg RX from 1:00000000000000B0 [8A89] --- Type 0001:04 (IM:SubscribeResponse)
2024-07-17 10:41:19.788 (Dummy-2) CHIP_PROGRESS [chip.native.DMG] SubscribeResponse is received
2024-07-17 10:41:19.788 (Dummy-2) CHIP_PROGRESS [chip.native.DMG] Subscription established with SubscriptionID = 0xc9a8e385 MinInterval = 0s MaxInterval = 60s Peer = 01:00000000000000B0
2024-07-17 10:41:19.788 (Dummy-2) CHIP_PROGRESS [chip.native.DMG] Refresh LivenessCheckTime for 64224 milliseconds with SubscriptionId = 0xc9a8e385 Peer = 01:00000000000000B0
2024-07-17 10:41:19.788 (Dummy-2) CHIP_PROGRESS [chip.native.EM] <<< [E:11521i S:40780 M:210203618 (Ack:47819891)] (S) Msg TX to 1:00000000000000B0 [8A89] [UDP:[fe80::4ae1:e9ff:fede:943e%eno2]:5540] --- Type 0000:10 (SecureChannel:StandaloneAck)
2024-07-17 10:41:19.789 (MainThread) INFO [matter_server.server.device_controller] <Node:176> Subscription succeeded with report interval [0, 60]
@agners
Copy link
Collaborator Author

agners commented Jul 17, 2024

The behavior stems from the auto-resubscribe flag. The code as it is today does not make a difference between the first read attempt (on subscription establishment) or subsequent re-subscriptions. In both cases OnResubscriptionNeeded is being called, which essentially suppresses any errors and simply calls OnResubscriptionNeeded(). This goes on forever.

See: https://github.com/project-chip/connectedhomeip/blob/v1.3.0.0/src/controller/python/chip/clusters/attribute.cpp#L151C16-L162

Now this is problematic since we do not have a handle for the subscription at that point. So from Python side, the async read call is just taking forever. In the SDK, the AsyncReadTransaction waits forever for the (initial) subscription to complete, which just does not happen.

We could use a asyncio.wait_for(), like so:

        try:
            sub: Attribute.SubscriptionTransaction = (
                await asyncio.wait_for(self._chip_device_controller.read_attribute(
                    node_id,
                    [()],
                    events=[("*", 1)],
                    return_cluster_objects=False,
                    report_interval=(interval_floor, interval_ceiling),
                    auto_resubscribe=True,
                )
                , 120)
            )
        except asyncio.TimeoutError:
            node_logger.error("Subscription attempt timed out!!!")
            raise

This would allow us to continue and try again.

However, the subscription will linger around. This is because the AsyncReadTransaction is manually incremented (since we pass it off to the native side), and hence even though the call is aborted due to the timeout, the native code will continue to try to establish that session, forever.

Sidenote: It seems that such a timeout bubbles up into the task, leading to unhandled task exception shown once exiting the server (this is modified code, but maybe it would be better to catch all exceptions in _setup_node() 🤔 ):

2024-07-17 12:36:42.669 (MainThread) ERROR [matter_server.server] Error doing task: Task exception was never retrieved
Traceback (most recent call last):
  File "/usr/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
    return await fut
           ^^^^^^^^^
  File "/home/sag/projects/home-assistant-libs/python-matter-server/matter_server/server/sdk.py", line 273, in read_attribute
    result = await self._chip_controller.Read(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sag/projects/home-assistant-libs/python-matter-server/venv/lib/python3.12/site-packages/chip/ChipDeviceCtrl.py", line 1480, in Read
    return await future
           ^^^^^^^^^^^^
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/sag/projects/home-assistant-libs/python-matter-server/matter_server/server/device_controller.py", line 1379, in _setup_node
    await self._subscribe_node(node_id)
  File "/home/sag/projects/home-assistant-libs/python-matter-server/matter_server/server/device_controller.py", line 1231, in _subscribe_node
    await asyncio.wait_for(self._chip_device_controller.read_attribute(
  File "/usr/lib/python3.12/asyncio/tasks.py", line 519, in wait_for
    async with timeouts.timeout(timeout):
  File "/usr/lib/python3.12/asyncio/timeouts.py", line 115, in __aexit__
    raise TimeoutError from exc_val
TimeoutError

@agners
Copy link
Collaborator Author

agners commented Jul 17, 2024

With PR project-chip/connectedhomeip#34370 the behavior will change: If the initial subscription attempt fails, then the Read() will return immediately.

agners added a commit that referenced this issue Jul 18, 2024
This fixes Subscription issues. Mainly it avoids being stuck in
subscription setup if there is a communication issue with the
device.

Fixes: #807
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant