Skip to content

nimble/phy/nrf5x: fix race conditions in TIMER0 setup #2036

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jwise
Copy link

@jwise jwise commented May 16, 2025

In the nRF5x LL PHY, we use TIMER0 to trigger various Tx/Rx sequencing tasks. The PHY attempted to recover from some cases in which it was 'late' setting up TIMER0; in theory, these ought not have happened, but if interrupt priorities do not provide absolute priority for the LL (for instance, in RTOS systems that have other priorities, or scheduling systems that have incompatible interrupt priority in order to call into RTOS queue management routines), then the PHY can indeed end up failing to schedule a transmit / receive window in time.

In some cases, the recovery process for these had race conditions, which could result in the LL state machine locking up (the end-of-tx interrupt would never fire). One race condition was extremely short (a few instructions wide; if the timer was set, but the timer triggered before the appropriate PPI could be configured); the other race condition was a microsecond wide (comparison to see if the timer had already triggered needed the current time to be greater in order to fail, but if it triggered within that microsecond, the current time would be equal). There are also other potential issues around timer event reset and check that we did not replicate, but did notice by inspection.

These cases appeared in a few different forms, so we factor out a "safe reset" routine to ensure that a timer will not trigger (to be used before setting up a PPI), and then we also factor out a routine to check whether a timer has been missed.

This change is verified to fix some observed stability issues in PebbleOS under heavy load.

cc: @gmarull @crc32 @sjp4 @ericmigi

In the nRF5x LL PHY, we use TIMER0 to trigger various Tx/Rx sequencing
tasks.  The PHY attempted to recover from some cases in which it was 'late'
setting up TIMER0; in theory, these ought not have happened, but if
interrupt priorities do not provide absolute priority for the LL (for
instance, in RTOS systems that have other priorities, or scheduling systems
that have incompatible interrupt priority in order to call into RTOS queue
management routines), then the PHY can indeed end up failing to schedule a
transmit / receive window in time.

In some cases, the recovery process for these had race conditions, which
could result in the LL state machine locking up (the end-of-tx interrupt
would never fire).  One race condition was extremely short (a few
instructions wide; if the timer was set, but the timer triggered before the
appropriate PPI could be configured); the other race condition was a
microsecond wide (comparison to see if the timer had already triggered
needed the current time to be greater in order to fail, but if it triggered
within that microsecond, the current time would be equal).  There are also
other potential issues around timer event reset and check that we did not
replicate, but did notice by inspection.

These cases appeared in a few different forms, so we factor out a "safe
reset" routine to ensure that a timer will not trigger (to be used before
setting up a PPI), and then we also factor out a routine to check whether a
timer has been missed.

This change is verified to fix some observed stability issues in PebbleOS
under heavy load.

Signed-off-by: Joshua Wise <joshua@accelerated.tech>
@github-actions github-actions bot added the size/S Small PR label May 16, 2025
@sjanc sjanc requested a review from andrzej-kaczmarek May 16, 2025 06:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/S Small PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant