nimble/phy/nrf5x: fix race conditions in TIMER0 setup #2036
+45
−24
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In the nRF5x LL PHY, we use TIMER0 to trigger various Tx/Rx sequencing tasks. The PHY attempted to recover from some cases in which it was 'late' setting up TIMER0; in theory, these ought not have happened, but if interrupt priorities do not provide absolute priority for the LL (for instance, in RTOS systems that have other priorities, or scheduling systems that have incompatible interrupt priority in order to call into RTOS queue management routines), then the PHY can indeed end up failing to schedule a transmit / receive window in time.
In some cases, the recovery process for these had race conditions, which could result in the LL state machine locking up (the end-of-tx interrupt would never fire). One race condition was extremely short (a few instructions wide; if the timer was set, but the timer triggered before the appropriate PPI could be configured); the other race condition was a microsecond wide (comparison to see if the timer had already triggered needed the current time to be greater in order to fail, but if it triggered within that microsecond, the current time would be equal). There are also other potential issues around timer event reset and check that we did not replicate, but did notice by inspection.
These cases appeared in a few different forms, so we factor out a "safe reset" routine to ensure that a timer will not trigger (to be used before setting up a PPI), and then we also factor out a routine to check whether a timer has been missed.
This change is verified to fix some observed stability issues in PebbleOS under heavy load.
cc: @gmarull @crc32 @sjp4 @ericmigi