-
Notifications
You must be signed in to change notification settings - Fork 7.9k
bflb: attempt to address e24 cores boot failures #95189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bflb: attempt to address e24 cores boot failures #95189
Conversation
fbb417c
to
297ab07
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At bootup, plenty of RAM types are containing random data.
Maybe this is the same for the caches, and they need to be flushed at startup?
This might be already done as part of chip init, except this could be taking time and starting the hardware right away could end-up executing random bytes?
If so, an interrupt a bit earlier (like when systick is faster than 1000/sec) could end-up executing some region of flash that has cache not flushed yet...
Not sure what that could be really!
Waiting author's confirmation that all is as intended before giving the 2nd +1, to avoid an accidental merge by the release team... |
297ab07
to
9da6d78
Compare
9da6d78
to
ff9443a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the explanations.
soc/bflb/bl60x/soc.c
Outdated
@@ -21,6 +21,8 @@ | |||
#include <hbn_reg.h> | |||
#include <pds_reg.h> | |||
|
|||
extern void soc_bflb_cache_invalidate(void); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cache should be in another commit with explanation about why it is necessary ? Why RISC-V arch infrastructure do not handle this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the thing the commit description is about... Because there is as many riscv cache schemes that there are riscv SoCs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay no seems it need to go in drivers/cache and enable CONFIG_CACHE_MANAGEMENT...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have this for now and leave making a driver for all the caches for later?
We will need it anyway for flash access now i know that's a driver type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe i can introduce it with the DMA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted.
soc/bflb/bl70x/Kconfig.defconfig
Outdated
@@ -5,9 +5,9 @@ | |||
if SOC_SERIES_BL70X | |||
|
|||
# On BL702, BL704 and BL706, not all values work, here is a list of some that work: | |||
# 2000, 1000, 100, 10, 1 | |||
# 1000, 100, 10, 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we changing this ? Another commit with clear explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also what the commit and PR is about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes but you are cleaning up and add cache statements in the same commit, makes no sense to me.
Is the tick that makes the SoC not boot, the cache or both ?
Why this combination was important only now and how it worked before ?
Is there a boot concurrence ?
The above questions can not be answered when I look the commit message. What I'm try to tell, as an example, is that it seems that you fix the problem but it was a guess because the technical details are not present. This makes people questions your changes and put in doubt the solution. So, is this really a fix ?
BTW, even in the title it seems that you not 100% sure that this fix the problem.
Then, why should maintainer accept the fix ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems that you fix the problem but it was a guess because the technical details are not present
I thought I made it obvious enough this was the case here, especially after talking about it in discord. It fixes the issue, which is critical and so reason enough, and I dont know why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, even in the title it seems that you not 100% sure that this fix the problem.
I encourage you to thoroughly test the PR... I've done my testing and as best I can tell it fixed it, except i've thought that many times so I am staying on the safe side because it randomly reappeared from seemingly unrelated things everytime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some commit jargon for "maybe" could be "improve stability"?
@nandojve Is something like this enough? Could you give an example of commit message to help us grasp what's missing?
Improve the e24 boot stability by ensuring interrupts are disabled until the SoC is ready for it.
@@ -9,19 +9,6 @@ | |||
*/ | |||
|
|||
#include <zephyr/arch/cpu.h> | |||
#include "clic.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering why we still need this file after all these removals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because it defines __soc_handle_irq.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain in the commit message why we can not use this, which is a week implementation.
zephyr/soc/common/riscv-privileged/soc_irq.S
Lines 21 to 34 in 36f78b5
WTEXT(__soc_handle_irq) | |
/* | |
* SOC-specific function to handle pending IRQ number generating the interrupt. | |
* Exception number is given as parameter via register a0. | |
*/ | |
SECTION_FUNC(exception.other, __soc_handle_irq) | |
/* Clear exception number from CSR mip register */ | |
li t1, 1 | |
sll t0, t1, a0 | |
csrrc t1, mip, t0 | |
/* Return */ | |
ret |
I see this pattern in many SoC and then it will better have an KCONFIG to handle all those SoC exceptions in the privileged mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, it's not needed. But it triggers the bug this PR is trying to fix when using the riscv-privileged .S... Which I didnt catch at first and thought it was because it needed it defined.
5f84dc9
to
d9890b2
Compare
@@ -9,19 +9,6 @@ | |||
*/ | |||
|
|||
#include <zephyr/arch/cpu.h> | |||
#include "clic.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain in the commit message why we can not use this, which is a week implementation.
zephyr/soc/common/riscv-privileged/soc_irq.S
Lines 21 to 34 in 36f78b5
WTEXT(__soc_handle_irq) | |
/* | |
* SOC-specific function to handle pending IRQ number generating the interrupt. | |
* Exception number is given as parameter via register a0. | |
*/ | |
SECTION_FUNC(exception.other, __soc_handle_irq) | |
/* Clear exception number from CSR mip register */ | |
li t1, 1 | |
sll t0, t1, a0 | |
csrrc t1, mip, t0 | |
/* Return */ | |
ret |
I see this pattern in many SoC and then it will better have an KCONFIG to handle all those SoC exceptions in the privileged mode.
soc/bflb/bl70x/Kconfig.defconfig
Outdated
@@ -5,9 +5,9 @@ | |||
if SOC_SERIES_BL70X | |||
|
|||
# On BL702, BL704 and BL706, not all values work, here is a list of some that work: | |||
# 2000, 1000, 100, 10, 1 | |||
# 1000, 100, 10, 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes but you are cleaning up and add cache statements in the same commit, makes no sense to me.
Is the tick that makes the SoC not boot, the cache or both ?
Why this combination was important only now and how it worked before ?
Is there a boot concurrence ?
The above questions can not be answered when I look the commit message. What I'm try to tell, as an example, is that it seems that you fix the problem but it was a guess because the technical details are not present. This makes people questions your changes and put in doubt the solution. So, is this really a fix ?
BTW, even in the title it seems that you not 100% sure that this fix the problem.
Then, why should maintainer accept the fix ?
Remove unecessary unused code and Kconfig Signed-off-by: Camille BAUD <mail@massdriver.space>
d9890b2
to
871beea
Compare
I tried another approach based on this idea #95189 (comment) and similar previous theories, and this looks like it fixed it, this however might still be a fluke. |
This should fix e24 failing to boot by ensuring interrupts are disabled at startup Signed-off-by: Camille BAUD <mail@massdriver.space>
871beea
to
11f2d0f
Compare
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting a bit before enabling IRQs seems present on other RISC-V:
zephyr/drivers/interrupt_controller/intc_vexriscv_litex.c
Lines 97 to 98 in 209bfbd
DEVICE_DT_INST_DEFINE(0, vexriscv_litex_irq_init, NULL, NULL, NULL, | |
PRE_KERNEL_1, CONFIG_INTC_INIT_PRIORITY, NULL); |
This is what the SDK seems to be doing too, in SystemInit()
, disabling IRQs first:
/* global IRQ disable */
__disable_irq();
IIRC there is a ROM bootloader starting first? Maybe this is what enables the IRQs to begin with.
If it is the actual issue, that is likely to be the reason indeed. |
Clean up code and add attempts to prevent bl60x and bl70x from failing to boot