Skip to content

Conversation

VynDragon
Copy link
Contributor

Clean up code and add attempts to prevent bl60x and bl70x from failing to boot

@VynDragon VynDragon marked this pull request as ready for review August 30, 2025 03:35
@zephyrbot zephyrbot requested a review from nandojve August 30, 2025 03:36
nandojve
nandojve previously approved these changes Aug 30, 2025
Copy link
Contributor

@josuah josuah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At bootup, plenty of RAM types are containing random data.
Maybe this is the same for the caches, and they need to be flushed at startup?

This might be already done as part of chip init, except this could be taking time and starting the hardware right away could end-up executing random bytes?

If so, an interrupt a bit earlier (like when systick is faster than 1000/sec) could end-up executing some region of flash that has cache not flushed yet...

Not sure what that could be really!

@josuah
Copy link
Contributor

josuah commented Sep 1, 2025

Waiting author's confirmation that all is as intended before giving the 2nd +1, to avoid an accidental merge by the release team...

josuah
josuah previously approved these changes Sep 2, 2025
Copy link
Contributor

@josuah josuah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanations.

@@ -21,6 +21,8 @@
#include <hbn_reg.h>
#include <pds_reg.h>

extern void soc_bflb_cache_invalidate(void);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache should be in another commit with explanation about why it is necessary ? Why RISC-V arch infrastructure do not handle this ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the thing the commit description is about... Because there is as many riscv cache schemes that there are riscv SoCs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay no seems it need to go in drivers/cache and enable CONFIG_CACHE_MANAGEMENT...

Copy link
Contributor Author

@VynDragon VynDragon Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have this for now and leave making a driver for all the caches for later?
We will need it anyway for flash access now i know that's a driver type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe i can introduce it with the DMA.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted.

@@ -5,9 +5,9 @@
if SOC_SERIES_BL70X

# On BL702, BL704 and BL706, not all values work, here is a list of some that work:
# 2000, 1000, 100, 10, 1
# 1000, 100, 10, 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we changing this ? Another commit with clear explanation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also what the commit and PR is about.

Copy link
Member

@nandojve nandojve Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but you are cleaning up and add cache statements in the same commit, makes no sense to me.

Is the tick that makes the SoC not boot, the cache or both ?
Why this combination was important only now and how it worked before ?
Is there a boot concurrence ?

The above questions can not be answered when I look the commit message. What I'm try to tell, as an example, is that it seems that you fix the problem but it was a guess because the technical details are not present. This makes people questions your changes and put in doubt the solution. So, is this really a fix ?

BTW, even in the title it seems that you not 100% sure that this fix the problem.

Then, why should maintainer accept the fix ?

Copy link
Contributor Author

@VynDragon VynDragon Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that you fix the problem but it was a guess because the technical details are not present

I thought I made it obvious enough this was the case here, especially after talking about it in discord. It fixes the issue, which is critical and so reason enough, and I dont know why.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, even in the title it seems that you not 100% sure that this fix the problem.

I encourage you to thoroughly test the PR... I've done my testing and as best I can tell it fixed it, except i've thought that many times so I am staying on the safe side because it randomly reappeared from seemingly unrelated things everytime.

Copy link
Contributor

@josuah josuah Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some commit jargon for "maybe" could be "improve stability"?

@nandojve Is something like this enough? Could you give an example of commit message to help us grasp what's missing?

Improve the e24 boot stability by ensuring interrupts are disabled until the SoC is ready for it.

@@ -9,19 +9,6 @@
*/

#include <zephyr/arch/cpu.h>
#include "clic.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering why we still need this file after all these removals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because it defines __soc_handle_irq.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain in the commit message why we can not use this, which is a week implementation.

WTEXT(__soc_handle_irq)
/*
* SOC-specific function to handle pending IRQ number generating the interrupt.
* Exception number is given as parameter via register a0.
*/
SECTION_FUNC(exception.other, __soc_handle_irq)
/* Clear exception number from CSR mip register */
li t1, 1
sll t0, t1, a0
csrrc t1, mip, t0
/* Return */
ret

I see this pattern in many SoC and then it will better have an KCONFIG to handle all those SoC exceptions in the privileged mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, it's not needed. But it triggers the bug this PR is trying to fix when using the riscv-privileged .S... Which I didnt catch at first and thought it was because it needed it defined.

@VynDragon VynDragon force-pushed the e24_fix_attempt branch 2 times, most recently from 5f84dc9 to d9890b2 Compare September 2, 2025 16:51
@VynDragon VynDragon requested a review from nandojve September 2, 2025 16:51
@@ -9,19 +9,6 @@
*/

#include <zephyr/arch/cpu.h>
#include "clic.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain in the commit message why we can not use this, which is a week implementation.

WTEXT(__soc_handle_irq)
/*
* SOC-specific function to handle pending IRQ number generating the interrupt.
* Exception number is given as parameter via register a0.
*/
SECTION_FUNC(exception.other, __soc_handle_irq)
/* Clear exception number from CSR mip register */
li t1, 1
sll t0, t1, a0
csrrc t1, mip, t0
/* Return */
ret

I see this pattern in many SoC and then it will better have an KCONFIG to handle all those SoC exceptions in the privileged mode.

@@ -5,9 +5,9 @@
if SOC_SERIES_BL70X

# On BL702, BL704 and BL706, not all values work, here is a list of some that work:
# 2000, 1000, 100, 10, 1
# 1000, 100, 10, 1
Copy link
Member

@nandojve nandojve Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but you are cleaning up and add cache statements in the same commit, makes no sense to me.

Is the tick that makes the SoC not boot, the cache or both ?
Why this combination was important only now and how it worked before ?
Is there a boot concurrence ?

The above questions can not be answered when I look the commit message. What I'm try to tell, as an example, is that it seems that you fix the problem but it was a guess because the technical details are not present. This makes people questions your changes and put in doubt the solution. So, is this really a fix ?

BTW, even in the title it seems that you not 100% sure that this fix the problem.

Then, why should maintainer accept the fix ?

Remove unecessary unused code and Kconfig

Signed-off-by: Camille BAUD <mail@massdriver.space>
@VynDragon
Copy link
Contributor Author

I tried another approach based on this idea #95189 (comment) and similar previous theories, and this looks like it fixed it, this however might still be a fluke.

This should fix e24 failing to boot by ensuring
interrupts are disabled at startup

Signed-off-by: Camille BAUD <mail@massdriver.space>
Copy link

sonarqubecloud bot commented Sep 3, 2025

@VynDragon VynDragon requested review from nandojve and josuah September 3, 2025 12:00
Copy link
Contributor

@josuah josuah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting a bit before enabling IRQs seems present on other RISC-V:

DEVICE_DT_INST_DEFINE(0, vexriscv_litex_irq_init, NULL, NULL, NULL,
PRE_KERNEL_1, CONFIG_INTC_INIT_PRIORITY, NULL);

This is what the SDK seems to be doing too, in SystemInit(), disabling IRQs first:

    /* global IRQ disable */
    __disable_irq();

IIRC there is a ROM bootloader starting first? Maybe this is what enables the IRQs to begin with.

@nandojve nandojve added this to the v4.3.0 milestone Sep 3, 2025
@VynDragon
Copy link
Contributor Author

If it is the actual issue, that is likely to be the reason indeed.

@fabiobaltieri fabiobaltieri merged commit b13ec9a into zephyrproject-rtos:main Sep 3, 2025
26 checks passed
@nandojve nandojve deleted the e24_fix_attempt branch September 3, 2025 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants