Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple memory allocations from aligned memory allocations #87003

Merged
merged 6 commits into from
Apr 1, 2025

Conversation

npitre
Copy link
Collaborator

@npitre npitre commented Mar 12, 2025

When a simple memory allocation is accomplished in terms of an aligned memory
allocation, a longer aligned allocation code path with an extra runtime
overhead is involved even though no alignment is necessary.

Let's rework the code so the aligned allocation code path is invoked only
when an actual aligned allocation is requested. This opens the possibility
for the linker to garbage-collect the aligning code otherwise.

Copy link
Collaborator

@peter-mitsis peter-mitsis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question/comment about alignment in z_heap_aligned_alloc(). Everything else looks good to me.

Comment on lines +18 to +24
__ASSERT((align & (align - 1)) == 0,
"align must be a power of 2");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a minimum alignment? I note that if align is 0, this assert still passes. I think that lib/heap/heap.c uses a minimum alignment of 1 and k_aligned_alloc() had a minimum aligment of sizeof(void *).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minimum alignment from heap.c is 4 bytes on 32-bits archs and 8 bytes on 64-bits archs already.

peter-mitsis
peter-mitsis previously approved these changes Mar 12, 2025
andyross
andyross previously approved these changes Mar 12, 2025
Copy link
Collaborator

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems very reasonable. Doesn't look like the impact on code size should be very high, but might be worth measuring.

cfriedt
cfriedt previously approved these changes Mar 13, 2025
Copy link
Member

@cfriedt cfriedt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Just one suggestion, but not a blocker (use IS_POWER_OF_TWO()).

Re: measurements, I was curious about the size difference and as well, and also a little curious about timing. If there are fewer conditions on a potentially hot path like this, there could be some perf gain.

Would be really great to see some memory allocation benchmarks down the road.

@npitre
Copy link
Collaborator Author

npitre commented Mar 13, 2025

Due to popular demand, here's some code size comparisons using the
qemu_cortex_a53 target.

tests/lib/heap_align:

   text    data     bss     dec     hex filename
  60928    4800  339966  405694   630be zephyr_before.elf
  60928    4800  339966  405694   630be zephyr_after.elf

tests/kernel/mem_heap/k_heap_api:

   text    data     bss     dec     hex filename
  61113    4704  344066  409883   6411b zephyr_before.elf
  61057    4760  344066  409883   6411b zephyr_after.elf

samples/userspace/hello_world_user:

   text    data     bss     dec     hex filename
  75478   24716  356781  456975   6f90f zephyr_before.elf
  75478   24716  356781  456975   6f90f zephyr_after.elf

So... it seems there isn't much difference as code size goes. If anything,
the code may be slightly smaller, but the data section grows accordingly
probably due to additional alignment padding.

TaiJuWu
TaiJuWu previously approved these changes Mar 13, 2025
@npitre
Copy link
Collaborator Author

npitre commented Mar 15, 2025

And here's some perf numbers using qemu_riscv64 which I hope can be
meaningful enough. That's from tests/benchmarks/sys_kernel with the first
commit applied in both cases.

Before malloc changes:

TEST CASE: malloc #1
TEST COVERAGE:
        k_malloc
Starting test. Please wait...
TEST RESULT: SUCCESSFUL
DETAILS: Average time for 1 iteration: 29890 nSec
END TEST CASE

TEST CASE: malloc #2
TEST COVERAGE:
        k_free
Starting test. Please wait...
TEST RESULT: SUCCESSFUL
DETAILS: Average time for 1 iteration: 22640 nSec
END TEST CASE

With the whole series applied:

TEST CASE: malloc #1
TEST COVERAGE:
        k_malloc
Starting test. Please wait...
TEST RESULT: SUCCESSFUL
DETAILS: Average time for 1 iteration: 25920 nSec
END TEST CASE

TEST CASE: malloc #2
TEST COVERAGE:
        k_free
Starting test. Please wait...
TEST RESULT: SUCCESSFUL
DETAILS: Average time for 1 iteration: 22640 nSec
END TEST CASE

In short, k_malloc() performance improved by 13%.

Nicolas Pitre added 2 commits March 15, 2025 17:02
Useful to evaluate malloc performance changes.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
Let's avoid division and modulus operations as they're costly... and even
more so when they're unnecessary as in this case the main constraint is
about the alignment being a power of 2 which is a very small subset of
sizeof(void *) multiples.

Then move the assertion to common code for wider coverage.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
cfriedt
cfriedt previously approved these changes Mar 16, 2025
Nicolas Pitre added 4 commits March 16, 2025 12:10
When k_heap_alloc() is expressed in terms of k_heap_aligned_alloc()
it invokes a longer aligned allocation code path with an extra runtime
overhead even though no alignment is necessary.

Let's reference and invoke the aligned allocation code path only when an
actual aligned allocation is requested. This opens the possibility for
the linker to garbage-collect the aligning code otherwise.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
In .../modules/debug/percepio/TraceRecorder/kernelports/Zephyr/include/\
tracing_tracerecorder.h there is a concealed non-parameterized direct
reference to a local variable that is no longer in scope. Provide a dummy
stub for compilation to succeed until that module's layering violation is
fixed, after which this could be reverted.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
When k_malloc() is expressed in terms of k_aligned_alloc() it invokes a
longer aligned allocation code path with an extra runtime overhead even
though no alignment is necessary.

Let's reference and invoke the aligned allocation code path only when an
actual aligned allocation is requested. This opens the possibility for
the linker to garbage-collect the aligning code otherwise.

Also bypass k_heap_malloc() and friends given they're invoked with
K_NO_WAIT. Go directly to sys_heap_*() instead to cut some more unneeded
overhead.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
When sys_heap_realloc() is expressed in terms of sys_heap_aligned_realloc()
it invokes a longer aligned allocation code path with an extra runtime
overhead even though no alignment is necessary.

Let's reference and invoke the aligned allocation code path only when an
actual aligned allocation is requested. This opens the possibility for
the linker to garbage-collect the aligning code otherwise.

Improve realloc documentation while at it.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
@npitre npitre requested review from andyross and peter-mitsis March 18, 2025 18:10
@npitre
Copy link
Collaborator Author

npitre commented Mar 27, 2025

@andyross, @peter-mitsis: ping

@npitre
Copy link
Collaborator Author

npitre commented Apr 1, 2025

This is no April Fools' prank: assignees' (re)approval is still needed. ;-)

@kartben kartben merged commit 11021cd into zephyrproject-rtos:main Apr 1, 2025
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants