Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix FP64 operations on conv_diff #199

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Fix FP64 operations on conv_diff #199

wants to merge 5 commits into from

Conversation

b-fg
Copy link
Member

@b-fg b-fg commented Mar 11, 2025

In FP32 (T=Float32) GPU simulations, FP64 operations were detected in the conv_diff! routine with Nsight Compute (should also happen for CPU). This fix closes #197. I have tracked this down to the flux function

@inline ϕ(a,I,f) = @inbounds (f[I]+f[I-δ(a,I)])*0.5

where the 0.5 always promotes the operation to FP64. The fix is to use /2 instead, which preserves the floating point operation.

I have also done some additional type cleaning, and separated the @loop in conv_diff! into their own kernels. Benchmarks tests need to be conducted to see if the fix impacts performance or not.

@b-fg b-fg added the bug Something isn't working label Mar 11, 2025
@b-fg
Copy link
Member Author

b-fg commented Mar 11, 2025

Benchmarks do not show any speedup currently. I need to try with the original version of merged @loop.

Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 7
┌────────────┬───────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │   WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx04 │        master │ 1.11.3 │   Float32 │     2153203 │   0.00 │    19.42 │            92.60 │     1.00 │
│ GPU-NVIDIA │        master │ 1.11.3 │   Float32 │     2574744 │   0.00 │     3.07 │            14.64 │     6.33 │
│     CPUx04 │ fix_conv_diff │ 1.11.3 │   Float32 │     2266603 │   0.00 │    17.89 │            85.28 │     1.09 │
│ GPU-NVIDIA │ fix_conv_diff │ 1.11.3 │   Float32 │     2655696 │   0.00 │     3.18 │            15.16 │     6.11 │
└────────────┴───────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: cylinder sim_step! (max_steps=100)
▶ log2p = 5
┌────────────┬───────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │   WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx04 │        master │ 1.11.3 │   Float32 │     5268250 │   0.00 │    38.78 │           109.58 │     1.00 │
│ GPU-NVIDIA │        master │ 1.11.3 │   Float32 │     5703296 │   0.00 │     7.62 │            21.54 │     5.09 │
│     CPUx04 │ fix_conv_diff │ 1.11.3 │   Float32 │     5389939 │   0.00 │    38.09 │           107.63 │     1.02 │
│ GPU-NVIDIA │ fix_conv_diff │ 1.11.3 │   Float32 │     5793140 │   0.19 │     7.64 │            21.59 │     5.08 │
└────────────┴───────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘

@b-fg
Copy link
Member Author

b-fg commented Mar 11, 2025

It seems that the loop merging actually helps a bit with performance, see below (note 32c3661 is without loop merging). Comparing to master, the /2 yields similar results on GPU, and is a bit faster on CPU (should be opposite..?).

Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 7
┌────────────┬───────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │   WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx04 │        master │ 1.11.3 │   Float32 │     2153203 │   0.00 │    19.42 │            92.60 │     1.00 │
│ GPU-NVIDIA │        master │ 1.11.3 │   Float32 │     2574744 │   0.00 │     3.07 │            14.64 │     6.33 │
│     CPUx04 │ fix_conv_diff │ 1.11.3 │   Float32 │     2153203 │   0.00 │    17.03 │            81.21 │     1.14 │
│ GPU-NVIDIA │ fix_conv_diff │ 1.11.3 │   Float32 │     2573880 │   0.00 │     3.06 │            14.59 │     6.34 │
│     CPUx04 │       32c3661 │ 1.11.3 │   Float32 │     2266603 │   0.00 │    17.89 │            85.28 │     1.09 │
│ GPU-NVIDIA │       32c3661 │ 1.11.3 │   Float32 │     2655696 │   0.00 │     3.18 │            15.16 │     6.11 │
└────────────┴───────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: cylinder sim_step! (max_steps=100)
▶ log2p = 5
┌────────────┬───────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │   WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx04 │        master │ 1.11.3 │   Float32 │     5268250 │   0.00 │    38.78 │           109.58 │     1.00 │
│ GPU-NVIDIA │        master │ 1.11.3 │   Float32 │     5703296 │   0.00 │     7.62 │            21.54 │     5.09 │
│     CPUx04 │ fix_conv_diff │ 1.11.3 │   Float32 │     5276539 │   0.00 │    36.93 │           104.36 │     1.05 │
│ GPU-NVIDIA │ fix_conv_diff │ 1.11.3 │   Float32 │     5703067 │   0.00 │     7.53 │            21.28 │     5.15 │
│     CPUx04 │       32c3661 │ 1.11.3 │   Float32 │     5389939 │   0.00 │    38.09 │           107.63 │     1.02 │
│ GPU-NVIDIA │       32c3661 │ 1.11.3 │   Float32 │     5793140 │   0.19 │     7.64 │            21.59 │     5.08 │
└────────────┴───────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘

@weymouth
Copy link
Collaborator

I don't know enough about GPUs to tell you what to expect. Might need to call in an expert...

@b-fg
Copy link
Member Author

b-fg commented Mar 12, 2025

I have verified with Nsight Compute that with the /2 fix removes FP64 operations in conv_diff! when running with T=Float32, while the overall solver performance is similar. Maybe @vchuravy or @maleadt can give us some quick feedback on how to correctly implement functions that are executed in GPU kernels which can return either FP32 or FP64 (as selected by users)?

@vchuravy
Copy link

Sadly, there is no magic here. T(0.5) is basically th eonly thing you can do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FP64 operations on FP32 simulations in conv_diff!
3 participants