Fix FP64 operations on `conv_diff` #199

b-fg · 2025-03-11T22:17:11Z

In FP32 (T=Float32) GPU simulations, FP64 operations were detected in the conv_diff! routine with Nsight Compute (should also happen for CPU). This fix closes #197. I have tracked this down to the flux function

WaterLily.jl/src/Flow.jl

Line 3 in 0c05f4d

@inline ϕ(a,I,f) = @inbounds (f[I]+f[I-δ(a,I)])*0.5

where the 0.5 always promotes the operation to FP64. The fix is to use /2 instead, which preserves the floating point operation.

I have also done some additional type cleaning, and separated the @loop in conv_diff! into their own kernels. Benchmarks tests need to be conducted to see if the fix impacts performance or not.

b-fg · 2025-03-11T22:32:03Z

Benchmarks do not show any speedup currently. I need to try with the original version of merged @loop.

Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 7
┌────────────┬───────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │   WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx04 │        master │ 1.11.3 │   Float32 │     2153203 │   0.00 │    19.42 │            92.60 │     1.00 │
│ GPU-NVIDIA │        master │ 1.11.3 │   Float32 │     2574744 │   0.00 │     3.07 │            14.64 │     6.33 │
│     CPUx04 │ fix_conv_diff │ 1.11.3 │   Float32 │     2266603 │   0.00 │    17.89 │            85.28 │     1.09 │
│ GPU-NVIDIA │ fix_conv_diff │ 1.11.3 │   Float32 │     2655696 │   0.00 │     3.18 │            15.16 │     6.11 │
└────────────┴───────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: cylinder sim_step! (max_steps=100)
▶ log2p = 5
┌────────────┬───────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │   WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx04 │        master │ 1.11.3 │   Float32 │     5268250 │   0.00 │    38.78 │           109.58 │     1.00 │
│ GPU-NVIDIA │        master │ 1.11.3 │   Float32 │     5703296 │   0.00 │     7.62 │            21.54 │     5.09 │
│     CPUx04 │ fix_conv_diff │ 1.11.3 │   Float32 │     5389939 │   0.00 │    38.09 │           107.63 │     1.02 │
│ GPU-NVIDIA │ fix_conv_diff │ 1.11.3 │   Float32 │     5793140 │   0.19 │     7.64 │            21.59 │     5.08 │
└────────────┴───────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘

b-fg · 2025-03-11T22:40:00Z

It seems that the loop merging actually helps a bit with performance, see below (note 32c3661 is without loop merging). Comparing to master, the /2 yields similar results on GPU, and is a bit faster on CPU (should be opposite..?).

Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 7
┌────────────┬───────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │   WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx04 │        master │ 1.11.3 │   Float32 │     2153203 │   0.00 │    19.42 │            92.60 │     1.00 │
│ GPU-NVIDIA │        master │ 1.11.3 │   Float32 │     2574744 │   0.00 │     3.07 │            14.64 │     6.33 │
│     CPUx04 │ fix_conv_diff │ 1.11.3 │   Float32 │     2153203 │   0.00 │    17.03 │            81.21 │     1.14 │
│ GPU-NVIDIA │ fix_conv_diff │ 1.11.3 │   Float32 │     2573880 │   0.00 │     3.06 │            14.59 │     6.34 │
│     CPUx04 │       32c3661 │ 1.11.3 │   Float32 │     2266603 │   0.00 │    17.89 │            85.28 │     1.09 │
│ GPU-NVIDIA │       32c3661 │ 1.11.3 │   Float32 │     2655696 │   0.00 │     3.18 │            15.16 │     6.11 │
└────────────┴───────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: cylinder sim_step! (max_steps=100)
▶ log2p = 5
┌────────────┬───────────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │   WaterLily   │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx04 │        master │ 1.11.3 │   Float32 │     5268250 │   0.00 │    38.78 │           109.58 │     1.00 │
│ GPU-NVIDIA │        master │ 1.11.3 │   Float32 │     5703296 │   0.00 │     7.62 │            21.54 │     5.09 │
│     CPUx04 │ fix_conv_diff │ 1.11.3 │   Float32 │     5276539 │   0.00 │    36.93 │           104.36 │     1.05 │
│ GPU-NVIDIA │ fix_conv_diff │ 1.11.3 │   Float32 │     5703067 │   0.00 │     7.53 │            21.28 │     5.15 │
│     CPUx04 │       32c3661 │ 1.11.3 │   Float32 │     5389939 │   0.00 │    38.09 │           107.63 │     1.02 │
│ GPU-NVIDIA │       32c3661 │ 1.11.3 │   Float32 │     5793140 │   0.19 │     7.64 │            21.59 │     5.08 │
└────────────┴───────────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘

weymouth · 2025-03-12T09:07:44Z

I don't know enough about GPUs to tell you what to expect. Might need to call in an expert...

b-fg · 2025-03-12T09:18:44Z

I have verified with Nsight Compute that with the /2 fix removes FP64 operations in conv_diff! when running with T=Float32, while the overall solver performance is similar. Maybe @vchuravy or @maleadt can give us some quick feedback on how to correctly implement functions that are executed in GPU kernels which can return either FP32 or FP64 (as selected by users)?

vchuravy · 2025-03-12T09:32:43Z

Sadly, there is no magic here. T(0.5) is basically th eonly thing you can do.

b-fg added 4 commits March 11, 2025 16:59

Reset RHS with appropiate type.

5443b82

Separated internal cells kernel loops.

ab37314

Fixed type instability in ϕ

26c780d

reset to 0.0 initializations in Flow

32c3661

b-fg added the bug Something isn't working label Mar 11, 2025

Reverted loop merging in conv_diff

289cce9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix FP64 operations on `conv_diff` #199

Fix FP64 operations on `conv_diff` #199

b-fg commented Mar 11, 2025

b-fg commented Mar 11, 2025 •

edited

Loading

b-fg commented Mar 11, 2025 •

edited

Loading

weymouth commented Mar 12, 2025

b-fg commented Mar 12, 2025

vchuravy commented Mar 12, 2025

Fix FP64 operations on conv_diff #199

Are you sure you want to change the base?

Fix FP64 operations on conv_diff #199

Conversation

b-fg commented Mar 11, 2025

b-fg commented Mar 11, 2025 • edited Loading

b-fg commented Mar 11, 2025 • edited Loading

weymouth commented Mar 12, 2025

b-fg commented Mar 12, 2025

vchuravy commented Mar 12, 2025

Fix FP64 operations on `conv_diff` #199

Fix FP64 operations on `conv_diff` #199

b-fg commented Mar 11, 2025 •

edited

Loading

b-fg commented Mar 11, 2025 •

edited

Loading