Running same prompt with different hardware(CPU & Customer hardware) lead to different prompt response even with temp:0.0 value #12167

akapoor3518 · 2025-03-03T18:13:13Z

akapoor3518
Mar 3, 2025

Hi,
I am running below prompt (with --temp 0.0) on CPU and my customer hardware and i expected same response. But i am getting different response. I will looking through Llama.cpp code and see why this is difference. It will help if you guys can also give your suggestion on this.

Below what i had run:
./build/bin/llama-cli -p "my cat name" -m ./models/tinyllama-vo-5m-para.gguf --device none -c 12288 --temp 0.0 --n-predict 4 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup

Prompt response
###########
my cat name was a little girl

With Customer Hardware only following operation: GGML_OP_NONE, GGML_OP_ADD, GGML_OP_SUB, GGML_OP_DIV, GGML_OP_MUL were offloaded to customer hardware, rest going to CPU
./build/bin/llama-cli -p "my cat name" -m ./models/tinyllama-vo-5m-para.gguf --device customer-hardware -c 12288 --temp 0.0 --n-predict 4 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup

Prompt response
###########
my cat name the sun was shining

fairydreaming · 2025-03-04T17:37:49Z

fairydreaming
Mar 4, 2025
Collaborator

@akapoor3518 Does it happen when you set the same integer random seed value? (-s or --seed parameter).

8 replies

akapoor3518 Mar 5, 2025
Author

Thanks @fairydreaming for your response.
I am bit confuse different order of operation(why order will be different if we are offloading ADD & MULT operation to my custom hardware).
Another question i have when we do ggml_backend_tensor_copy
i see some of src tensor are GGML_OP_RMS_NORM which get copy to dst tensor of op GGML_OP_NONE
it doesnt copy src field detail to src tensor(list of tensors might used during recursion, hence this information last, do we need to worry about this) to dst tensor

Following are more detail:

I have test cases for Vector Add and Mult and validated it work fine using my backend. I am only offloading ADD & Mult to my custom hardware. and i had llama-cli for model; tinyllama-vo-5m-para.gguf(for 4 tokens) . With log and my stats i see. that following
Add Operation has 1 Tensor(Node) which was offloaded to my hardware(hence ggml_backend_tensor_copy has happen to my backend Buffer with dst tensor)
Mult Operation has 68 tensor(Node Tensor) to offloaded to
Hence we have Split the graph and each time graph has One Node & two leaf(which was computed at my Custom hardware)
for 4 token total Graph (68+1)
For Code inspection
llama_decode_impl run in token loop
It Spilt the graph for each token(calling: ggml_backend_sched_alloc_graph)
here we create Graph with one Node and two leaf node which is executed at custom hardware(17 graph created for Mult operation for each token)

./build/bin/llama-cli -p "my cat name" -m ./models/tinyllama-vo-5m-para.gguf --device custom-hardware -c 12288 --temp 0.0 --n-predict 4 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --seed 0

fairydreaming Mar 5, 2025
Collaborator

Thanks @fairydreaming for your response. I am bit confuse different order of operation(why order will be different if we are offloading ADD & MULT operation to my custom hardware).

I mean internally in the software stack for the custom hardware, no idea if it's fully deterministic or not.

Regarding other questions unfortunately I don't have time to dig into details. But I definitely recommend printing tensor values like I suggested earlier, it helped me to find causes of problems many times.

akapoor3518 Mar 6, 2025
Author

Thanks @fairydreaming i was able to figure out and get similar consistent prompt response when running CPU vs Custom Hardware
Following was my finding with As One example: Lets take one Operation which was got splited and put in Backend Subgraph
NODE OP: GGML_OP_MUL, size 64, 4,1,1
src0 OP: GGML_OP_NONE, size 64,1,,1
src1 OP: GGML_OP_NONE, size 64,1,1,1
Basically Node->data memory point to same memory of src0->data and i am assuming leaf1 size will be always bigger or equal to lesrc1 size
My Custom Kernel with change do following
here i am passing size 64 or src1 size: I wanted to know how we handle this Operation at CPU when size are different same as above. I will go through the code and try to understand but it will be good if you can also suggest here.
Custom_Kernel_Mul(leaf1, leaf2, Node, size)

akapoor3518 Mar 6, 2025
Author

Hi @fairydreaming,
Will you time to connect, i would like to understand following encoding for ggml_compute_forward_mul_f32 and logic behind that.
Thanks!

akapoor3518 Mar 6, 2025
Author

what i understood below code from function only needed from func(gml_compute_forward_mul_f32) when we have more than one thread to execute compute. Since i am no implemented graph planning and only doing my backend compute with one thread i think i am ok and dont need to do any thing for below, only thing i need to take care pick lowest src1 size and then do my custom kernel vector multiplication to src0 & src1 from (index 0 to size-1). @fairydreaming please let me know whether my understanding is correct
const int64_t i03 = ir/(ne02ne01);
const int64_t i02 = (ir - i03ne02ne01)/ne01;
const int64_t i01 = (ir - i03ne02ne01 - i02ne01);

        const int64_t i13 = i03 % ne13;
        const int64_t i12 = i02 % ne12;
        const int64_t i11 = i01 % ne11;
        const int64_t nr0 = ne00 / ne10;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running same prompt with different hardware(CPU & Customer hardware) lead to different prompt response even with temp:0.0 value #12167

{{title}}

Replies: 1 comment 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Running same prompt with different hardware(CPU & Customer hardware) lead to different prompt response even with temp:0.0 value #12167

akapoor3518 Mar 3, 2025

Replies: 1 comment · 8 replies

fairydreaming Mar 4, 2025 Collaborator

akapoor3518 Mar 5, 2025 Author

fairydreaming Mar 5, 2025 Collaborator

akapoor3518 Mar 6, 2025 Author

akapoor3518 Mar 6, 2025 Author

akapoor3518 Mar 6, 2025 Author

akapoor3518
Mar 3, 2025

Replies: 1 comment 8 replies

fairydreaming
Mar 4, 2025
Collaborator

akapoor3518 Mar 5, 2025
Author

fairydreaming Mar 5, 2025
Collaborator

akapoor3518 Mar 6, 2025
Author

akapoor3518 Mar 6, 2025
Author

akapoor3518 Mar 6, 2025
Author