Replies: 1 comment 3 replies
-
I don't know why, but I figured out that I can force it to by using / instead of , to specify exactly how many layers to put on each GPU |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
load_tensors: offloaded 50/62 layers to GPU
load_tensors: CPU_Mapped model buffer size = 31952.61 MiB
load_tensors: CUDA0 model buffer size = 17656.55 MiB
load_tensors: CUDA1 model buffer size = 14125.24 MiB
load_tensors: CUDA2 model buffer size = 17656.55 MiB
load_tensors: CUDA3 model buffer size = 14125.24 MiB
load_tensors: CUDA4 model buffer size = 17656.55 MiB
load_tensors: CUDA5 model buffer size = 14125.24 MiB
The uneven distribution causes OOM when the KV is being loaded, meanwhile it's clear that we would have enough memory if it was all loaded evenly.
Beta Was this translation helpful? Give feedback.
All reactions