Changing the number of experts with a Mixtral GGUF? #5114
-
I'm using ooba webui, and I notice that when I look at the Exllamav2 model loader, it has an option, 'Number of experts per token' for Mixtral that lets you set it to a different value to the usual value of 2. But when I use the llama.cpp loader (because I'm using an 8bpp GGUF of Mixtral), that option isn't available. I want to see how good a response I can get from Mixtral, so I don't want to switch to a lower bpp so the model fits on my GPU, because that would make the response worse in a different way. Is there any way to get a higher number of experts while still using a GGUF? |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments
-
I tried this with
|
Beta Was this translation helpful? Give feedback.
-
That's really useful info, thanks! Now, I'll have a look to see where I could splice that into ooba's codebase. Unless anyone knows where that is offhand? |
Beta Was this translation helpful? Give feedback.
-
I've found this bit in llamacpp_model.py, but I haven't worked out yet how to set llama.expert_used_count, or n_expert_used to 3. Adding both of these items to the params array doesn't seem to do the job. |
Beta Was this translation helpful? Give feedback.
-
If people are looking for this thread continuing with regards to integrating this with ooba webui, I opened up a new thread on that Discussion tab: |
Beta Was this translation helpful? Give feedback.
I tried this with
--override-kv llama.expert_used_count=int:3
and it worked: