Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add a local endpoint type for inference directly from chat-ui #1778

Merged
merged 12 commits into from
Apr 4, 2025

Conversation

nsarrazin
Copy link
Collaborator

@nsarrazin nsarrazin commented Mar 31, 2025

Part of #1774

  • Run models locally from .gguf file
  • Auto-download model if not stored locally
  • Use GPU if available
  • Get chat template from .gguf file
  • Show every .gguf in models/ as a model if MODELS is undefined
  • Handle batching & multiple model inference at once more gracefully

@nsarrazin nsarrazin added enhancement New feature or request back This issue is related to the Svelte backend or the DB models This issue is related to model performance/reliability labels Mar 31, 2025
@nsarrazin
Copy link
Collaborator Author

nsarrazin commented Apr 1, 2025

Something is going wrong in the build step... Found this relevant issue, trying to fix

@nsarrazin
Copy link
Collaborator Author

Works well you can do something like

MODELS=`[{
  "name": "HuggingFaceTB/SmolLM2-1.7B-Instruct",
  "parameters": {
    "stop_sequences": ["<|im_end|>", "<|endoftext|>"]
  },
  "endpoints": [{"type": "local", "modelPath": "hf:HuggingFaceTB/SmolLM2-1.7B-Instruct-GGUF:Q4_K_M"}]
}]`

it will automatically use your GPU if available and download models to the models folder if not available locally.

It's still super rough as it doesn't handle running out of memory gracefully so I'm still working on dealing with this better.

I also want to automatically expose any .gguf files in the models/ folder as a model in chat-ui without having to set the MODELS env var

@nsarrazin
Copy link
Collaborator Author

Merging for now, it works well in local testing! Will update the docs to explain this when I'm done with the quick setup.

@nsarrazin nsarrazin merged commit 4906793 into main Apr 4, 2025
4 checks passed
@nsarrazin nsarrazin deleted the feat/local_endpoint_type branch April 4, 2025 13:11
csanz91 pushed a commit to csanz91/chat-ui that referenced this pull request Apr 7, 2025
…uggingface#1778)

* feat: add a local endpoint type running llama.cpp from chat-ui

* fix: build image

* fix: lock file

* wip: try to make it more reliable

* feat: load chat template from .gguf file

* feat: load gguf models from `models/` folder

* fix: default config

* feat: make endpoint use chatSession instead of completion

* refactor: improve exit handling, exit immediately on second sinal

* fix: various fixes to improve reliability when calling multiple models at once

* docs: add instructions for adding .gguf files to the models directory
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
back This issue is related to the Svelte backend or the DB enhancement New feature or request models This issue is related to model performance/reliability
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant