feat: add a local endpoint type for inference directly from chat-ui #1778

nsarrazin · 2025-03-31T15:28:12Z

Part of #1774

Run models locally from .gguf file
Auto-download model if not stored locally
Use GPU if available
Get chat template from .gguf file
Show every .gguf in models/ as a model if MODELS is undefined
Handle batching & multiple model inference at once more gracefully

nsarrazin · 2025-04-01T09:18:51Z

Something is going wrong in the build step... Found this relevant issue, trying to fix

nsarrazin · 2025-04-01T14:28:08Z

Works well you can do something like

MODELS=`[{
  "name": "HuggingFaceTB/SmolLM2-1.7B-Instruct",
  "parameters": {
    "stop_sequences": ["<|im_end|>", "<|endoftext|>"]
  },
  "endpoints": [{"type": "local", "modelPath": "hf:HuggingFaceTB/SmolLM2-1.7B-Instruct-GGUF:Q4_K_M"}]
}]`

it will automatically use your GPU if available and download models to the models folder if not available locally.

It's still super rough as it doesn't handle running out of memory gracefully so I'm still working on dealing with this better.

I also want to automatically expose any .gguf files in the models/ folder as a model in chat-ui without having to set the MODELS env var

…s at once

nsarrazin · 2025-04-04T13:11:46Z

Merging for now, it works well in local testing! Will update the docs to explain this when I'm done with the quick setup.

…uggingface#1778) * feat: add a local endpoint type running llama.cpp from chat-ui * fix: build image * fix: lock file * wip: try to make it more reliable * feat: load chat template from .gguf file * feat: load gguf models from `models/` folder * fix: default config * feat: make endpoint use chatSession instead of completion * refactor: improve exit handling, exit immediately on second sinal * fix: various fixes to improve reliability when calling multiple models at once * docs: add instructions for adding .gguf files to the models directory

feat: add a local endpoint type running llama.cpp from chat-ui

1935d59

nsarrazin added enhancement New feature or request back This issue is related to the Svelte backend or the DB models This issue is related to model performance/reliability labels Mar 31, 2025

nsarrazin mentioned this pull request Mar 31, 2025

Improve quickstart setup experience #1774

Open

5 tasks

nsarrazin added 3 commits April 1, 2025 10:22

fix: build image

a7b27b8

fix: lock file

ddebb17

wip: try to make it more reliable

c3a3945

nsarrazin added 8 commits April 2, 2025 12:09

feat: load chat template from .gguf file

e1c97a8

feat: load gguf models from models/ folder

928f0eb

fix: default config

466e3a9

feat: make endpoint use chatSession instead of completion

653ea25

Merge branch 'main' into feat/local_endpoint_type

d170690

refactor: improve exit handling, exit immediately on second sinal

9f9ee69

fix: various fixes to improve reliability when calling multiple model…

6902c20

…s at once

docs: add instructions for adding .gguf files to the models directory

b8e09f7

nsarrazin merged commit 4906793 into main Apr 4, 2025
4 checks passed

nsarrazin deleted the feat/local_endpoint_type branch April 4, 2025 13:11

nsarrazin mentioned this pull request Apr 7, 2025

Missing binariesGithubRelease.json file when running latest Docker image #1786

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add a local endpoint type for inference directly from chat-ui #1778

feat: add a local endpoint type for inference directly from chat-ui #1778

nsarrazin commented Mar 31, 2025 •

edited

Loading

nsarrazin commented Apr 1, 2025 •

edited

Loading

nsarrazin commented Apr 1, 2025

nsarrazin commented Apr 4, 2025

feat: add a local endpoint type for inference directly from chat-ui #1778

feat: add a local endpoint type for inference directly from chat-ui #1778

Conversation

nsarrazin commented Mar 31, 2025 • edited Loading

nsarrazin commented Apr 1, 2025 • edited Loading

nsarrazin commented Apr 1, 2025

nsarrazin commented Apr 4, 2025

nsarrazin commented Mar 31, 2025 •

edited

Loading

nsarrazin commented Apr 1, 2025 •

edited

Loading