Inferless
Popular repositories Loading
-
triton-co-pilot
triton-co-pilot PublicGenerate Glue Code in seconds to simplify your Nvidia Triton Inference Server Deployments
-
qwq-32b-preview
qwq-32b-preview Public templateA 32B experimental reasoning model for advanced text generation and robust instruction following. <metadata> gpu: A100 | collections: ["vLLM"] </metadata>
-
whisper-large-v3
whisper-large-v3 Public templateState‑of‑the‑art speech recognition model for English, delivering transcription accuracy across diverse audio scenarios. <metadata> gpu: T4 | collections: ["CTranslate2"] </metadata>
-
deepseek-r1-distill-qwen-32b
deepseek-r1-distill-qwen-32b Public templateA distilled DeepSeek-R1 variant built on Qwen2.5-32B, fine-tuned with curated data for enhanced performance and efficiency. <metadata> gpu: A100 | collections: ["vLLM"] </metadata>
Repositories
- stable-diffusion-3.5-large Public
8B model, excels in producing high-quality, detailed images up to 1 megapixel in resolution. <metadata> gpu: A100 | collections: ["Diffusers"] </metadata>
- phi-4-GGUF Public template
A 14B model optimized in GGUF format for efficient inference, designed to excel in complex reasoning tasks. <metadata> gpu: A100 | collections: ["llama.cpp","GGUF"] </metadata>
- tinyllama-1-1b-chat-v1-0 Public template
A chat model fine-tuned on TinyLlama, a compact 1.1B Llama model pretrained on 3 trillion tokens. <metadata> gpu: T4 | collections: ["vLLM"] </metadata>
- llama-2-13b-chat-hf Public template
A 13B model fine-tuned with reinforcement learning from human feedback, part of Meta’s Llama 2 family for dialogue tasks. <metadata> gpu: A100 | collections: ["HF Transformers"] </metadata>
- falcon-7b-instruct Public template
A 7B instruction-tuned language model that excels in following detailed prompts and effectively performing a wide variety of natural language processing tasks. <metadata> gpu: A100 | collections: ["HF Transformers"] </metadata>
- vicuna-7b-8k Public template
A GPTQ‑quantized variant of Vicuna 7B v1.3, optimized for conversational AI and instruction‑following with efficient, robust performance. <metadata> gpu: T4 | collections:["GPTQ"] </metadata>
- vicuna-13b-8k Public template
A GPTQ‑quantized, 13‑billion‑parameter uncensored language model with an extended 8K context window, designed for dynamic, high‑performance conversational tasks. <metadata> gpu: T4 | collections: ["GPTQ"] </metadata>
- codellama-7b Public template
A 7B-parameter, Python-specialized model for lightweight, efficient code generation and comprehension. <metadata> gpu: A100 | collections: ["vLLM"] </metadata>
- meditron-7b-gptq Public template
An AWQ-quantized open-source medical LLM designed for exam question answering, differential diagnosis support, and providing comprehensive disease, symptom, cause, and treatment information. <metadata> gpu: A100 | collections: ["vLLM"] </metadata>
- openhermes-2-5-mistral-7b Public template
A quantized model fine-tuned for rapid, efficient, and robust conversational and instruction tasks. <metadata> gpu: A100 | collections: ["vLLM","AWQ"] </metadata>