Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage]: Is it possible to run vLLM inside a Jupyter Notebook? #16003

Open
1 task done
repodiac opened this issue Apr 3, 2025 · 2 comments
Open
1 task done

[Usage]: Is it possible to run vLLM inside a Jupyter Notebook? #16003

repodiac opened this issue Apr 3, 2025 · 2 comments
Labels
usage How to use vllm

Comments

@repodiac
Copy link

repodiac commented Apr 3, 2025

Your current environment

PyTorch version: 2.6.0+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Arch Linux (x86_64)
GCC version: (GCC) 14.2.1 20250207
Clang version: 19.1.7
CMake version: version 4.0.0
Libc version: glibc-2.41

Python version: 3.12.9 | packaged by Anaconda, Inc. | (main, Feb  6 2025, 18:56:27) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-6.13.8-arch1-1-x86_64-with-glibc2.41
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architektur:                          x86_64
CPU Operationsmodus:                  32-bit, 64-bit
Adressgrößen:                         39 bits physical, 48 bits virtual
Byte-Reihenfolge:                     Little Endian
CPU(s):                               8
Liste der Online-CPU(s):              0-7
Anbieterkennung:                      GenuineIntel
Modellname:                           Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
Prozessorfamilie:                     6
Modell:                               142
Thread(s) pro Kern:                   2
Kern(e) pro Sockel:                   4
Sockel:                               1
Stepping:                             12
Skalierung der CPU(s):                47%
Maximale Taktfrequenz der CPU:        4900,0000
Minimale Taktfrequenz der CPU:        400,0000
BogoMIPS:                             4599,93
Markierungen:                         fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp vnmi md_clear flush_l1d arch_capabilities
Virtualisierung:                      VT-x
L1d Cache:                            128 KiB (4 Instanzen)
L1i Cache:                            128 KiB (4 Instanzen)
L2 Cache:                             1 MiB (4 Instanzen)
L3 Cache:                             8 MiB (1 Instanz)
NUMA-Knoten:                          1
NUMA-Knoten0 CPU(s):                  0-7
Schwachstelle Gather data sampling:   Vulnerable: No microcode
Schwachstelle Itlb multihit:          KVM: Mitigation: VMX disabled
Schwachstelle L1tf:                   Not affected
Schwachstelle Mds:                    Not affected
Schwachstelle Meltdown:               Not affected
Schwachstelle Mmio stale data:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable
Schwachstelle Reg file data sampling: Not affected
Schwachstelle Retbleed:               Mitigation; Enhanced IBRS
Schwachstelle Spec rstack overflow:   Not affected
Schwachstelle Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Schwachstelle Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Schwachstelle Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop
Schwachstelle Srbds:                  Vulnerable: No microcode
Schwachstelle Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pyzmq==26.2.0
[pip3] torch==2.6.0+cpu
[pip3] torchaudio==2.6.0+cpu
[pip3] torchvision==0.21.0+cpu
[pip3] transformers==4.50.3
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] pyzmq                     26.3.0                   pypi_0    pypi
[conda] torch                     2.6.0+cpu                pypi_0    pypi
[conda] torchaudio                2.6.0+cpu                pypi_0    pypi
[conda] torchvision               0.21.0+cpu               pypi_0    pypi
[conda] transformers              4.50.3                   pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.8.3.dev212+g58e234a7
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

NCCL_CUMEM_ENABLE=0
TORCHINDUCTOR_COMPILE_THREADS=1

How would you like to use vllm

I would like to run vLLM inside a Jupyter notebook environment as any other python code snippet.
When I run the example code (see below) from the CLI, it works as expected!

When I run the following snippet from your examples, I get an error:

from vllm import LLM

llm = LLM(model="OpenGVLab/InternVL2_5-1B")

# Refer to the HuggingFace repo for the correct format to use
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"

# Load the image using PIL.Image
image = PIL.Image.open('/tmp/pic1.png')

# Single prompt inference
outputs = llm.generate({
    "prompt": prompt,
    "multi_modal_data": {"image": image},
})

The error is as follows:

/home/repodiac/anaconda3/envs/vllm/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

INFO 04-03 10:43:58 [__init__.py:239] Automatically detected platform cpu.

2025-04-03 10:43:59,209	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.

INFO 04-03 10:44:06 [config.py:598] This model supports multiple tasks: {'classify', 'reward', 'score', 'embed', 'generate'}. Defaulting to 'generate'.
WARNING 04-03 10:44:06 [arg_utils.py:1707] device type=cpu is not supported by the V1 Engine. Falling back to V0. 
WARNING 04-03 10:44:06 [cpu.py:98] Environment variable VLLM_CPU_KVCACHE_SPACE (GiB) for CPU backend is not set, using 4 by default.
WARNING 04-03 10:44:06 [cpu.py:111] uni is not supported on CPU, fallback to mp distributed executor backend.
INFO 04-03 10:44:06 [llm_engine.py:242] Initializing a V0 LLM engine (v0.8.3.dev212+g58e234a7) with config: model='OpenGVLab/InternVL2_5-1B', speculative_config=None, tokenizer='OpenGVLab/InternVL2_5-1B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=OpenGVLab/InternVL2_5-1B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
INFO 04-03 10:44:08 [cpu.py:44] Using Torch SDPA backend.
WARNING 04-03 10:44:08 [_custom_ops.py:21] Failed to import from vllm._C with ImportError("/home/repodiac/anaconda3/envs/vllm/lib/python3.12/site-packages/zmq/backend/cython/../../../../.././libstdc++.so.6: version `GLIBCXX_3.4.32' not found (required by /home/repodiac/anaconda3/envs/vllm/lib/python3.12/site-packages/vllm/_C.abi3.so)")
INFO 04-03 10:44:08 [importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 04-03 10:44:08 [parallel_state.py:954] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-03 10:44:08 [cpu.py:44] Using Torch SDPA backend.
INFO 04-03 10:44:08 [config.py:3317] cudagraph sizes specified by model runner [] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248]
WARNING 04-03 10:44:08 [cpu.py:98] Environment variable VLLM_CPU_KVCACHE_SPACE (GiB) for CPU backend is not set, using 4 by default.

[W403 10:44:08.931695810 socket.cpp:759] [c10d] The client socket cannot be initialized to connect to [quark-247]:35995 (errno: 97 - Address family not supported by protocol).

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[1], line 3
      1 from vllm import LLM
----> 3 llm = LLM(model="OpenGVLab/InternVL2_5-1B")
      5 # Refer to the HuggingFace repo for the correct format to use
      6 prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"

File ~/anaconda3/envs/vllm/lib/python3.12/site-packages/vllm/utils.py:1096, in deprecate_args.<locals>.wrapper.<locals>.inner(*args, **kwargs)
   1089             msg += f" {additional_message}"
   1091         warnings.warn(
   1092             DeprecationWarning(msg),
   1093             stacklevel=3,  # The inner function takes up one level
   1094         )
-> 1096 return fn(*args, **kwargs)

File ~/anaconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/llm.py:243, in LLM.__init__(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, allowed_local_media_path, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_seq_len_to_capture, disable_custom_all_reduce, disable_async_output_proc, hf_overrides, mm_processor_kwargs, task, override_pooler_config, compilation_config, **kwargs)
    214 engine_args = EngineArgs(
    215     model=model,
    216     task=task,
   (...)
    239     **kwargs,
    240 )
    242 # Create the Engine (autoselects V0 vs V1)
--> 243 self.llm_engine = LLMEngine.from_engine_args(
    244     engine_args=engine_args, usage_context=UsageContext.LLM_CLASS)
    245 self.engine_class = type(self.llm_engine)
    247 self.request_counter = Counter()

File ~/anaconda3/envs/vllm/lib/python3.12/site-packages/vllm/engine/llm_engine.py:521, in LLMEngine.from_engine_args(cls, engine_args, usage_context, stat_loggers)
    518     from vllm.v1.engine.llm_engine import LLMEngine as V1LLMEngine
    519     engine_cls = V1LLMEngine
--> 521 return engine_cls.from_vllm_config(
    522     vllm_config=vllm_config,
    523     usage_context=usage_context,
    524     stat_loggers=stat_loggers,
    525     disable_log_stats=engine_args.disable_log_stats,
    526 )

File ~/anaconda3/envs/vllm/lib/python3.12/site-packages/vllm/engine/llm_engine.py:497, in LLMEngine.from_vllm_config(cls, vllm_config, usage_context, stat_loggers, disable_log_stats)
    489 @classmethod
    490 def from_vllm_config(
    491     cls,
   (...)
    495     disable_log_stats: bool = False,
    496 ) -> "LLMEngine":
--> 497     return cls(
    498         vllm_config=vllm_config,
    499         executor_class=cls._get_executor_cls(vllm_config),
    500         log_stats=(not disable_log_stats),
    501         usage_context=usage_context,
    502         stat_loggers=stat_loggers,
    503     )

File ~/anaconda3/envs/vllm/lib/python3.12/site-packages/vllm/engine/llm_engine.py:281, in LLMEngine.__init__(self, vllm_config, executor_class, log_stats, usage_context, stat_loggers, input_registry, mm_registry, use_cached_outputs)
    277 self.input_registry = input_registry
    278 self.input_processor = input_registry.create_input_processor(
    279     self.model_config)
--> 281 self.model_executor = executor_class(vllm_config=vllm_config, )
    283 if self.model_config.runner_type != "pooling":
    284     self._initialize_kv_caches()

File ~/anaconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/executor_base.py:286, in DistributedExecutorBase.__init__(self, *args, **kwargs)
    281 def __init__(self, *args, **kwargs):
    282     # This is non-None when the execute model loop is running
    283     # in the parallel workers. It's a coroutine in the AsyncLLMEngine case.
    284     self.parallel_worker_tasks: Optional[Union[Any, Awaitable[Any]]] = None
--> 286     super().__init__(*args, **kwargs)

File ~/anaconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/executor_base.py:52, in ExecutorBase.__init__(self, vllm_config)
     50 self.prompt_adapter_config = vllm_config.prompt_adapter_config
     51 self.observability_config = vllm_config.observability_config
---> 52 self._init_executor()
     53 self.is_sleeping = False
     54 self.sleeping_tags: set[str] = set()

File ~/anaconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/mp_distributed_executor.py:125, in MultiprocessingDistributedExecutor._init_executor(self)
    123 self._run_workers("init_worker", all_kwargs)
    124 self._run_workers("init_device")
--> 125 self._run_workers("load_model",
    126                   max_concurrent_workers=self.parallel_config.
    127                   max_parallel_loading_workers)
    128 self.driver_exec_model = make_async(self.driver_worker.execute_model)
    129 self.pp_locks: Optional[List[asyncio.Lock]] = None

File ~/anaconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/mp_distributed_executor.py:185, in MultiprocessingDistributedExecutor._run_workers(***failed resolving arguments***)
    179 # Start all remote workers first.
    180 worker_outputs = [
    181     worker.execute_method(sent_method, *args, **kwargs)
    182     for worker in self.workers
    183 ]
--> 185 driver_worker_output = run_method(self.driver_worker, sent_method,
    186                                   args, kwargs)
    188 # Get the results of the workers.
    189 return [driver_worker_output
    190         ] + [output.get() for output in worker_outputs]

File ~/anaconda3/envs/vllm/lib/python3.12/site-packages/vllm/utils.py:2347, in run_method(obj, method, args, kwargs)
   2345 else:
   2346     func = partial(method, obj)  # type: ignore
-> 2347 return func(*args, **kwargs)

File ~/anaconda3/envs/vllm/lib/python3.12/site-packages/vllm/worker/cpu_worker.py:233, in CPUWorker.load_model(self)
    232 def load_model(self):
--> 233     self.model_runner.load_model()

File ~/anaconda3/envs/vllm/lib/python3.12/site-packages/vllm/worker/cpu_model_runner.py:491, in CPUModelRunnerBase.load_model(self)
    490 def load_model(self) -> None:
--> 491     self.model = get_model(vllm_config=self.vllm_config)
    493     if self.lora_config:
    494         assert supports_lora(
    495             self.model
    496         ), f"{self.model.__class__.__name__} does not support LoRA yet."

File ~/anaconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/model_loader/__init__.py:14, in get_model(vllm_config)
     12 def get_model(*, vllm_config: VllmConfig) -> nn.Module:
     13     loader = get_model_loader(vllm_config.load_config)
---> 14     return loader.load_model(vllm_config=vllm_config)

File ~/anaconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py:441, in DefaultModelLoader.load_model(self, vllm_config)
    439 with set_default_torch_dtype(model_config.dtype):
    440     with target_device:
--> 441         model = _initialize_model(vllm_config=vllm_config)
    443     weights_to_load = {name for name, _ in model.named_parameters()}
    444     loaded_weights = model.load_weights(
    445         self._get_all_weights(model_config, model))

File ~/anaconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py:127, in _initialize_model(vllm_config, prefix)
    124 if "vllm_config" in all_params and "prefix" in all_params:
    125     # new-style model class
    126     with set_current_vllm_config(vllm_config, check_compile=True):
--> 127         return model_class(vllm_config=vllm_config, prefix=prefix)
    129 msg = ("vLLM model class should accept `vllm_config` and `prefix` as "
    130        "input arguments. Possibly you have an old-style model class"
    131        " registered from out of tree and it is used for new vLLM version. "
    132        "Check https://docs.vllm.ai/en/latest/design/arch_overview.html "
    133        "for the design and update the model class accordingly.")
    134 warnings.warn(msg, DeprecationWarning, stacklevel=2)

File ~/anaconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/internvl.py:714, in InternVLChatModel.__init__(self, vllm_config, prefix)
    706 self.is_mono = self.llm_arch_name == 'InternLM2VEForCausalLM'
    707 self.vision_model = self._init_vision_model(
    708     config,
    709     quant_config=quant_config,
    710     is_mono=self.is_mono,
    711     prefix=maybe_prefix(prefix, "vision_model"),
    712 )
--> 714 self.language_model = init_vllm_registered_model(
    715     vllm_config=vllm_config,
    716     hf_config=config.text_config,
    717     prefix=maybe_prefix(prefix, "language_model"),
    718 )
    720 self.mlp1 = self._init_mlp1(config)
    722 self.img_context_token_id = None

File ~/anaconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/utils.py:286, in init_vllm_registered_model(vllm_config, prefix, hf_config, architectures)
    282 if hf_config is not None:
    283     vllm_config = vllm_config.with_hf_config(hf_config,
    284                                              architectures=architectures)
--> 286 return _initialize_model(vllm_config=vllm_config, prefix=prefix)

File ~/anaconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py:127, in _initialize_model(vllm_config, prefix)
    124 if "vllm_config" in all_params and "prefix" in all_params:
    125     # new-style model class
    126     with set_current_vllm_config(vllm_config, check_compile=True):
--> 127         return model_class(vllm_config=vllm_config, prefix=prefix)
    129 msg = ("vLLM model class should accept `vllm_config` and `prefix` as "
    130        "input arguments. Possibly you have an old-style model class"
    131        " registered from out of tree and it is used for new vLLM version. "
    132        "Check https://docs.vllm.ai/en/latest/design/arch_overview.html "
    133        "for the design and update the model class accordingly.")
    134 warnings.warn(msg, DeprecationWarning, stacklevel=2)

File ~/anaconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/qwen2.py:431, in Qwen2ForCausalLM.__init__(self, vllm_config, prefix)
    428 self.lora_config = lora_config
    430 self.quant_config = quant_config
--> 431 self.model = Qwen2Model(vllm_config=vllm_config,
    432                         prefix=maybe_prefix(prefix, "model"))
    434 if get_pp_group().is_last_rank:
    435     if config.tie_word_embeddings:

File ~/anaconda3/envs/vllm/lib/python3.12/site-packages/vllm/compilation/decorators.py:151, in _support_torch_compile.<locals>.__init__(self, vllm_config, prefix, **kwargs)
    150 def __init__(self, *, vllm_config: VllmConfig, prefix: str = '', **kwargs):
--> 151     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
    152     self.vllm_config = vllm_config
    153     # for CompilationLevel.DYNAMO_AS_IS , the upper level model runner
    154     # will handle the compilation, so we don't need to do anything here.

File ~/anaconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/qwen2.py:300, in Qwen2Model.__init__(self, vllm_config, prefix)
    297 else:
    298     self.embed_tokens = PPMissingLayer()
--> 300 self.start_layer, self.end_layer, self.layers = make_layers(
    301     config.num_hidden_layers,
    302     lambda prefix: Qwen2DecoderLayer(config=config,
    303                                      cache_config=cache_config,
    304                                      quant_config=quant_config,
    305                                      prefix=prefix),
    306     prefix=f"{prefix}.layers",
    307 )
    309 self.make_empty_intermediate_tensors = (
    310     make_empty_intermediate_tensors_factory(
    311         ["hidden_states", "residual"], config.hidden_size))
    312 if get_pp_group().is_last_rank:

File ~/anaconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/utils.py:610, in make_layers(num_hidden_layers, layer_fn, prefix)
    604 from vllm.distributed.utils import get_pp_indices
    605 start_layer, end_layer = get_pp_indices(num_hidden_layers,
    606                                         get_pp_group().rank_in_group,
    607                                         get_pp_group().world_size)
    608 modules = torch.nn.ModuleList(
    609     [PPMissingLayer() for _ in range(start_layer)] + [
--> 610         maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
    611         for idx in range(start_layer, end_layer)
    612     ] + [PPMissingLayer() for _ in range(end_layer, num_hidden_layers)])
    613 return start_layer, end_layer, modules

File ~/anaconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/qwen2.py:302, in Qwen2Model.__init__.<locals>.<lambda>(prefix)
    297 else:
    298     self.embed_tokens = PPMissingLayer()
    300 self.start_layer, self.end_layer, self.layers = make_layers(
    301     config.num_hidden_layers,
--> 302     lambda prefix: Qwen2DecoderLayer(config=config,
    303                                      cache_config=cache_config,
    304                                      quant_config=quant_config,
    305                                      prefix=prefix),
    306     prefix=f"{prefix}.layers",
    307 )
    309 self.make_empty_intermediate_tensors = (
    310     make_empty_intermediate_tensors_factory(
    311         ["hidden_states", "residual"], config.hidden_size))
    312 if get_pp_group().is_last_rank:

File ~/anaconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/qwen2.py:218, in Qwen2DecoderLayer.__init__(self, config, cache_config, quant_config, prefix)
    204     attn_type = AttentionType.ENCODER_ONLY
    206 self.self_attn = Qwen2Attention(
    207     hidden_size=self.hidden_size,
    208     num_heads=config.num_attention_heads,
   (...)
    216     attn_type=attn_type,
    217 )
--> 218 self.mlp = Qwen2MLP(
    219     hidden_size=self.hidden_size,
    220     intermediate_size=config.intermediate_size,
    221     hidden_act=config.hidden_act,
    222     quant_config=quant_config,
    223     prefix=f"{prefix}.mlp",
    224 )
    225 self.input_layernorm = RMSNorm(config.hidden_size,
    226                                eps=config.rms_norm_eps)
    227 self.post_attention_layernorm = RMSNorm(config.hidden_size,
    228                                         eps=config.rms_norm_eps)

File ~/anaconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/qwen2.py:92, in Qwen2MLP.__init__(self, hidden_size, intermediate_size, hidden_act, quant_config, prefix)
     89 if hidden_act != "silu":
     90     raise ValueError(f"Unsupported activation: {hidden_act}. "
     91                      "Only silu is supported for now.")
---> 92 self.act_fn = SiluAndMul()

File ~/anaconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/layers/activation.py:68, in SiluAndMul.__init__(self)
     66 super().__init__()
     67 if current_platform.is_cuda_alike() or current_platform.is_cpu():
---> 68     self.op = torch.ops._C.silu_and_mul
     69 elif current_platform.is_xpu():
     70     from vllm._ipex_ops import ipex_ops

File ~/anaconda3/envs/vllm/lib/python3.12/site-packages/torch/_ops.py:1232, in _OpNamespace.__getattr__(self, op_name)
   1230     op, overload_names = _get_packet(qualified_op_name, module_name)
   1231     if op is None:
-> 1232         raise AttributeError(
   1233             f"'_OpNamespace' '{self.name}' object has no attribute '{op_name}'"
   1234         )
   1235 except RuntimeError as e:
   1236     # Turn this into AttributeError so getattr(obj, key, default)
   1237     # works (this is called by TorchScript with __origin__)
   1238     raise AttributeError(
   1239         f"'_OpNamespace' '{self.name}' object has no attribute '{op_name}'"
   1240     ) from e

AttributeError: '_OpNamespace' '_C' object has no attribute 'silu_and_mul'

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@repodiac repodiac added the usage How to use vllm label Apr 3, 2025
@BKitor
Copy link
Contributor

BKitor commented Apr 7, 2025

fwiw, I was able to run your sample code in a notebook w/ the following software versions:

$ pip list | grep -P '(jupyterlab |torch |vllm )'
jupyterlab                        4.3.6
torch                             2.6.0
vllm                              0.8.3

(With Cuda)
Not sure why it's working on the CLI, but not in a notebook.
Could it be an issue with how your notebook environment was launched or some other linker/python-env error?

@repodiac
Copy link
Author

repodiac commented Apr 8, 2025

I can check back, but I don't have CUDA !? I manually compiled it for CPU only (according to your instructions on the website)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage How to use vllm
Projects
None yet
Development

No branches or pull requests

2 participants