Skip to content

Instantly share code, notes, and snippets.

View FirwoodLin's full-sized avatar
💭
🎣

Fir Wood FirwoodLin

💭
🎣
  • Shanghai, China
  • 22:51 (UTC +08:00)
View GitHub Profile
@FirwoodLin
FirwoodLin / gist:b961faee611c8acfa49168834bb4c73e
Last active November 21, 2025 06:27
vllm error when disable deepgemm
VLLM_MOE_USE_DEEP_GEMM=0 VLLM_USE_DEEP_GEMM=0 VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 VLLM_MOE_ROUTING_SIMULATION_STRATEGY="uniform_random" UCX_NET_DEVICES="mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1" GLOO_SOCKET_IFNAME=bond0 NCCL_SOCKET_IFNAME=bond0 vllm serve "/models/models--Qwen--Qwen3-235B-A22B-Instruct-2507-FP8/snapshots/ba82a1060073fa0ecdc70d7b1922ec071f60cf3e" -dp 32 --data-parallel-size-local 8 --data-parallel-backend=ray --max-num-seqs 64 --enable-expert-parallel --no-enable-prefix-caching --distributed-executor-backend ray --load-format dummy --all2all-backend deepep_low_latency 2>&1 | tee /docker_mnt/vllm.log
INFO 11-20 22:18:43 [scheduler.py:216] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=93062) INFO 11-20 22:18:43 [api_server.py:1977] vLLM API server version 0.11.2
(APIServer pid=93062) INFO 11-20 22:18:43 [utils.py:253] non-default args: {'model_tag': '/models/models--Qwen--Qwen3-235B-A22B-Instru
@FirwoodLin
FirwoodLin / gist:29d2625b08fda158afc93d12dae7a2b3
Created November 20, 2025 13:32
vllm 0.11.1 CUDA error when running DSV3/QWEN3-Moe
# command
- 不设置 VLLM_USE_DEEP_GEMM 和 --enforce-eager 也会报错
-
```
VLLM_USE_DEEP_GEMM=0 VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 VLLM_MOE_ROUTING_SIMULATION_STRATEGY="uniform_random" UCX_NET_DEVICES="mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1" GLOO_SOCKET_IFNAME=bond0 NCCL_SOCKET_IFNAME=bond0 vllm serve "/models/models--Qwen--Qwen3-235B-A22B-Instruct-2507-FP8/snapshots/ba82a1060073fa0ecdc70d7b1922ec071f60cf3e" --max-model-len 435000 -dp 32 --data-parallel-size-local 8 --data-parallel-backend=ray --max-num-seqs 64 --enable-expert-parallel --no-enable-prefix-caching --distributed-executor-backend ray --kv-transfer-config '{ "kv_connector": "DecodeBenchConnector", "kv_role": "kv_both", "kv_connector_extra_config": { "fill_mean": 0.015, "fill_std": 0.0 } }' --load-format dummy --all2all-backend deepep_low_latency --enforce-eager
```