This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| VLLM_MOE_USE_DEEP_GEMM=0 VLLM_USE_DEEP_GEMM=0 VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 VLLM_MOE_ROUTING_SIMULATION_STRATEGY="uniform_random" UCX_NET_DEVICES="mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1" GLOO_SOCKET_IFNAME=bond0 NCCL_SOCKET_IFNAME=bond0 vllm serve "/models/models--Qwen--Qwen3-235B-A22B-Instruct-2507-FP8/snapshots/ba82a1060073fa0ecdc70d7b1922ec071f60cf3e" -dp 32 --data-parallel-size-local 8 --data-parallel-backend=ray --max-num-seqs 64 --enable-expert-parallel --no-enable-prefix-caching --distributed-executor-backend ray --load-format dummy --all2all-backend deepep_low_latency 2>&1 | tee /docker_mnt/vllm.log | |
| INFO 11-20 22:18:43 [scheduler.py:216] Chunked prefill is enabled with max_num_batched_tokens=2048. | |
| [1;36m(APIServer pid=93062)[0;0m INFO 11-20 22:18:43 [api_server.py:1977] vLLM API server version 0.11.2 | |
| [1;36m(APIServer pid=93062)[0;0m INFO 11-20 22:18:43 [utils.py:253] non-default args: {'model_tag': '/models/models--Qwen--Qwen3-235B-A22B-Instru |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # command | |
| - 不设置 VLLM_USE_DEEP_GEMM 和 --enforce-eager 也会报错 | |
| - | |
| ``` | |
| VLLM_USE_DEEP_GEMM=0 VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 VLLM_MOE_ROUTING_SIMULATION_STRATEGY="uniform_random" UCX_NET_DEVICES="mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1" GLOO_SOCKET_IFNAME=bond0 NCCL_SOCKET_IFNAME=bond0 vllm serve "/models/models--Qwen--Qwen3-235B-A22B-Instruct-2507-FP8/snapshots/ba82a1060073fa0ecdc70d7b1922ec071f60cf3e" --max-model-len 435000 -dp 32 --data-parallel-size-local 8 --data-parallel-backend=ray --max-num-seqs 64 --enable-expert-parallel --no-enable-prefix-caching --distributed-executor-backend ray --kv-transfer-config '{ "kv_connector": "DecodeBenchConnector", "kv_role": "kv_both", "kv_connector_extra_config": { "fill_mean": 0.015, "fill_std": 0.0 } }' --load-format dummy --all2all-backend deepep_low_latency --enforce-eager | |
| ``` |