vLLM

`vllm serve` 常用参数

--enable-log-requests: 同时开启VLLM_LOGGING_LEVEL=DEBUG后prompt在日志中的details: prompt
--enable-log-outputs: 同时开启VLLM_LOGGING_LEVEL=DEBUG后output在日志中的(streaming complete): output

--model: 模型名称或本地模型路径（Hugging Face 仓库名、本地目录）。
--served-model-name: 对外提供的模型名称
--download-dir: 模型下载和加载缓存目录。
--trust-remote-code: 允许执行 Hugging Face 仓库中的自定义代码。注意：只对可信模型开启。
--dtype: 模型权重和激活的数据类型，常见写法：auto、half、float16、bfloat16、float、'float32'。
--kv-cache-dtype:
--quantization / -q:
--load-format: 指定权重加载格式，如 auto、safetensors、pt、gguf。
--max-model-len: 模型最大上下文长度，包含输入和输出。常见写法：4096、8192、32768、32K。
--tokenizer: 单独指定 tokenizer；不指定时通常跟随 --model。
--chat-template: 把 OpenAI 风格的 messages，翻译成具体模型 prompt 格式的规则文件。示例/path/to/template.jinja
--default-chat-template-kwargs: JSON格式设置默认chat-template参数

--seed: 随机种子，用于复现。
--master-addr & --master-port & --nnodes & --node-rank: 多机部署时的主节点地址、端口、节点总数和当前节点编号。
--speculative-config: Speculative decoding configuration. SpeculativeConfig。示例用法：'{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

--dtype Possible choices: auto, bfloat16, float, float16, float32, half Data type for model weights and activations:

"auto" will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models.
"half" for FP16. Recommended for AWQ quantization.
"float16" is the same as "half".
"bfloat16" for a balance between precision and range.
"float" is shorthand for FP32 precision.
"float32" for FP32 precision. Default: auto

图片下载、转成PIL Image再通过Vision Encoder转为tensor/embedding等步骤都是vllm来完成的