本人之前试过Xinference和Ollama,这两个遇到了因为其他软件不兼容或者无安装软件权限导致安装失败,vllm是python包,不需要安装软件所以更方便。
pip install vllm
在安装时,我需要安装最新版本vllm 0.6.4与torch 2.4.0版本不兼容,在构建依赖时下载torch 2.5.1并安装失败。报错如下
ImportError: XXX/../../nvidia/cusparse/lib/libcusparse.so.12: symbol __nvJitLinkComplete_12_4, version libnvJitLink.so.12 not defined in file libnvJitLink.so.12 with link time reference
error: metadata-generation-failed
这是由于版本不兼容,我重新指定安装0.6的版本,与torch 2.4.0兼容,没有出现在构造依赖时重新下载一堆环境已有包的问题。vllm 0.5不行,我已经试过。
pip install vllm==0.6.0
export CUDA_VISIBLE_DEVICES=3
nohup python -m vllm.entrypoints.openai.api_server \
--model \
--served-model-name llama3_8b \
--port 5551 \
--gpu-memory-utilization 0.25
--dtype=half \
> vllm_test.out &
运行以上命令即可部署大模型
vllm的输入都会在nohup.out中
注意,vllm现在没有默认chat模板了,所以有些教程中的chat模板调用会报错
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:5551/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="llama3_8b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a joke."},
]
)
print("Chat response:", chat_response)
这就是用模板生成的代码,但如今会报错
BadRequestError: Error code: 400 - {'object': 'error', 'message': 'As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.', 'type': 'BadRequestError', 'param': None, 'code': 400}
这是直接prompt调用的方法,有4个参数
from openai import OpenAI
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:5551/v1"
model_name = 'llama3_8b'
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
completion = client.completions.create(model=model_name,
prompt="San Francisco is a")
print("Completion result:", completion)
Completion result: Completion(id='cmpl-302c965c13244fc7ab40d4d5ccd01f8c', choices=[CompletionChoice(finish_reason='length', index=0, logprobs=None, text=' top holiday destination featuring scenic beauty and great ethnic and cultural diversity. Explore San Francisco', stop_reason=None, prompt_logprobs=None)], created=1732034306, model='llama3_8b', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=16, prompt_tokens=5, total_tokens=21, completion_tokens_details=None, prompt_tokens_details=None))
这是一个包装类,结果在choices参数中,我们没有指定生成数量,所以这个list只有一个元素,用以下代码取出生成结果
completion.choices[0].text
# ' top holiday destination featuring scenic beauty and great ethnic and cultural diversity. Explore San Francisco'
使用tensor-parallel-size来指定用的卡数
export CUDA_VISIBLE_DEVICES=4,5
modelpath=../DataCollection/officials/Qwen2.5-14B-Instruct
modelname=Qwen2.5-14B-Instruct
nohup python -m vllm.entrypoints.openai.api_server \
--model $modelpath \
--served-model-name $modelname \
--port 5551 \
--gpu-memory-utilization 0.3 \
--dtype=half \
--tensor-parallel-size 2 \
> run_vllm_distributed.log 2>&1 &