大语言模型的本地部署正成为一个热门话题。本指南将帮助你掌握如何使用 IPEX-LLM(Intel PyTorch Extension for Large Language Models)在英特尔硬件上实现最优化的模型部署。无论你是刚开始接触还是已经有一定经验,这份指南都能满足你的需求。
IPEX-LLM 是英特尔基于 PyTorch 开发的专业优化库,它不仅能显著提升 CPU 推理性能,还为英特尔全系列 GPU 提供了深度优化支持。它支持以下硬件平台:
覆盖主流开源模型生态:
精度优化方案
智能内存管理
计算性能提升
完美对接主流框架:
处理器选择:
显卡支持:
重要说明
- IPEX-LLM 主要面向 Linux 平台,Windows 用户可通过 WSL 使用
- iGPU 用户需要自行配置环境
- Arc 系列 dGPU 用户推荐使用 Windows + WSL + Docker 方案
# 创建并激活 conda 环境
conda create -n llm python=3.11 libuv
conda activate llm
根据处理器型号选择安装命令:
Intel Core™ Ultra 处理器(Series 2,型号 2xxV,代号 Lunar Lake):
美国地区:
pip install --pre --upgrade ipex-llm[xpu_lnl] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/lnl/us/
中国地区:
pip install --pre --upgrade ipex-llm[xpu_lnl] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/lnl/cn/
其他 Intel iGPU 和 dGPU:
美国地区:
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
中国地区:
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/
在 Miniforge Prompt 中设置环境变量:
Intel iGPU:
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
Intel Arc™ A770:
set SYCL_CACHE_PERSISTENT=1
import torch
from ipex_llm.transformers import AutoModel, AutoModelForCausalLM
tensor_1 = torch.randn(1, 1, 40, 128).to('xpu')
tensor_2 = torch.randn(1, 1, 128, 40).to('xpu')
print(torch.matmul(tensor_1, tensor_2).size())
预期输出:
torch.Size([1, 1, 40, 40])
import torch
from ipex_llm.transformers import AutoModelForCausalLM
from transformers import AutoTokenizer, GenerationConfig
import time
class Qwen2Deployment:
def __init__(self):
self.generation_config = GenerationConfig(
use_cache=True,
temperature=0.7,
top_p=0.9,
max_new_tokens=512
)
self.setup_model()
def setup_model(self):
print('正在加载模型和分词器...')
self.tokenizer = AutoTokenizer.from_pretrained(
"Qwen/Qwen2-1.5B-Instruct",
trust_remote_code=True
)
self.model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2-1.5B-Instruct",
load_in_4bit=True,
cpu_embedding=False,
trust_remote_code=True
).to('xpu')
print('模型加载完成!')
def warmup(self):
print('开始预热...')
test_input = "Hello, how are you?"
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": test_input}
]
text = self.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
input_ids = self.tokenizer.encode(text, return_tensors="pt").to('xpu')
with torch.inference_mode():
_ = self.model.generate(
input_ids,
do_sample=False,
max_new_tokens=32,
generation_config=self.generation_config
)
print('预热完成!')
def generate_response(self, user_input):
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": user_input}
]
text = self.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
start_time = time.time()
with torch.inference_mode():
input_ids = self.tokenizer.encode(text, return_tensors="pt").to('xpu')
output = self.model.generate(
input_ids,
do_sample=True,
max_new_tokens=512,
generation_config=self.generation_config
).cpu()
response = self.tokenizer.decode(output[0], skip_special_tokens=False)
end_time = time.time()
return {
'response': response,
'generation_time': f"{(end_time - start_time):.2f} seconds"
}
if __name__ == "__main__":
# 初始化部署
deployment = Qwen2Deployment()
# 进行预热
deployment.warmup()
# 测试生成
test_questions = [
"What is artificial intelligence?",
"How does machine learning work?",
"Explain neural networks in simple terms."
]
for question in test_questions:
print(f"\nQuestion: {question}")
result = deployment.generate_response(question)
print(f"Response: {result['response']}")
print(f"Generation time: {result['generation_time']}")
在内存有限的 Intel iGPU 上运行 LLM 时,我们建议在函数
from_pretrained
中进行设置cpu_embedding=True
。这将允许内存密集型嵌入层利用 CPU 而不是 GPU。
示例输出
正在加载模型和分词器...
模型加载完成!
开始预热...
预热完成!
Question: What is artificial intelligence?
Response: <|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is artificial intelligence?<|im_end|>
<|im_start|>assistant
Artificial intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think and act like humans. It involves the development of computer systems that can learn, reason, and solve problems, as well as perform tasks that typically require human intelligence, such as speech recognition, image recognition, natural language processing, and decision making. AI is used in a wide range of fields, including computer vision, machine learning, natural language processing, robotics, and healthcare.<|im_end|>
Generation time: 6.06 seconds
Question: How does machine learning work?
Response: <|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
How does machine learning work?<|im_end|>
<|im_start|>assistant
Machine learning is a branch of artificial intelligence that allows computers to learn and improve their performance over time without being explicitly programmed. It is based on the idea that computers can be taught to recognize patterns and make decisions based on those patterns.
The basic steps of machine learning are as follows:
1. Data collection: Collect a large amount of data that can be used to train the machine learning algorithm.
2. Data preprocessing: Clean and organize the data before it can be used to train the machine learning algorithm.
3. Model selection: Choose a machine learning algorithm that is appropriate for the type of data and problem that needs to be solved.
4. Training: Use the data and machine learning algorithm to train the model. This involves feeding the data into the model and adjusting the parameters until the model produces the best possible output.
5. Testing: Evaluate the model using a separate set of data that was not used during training. This helps to see how well the model generalizes to new data.
6. Model evaluation: Evaluate the performance of the model using various metrics, such as accuracy, precision, recall, and F1 score.
7. Model refinement: Refine the model based on the results of the evaluation to improve its performance.
8. Deployment: Deploy the model in a production environment, such as a website or mobile app, to make predictions or recommendations based on the input data.
Overall, machine learning is a powerful tool that can be used to automate many tasks, improve decision-making, and make predictions based on historical data.<|im_end|>
Generation time: 17.51 seconds
Question: Explain neural networks in simple terms.
Response: <|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Explain neural networks in simple terms.<|im_end|>
<|im_start|>assistant
A neural network is a type of machine learning algorithm that is used to make predictions or classify data. It is based on the idea of artificial neural networks, which are modeled after the structure and function of the human brain.
A neural network consists of multiple layers of interconnected nodes, called neurons, which are connected by weighted edges. Each neuron receives inputs from previous neurons, processes the information, and produces an output that is used to update the weights of the connections to other neurons. This process is repeated many times, resulting in a set of weights that can be used to make predictions or classify data.
Neural networks can be trained using algorithms such as backpropagation, which allows the network to learn from its mistakes and make more accurate predictions. They can also be used for a variety of tasks, such as image recognition, natural language processing, and predictive modeling.
One of the key advantages of neural networks is their ability to learn from large amounts of data and make predictions on unseen data. They are also able to handle complex relationships between features, making them useful for tasks such as image recognition and natural language processing.<|im_end|>
Generation time: 12.68 seconds
Windows 任务管理器
Arc Control(需要 Arc 独立显卡)
IPEX-LLM 为在英特尔硬件上部署大语言模型提供了一个强大而灵活的解决方案。通过本指南的实践,你可以充分发挥硬件性能,实现高效的模型部署。
要记住,优化是一个持续的过程。建议:
更多详情:更多关于IPEX-LLM的详细文档信息请看Github -ipex-llm