Xorbits Inference (Xinference) 是一个开源平台,用于简化各种 AI 模型的运行和集成。借助 Xinference,您可以使用任何开源 LLM、嵌入模型和多模态模型在云端或本地环境中运行推理,并创建强大的 AI 应用。
docker 下载对应的 xinference
docker pull xprobe/xinference
docker 运行,注意 路径改成自己的,
docker run -d --name xinference --gpus all -v E:/docker/xinference/models:/root/models -v E:/docker/xinference/.xinference:/root/.xinference -v E:/docker/xinference/.cache/huggingface:/root/.cache/huggingface -e XINFERENCE_HOME=/root/models -p 9997:9997 xprobe/xinference:latest xinference-local -H 0.0.0.0
-d
: 让容器在后台运行。--name xinference
: 为容器指定一个名称,这里是xinference。--gpus all
: 允许容器访问主机上的所有GPU,这对于需要进行大量计算的任务(如机器学习模型的推理)非常有用。-v E:/docker/xinference/models:/root/models
, -v E:/docker/xinference/.xinference:/root/.xinference
, -v E:/docker/xinference/.cache/huggingface:/root/.cache/huggingface
: 这些参数用于将主机的目录挂载到容器内部的特定路径,以便于数据持久化和共享。例如,第一个挂载是将主机的E:/docker/xinference/models目录映射到容器内的/root/models目录。-e XINFERENCE_HOME=/root/models
: 设置环境变量XINFERENCE_HOME,其值为/root/models,这可能是在容器内配置某些应用行为的方式。-p 9997:9997
: 将主机的9997端口映射到容器的9997端口,允许外部通过主机的该端口访问容器的服务。xprobe/xinference:latest
: 指定要使用的镜像和标签,这里使用的是xprobe/xinference镜像的latest版本。xinference-local -H 0.0.0.0
: 在容器启动时执行的命令,看起来像是以本地模式运行某个服务,并监听所有网络接口。https://inference.readthedocs.io/zh-cn/latest/index.html
docker dify 添加 docker 容器内ip 配置
http://host.docker.internal:9997
MODEL NAME |
ABILITIES |
COTNEXT_LENGTH |
DESCRIPTION |
---|---|---|---|
aquila2 |
generate |
2048 |
Aquila2 series models are the base language models |
aquila2-chat |
chat |
2048 |
Aquila2-chat series models are the chat models |
aquila2-chat-16k |
chat |
16384 |
AquilaChat2-16k series models are the long-text chat models |
baichuan-2 |
generate |
4096 |
Baichuan2 is an open-source Transformer based LLM that is trained on both Chinese and English data. |
baichuan-2-chat |
chat |
4096 |
Baichuan2-chat is a fine-tuned version of the Baichuan LLM, specializing in chatting. |
c4ai-command-r-v01 |
chat |
131072 |
C4AI Command-R(+) is a research release of a 35 and 104 billion parameter highly performant generative model. |
code-llama |
generate |
100000 |
Code-Llama is an open-source LLM trained by fine-tuning LLaMA2 for generating and discussing code. |
code-llama-instruct |
chat |
100000 |
Code-Llama-Instruct is an instruct-tuned version of the Code-Llama LLM. |
code-llama-python |
generate |
100000 |
Code-Llama-Python is a fine-tuned version of the Code-Llama LLM, specializing in Python. |
codegeex4 |
chat |
131072 |
the open-source version of the latest CodeGeeX4 model series |
codeqwen1.5 |
generate |
65536 |
CodeQwen1.5 is the Code-Specific version of Qwen1.5. It is a transformer-based decoder-only language model pretrained on a large amount of data of codes. |
codeqwen1.5-chat |
chat |
65536 |
CodeQwen1.5 is the Code-Specific version of Qwen1.5. It is a transformer-based decoder-only language model pretrained on a large amount of data of codes. |
codeshell |
generate |
8194 |
CodeShell is a multi-language code LLM developed by the Knowledge Computing Lab of Peking University. |
codeshell-chat |
chat |
8194 |
CodeShell is a multi-language code LLM developed by the Knowledge Computing Lab of Peking University. |
codestral-v0.1 |
generate |
32768 |
Codestrall-22B-v0.1 is trained on a diverse dataset of 80+ programming languages, including the most popular ones, such as Python, Java, C, C++, JavaScript, and Bash |
cogagent |
chat, vision |
4096 |
The CogAgent-9B-20241220 model is based on GLM-4V-9B, a bilingual open-source VLM base model. Through data collection and optimization, multi-stage training, and strategy improvements, CogAgent-9B-20241220 achieves significant advancements in GUI perception, inference prediction accuracy, action space completeness, and task generalizability. |
cogvlm2 |
chat, vision |
8192 |
CogVLM2 have achieved good results in many lists compared to the previous generation of CogVLM open source models. Its excellent performance can compete with some non-open source models. |
cogvlm2-video-llama3-chat |
chat, vision |
8192 |
CogVLM2-Video achieves state-of-the-art performance on multiple video question answering tasks. |
csg-wukong-chat-v0.1 |
chat |
32768 |
csg-wukong-1B is a 1 billion-parameter small language model(SLM) pretrained on 1T tokens. |
deepseek |
generate |
4096 |
DeepSeek LLM, trained from scratch on a vast dataset of 2 trillion tokens in both English and Chinese. |
deepseek-chat |
chat |
4096 |
DeepSeek LLM is an advanced language model comprising 67 billion parameters. It has been trained from scratch on a vast dataset of 2 trillion tokens in both English and Chinese. |
deepseek-coder |
generate |
16384 |
Deepseek Coder is composed of a series of code language models, each trained from scratch on 2T tokens, with a composition of 87% code and 13% natural language in both English and Chinese. |
deepseek-coder-instruct |
chat |
16384 |
deepseek-coder-instruct is a model initialized from deepseek-coder-base and fine-tuned on 2B tokens of instruction data. |
deepseek-r1 |
chat |
163840 |
DeepSeek-R1, which incorporates cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks. |
deepseek-r1-distill-llama |
chat |
131072 |
deepseek-r1-distill-llama is distilled from DeepSeek-R1 based on Llama |
deepseek-r1-distill-qwen |
chat |
131072 |
deepseek-r1-distill-qwen is distilled from DeepSeek-R1 based on Qwen |
deepseek-v2 |
generate |
128000 |
DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. |
deepseek-v2-chat |
chat |
128000 |
DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. |
deepseek-v2-chat-0628 |
chat |
128000 |
DeepSeek-V2-Chat-0628 is an improved version of DeepSeek-V2-Chat. |
deepseek-v2.5 |
chat |
128000 |
DeepSeek-V2.5 is an upgraded version that combines DeepSeek-V2-Chat and DeepSeek-Coder-V2-Instruct. The new model integrates the general and coding abilities of the two previous versions. |
deepseek-v3 |
chat |
163840 |
DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. |
deepseek-vl-chat |
chat, vision |
4096 |
DeepSeek-VL possesses general multimodal understanding capabilities, capable of processing logical diagrams, web pages, formula recognition, scientific literature, natural images, and embodied intelligence in complex scenarios. |
gemma-2-it |
chat |
8192 |
Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. |
gemma-it |
chat |
8192 |
Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. |
glm-4v |
chat, vision |
8192 |
GLM4 is the open source version of the latest generation of pre-trained models in the GLM-4 series launched by Zhipu AI. |
glm-edge-chat |
chat |
8192 |
The GLM-Edge series is our attempt to face the end-side real-life scenarios, which consists of two sizes of large-language dialogue models and multimodal comprehension models (GLM-Edge-1.5B-Chat, GLM-Edge-4B-Chat, GLM-Edge-V-2B, GLM-Edge-V-5B). Among them, the 1.5B / 2B model is mainly for platforms such as mobile phones and cars, and the 4B / 5B model is mainly for platforms such as PCs. |
glm-edge-v |
chat, vision |
8192 |
The GLM-Edge series is our attempt to face the end-side real-life scenarios, which consists of two sizes of large-language dialogue models and multimodal comprehension models (GLM-Edge-1.5B-Chat, GLM-Edge-4B-Chat, GLM-Edge-V-2B, GLM-Edge-V-5B). Among them, the 1.5B / 2B model is mainly for platforms such as mobile phones and cars, and the 4B / 5B model is mainly for platforms such as PCs. |
glm4-chat |
chat, tools |
131072 |
GLM4 is the open source version of the latest generation of pre-trained models in the GLM-4 series launched by Zhipu AI. |
glm4-chat-1m |
chat, tools |
1048576 |
GLM4 is the open source version of the latest generation of pre-trained models in the GLM-4 series launched by Zhipu AI. |
gorilla-openfunctions-v2 |
chat |
4096 |
OpenFunctions is designed to extend Large Language Model (LLM) Chat Completion feature to formulate executable APIs call given natural language instructions and API context. |
gpt-2 |
generate |
1024 |
GPT-2 is a Transformer-based LLM that is trained on WebTest, a 40 GB dataset of Reddit posts with 3+ upvotes. |
internlm2-chat |
chat |
32768 |
The second generation of the InternLM model, InternLM2. |
internlm2.5-chat |
chat |
32768 |
InternLM2.5 series of the InternLM model. |
internlm2.5-chat-1m |
chat |
262144 |
InternLM2.5 series of the InternLM model supports 1M long-context |
internlm3-instruct |
chat, tools |
32768 |
InternLM3 has open-sourced an 8-billion parameter instruction model, InternLM3-8B-Instruct, designed for general-purpose usage and advanced reasoning. |
internvl-chat |
chat, vision |
32768 |
InternVL 1.5 is an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. |
internvl2 |
chat, vision |
32768 |
InternVL 2 is an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. |
llama-2 |
generate |
4096 |
Llama-2 is the second generation of Llama, open-source and trained on a larger amount of data. |
llama-2-chat |
chat |
4096 |
Llama-2-Chat is a fine-tuned version of the Llama-2 LLM, specializing in chatting. |
llama-3 |
generate |
8192 |
Llama 3 is an auto-regressive language model that uses an optimized transformer architecture |
llama-3-instruct |
chat |
8192 |
The Llama 3 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks.. |
llama-3.1 |
generate |
131072 |
Llama 3.1 is an auto-regressive language model that uses an optimized transformer architecture |
llama-3.1-instruct |
chat, tools |
131072 |
The Llama 3.1 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks.. |
llama-3.2-vision |
generate, vision |
131072 |
The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image… |
llama-3.2-vision-instruct |
chat, vision |
131072 |
Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image… |
llama-3.3-instruct |
chat, tools |
131072 |
The Llama 3.3 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks.. |
marco-o1 |
chat, tools |
32768 |
Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions |
minicpm-2b-dpo-bf16 |
chat |
4096 |
MiniCPM is an End-Size LLM developed by ModelBest Inc. and TsinghuaNLP, with only 2.4B parameters excluding embeddings. |
minicpm-2b-dpo-fp16 |
chat |
4096 |
MiniCPM is an End-Size LLM developed by ModelBest Inc. and TsinghuaNLP, with only 2.4B parameters excluding embeddings. |
minicpm-2b-dpo-fp32 |
chat |
4096 |
MiniCPM is an End-Size LLM developed by ModelBest Inc. and TsinghuaNLP, with only 2.4B parameters excluding embeddings. |
minicpm-2b-sft-bf16 |
chat |
4096 |
MiniCPM is an End-Size LLM developed by ModelBest Inc. and TsinghuaNLP, with only 2.4B parameters excluding embeddings. |
minicpm-2b-sft-fp32 |
chat |
4096 |
MiniCPM is an End-Size LLM developed by ModelBest Inc. and TsinghuaNLP, with only 2.4B parameters excluding embeddings. |
minicpm-llama3-v-2_5 |
chat, vision |
8192 |
MiniCPM-Llama3-V 2.5 is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. |
minicpm-v-2.6 |
chat, vision |
32768 |
MiniCPM-V 2.6 is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Qwen2-7B with a total of 8B parameters. |
minicpm3-4b |
chat |
32768 |
MiniCPM3-4B is the 3rd generation of MiniCPM series. The overall performance of MiniCPM3-4B surpasses Phi-3.5-mini-Instruct and GPT-3.5-Turbo-0125, being comparable with many recent 7B~9B models. |
mistral-instruct-v0.1 |
chat |
8192 |
Mistral-7B-Instruct is a fine-tuned version of the Mistral-7B LLM on public datasets, specializing in chatting. |
mistral-instruct-v0.2 |
chat |
8192 |
The Mistral-7B-Instruct-v0.2 Large Language Model (LLM) is an improved instruct fine-tuned version of Mistral-7B-Instruct-v0.1. |
mistral-instruct-v0.3 |
chat |
32768 |
The Mistral-7B-Instruct-v0.2 Large Language Model (LLM) is an improved instruct fine-tuned version of Mistral-7B-Instruct-v0.1. |
mistral-large-instruct |
chat |
131072 |
Mistral-Large-Instruct-2407 is an advanced dense Large Language Model (LLM) of 123B parameters with state-of-the-art reasoning, knowledge and coding capabilities. |
mistral-nemo-instruct |
chat |
1024000 |
The Mistral-Nemo-Instruct-2407 Large Language Model (LLM) is an instruct fine-tuned version of the Mistral-Nemo-Base-2407 |
mistral-v0.1 |
generate |
8192 |
Mistral-7B is a unmoderated Transformer based LLM claiming to outperform Llama2 on all benchmarks. |
mixtral-8x22b-instruct-v0.1 |
chat |
65536 |
The Mixtral-8x22B-Instruct-v0.1 Large Language Model (LLM) is an instruct fine-tuned version of the Mixtral-8x22B-v0.1, specializing in chatting. |
mixtral-instruct-v0.1 |
chat |
32768 |
Mistral-8x7B-Instruct is a fine-tuned version of the Mistral-8x7B LLM, specializing in chatting. |
mixtral-v0.1 |
generate |
32768 |
The Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts. |
omnilmm |
chat, vision |
2048 |
OmniLMM is a family of open-source large multimodal models (LMMs) adept at vision & language modeling. |
openhermes-2.5 |
chat |
8192 |
Openhermes 2.5 is a fine-tuned version of Mistral-7B-v0.1 on primarily GPT-4 generated data. |
opt |
generate |
2048 |
Opt is an open-source, decoder-only, Transformer based LLM that was designed to replicate GPT-3. |
orion-chat |
chat |
4096 |
Orion-14B series models are open-source multilingual large language models trained from scratch by OrionStarAI. |
orion-chat-rag |
chat |
4096 |
Orion-14B series models are open-source multilingual large language models trained from scratch by OrionStarAI. |
phi-2 |
generate |
2048 |
Phi-2 is a 2.7B Transformer based LLM used for research on model safety, trained with data similar to Phi-1.5 but augmented with synthetic texts and curated websites. |
phi-3-mini-128k-instruct |
chat |
128000 |
The Phi-3-Mini-128K-Instruct is a 3.8 billion-parameter, lightweight, state-of-the-art open model trained using the Phi-3 datasets. |
phi-3-mini-4k-instruct |
chat |
4096 |
The Phi-3-Mini-4k-Instruct is a 3.8 billion-parameter, lightweight, state-of-the-art open model trained using the Phi-3 datasets. |
platypus2-70b-instruct |
generate |
4096 |
Platypus-70B-instruct is a merge of garage-bAInd/Platypus2-70B and upstage/Llama-2-70b-instruct-v2. |
qvq-72b-preview |
chat, vision |
32768 |
QVQ-72B-Preview is an experimental research model developed by the Qwen team, focusing on enhancing visual reasoning capabilities. |
qwen-chat |
chat |
32768 |
Qwen-chat is a fine-tuned version of the Qwen LLM trained with alignment techniques, specializing in chatting. |
qwen-vl-chat |
chat, vision |
4096 |
Qwen-VL-Chat supports more flexible interaction, such as multiple image inputs, multi-round question answering, and creative capabilities. |
qwen1.5-chat |
chat, tools |
32768 |
Qwen1.5 is the beta version of Qwen2, a transformer-based decoder-only language model pretrained on a large amount of data. |
qwen1.5-moe-chat |
chat, tools |
32768 |
Qwen1.5-MoE is a transformer-based MoE decoder-only language model pretrained on a large amount of data. |
qwen2-audio |
generate, audio |
32768 |
Qwen2-Audio: A large-scale audio-language model which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. |
qwen2-audio-instruct |
chat, audio |
32768 |
Qwen2-Audio: A large-scale audio-language model which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. |
qwen2-instruct |
chat, tools |
32768 |
Qwen2 is the new series of Qwen large language models |
qwen2-moe-instruct |
chat, tools |
32768 |
Qwen2 is the new series of Qwen large language models. |
qwen2-vl-instruct |
chat, vision |
32768 |
Qwen2-VL: To See the World More Clearly.Qwen2-VL is the latest version of the vision language models in the Qwen model familities. |
qwen2.5 |
generate |
32768 |
Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. |
qwen2.5-coder |
generate |
32768 |
Qwen2.5-Coder is the latest series of Code-Specific Qwen large language models (formerly known as CodeQwen). |
qwen2.5-coder-instruct |
chat, tools |
32768 |
Qwen2.5-Coder is the latest series of Code-Specific Qwen large language models (formerly known as CodeQwen). |
qwen2.5-instruct |
chat, tools |
32768 |
Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. |
qwen2.5-vl-instruct |
chat, vision |
128000 |
Qwen2.5-VL: Qwen2.5-VL is the latest version of the vision language models in the Qwen model familities. |
qwq-32b-preview |
chat |
32768 |
QwQ-32B-Preview is an experimental research model developed by the Qwen Team, focused on advancing AI reasoning capabilities. |
seallm_v2 |
generate |
8192 |
We introduce SeaLLM-7B-v2, the state-of-the-art multilingual LLM for Southeast Asian (SEA) languages |
seallm_v2.5 |
generate |
8192 |
We introduce SeaLLM-7B-v2.5, the state-of-the-art multilingual LLM for Southeast Asian (SEA) languages |
skywork |
generate |
4096 |
Skywork is a series of large models developed by the Kunlun Group · Skywork team. |
skywork-math |
generate |
4096 |
Skywork is a series of large models developed by the Kunlun Group · Skywork team. |
starling-lm |
chat |
4096 |
We introduce Starling-7B, an open large language model (LLM) trained by Reinforcement Learning from AI Feedback (RLAIF). The model harnesses the power of our new GPT-4 labeled ranking dataset |
telechat |
chat |
8192 |
The TeleChat is a large language model developed and trained by China Telecom Artificial Intelligence Technology Co., LTD. The 7B model base is trained with 1.5 trillion Tokens and 3 trillion Tokens and Chinese high-quality corpus. |
tiny-llama |
generate |
2048 |
The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens. |
wizardcoder-python-v1.0 |
chat |
100000 |
|
wizardmath-v1.0 |
chat |
2048 |
WizardMath is an open-source LLM trained by fine-tuning Llama2 with Evol-Instruct, specializing in math. |
xverse |
generate |
2048 |
XVERSE is a multilingual large language model, independently developed by Shenzhen Yuanxiang Technology. |
xverse-chat |
chat |
2048 |
XVERSEB-Chat is the aligned version of model XVERSE. |
yi |
generate |
4096 |
The Yi series models are large language models trained from scratch by developers at 01.AI. |
yi-1.5 |
generate |
4096 |
Yi-1.5 is an upgraded version of Yi. It is continuously pre-trained on Yi with a high-quality corpus of 500B tokens and fine-tuned on 3M diverse fine-tuning samples. |
yi-1.5-chat |
chat |
4096 |
Yi-1.5 is an upgraded version of Yi. It is continuously pre-trained on Yi with a high-quality corpus of 500B tokens and fine-tuned on 3M diverse fine-tuning samples. |
yi-1.5-chat-16k |
chat |
16384 |
Yi-1.5 is an upgraded version of Yi. It is continuously pre-trained on Yi with a high-quality corpus of 500B tokens and fine-tuned on 3M diverse fine-tuning samples. |
yi-200k |
generate |
262144 |
The Yi series models are large language models trained from scratch by developers at 01.AI. |
yi-chat |
chat |
4096 |
The Yi series models are large language models trained from scratch by developers at 01.AI. |
yi-coder |
generate |
131072 |
Yi-Coder is a series of open-source code language models that delivers state-of-the-art coding performance with fewer than 10 billion parameters.Excelling in long-context understanding with a maximum context length of 128K tokens.Supporting 52 major programming languages, including popular ones such as Java, Python, JavaScript, and C++. |
yi-coder-chat |
chat |
131072 |
Yi-Coder is a series of open-source code language models that delivers state-of-the-art coding performance with fewer than 10 billion parameters.Excelling in long-context understanding with a maximum context length of 128K tokens.Supporting 52 major programming languages, including popular ones such as Java, Python, JavaScript, and C++. |
yi-vl-chat |
chat, vision |
4096 |
Yi Vision Language (Yi-VL) model is the open-source, multimodal version of the Yi Large Language Model (LLM) series, enabling content comprehension, recognition, and multi-round conversations about images. |
以下是 Xinference 中内置的音频模型列表:
以下是 Xinference 中内置的重排序模型列表:
以下是 Xinference 中内置的视频模型列表: