原文地址:LLaVa
LLaVa is an open-source chatbot trained by fine-tuning LlamA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture. In other words, it is an multi-modal version of LLMs fine-tuned for chat / instructions. | LLaVa 是一个开源聊天机器人,它基于 GPT 生成的多模态指令执行数据,通过对 LlamA/Vicuna 进行微调来训练。它是一个基于 Transformer 架构的自回归语言模型。换句话说,它是针对聊天/指令进行微调的多模态 LLM 版本。 |
The LLaVa model was proposed in Visual Instruction Tuning and improved in Improved Baselines with Visual Instruction Tuning by Haotian Liu, Chunyuan Li, Yuheng Li and Yong Jae Lee.The abstract from the paper is the following: | LLaVa 模型由 Haotian Liu、Chunyuan Li、Yuheng Li 和 Yong Jae Lee 在《Visual Instruction Tuning》中提出,并在《Improved Baselines with Visual Instruction Tuning》中进行了改进。论文摘要如下: |
Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ∼1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available | 大型多模态模型 (LMM) 最近在视觉指令调优方面取得了令人鼓舞的进展。本文展示了 LLaVA 中全连接的视觉-语言跨模态连接器的惊人强大和数据效率。通过对 LLaVA 进行简单的修改,即使用带有 MLP 投影的 CLIP-ViT-L-336px 模型,并添加带有简单响应格式提示的面向学术任务的 VQA 数据,我们建立了更强大的基线模型,并在 11 个基准测试中取得了最佳表现。我们最终的 13B 检查点仅使用了 120 万个公开数据,并在单个 8-A100 节点上大约 1 天内完成了完整训练。我们希望这能让最先进的 LMM 研究更容易被大众接受。代码和模型将公开发布 |
LLaVa architecture. Taken from the original paper.
This model was contributed by ArthurZ and ybelkada. The original code can be found here.
We advise users to use padding_side=“left” when computing batched generation as it leads to more accurate results. Simply make sure to call processor.tokenizer.padding_side = “left” before generating. | 我们建议用户在批量生成计算时使用 padding_side=“left”,因为它可以获得更准确的结果。只需确保在生成之前调用 process.tokenizer.padding_side = “left” 即可。 |
Note the model has not been explicitly trained to process multiple images in the same prompt, although this is technically possible, you may experience inaccurate results. | 请注意,该模型尚未经过明确训练以在同一个提示中处理多个图像,尽管这在技术上是可行的,但您可能会遇到不准确的结果。 |
[!NOTE] LLaVA models after release v4.46 will raise warnings about adding processor.patch_size = {{patch_size}}
, processor.num_additional_image_tokens = {{num_additional_image_tokens}}
and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}
. It is strongly recommended to add the attributes to the processor if you own the model checkpoint, or open a PR if it is not owned by you. Adding these attributes means that LLaVA will try to infer the number of image tokens required per image and expand the text with as many model config
, as model.config.vision_config.patch_size
ormodel.config.vision_feature_select_strategy
. The num_additional_image_tokens
should be1 if the vision backbone adds a CLS token or 0 if nothing extra is added to the vision patches.
[!NOTE] LLaVA 模型在 v4.46 版本之后发布时,会引发有关添加
processor.patch_size = {{patch_size}}
processor.num_additional_image_tokens = {{num_additional_image_tokens}}
processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}
model config
中获取,例如model.config.vision_config.patch_size
或者model.config.vision_feature_select_strategy
。如果视觉主干添加了 CLS token,则num_additional_image_tokens
应为 1;如果没有在视觉块中添加任何额外内容,则应为 0。Each checkpoint is trained with a specific prompt format, depending on the underlying large language model backbone. To ensure correct formatting, use the processor’s apply_chat_template method. |
每个检查点都使用特定的提示格式进行训练,具体取决于底层大型语言模型主干。为了确保格式正确,请使用processor的 apply_chat_template 方法。 |
Important: You must construct a conversation history — passing a plain string won’t work.Each message should be a dictionary with “role” and “content” keys.The “content” should be a list of dictionaries for different modalities like “text” and “image”. | 重要提示:您必须构建对话历史记录 - 传递纯字符串是行不通的。每条消息都应该是一个带有“role”和“content”键的字典。“content”应该是不同模态的字典列表,如“文本”和“图像”。 |
Here’s an example of how to structure your input. We will use llava-hf/llava-1.5-7b-hf and a conversation history of text and image. Each content field has to be a list of dicts, as follows: | 以下是如何构建输入的示例。我们将使用 llava-hf/llava-1.5-7b-hf 以及包含文本和图像的对话历史记录。每个内容字段必须是字典列表,如下所示: |
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
conversation = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What’s shown in this image?"},
],
},
{
"role": "assistant",
"content": [{"type": "text", "text": "This image shows a red stop sign."},]
},
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the image in more details."},
],
},
]
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Note that the template simply formats your prompt, you still have to tokenize it and obtain pixel values for your images
print(text_prompt)
>>>"USER: \nUSER: Describe the image in more details. ASSISTANT:"
If you want to construct a chat prompt yourself, below is a list of prompt formats accepted by each llava checkpoint:
如果您想自己构建聊天提示,下面是每个 llava 检查点接受的提示格式列表:
llava-interleave models requires the following format:
llava-interleave 模型需要以下格式:
"<|im_start|>user \nWhat is shown in this image?<|im_end|><|im_start|>assistant"
对于多轮对话:
"<|im_start|>user \n<|im_end|><|im_start|>assistant <|im_end|><|im_start|>user \n<|im_end|><|im_start|>assistant "
llava-1.5 models requires the following format:
"USER: \n ASSISTANT:"
对于多轮对话:
"USER: \n ASSISTANT: USER: ASSISTANT: USER: ASSISTANT:"
Bonus: If you’re using transformers>=4.49.0, you can also get a vectorized output from apply_chat_template. See the Usage Examples below for more details on how to use it.
额外提示:如果您使用的 transforms 版本高于 4.49.0,您还可以从 apply_chat_template 获得矢量化输出。有关如何使用它的更多详细信息,请参阅下面的使用示例。
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration
# Load the model in half-precision
model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf", torch_dtype=torch.float16, device_map="auto")
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
conversation = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
{"type": "text", "text": "What is shown in this image?"},
],
},
]
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(model.device, torch.float16)
# Generate
generate_ids = model.generate(**inputs, max_new_tokens=30)
processor.batch_decode(generate_ids, skip_special_tokens=True)
LLaVa 还支持批量推理。具体操作如下:
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration
# Load the model in half-precision
model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf", torch_dtype=torch.float16, device_map="auto")
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
# Prepare a batch of two prompts
conversation_1 = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
{"type": "text", "text": "What is shown in this image?"},
],
},
]
conversation_2 = [
{
"role": "user",
"content": [
{"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
{"type": "text", "text": "What is shown in this image?"},
],
},
]
inputs = processor.apply_chat_template(
[conversation_1, conversation_2],
add_generation_prompt=True,
tokenize=True,
return_dict=True,
padding=True,
return_tensors="pt"
).to(model.device, torch.float16)
# Generate
generate_ids = model.generate(**inputs, max_new_tokens=30)
processor.batch_decode(generate_ids, skip_special_tokens=True)
In order to match the logits of the original implementation, one needs to additionally specify do_pad=True when instantiating LLavaImageProcessor : |
为了匹配原始实现的 logits,在实例化 LLavaImageProcessor 时需要另外指定 do_pad=True : |
from transformers import LLavaImageProcessor
image_processor = LLavaImageProcessor.from_pretrained("https://huggingface.co/llava-hf/llava-1.5-7b-hf", do_pad=True)
Flash Attention 2 is an even faster, optimized version of the previous optimization, please refer to the Flash Attention 2 section of performance docs.
Flash Attention 2 是之前优化版本的一个更快、更优化版本,请参阅性能文档的 Flash Attention 2 部分。
A list of official Hugging Face and community (indicated by ) resources to help you get started with BEiT. | 官方 Hugging Face 和社区(以 表示)资源列表,可帮助您开始使用 BEiT。 |
class transformers.LlavaConfig |
---|
( vision_config = None , text_config = None , image_token_index = 32000 , projector_hidden_act = 'gelu' , vision_feature_select_strategy = 'default' , vision_feature_layer = -2 , image_seq_length = 576 , multimodal_projector_bias = True , **kwargs ) |
参数解析
参数 | 英文解释 | 翻译 |
---|---|---|
vision_config (Union[AutoConfig, dict] , optional, defaults to CLIPVisionConfig ) |
The config object or dictionary of the vision backbone. | 视觉主干的配置对象或字典 |
text_config (Union[AutoConfig, dict] , optional, defaults to LlamaConfig ) |
The config object or dictionary of the text backbone. | 文本主干的配置对象或字典。 |
image_token_index (int , optional, defaults to 32000 ) |
The image token index to encode the image prompt. | 用于对图像提示进行编码的图像标记索引。 |
projector_hidden_act (str , optional, defaults to "gelu" ) |
The activation function used by the multimodal projector. | 多模态projector使用的激活函数。 |
vision_feature_select_strategy (str , optional, defaults to "default" ) |
The feature selection strategy used to select the vision feature from the vision backbone. Can be one of “default” or “full”. | 用于从视觉主干中选择视觉特征的特征选择策略。可以是“default”或“full”。 |
vision_feature_layer (Union[int, List[int]] , optional, defaults to -2 ) |
The index of the layer to select the vision feature. If multiple indices are provided, the vision feature of the corresponding indices will be concatenated to form the vision features. | 用于选择视觉特征的层的索引。如果提供多个索引,则相应索引的视觉特征将被连接起来以形成视觉特征。 |
image_seq_length (int , optional, defaults to 576 ) |
Sequence length of one image embedding. | 图像嵌入的序列长度。 |
multimodal_projector_bias (bool , optional, defaults to True ) |
Whether to use bias in the multimodal projector. | 是否在多模态projector中使用偏差。 |
This is the configuration class to store the configuration of a LlavaForConditionalGeneration
. It is used to instantiate an Llava model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Llava-9B.
这是用于存储 LlavaForConditionalGeneration
配置的配置类。它用于根据指定的参数实例化 Llava 模型,并定义模型架构。使用默认值实例化配置将生成与 Llava-9B 类似的配置。
Configuration objects inherit from PretrainedConfig
and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
配置对象继承自 PretrainedConfig
,可用于控制模型输出。请参阅 PretrainedConfig 的文档了解更多信息。
Example:
from transformers import LlavaForConditionalGeneration, LlavaConfig, CLIPVisionConfig, LlamaConfig
# Initializing a CLIP-vision config
vision_config = CLIPVisionConfig()
# Initializing a Llama config
text_config = LlamaConfig()
# Initializing a Llava llava-1.5-7b style configuration
configuration = LlavaConfig(vision_config, text_config)
# Initializing a model from the llava-1.5-7b style configuration
model = LlavaForConditionalGeneration(configuration)
# Accessing the model configuration
configuration = model.config
Constructs a LLaVa image processor.构建LLaVa图像处理器。
class transformers.LlavaImageProcessor |
---|
( do_pad: bool = False , do_resize: bool = True ,size: typing.Optional[typing.Dict[str, int]] = None , resample: Resampling = , do_center_crop: bool = True , crop_size: typing.Optional[typing.Dict[str, int]] = None , do_rescale: bool = True ,rescale_factor: typing.Union[int, float] = 0.00392156862745098 ,do_normalize: bool = True ,image_mean: typing.Union[float, typing.List[float], NoneType] = None ,image_std: typing.Union[float, typing.List[float], NoneType] = None ,do_convert_rgb: bool = True ,**kwargs ) |
参数:
bool
, optional, defaults to False
) — Whether to pad the image to a square based on the longest edge. The padding value is determined by the image_mean parameter. Can be overridden by do_pad in the preprocess method. 是否根据最长边将图像填充为正方形。填充值由 image_mean 参数决定。可以在preprocess method中被 do_pad 覆盖。bool
, optional, defaults to True
) — Whether to resize the image’s (height, width) dimensions to the specified size. Can be overridden by do_resize in the preprocess method. 是否将图像的(高,宽)尺寸调整为指定大小。可以在预处理方法中被 do_resize 覆盖。Dict[str, int]
optional, defaults to {"shortest_edge" -- 224}
): Size of the image after resizing. The shortest edge of the image is resized to size[“shortest_edge”], with the longest edge resized to keep the input aspect ratio. Can be overridden by size in the preprocess method.调整大小后的图像尺寸。图像的最短边调整为 size[“shortest_edge”],最长边调整大小以保持输入的宽高比。可以在预处理方法中被 size 覆盖。PILImageResampling
, optional, defaults to Resampling.BICUBIC
) — Resampling filter to use if resizing the image. Can be overridden by resample in the preprocess method. 调整图像大小时使用的重采样filter。可以在预处理方法中被 resample 覆盖。bool
, optional, defaults to True
) — Whether to center crop the image to the specified crop_size. Can be overridden by do_center_crop in the preprocess method. 是否将图像居中裁剪至指定的 crop_size。可以在预处理方法中被 do_center_crop 覆盖。Dict[str, int]
optional, defaults to 224
) — Size of the output image after applying center_crop. Can be overridden by crop_size in the preprocess method. 应用 center_crop 后的输出图像大小。可以在预处理方法中被 crop_size 覆盖。bool
, optional, defaults to True
) — Whether to rescale the image by the specified scale rescale_factor. Can be overridden by do_rescale in the preprocess method.是否按指定的比例 rescale_factor 重新缩放图像。可以在预处理方法中被 do_rescale 覆盖。int
or float
, optional, defaults to 1/255
) — Scale factor to use if rescaling the image. Can be overridden by rescale_factor in the preprocess method. 重新缩放图像时使用的比例因子。可以在预处理方法中被 rescale_factor 覆盖。bool
, optional, defaults to True
) — Whether to normalize the image. Can be overridden by do_normalize in the preprocess method. 是否对图像进行归一化。可以在预处理方法中被 do_normalize 覆盖。float
or List[float]
, optional, defaults to [0.48145466, 0.4578275, 0.40821073]
) — Mean to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by the image_mean parameter in the preprocess method. 用于标准化图像的平均值。这是一个浮点数或浮点数列表,长度等于图像的通道数。该值可以被预处理方法中的 image_mean 参数覆盖。float
or List[float]
, optional, defaults to [0.26862954, 0.26130258, 0.27577711]
) — Standard deviation to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by the image_std parameter in the preprocess method. Can be overridden by the image_std parameter in the preprocess method. 用于标准化图像的标准差。这是一个浮点数或浮点数列表,其长度等于图像的通道数。可以在预处理方法中被 image_std 参数覆盖。可以在预处理方法中被 image_std 参数覆盖。bool
, optional, defaults to True
) — Whether to convert the image to RGB. 是否将图像转换为RGB[TO DO]
class transformers.LlavaProcessor |
---|
( image_processor = None , tokenizer = None , patch_size = None ,vision_feature_select_strategy = None ,chat_template = None ,image_token = ' ,num_additional_image_tokens = 0 ,**kwargs ) |
Constructs a LLaVa processor which wraps a LLaVa image processor and a LLaMa tokenizer into a single processor. | 构建一个 LLaVa 处理器,将 LLaVa 图像处理器和 LLaMa tokenizer包装到单个处理器中。 |
参数:
LlavaImageProcessor
, optional) — The image processor is a required input. 图像处理器是必需的输入。LlamaTokenizerFast
, optional) — The tokenizer is a required input. 必须的输入int
, optional) — Patch size from the vision tower. 视觉塔中的块大小。str
, optional) — The feature selection strategy used to select the vision feature from the vision backbone. Should be same as in model’s config 用于从视觉主干中选择视觉特征的特征选择策略。应与模型配置中的相同str
, optional) — A Jinja template which will be used to convert lists of messages in a chat into a tokenizable string. Jinja 模板将用于将聊天中的消息列表转换为可标记的字符串。LlavaProcessor offers all the functionalities of LlavaImageProcessor and LlamaTokenizerFast. See the __call__() and decode() for more information. |
LlavaProcessor 提供 LlavaImageProcessor 和 LlamaTokenizerFast 的所有功能。更多信息,请参阅 __call__() 和 decode()。 |
batch_decode | ( *args**kwargs ) |
---|---|
This method forwards all its arguments to LlamaTokenizerFast’s batch_decode(). Please refer to the docstring of this method for more information. | 此方法将其所有参数转发给 LlamaTokenizerFast 的 batch_decode()。请参阅此方法的文档字符串了解更多信息。 |
decode | ( *args**kwargs ) |
---|---|
This method forwards all its arguments to LlamaTokenizerFast’s decode(). Please refer to the docstring of this method for more information. | 此方法将所有参数转发给 LlamaTokenizerFast 的 decode()。请参阅此方法的文档字符串了解更多信息。 |
The Llava model which consists of a vision backbone and a language model, without a language modeling head. | Llava 模型由视觉主干模型和语言模型组成,不包含语言模型头。 |
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.) | 此模型继承自 PreTrainedModel。请查看超类文档,了解该库为其所有模型实现的通用方法(例如下载或保存、调整输入嵌入的大小、修剪头等)。 |
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior. | 此模型也是 PyTorch torch.nn.Module 的子类。请将其用作常规 PyTorch 模块,并参考 PyTorch 文档,了解所有与常规用法和行为相关的内容。 |
class transformers.LlavaModel
( config: LlavaConfig )
参数:
forward方法的参数
(
input_ids: LongTensor
= None
pixel_values: FloatTensor
= None
attention_mask: typing.Optional[torch.Tensor]
= None
position_ids: typing.Optional[torch.LongTensor]
= None
past_key_values: typing.Optional[typing.List[torch.FloatTensor]]
= None
inputs_embeds: typing.Optional[torch.FloatTensor]
= None
vision_feature_layer: typing.Union[int, typing.List[int], NoneType]
= None
vision_feature_select_strategy: typing.Optional[str]
= None
use_cache: typing.Optional[bool]
= None
output_attentions: typing.Optional[bool]
= None
output_hidden_states: typing.Optional[bool]
= None
return_dict: typing.Optional[bool]
= None
cache_position: typing.Optional[torch.LongTensor]
= None
image_sizes: Tensor
= None
**kwargs: typing_extensions.Unpack[transformers.modeling_flash_attention_utils.FlashAttentionKwargs
]
)
→
transformers.models.llava.modeling_llava.LlavaModelOutputWithPast
or tuple(torch.FloatTensor)
参数 | 英语解释 | 解释 |
---|---|---|
input_ids (torch.LongTensor of shape (batch_size, sequence_length) ) |
Indices of input sequence tokens in the vocabulary. Padding will be ignored by default. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. | 词汇表中输入序列标记的索引。默认情况下,填充将被忽略。可以使用 AutoTokenizer 获取索引。有关详情,请参阅 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。 |
pixel_values ( of shape ) | The tensors corresponding to the input images. Pixel values can be obtained using . See for details ( uses for processing images).torch.FloatTensor``(batch_size, num_channels, image_size, image_size)``{image_processor_class}``{image_processor_class}.__call__``{processor_class}``{image_processor_class} |
-与输入图像对应的张量。像素值可以使用 获取。详情请参阅(用于处理图像)。`torch.FloatTensor(batch_size, num_channels, image_size, image_size) {image_processor_class}{image_processor_class}.__call__ {processor_class}``{image_processor_class} |
attention_mask ( of shape , optional) | Mask to avoid performing attention on padding token indices. Mask values selected in :torch.Tensor``(batch_size, sequence_length)``[0, 1] 1 for tokens that are not masked, 0 for tokens that are masked. What are attention masks? |
掩码用于避免对填充标记索引执行注意力机制。掩码值在以下位置选择:torch.Tensor(batch_size, sequence_length)[0, 1] 1 表示未掩码的标记, 0 表示已掩码的标记。 什么是注意力掩码? |
position_ids ( of shape , optional) | Indices of positions of each input sequence tokens in the position embeddings. Selected in the range .torch.LongTensor``(batch_size, sequence_length)``[0, config.n_positions - 1] What are position IDs? |
位置嵌入中每个输入序列标记的位置索引。在以下范围内选择:torch.LongTensor(batch_size,sequence_length)[0,config.n_positions - 1] 什么是位置 ID? |
… |
[TO DO]