开源项目地址 GitHub - HKUDS/VideoRAG: "VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos",该项目同样是香港大学开源的知识库代码。之前的香港大学开源的LightRag的代码分析,参考:LightRag系统分析-CSDN博客
这个项目是针对Video的知识分析、向量化存储、知识库提取、知识图谱构建、以及知识召回,相对于LightRag而言,其增加了利用大模型对Video的进行文本分析提取,在文本分析后,后续的过程和LightRag区别不大。在音视频的内容识别依赖于大模型的多模态的能力。
本文从代码的角度进行VideoRag的系统分析,相关的论文的参考官方项目。
整体结构与LightRag系统分析-CSDN博客差别不大,增加了视频内容的分割与提取
video_segments:keyvalue库存储video识别出来的内容,存储格式为kv。key为video的名称,value为识别出来的内容。
video_segment_feature_vdb:向量库,根据video的分割的名称前缀,获得存储的分割片段,通过imagebind_model.imagebind_huge计算向量,并存储到向量库中。存储的id对应的就是名称前缀
chunks_vdb:向量库,存储视频分析内容的chunk之后的内容,将内容向量化,存储id对应的向量
text_chunks:keyvalue库,按key value存储每个向量,key为id,value为具体的内容
entities_vdb:实体库,向量库存储每个对象实体
1、调用split_video分割语音,输出的结果为分割的序列号对应的文件名称前缀的map表segment_index2name(后面处理音频或视频时,均以表中的文件名为前缀),以及序号对应的时间map表
(1)使用moviepy打开video,读出视频的长度,根据分割的步长(默认为30s),计算分割列表,列表由每一段开始时间组成
(2)根据上述的分割的列表,按序将视频中的音频部分写入至音频文件中,默认输出为mp3。文件名中包分割的序号-开始时间-结束时间
以下代码可能存在问题:
total_video_length = int(video.duration)
start_times = list(range(0, total_video_length, segment_length))
# if the last segment is shorter than 5 seconds, we merged it to the last segment
if len(start_times) > 1 and (total_video_length - start_times[-1]) < 5:
start_times = start_times[:-1]
2、调用speech_to_text调用大模型,识别出分割好的音频段,并记录识别出的文字
依次遍历文件名称表segment_index2name前缀,拼上后缀取出文件,并调用model.transcribe,识别出文字,存储到transcripts列表中
3、调用saving_video_segments分割video中的视频部分,根据segment_index2name中记录的时间信息,分割视频部分,并存储到文件中,存储格式为mp4
4、调用segment_caption处理分割的视频,依次遍历segment_index2name,获取分割好的视频,将视频调用encode_video,获得若干帧的图象,再调用model.chat分析图像(这里的model需要支持分析图像的多模态的大模型),得到视频描述的内容。将得到的内容存在至caption_result中
5、调用merge_segment_information合并音频和视频的识别内容,存到segments_information,其结构如下:
inserting_segments[index]["time"] = '-'.join(segment_name.split('-')[-2:])
inserting_segments[index]["content"] = f"Caption:\n{captions[index]}\nTranscript:\n{transcripts[index]}\n\n"
inserting_segments[index]["transcript"] = transcripts[index]
inserting_segments[index]["frame_times"] = segment_times_info[index["frame_times"].tolist()
6、将segments_information存在至video_segments
7、存储video的分割信息至video_segment_feature_vdb
8、删除绑存的内容
9、调用ainsert,分析视频的提取的内容,content,根据设备的token进行chunk,然后进行知识结构提取,构建知识图谱,具体的过程参考LightRag系统分析-CSDN博客
召回有两种模式,videorag模式和videorag_multiple_choice模式
4.2.1 videorag模式
(1)从chunks_vdb中,根据query的问题,检索出符合的chunk的id,然后根据id从text_chunks获得内容列表
(2)遍历内容列表,计算内容的长度的和,根据token的最大长度截取内容,并将截取的内容列表给maybe_trun_chunks列表
(3)将maybe_trun_chunks列表,合并成一个字符串为retreived_chunk_context
(4)从entities_vdb中,根据query问题,检索entity, 根据检索出来的entity,从图数据库检索对应的node,从node中获得实体信息,以及实体之间的关系信息,根据node中的chunk id从得相应的文本。得到entity_retrieved_segments
(5)根据问题从video_segment_feature_vdb中,查找视频特征信息,并将查找到的特征对应的id添加到entity_retrieved_segments中,得到retrieved_segments,并根据index对retrieved_segments进行排序
(6)遍历retrieved_segments,根据id(名称前缀+index),从video_segments中得到对应版本的content列表rough_captions
(7)调用_filter_single_segment,使用llm,根据问题过滤rough_captions,得到remain_segments
(8)向LLM请求提取问题的关键词服务,得到问题的关键词
(9)根据提取的关键词和remain_segments中的Transcription组装为prompt,content为视频分割部分的提取帧,向LLM请求服务,得到分析结果,最终组装得到的caption_result表,key为id分割文件名前缀,value组成部分包括:是chat得到分析结果,以及之前分割时音频部分得到的transcript
def retrieved_segment_caption(caption_model, caption_tokenizer, refine_knowledge, retrieved_segments, video_path_db, video_segments, num_sampled_frames):
# model = AutoModel.from_pretrained('./MiniCPM-V-2_6-int4', trust_remote_code=True)
# tokenizer = AutoTokenizer.from_pretrained('./MiniCPM-V-2_6-int4', trust_remote_code=True)
# model.eval()
caption_result = {}
for this_segment in tqdm(retrieved_segments, desc='Captioning Segments for Given Query'):
video_name = '_'.join(this_segment.split('_')[:-1])
index = this_segment.split('_')[-1]
video_path = video_path_db._data[video_name]
timestamp = video_segments._data[video_name][index]["time"].split('-')
start, end = eval(timestamp[0]), eval(timestamp[1])
video = VideoFileClip(video_path)
frame_times = np.linspace(start, end, num_sampled_frames, endpoint=False)
video_frames = encode_video(video, frame_times)
segment_transcript = video_segments._data[video_name][index]["transcript"]
# query = f"The transcript of the current video:\n{segment_transcript}.\nGiven a question: {query}, you have to extract relevant information from the video and transcript for answering the question."
query = f"The transcript of the current video:\n{segment_transcript}.\nNow provide a very detailed description (caption) of the video in English and extract relevant information about: {refine_knowledge}'"
msgs = [{'role': 'user', 'content': video_frames + [query]}]
params = {}
params["use_image_id"] = False
params["max_slice_nums"] = 2
segment_caption = caption_model.chat(
image=None,
msgs=msgs,
tokenizer=caption_tokenizer,
**params
)
this_caption = segment_caption.replace("\n", "").replace("<|endoftext|>", "")
caption_result[this_segment] = f"Caption:\n{this_caption}\nTranscript:\n{segment_transcript}\n\n"
torch.cuda.empty_cache()
return caption_result
(10)根据caption_results,组装text_units_context,分割key得到时间片段信息,最后得到的信息格式为:"video_name", "start_time", "end_time", "content“,并转为csv格式,并赋给变量retreived_video_context
(11)组装llm服务的req消息:
systempromt:由retreived_video_context和前面分析的retreived_chunk_context,根据模板组装sysprompt。有两种模板,区别在于输出格式有差异。模板如下:
PROMPTS[
"videorag_response"
] = """---Role---
You are a helpful assistant responding to a query with retrieved knowledge.
---Goal---
Generate a response of the target length and format that responds to the user's question with relevant general knowledge.
Summarize useful and relevant information from the retrieved text chunks and the information retrieved from videos, suitable for the specified response length and format.
If you don't know the answer or if the input data tables do not contain sufficient information to provide an answer, just say so. Do not make anything up.
Do not include information where the supporting evidence for it is not provided.
---Target response length and format---
{response_type}
---Retrieved Information From Videos---
{video_data}
---Retrieved Text Chunks---
{chunk_data}
---Goal---
Generate a response of the target length and format that responds to the user's question with relevant general knowledge.
Summarize useful and relevant information from the retrieved text chunks and the information retrieved from videos, suitable for the specified response length and format.
If you don't know the answer or if the input data tables do not contain sufficient information to provide an answer, just say so. Do not make anything up.
Do not include information where the supporting evidence for it is not provided.
Reference relevant video segments within the answers, specifying the video name and start & end timestamps. Use the following reference format:
---Example of Reference---
In one segment, the film highlights the devastating effects of deforestation on wildlife habitats [1]. Another part illustrates successful conservation efforts that have helped endangered species recover [2].
#### Reference:
[1] video_name_1, 05:30, 08:00
[2] video_name_2, 25:00, 28:00
---Notice---
Please add sections and commentary as appropriate for the length and format if necessary. Format the response in Markdown.
"""
PROMPTS[
"videorag_response_wo_reference"
] = """---Role---
You are a helpful assistant responding to a query with retrieved knowledge.
---Goal---
Generate a response of the target length and format that responds to the user's question with relevant general knowledge.
Summarize useful and relevant information from the retrieved text chunks and the information retrieved from videos, suitable for the specified response length and format.
If you don't know the answer or if the input data tables do not contain sufficient information to provide an answer, just say so. Do not make anything up.
Do not include information where the supporting evidence for it is not provided.
---Target response length and format---
{response_type}
---Retrieved Information From Videos---
{video_data}
---Retrieved Text Chunks---
{chunk_data}
---Goal---
Generate a response of the target length and format that responds to the user's question with relevant general knowledge.
Summarize useful and relevant information from the retrieved text chunks and the information retrieved from videos, suitable for the specified response length and format.
If you don't know the answer or if the input data tables do not contain sufficient information to provide an answer, just say so. Do not make anything up.
Do not include information where the supporting evidence for it is not provided.
---Notice---
Please add sections and commentary as appropriate for the length and format if necessary. Format the response in Markdown.
"""
qery为用户的问题。
(12)向llm发送服务请求,得到分析结果
4.2.2 videorag_multiple_choice模式
videorag_multiple_choice与videorag的差异主要在最后的systempromt组装上。这里的prompt适用于有回答内容多个选项,但只有一个答案是正确的的情况,并输出是json格式。videorag_multiple_choice的模板如下:
PROMPTS[
"videorag_response_for_multiple_choice_question"
] = """---Role---
You are a helpful assistant responding to a multiple-choice question with retrieved knowledge.
---Goal---
Generate a concise response that addresses the user's question by summarizing relevant information derived from the retrieved text and video content. Ensure the response aligns with the specified format and length.
Please note that there is only one choice is correct.
---Target response length and format---
{response_type}
---Retrieved Information From Videos---
{video_data}
---Retrieved Text Chunks---
{chunk_data}
---Goal---
Generate a concise response that addresses the user's question by summarizing relevant information derived from the retrieved text and video content. Ensure the response aligns with the specified format and length.
Please note that there is only one choice is correct.
---Notice---
Please provide your answer in JSON format as follows:
{{
"Answer": "The label of the answer, like A/B/C/D or 1/2/3/4 or others, depending on the given query"
"Explanation": "Provide explanations for your choice. Use sections and commentary as needed to ensure clarity and depth. Format the response in Markdown."
}}
Key points:
1. Ensure that the "Answer" reflects the correct label format.
2. Structure the "Explanation" for clarity, using Markdown for any necessary formatting.
"""
(1)整个分割过程中,须首先分割存储到文件,再进行识别,存在效率不高的问题。实际可以在分割的同时,调用LLM服务进行音频或视频的识别,存储中间的结果,然后再将中间结果合并。
(2)另外视频的识别每帧均进行识别,在视频的转换过程中存在额外的编解码,效率低下
(3)上述所有提取音视频的识别结果全部存在map内存表中,如果内存过小,而视频过长,可能会内存不够,此外程序宕机的情况会导致内容识别的结果无法存储
(4)在知识召回过程中,存在大量的冗余信息,相关的过程可以进一步的优化。另外在最后一步,依赖于音频的部分,如果视频中没有音频,会导致整个知识库的无法有效运行。