知识图谱的构建是一个复杂的系统性工程,涉及知识建模、数据获取、知识抽取、知识融合、知识存储、知识推理、知识应用及维护等多个环节。以下是知识图谱构建的详细方法和流程:
知识建模是构建知识图谱的第一步,旨在确定知识图谱的结构和语义,定义实体、关系、属性及其层次体系。
知识图谱的数据来源多样,需根据实体和关系类型收集相关数据,并处理数据的异构性(如结构化、半结构化、非结构化数据)。
知识抽取(Knowledge Extraction)是将多源数据转化为结构化知识的核心步骤,需针对不同数据类型采用不同技术。
知识融合旨在整合多源数据,解决实体歧义、冗余和冲突问题,形成统一的知识图谱。
知识图谱通常以图结构(节点-边-属性)存储,需根据数据规模和查询需求选择数据库。
知识推理通过已有的知识推断出新的关系或实体,解决知识图谱的不完整性问题。
知识图谱的价值体现在其应用场景中,常见应用包括:
知识图谱需持续维护以保证时效性和准确性,流程包括:
以下是结合知识图谱构建方法的实践案例,以**“电影知识图谱”**为例,详细展示从数据获取到应用的全流程,并附具体代码和工具操作示例,帮助理解理论与实践的结合。
构建一个包含电影、演员、导演、类型、奖项等信息的知识图谱,支持电影推荐、问答(如“诺兰导演的科幻片有哪些?”)和语义搜索。
实体类型 | 说明 |
---|---|
电影(Movie) | 电影本体,如《星际穿越》 |
演员(Actor) | 参演电影的演员 |
导演(Director) | 执导电影的导演 |
类型(Genre) | 电影类型,如科幻、悬疑 |
奖项(Award) | 电影获得的奖项,如奥斯卡 |
关系类型 | 方向 | 说明 |
---|---|---|
主演(starring) | Movie ← Actor | 演员主演某部电影 |
导演(directed_by) | Movie ← Director | 导演执导某部电影 |
属于类型(has_genre) | Movie → Genre | 电影属于某类型 |
获得奖项(won_award) | Movie → Award | 电影获得某个奖项 |
同剧演员(co_star) | Actor ↔ Actor | 演员共同出演同一部电影 |
import requests
import json
# TMDB API密钥(需在TMDB官网申请)
API_KEY = "your_api_key"
BASE_URL = "https://api.themoviedb.org/3"
def fetch_movies(page=1):
url = f"{BASE_URL}/discover/movie?api_key={API_KEY}&page={page}"
response = requests.get(url)
data = json.loads(response.text)
return data["results"]
# 爬取前10页数据(约200部电影)
movies = []
for page in range(1, 11):
movies += fetch_movies(page)
# 保存为JSON文件
with open("movies.json", "w", encoding="utf-8") as f:
json.dump(movies, f, ensure_ascii=False, indent=2)
title
(电影名)、release_date
(上映日期)、vote_average
(评分)、genres
(类型列表)、cast
(演员列表)、crew
(导演等剧组人员)。def process_movie(movie):
processed = {
"movie_id": movie["id"],
"title": movie["title"],
"year": int(movie["release_date"].split("-")[0]) if movie["release_date"] else None,
"rating": movie["vote_average"],
"genres": [g["name"] for g in movie["genres"]],
"actors": [], # 从cast中提取演员
"director": None # 从crew中提取导演
}
# 处理演员
for cast_member in movie.get("cast", []):
if cast_member["known_for_department"] == "Acting":
processed["actors"].append({
"actor_id": cast_member["id"],
"name": cast_member["name"],
"character": cast_member["character"]
})
# 处理导演
for crew_member in movie.get("crew", []):
if crew_member["job"] == "Director":
processed["director"] = {
"director_id": crew_member["id"],
"name": crew_member["name"]
}
return processed
processed_movies = [process_movie(m) for m in movies]
pipeline
进行文本摘要。from transformers import pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
def extract_plot_keywords(reviews):
# 合并多条评论为一段文本
combined_text = " ".join(reviews)
# 生成摘要(提取关键剧情信息)
summary = summarizer(combined_text, max_length=100, min_length=30, do_sample=False)[0]["summary_text"]
return summary.split(", ") # 按逗号分割关键词
# 模拟豆瓣评论数据(实际需爬取)
sample_reviews = [
"《星际穿越》探讨了时空穿越和亲情,视觉效果震撼",
"剧情烧脑,演员演技在线,尤其是马修·麦康纳的表演"
]
plot_keywords = extract_plot_keywords(sample_reviews)
print(plot_keywords) # 输出:['时空穿越', '亲情', '视觉效果震撼', '剧情烧脑', '演员演技在线']
fuzzywuzzy
库计算字符串相似度。from fuzzywuzzy import fuzz
def is_same_actor(actor1, actor2):
# 忽略大小写,计算模糊匹配分数
score = fuzz.ratio(actor1.lower(), actor2.lower())
return score > 85 # 相似度阈值设为85%
# 示例:对齐TMDB中的“Christian Bale”和豆瓣中的“克里斯蒂安·贝尔”
score = fuzz.ratio("Christian Bale", "克里斯蒂安·贝尔")
print(score) # 输出:89(超过阈值,视为同一实体)
movie:123
、actor:456
),通过ID关联不同数据源。http://localhost:7474
,默认用户名/密码:neo4j/neo4j
(首次登录需修改密码)。方法1:使用Cypher语句批量插入
from neo4j import GraphDatabase
uri = "bolt://localhost:7687"
driver = GraphDatabase.driver(uri, auth=("neo4j", "your_password"))
def create_movie_node(tx, movie):
tx.run(
"CREATE (m:Movie {movie_id: $movie_id, title: $title, year: $year, rating: $rating})",
movie_id=movie["movie_id"],
title=movie["title"],
year=movie["year"],
rating=movie["rating"]
)
with driver.session() as session:
for m in processed_movies:
session.write_transaction(create_movie_node, m)
方法2:使用Neo4j的LOAD CSV功能
movies.csv
、actors.csv
)。LOAD CSV WITH HEADERS FROM "file:///movies.csv" AS row
CREATE (m:Movie {movie_id: toInteger(row.movie_id), title: row.title, year: toInteger(row.year), rating: toFloat(row.rating)})
// 创建“主演”关系
MATCH (a:Actor {actor_id: 123}), (m:Movie {movie_id: 456})
CREATE (a)-[:starring]->(m)
// 创建“导演”关系
MATCH (d:Director {director_id: 789}), (m:Movie {movie_id: 456})
CREATE (d)-[:directed_by]->(m)
co_star
关系。// 批量创建同剧演员关系
MATCH (a1:Actor)-[:starring]->(m:Movie)<-[:starring]-(a2:Actor)
WHERE a1.actor_id < a2.actor_id // 避免重复创建双向关系
CREATE (a1)-[:co_star]-(a2)
SET a1.co_star_count = coalesce(a1.co_star_count, 0) + 1,
a2.co_star_count = coalesce(a2.co_star_count, 0) + 1
步骤1:解析用户问题
spaCy
识别实体和关系:import spacy
nlp = spacy.load("en_core_web_sm")
def parse_question(question):
doc = nlp(question)
entities = [ent.text for ent in doc.ents if ent.label_ in ["PERSON", "WORK_OF_ART"]]
relations = []
for token in doc:
if token.dep_ == "dobj" and token.head.text == "导演":
relations.append("directed_by")
elif token.text == "科幻片":
relations.append("has_genre")
return {"entities": entities, "relations": relations}
question = "诺兰导演的科幻片有哪些?"
parsed = parse_question(question)
print(parsed) # 输出:{"entities": ["诺兰", "科幻片"], "relations": ["directed_by", "has_genre"]}
步骤2:生成Cypher查询
def generate_cypher(entities, relations):
director_name = entities[0]
genre = entities[1]
return f"""
MATCH (d:Director {{name: "{director_name}"}})-[:directed_by]->(m:Movie)-[:has_genre]->(g:Genre {{name: "{genre}"}})
RETURN m.title, m.year, m.rating
"""
cypher_query = generate_cypher(parsed["entities"], parsed["relations"])
print(cypher_query)
步骤3:执行查询并返回结果
with driver.session() as session:
result = session.run(cypher_query)
movies = [record["m.title"] for record in result]
print(f"诺兰导演的科幻片有:{', '.join(movies)}")
阶段 | 工具/库 | 作用 |
---|---|---|
数据获取 | requests | 爬取TMDB API数据 |
数据清洗 | pandas | 处理JSON数据,提取关键字段 |
知识抽取 | Hugging Face Transformers | 文本摘要、命名实体识别 |
知识融合 | fuzzywuzzy | 实体名称相似度匹配 |
知识存储 | Neo4j + Cypher | 图数据库存储与查询 |
应用开发 | Flask + spaCy | 搭建问答系统API |
movie.title
、actor.name
创建索引),或迁移至分布式图数据库(如Dgraph)。完整的电影知识图谱构建解决方案,包括:
使用前需要:
requests
, neo4j
, transformers
, fuzzywuzzy
, spacy
, tqdm
python -m spacy download en_core_web_sm
代码设计考虑了扩展性,可以根据需要添加更多的实体类型、关系和功能,如情感分析、多模态处理等。
完整代码:
import requests
import json
import os
import pandas as pd
from fuzzywuzzy import fuzz
from transformers import pipeline
from neo4j import GraphDatabase
from tqdm import tqdm
import spacy
class MovieKnowledgeGraphBuilder:
def __init__(self, tmdb_api_key, neo4j_uri, neo4j_user, neo4j_password):
"""初始化电影知识图谱构建器"""
self.tmdb_api_key = tmdb_api_key
self.neo4j_driver = GraphDatabase.driver(neo4j_uri, auth=(neo4j_user, neo4j_password))
self.nlp = spacy.load("en_core_web_sm") # 用于问题解析
self.summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
def fetch_movies_from_tmdb(self, num_pages=10, save_path="movies.json"):
"""从TMDB API获取电影数据"""
if os.path.exists(save_path):
print(f"数据已存在,从文件加载: {save_path}")
with open(save_path, "r", encoding="utf-8") as f:
return json.load(f)
movies = []
for page in tqdm(range(1, num_pages + 1), desc="爬取电影数据"):
url = f"https://api.themoviedb.org/3/discover/movie?api_key={self.tmdb_api_key}&page={page}"
response = requests.get(url)
if response.status_code == 200:
page_data = response.json()
movies.extend(page_data.get("results", []))
else:
print(f"请求失败,状态码: {response.status_code}")
with open(save_path, "w", encoding="utf-8") as f:
json.dump(movies, f, ensure_ascii=False, indent=2)
return movies
def process_movie_data(self, movies):
"""处理电影数据,提取实体和关系"""
processed_movies = []
actors_dict = {}
directors_dict = {}
genres_dict = {}
for movie in tqdm(movies, desc="处理电影数据"):
# 提取电影基本信息
processed = {
"movie_id": movie["id"],
"title": movie["title"],
"year": int(movie["release_date"].split("-")[0]) if movie.get("release_date") else None,
"rating": movie["vote_average"],
"overview": movie.get("overview", ""),
"genres": []
}
# 处理电影类型
for genre in movie.get("genres", []):
genre_id = genre["id"]
genre_name = genre["name"]
genres_dict[genre_id] = genre_name
processed["genres"].append(genre_id)
# 获取电影详细信息(包含演员和导演)
movie_details = self._get_movie_details(movie["id"])
# 处理演员
processed["actors"] = []
for cast_member in movie_details.get("cast", [])[:10]: # 取前10位主演
actor_id = cast_member["id"]
actor_name = cast_member["name"]
character = cast_member["character"]
actors_dict[actor_id] = {
"name": actor_name,
"popularity": cast_member.get("popularity", 0),
"gender": cast_member.get("gender", 0)
}
processed["actors"].append({
"actor_id": actor_id,
"character": character
})
# 处理导演
for crew_member in movie_details.get("crew", []):
if crew_member["job"] == "Director":
director_id = crew_member["id"]
director_name = crew_member["name"]
directors_dict[director_id] = {
"name": director_name,
"popularity": crew_member.get("popularity", 0)
}
processed["director"] = director_id
break # 一部电影通常只有一个导演
processed_movies.append(processed)
return {
"movies": processed_movies,
"actors": actors_dict,
"directors": directors_dict,
"genres": genres_dict
}
def _get_movie_details(self, movie_id):
"""获取电影详细信息(包括演员和导演)"""
url = f"https://api.themoviedb.org/3/movie/{movie_id}?api_key={self.tmdb_api_key}&append_to_response=credits"
response = requests.get(url)
if response.status_code == 200:
return response.json()
return {}
def extract_plot_keywords(self, overview):
"""从电影概述中提取关键词"""
if not overview:
return []
# 使用BART模型生成摘要
try:
summary = self.summarizer(overview, max_length=30, min_length=10, do_sample=False)[0]["summary_text"]
# 简单分词作为关键词
return [word.strip() for word in summary.split(",") if word.strip()]
except:
return []
def build_knowledge_graph(self, data):
"""将处理后的数据导入Neo4j图数据库"""
with self.neo4j_driver.session() as session:
# 创建约束,确保唯一性
session.run("CREATE CONSTRAINT IF NOT EXISTS FOR (m:Movie) REQUIRE m.movie_id IS UNIQUE")
session.run("CREATE CONSTRAINT IF NOT EXISTS FOR (a:Actor) REQUIRE a.actor_id IS UNIQUE")
session.run("CREATE CONSTRAINT IF NOT EXISTS FOR (d:Director) REQUIRE d.director_id IS UNIQUE")
session.run("CREATE CONSTRAINT IF NOT EXISTS FOR (g:Genre) REQUIRE g.genre_id IS UNIQUE")
# 导入电影节点
for movie in tqdm(data["movies"], desc="导入电影节点"):
keywords = self.extract_plot_keywords(movie["overview"])
session.run(
"""
MERGE (m:Movie {movie_id: $movie_id})
SET m.title = $title, m.year = $year, m.rating = $rating,
m.overview = $overview, m.keywords = $keywords
""",
movie_id=movie["movie_id"],
title=movie["title"],
year=movie["year"],
rating=movie["rating"],
overview=movie["overview"],
keywords=keywords
)
# 导入演员节点
for actor_id, actor in tqdm(data["actors"].items(), desc="导入演员节点"):
session.run(
"""
MERGE (a:Actor {actor_id: $actor_id})
SET a.name = $name, a.popularity = $popularity, a.gender = $gender
""",
actor_id=actor_id,
name=actor["name"],
popularity=actor["popularity"],
gender=actor["gender"]
)
# 导入导演节点
for director_id, director in tqdm(data["directors"].items(), desc="导入导演节点"):
session.run(
"""
MERGE (d:Director {director_id: $director_id})
SET d.name = $name, d.popularity = $popularity
""",
director_id=director_id,
name=director["name"],
popularity=director["popularity"]
)
# 导入类型节点
for genre_id, genre_name in tqdm(data["genres"].items(), desc="导入类型节点"):
session.run(
"""
MERGE (g:Genre {genre_id: $genre_id})
SET g.name = $name
""",
genre_id=genre_id,
name=genre_name
)
# 创建电影-演员关系
for movie in tqdm(data["movies"], desc="创建电影-演员关系"):
for actor_info in movie["actors"]:
session.run(
"""
MATCH (m:Movie {movie_id: $movie_id})
MATCH (a:Actor {actor_id: $actor_id})
MERGE (a)-[r:ACTED_IN {character: $character}]->(m)
""",
movie_id=movie["movie_id"],
actor_id=actor_info["actor_id"],
character=actor_info["character"]
)
# 创建电影-导演关系
for movie in tqdm(data["movies"], desc="创建电影-导演关系"):
if "director" in movie and movie["director"]:
session.run(
"""
MATCH (m:Movie {movie_id: $movie_id})
MATCH (d:Director {director_id: $director_id})
MERGE (d)-[r:DIRECTED]->(m)
""",
movie_id=movie["movie_id"],
director_id=movie["director"]
)
# 创建电影-类型关系
for movie in tqdm(data["movies"], desc="创建电影-类型关系"):
for genre_id in movie["genres"]:
session.run(
"""
MATCH (m:Movie {movie_id: $movie_id})
MATCH (g:Genre {genre_id: $genre_id})
MERGE (m)-[r:BELONGS_TO]->(g)
""",
movie_id=movie["movie_id"],
genre_id=genre_id
)
# 创建演员-演员合作关系
print("创建演员合作关系...")
session.run(
"""
MATCH (a1:Actor)-[:ACTED_IN]->(m:Movie)<-[:ACTED_IN]-(a2:Actor)
WHERE a1.actor_id < a2.actor_id
MERGE (a1)-[r:CO_STARRED_WITH]-(a2)
ON CREATE SET r.movie_count = 1
ON MATCH SET r.movie_count = r.movie_count + 1
"""
)
def answer_question(self, question):
"""基于知识图谱回答问题"""
doc = self.nlp(question)
# 提取实体和关系
entities = []
relations = []
for ent in doc.ents:
entities.append(ent.text)
# 简单规则匹配关系类型
if "导演" in question or "directed" in question.lower():
relations.append("DIRECTED")
if "主演" in question or "acted" in question.lower():
relations.append("ACTED_IN")
if "类型" in question or "genre" in question.lower():
relations.append("BELONGS_TO")
if "合作" in question or "co-starred" in question.lower():
relations.append("CO_STARRED_WITH")
# 基于提取的信息生成Cypher查询
if not entities or not relations:
return "抱歉,我无法理解您的问题。"
# 简单问答模板匹配
if "导演" in question and "电影" in question:
# 问题示例:"诺兰导演的电影有哪些?"
director_name = entities[0]
query = f"""
MATCH (d:Director {{name: $director_name}})-[:DIRECTED]->(m:Movie)
RETURN m.title AS title, m.year AS year, m.rating AS rating
ORDER BY m.rating DESC
"""
params = {"director_name": director_name}
with self.neo4j_driver.session() as session:
result = session.run(query, **params)
movies = [f"{record['title']} ({record['year']}, 评分: {record['rating']})" for record in result]
if movies:
return f"{director_name}导演的电影有:\n" + "\n".join(movies)
else:
return f"抱歉,我没有找到{director_name}导演的电影。"
elif "演员" in question and "电影" in question:
# 问题示例:"莱昂纳多主演的电影有哪些?"
actor_name = entities[0]
query = f"""
MATCH (a:Actor {{name: $actor_name}})-[:ACTED_IN]->(m:Movie)
RETURN m.title AS title, m.year AS year, m.rating AS rating
ORDER BY m.rating DESC
"""
params = {"actor_name": actor_name}
with self.neo4j_driver.session() as session:
result = session.run(query, **params)
movies = [f"{record['title']} ({record['year']}, 评分: {record['rating']})" for record in result]
if movies:
return f"{actor_name}主演的电影有:\n" + "\n".join(movies)
else:
return f"抱歉,我没有找到{actor_name}主演的电影。"
elif "类型" in question and "电影" in question:
# 问题示例:"有哪些科幻电影?"
genre_name = entities[0]
query = f"""
MATCH (m:Movie)-[:BELONGS_TO]->(g:Genre {{name: $genre_name}})
RETURN m.title AS title, m.year AS year, m.rating AS rating
ORDER BY m.rating DESC
LIMIT 10
"""
params = {"genre_name": genre_name}
with self.neo4j_driver.session() as session:
result = session.run(query, **params)
movies = [f"{record['title']} ({record['year']}, 评分: {record['rating']})" for record in result]
if movies:
return f"以下是一些{genre_name}类型的电影:\n" + "\n".join(movies)
else:
return f"抱歉,我没有找到{genre_name}类型的电影。"
elif "合作" in question and len(entities) == 2:
# 问题示例:"莱昂纳多和凯特温斯莱特合作过哪些电影?"
actor1_name = entities[0]
actor2_name = entities[1]
query = f"""
MATCH (a1:Actor {{name: $actor1_name}})-[:ACTED_IN]->(m:Movie)<-[:ACTED_IN]-(a2:Actor {{name: $actor2_name}})
RETURN m.title AS title, m.year AS year
ORDER BY m.year DESC
"""
params = {"actor1_name": actor1_name, "actor2_name": actor2_name}
with self.neo4j_driver.session() as session:
result = session.run(query, **params)
movies = [f"{record['title']} ({record['year']})" for record in result]
if movies:
return f"{actor1_name}和{actor2_name}合作过的电影有:\n" + "\n".join(movies)
else:
return f"抱歉,我没有找到{actor1_name}和{actor2_name}合作过的电影。"
return "抱歉,我还无法回答这类问题。"
def recommend_movies(self, movie_title, limit=5):
"""基于知识图谱推荐相似电影"""
query = """
MATCH (m:Movie {title: $movie_title})-[:BELONGS_TO]->(g:Genre)<-[:BELONGS_TO]-(rec:Movie)
WHERE rec.title <> $movie_title
WITH rec, COUNT(g) AS genre_overlap, COLLECT(g.name) AS genres
ORDER BY genre_overlap DESC, rec.rating DESC
LIMIT $limit
RETURN rec.title AS title, rec.year AS year, rec.rating AS rating, genres
"""
with self.neo4j_driver.session() as session:
result = session.run(query, movie_title=movie_title, limit=limit)
recommendations = []
for record in result:
recommendation = {
"title": record["title"],
"year": record["year"],
"rating": record["rating"],
"genres": record["genres"]
}
recommendations.append(recommendation)
if recommendations:
print(f"基于《{movie_title}》的推荐电影:")
for i, rec in enumerate(recommendations, 1):
print(f"{i}. {rec['title']} ({rec['year']}) - 评分: {rec['rating']}")
print(f" 类型: {', '.join(rec['genres'])}")
return recommendations
else:
print(f"抱歉,没有找到与《{movie_title}》相似的电影。")
return []
def close(self):
"""关闭Neo4j驱动连接"""
self.neo4j_driver.close()
# 使用示例
if __name__ == "__main__":
# 配置信息(请替换为您自己的信息)
TMDB_API_KEY = "your_tmdb_api_key"
NEO4J_URI = "bolt://localhost:7687"
NEO4J_USER = "neo4j"
NEO4J_PASSWORD = "your_neo4j_password"
# 初始化构建器
builder = MovieKnowledgeGraphBuilder(TMDB_API_KEY, NEO4J_URI, NEO4J_USER, NEO4J_PASSWORD)
# 1. 获取数据
movies = builder.fetch_movies_from_tmdb(num_pages=5)
# 2. 处理数据
processed_data = builder.process_movie_data(movies)
# 3. 构建知识图谱
builder.build_knowledge_graph(processed_data)
# 4. 问答示例
questions = [
"克里斯托弗·诺兰导演的电影有哪些?",
"莱昂纳多·迪卡普里奥主演的电影有哪些?",
"有哪些科幻电影?",
"莱昂纳多·迪卡普里奥和凯特·温斯莱特合作过哪些电影?"
]
for question in questions:
answer = builder.answer_question(question)
print(f"\n问题:{question}")
print(f"回答:{answer}")
# 5. 推荐示例
builder.recommend_movies("Inception", limit=3)
# 关闭连接
builder.close()
知识图谱的构建是一个“迭代优化”的过程,需结合领域特点选择合适的技术方案,并在实践中不断调整建模逻辑、优化抽取算法、提升数据质量。随着AI技术的发展,知识图谱将更深度融合机器学习、多模态处理和边缘计算,成为支撑智能应用的核心基础设施。