用python实现词频分析与可视化

目标:

通过统计文本中各个词汇的出现频率,找出文本中的关键词,帮助我们了解文本的核心内容。

方案:

  1. 统计词频:计算每个词汇在文本中的出现次数。常用方法有TF(词频)和TF-IDF(词频-逆文档频率)。
  2. TF:词汇在文档中的出现频率。
  3. TF-IDF:不仅统计词频,还会考虑词汇在其他文档中的出现情况,减少常见词汇的影响。
  4. 可视化:使用词云图或柱状图可视化高频词,帮助直观展示文本中的关键词。
  5. 词云图:显示频率较高的词汇,词语大小与频率成正比。

步骤

1. 导入依赖库(pip install nltk wordcloud matplotlib scikit-learn jieba)

  • nltk:自然语言处理工具库,用于分词和停用词处理。
  • wordcloudmatplotlib:生成词云和绘制图表。
  • sklearnTfidfVectorizer:计算TF-IDF值。
  • jieba:中文分词库。
  • string:处理标点符号。

2. 初始化NLTK数据

  • 指定NLTK数据存储路径为 C:/nltk_data
  • 下载必要的NLTK数据(首次运行时需下载punkt分词器和stopwords停用词表)。

3. 文本预处理函数 preprocess_text

  • 输入参数:原始文本和语言类型(默认英语)。
  • 处理流程
    • 去除标点符号:使用string.punctuation移除所有标点。
    • 语言分支处理
      • 中文:使用jieba进行分词,加载外部文件chinese_stopwords.txt中的中文停用词。
      • 英文:文本转小写,用NLTK的word_tokenize分词,加载NLTK英文停用词。
    • 过滤条件:去除停用词、数字及长度小于2的单词。
def preprocess_text(text, language='english'):
    # 去除标点符号
    text = text.translate(str.maketrans('', '', string.punctuation))

    if language == 'chinese':
        # 中文分词
        words = jieba.cut(text)
        # 加载中文停用词表
        stop_words = set(open('chinese_stopwords.txt', encoding='utf-8').read().split())
    else:
        # 英文处理
        text = text.lower()
        words = word_tokenize(text)
        stop_words = set(stopwords.words('english'))

    # 去除停用词和数字
    filtered_words = [word for word in words if
                      word not in stop_words
                      and not word.isdigit()
                      and len(word) > 1]
    return filtered_words

4. 示例文本分析

  • 示例文本:一段关于自然语言处理(NLP)的英文文本。
  • 词频统计(TF)
    • 使用preprocess_text处理后,通过Counter统计词频。
    • 输出前10个高频词(如language, computers等)。
sample_text = """
Natural language processing (NLP) is a subfield of linguistics, computer science, 
and artificial intelligence concerned with the interactions between computers 
and human language. It focuses on how to program computers to process and 
analyze large amounts of natural language data. Key tasks include text 
classification, sentiment analysis, machine translation, and speech recognition.
processed_words = preprocess_text(sample_text)
word_freq = Counter(processed_words)

# 输出前10高频词
print("Top 10 TF高频词:")
for word, freq in word_freq.most_common(10):
    print(f"{word}: {freq}")
"""

5. TF-IDF计算

  • 输入数据:四个英文文档组成的列表。
  • 流程
    • 使用TfidfVectorizer将文档转换为TF-IDF矩阵。
    • 提取第一个文档的TF-IDF值,按得分排序输出前5个关键词(如computers, understand等)。
documents = [
    "Natural language processing enables computers to understand human language.",
    "Machine learning is a key component of artificial intelligence.",
    "Text classification and sentiment analysis are common NLP tasks.",
    "Deep learning has revolutionized speech recognition systems."
]

# 计算TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()

# 获取第一个文档的TF-IDF值
tfidf_values = tfidf_matrix[0].toarray().flatten()
tfidf_dict = {word: score for word, score in zip(feature_names, tfidf_values)}

print("\nTF-IDF示例(第一个文档):")
for word, score in sorted(tfidf_dict.items(), key=lambda x: x[1], reverse=True)[:5]:
    print(f"{word}: {score:.4f}")


6. 可视化

  • 词云图:基于词频生成词云,字体大小反映词频高低。
  • 水平柱状图:展示前10个高频词的频率分布。
# ====================
# 可视化(词云)
# ====================
wordcloud = WordCloud(width=800, height=400,
                      background_color='white',
                      colormap='viridis').generate_from_frequencies(word_freq)

plt.figure(figsize=(12, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Frequency Cloud')
plt.show()

# ====================
# 可视化(柱状图)
# ====================
top_words = 10
words, frequencies = zip(*word_freq.most_common(top_words))

plt.figure(figsize=(12, 6))
plt.barh(range(len(words)), frequencies, color='skyblue')
plt.yticks(range(len(words)), words)
plt.gca().invert_yaxis()  # 最高频词显示在最上面
plt.xlabel('Frequency')
plt.title(f'Top {top_words} Most Frequent Words')
plt.tight_layout()
plt.show()

代码功能总结

  • 核心目标:展示从文本清洗到关键信息提取及可视化的完整流程。
  • 多语言支持:通过language参数切换中英文处理(示例仅演示英文)。
  • 适用场景:文本数据探索、关键词提取、教学示例等。

潜在问题与注意事项

  1. 中文停用词文件依赖
    • 需确保chinese_stopwords.txt存在于当前目录且编码为UTF-8,否则中文处理会报错。
  2. 路径权限问题
    • NLTK数据目录C:/nltk_data可能需要管理员权限才能写入。
  3. 未处理词干/词形还原
    • 英文单词未进行词干提取(如“processing”和“process”会被视为不同词)。
  4. TF-IDF示例局限性
    • 示例文档数量少且内容简单,实际应用中需更大数据集以提高TF-IDF区分度。

运行效果

用python实现词频分析与可视化_第1张图片
词云:
用python实现词频分析与可视化_第2张图片
柱状图:
用python实现词频分析与可视化_第3张图片

完整代码

import nltk
nltk.data.path.append('C:/nltk_data')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
import string
import jieba  # 中文分词库


# 初始化nltk数据(首次运行需要下载)
nltk.download('punkt', download_dir='C:/nltk_data')
nltk.download('stopwords', download_dir='C:/nltk_data')

# ====================
# 文本预处理函数
# ====================
def preprocess_text(text, language='english'):
    # 去除标点符号
    text = text.translate(str.maketrans('', '', string.punctuation))

    if language == 'chinese':
        # 中文分词
        words = jieba.cut(text)
        # 加载中文停用词表
        stop_words = set(open('chinese_stopwords.txt', encoding='utf-8').read().split())
    else:
        # 英文处理
        text = text.lower()
        words = word_tokenize(text)
        stop_words = set(stopwords.words('english'))

    # 去除停用词和数字
    filtered_words = [word for word in words if
                      word not in stop_words
                      and not word.isdigit()
                      and len(word) > 1]
    return filtered_words


# ====================
# 示例文本
# ====================
sample_text = """
Natural language processing (NLP) is a subfield of linguistics, computer science, 
and artificial intelligence concerned with the interactions between computers 
and human language. It focuses on how to program computers to process and 
analyze large amounts of natural language data. Key tasks include text 
classification, sentiment analysis, machine translation, and speech recognition.
"""

# ====================
# 词频统计(TF)
# ====================
processed_words = preprocess_text(sample_text)
word_freq = Counter(processed_words)

# 输出前10高频词
print("Top 10 TF高频词:")
for word, freq in word_freq.most_common(10):
    print(f"{word}: {freq}")

# ====================
# TF-IDF计算(示例使用多个文档)
# ====================
documents = [
    "Natural language processing enables computers to understand human language.",
    "Machine learning is a key component of artificial intelligence.",
    "Text classification and sentiment analysis are common NLP tasks.",
    "Deep learning has revolutionized speech recognition systems."
]

# 计算TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()

# 获取第一个文档的TF-IDF值
tfidf_values = tfidf_matrix[0].toarray().flatten()
tfidf_dict = {word: score for word, score in zip(feature_names, tfidf_values)}

print("\nTF-IDF示例(第一个文档):")
for word, score in sorted(tfidf_dict.items(), key=lambda x: x[1], reverse=True)[:5]:
    print(f"{word}: {score:.4f}")

# ====================
# 可视化(词云)
# ====================
wordcloud = WordCloud(width=800, height=400,
                      background_color='white',
                      colormap='viridis').generate_from_frequencies(word_freq)

plt.figure(figsize=(12, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Frequency Cloud')
plt.show()

# ====================
# 可视化(柱状图)
# ====================
top_words = 10
words, frequencies = zip(*word_freq.most_common(top_words))

plt.figure(figsize=(12, 6))
plt.barh(range(len(words)), frequencies, color='skyblue')
plt.yticks(range(len(words)), words)
plt.gca().invert_yaxis()  # 最高频词显示在最上面
plt.xlabel('Frequency')
plt.title(f'Top {top_words} Most Frequent Words')
plt.tight_layout()
plt.show()

你可能感兴趣的:(人工智能实验,python,开发语言,图像处理,人工智能,计算机视觉,nlp)