自然语言处理(NLP)核心技术:从词嵌入到Transformer

1. NLP基础与文本表示

1.1 文本预处理技术

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

def preprocess_text(text):
    # 转换为小写
    text = text.lower()
    # 移除特殊字符和数字
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # 分词
    tokens = nltk.word_tokenize(text)
    # 去除停用词
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    # 词干提取
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens]
    # 词形还原
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatiz

你可能感兴趣的:(人工智能,#,深度学习,Python开发经验,自然语言处理,transformer,人工智能)