Scikit-Learn进行命名实体识别

image

命名实体识别是识别名称等信息单元的过程(包括人名、地名、组织机构和位置),以及包括非结构化文本中的时间,日期,钱等数值表达式。目标就是开发实用且与域无关的技术,以便自动高精度地检测命名实体。

1. 数据

数据是BIO和POS标签注释的特征设计语料库。

image.png

有关实体的基本信息

  • geo - 区域实体(Geographical Entity)
  • org - 组织(Organization)
  • per - 人(Person)
  • gpe - 地缘政治实体(Geopolitical Entity)
  • tim - 时间指示器(Time indicator)
  • art - 人工制品(Artifact)
  • eve - 事件(Event)
  • nat - 自然现象(Natural Phenomenon)

BIO(beginning-inside-outside)是用于标记标志的通用标记格式。

  • B - 标签前的前缀,表示标签是块的开始。
  • I - 标签前的前缀,表示标签位于块内。
  • O - 表示标志不属于任何块。
# import package
import numpy as np
import pandas as pd

from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import HashingVectorizer

from sklearn.model_selection import train_test_split

from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.naive_bayes import MultinomialNB

from sklearn.metrics import classification_report

整个数据集比较大,无法全部装入内存中,因此只选择前10000条记录。

df = pd.read_csv('datasets/entity-annotated-corpus/ner_dataset.csv')
print(df.head())
print(df.isnull().sum())

# 整个数据集不能装入一台计算机的内存中,因此我们选择前10,000个记录,
# 并使用外存学习算法(Out-of-core learning algorithm)来有效地获取和处理数据。
df = df[:10000]

# 结果
    Sentence #           Word  POS Tag
0  Sentence: 1      Thousands  NNS   O
1          NaN             of   IN   O
2          NaN  demonstrators  NNS   O
3          NaN           have  VBP   O
4          NaN        marched  VBN   O
Sentence #    1000616
Word                0
POS                 0
Tag                 0
dtype: int64

2. 数据预处理

可以看到"Sentence #"列中有很多NaN值,我们用前面的值去填充NaN。

# 数据预处理
# 我们注意到 "Sentence #"列中有很多NaN的值,我们用前面的值填充NaN。
df = df.fillna(method='ffill')
print(df.head())

# (457 2746 17)我们有457个句子,其中包含2746个独立单词并标记为17个标签
print(df['Sentence #'].nunique(), df.Word.nunique(), df.Tag.nunique())

# 结果
    Sentence #           Word  POS Tag
0  Sentence: 1      Thousands  NNS   O
1  Sentence: 1             of   IN   O
2  Sentence: 1  demonstrators  NNS   O
3  Sentence: 1           have  VBP   O
4  Sentence: 1        marched  VBN   O
457 2746 17

我们有457个句子,其中包含2746个独立单词并标记为17个标签。

使用DictVectorizer将文本转换为向量,然后拆分成训练集和测试集。

# 以下代码使用DictVectorizer将文本转换成向量,然后拆分成训练集和测试集
X = df.drop('Tag', axis=1)
v = DictVectorizer(sparse=False)
# print(X[:2].to_dict('records'))
# X = v.fit_transform(X[:2].to_dict('records'))
# print(X)
# # 输出各个维度的特征含义
# print('result array feature name:\n', v.get_feature_names())
X = v.fit_transform(X.to_dict('records'))
y = df.Tag.values
classes = np.unique(y)
classes = classes.tolist()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)
print(X_train.shape, y_train.shape)

# 因为标签“O”(outside)是最常见的标签,它会使我们的结果看起来比实际更好。因此,当我们评估分类指标时,我们会删除标记“O”。
new_classes = classes.copy()
new_classes.pop()
print(new_classes)

# 结果
(6700, 3242) (6700,)
['B-art', 'B-eve', 'B-geo', 'B-gpe', 'B-nat', 'B-org', 'B-per', 'B-tim', 'I-art', 'I-eve', 'I-geo', 'I-gpe', 'I-nat', 'I-org', 'I-per', 'I-tim']

3. 外存算法

我们将尝试一些外存算法,这些算法旨在处理太大而无法存入的单个计算机内存的数据(partial_fit 方法)。

3.1 感知机

"""
感知机
"""
per = Perceptron(verbose=1)
per.partial_fit(X_train, y_train, classes)

y_pred = per.predict(X_test)
# print(X_test[:10])
# print(y_pred[:10])
# print(y_test[:10])
# print(classes)
print(classification_report(y_pred=y_pred, y_true=y_test, labels=new_classes))

# 结果
-- Epoch 1
Norm: 8.77, NNZs: 65, Bias: -3.000000, T: 6700, Avg. loss: 0.005224
Total training time: 0.04 seconds.
-- Epoch 1
Norm: 5.48, NNZs: 27, Bias: -2.000000, T: 6700, Avg. loss: 0.002239
Total training time: 0.04 seconds.
-- Epoch 1
Norm: 22.11, NNZs: 339, Bias: -5.000000, T: 6700, Avg. loss: 0.041343
Total training time: 0.04 seconds.
-- Epoch 1
Norm: 26.00, NNZs: 389, Bias: -4.000000, T: 6700, Avg. loss: 0.040299
Total training time: 0.04 seconds.
-- Epoch 1
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
Norm: 5.10, NNZs: 18, Bias: -2.000000, T: 6700, Avg. loss: 0.001343
Total training time: 0.04 seconds.
-- Epoch 1
Norm: 19.57, NNZs: 269, Bias: -3.000000, T: 6700, Avg. loss: 0.028358
Total training time: 0.04 seconds.
-- Epoch 1
Norm: 18.73, NNZs: 241, Bias: -5.000000, T: 6700, Avg. loss: 0.024776
Total training time: 0.04 seconds.
-- Epoch 1
Norm: 17.83, NNZs: 166, Bias: -2.000000, T: 6700, Avg. loss: 0.012985
Total training time: 0.04 seconds.
-- Epoch 1
Norm: 6.63, NNZs: 41, Bias: -2.000000, T: 6700, Avg. loss: 0.002985
Total training time: 0.04 seconds.
-- Epoch 1
Norm: 5.92, NNZs: 32, Bias: -3.000000, T: 6700, Avg. loss: 0.001791
Total training time: 0.04 seconds.
-- Epoch 1
Norm: 8.66, NNZs: 72, Bias: -3.000000, T: 6700, Avg. loss: 0.006567
Total training time: 0.04 seconds.
-- Epoch 1
Norm: 8.06, NNZs: 50, Bias: -3.000000, T: 6700, Avg. loss: 0.003731
Total training time: 0.04 seconds.
-- Epoch 1
Norm: 3.16, NNZs: 10, Bias: -2.000000, T: 6700, Avg. loss: 0.000149
Total training time: 0.04 seconds.
-- Epoch 1
Norm: 17.58, NNZs: 225, Bias: -3.000000, T: 6700, Avg. loss: 0.024328
Total training time: 0.04 seconds.
-- Epoch 1
Norm: 23.11, NNZs: 333, Bias: -4.000000, T: 6700, Avg. loss: 0.032985
Total training time: 0.05 seconds.
-- Epoch 1
Norm: 5.92, NNZs: 35, Bias: -3.000000, T: 6700, Avg. loss: 0.002985
Total training time: 0.04 seconds.
-- Epoch 1
Norm: 23.13, NNZs: 331, Bias: 3.000000, T: 6700, Avg. loss: 0.033134
Total training time: 0.05 seconds.
              precision    recall  f1-score   support

       B-art       0.24      0.44      0.31         9
       B-eve       0.00      0.00      0.00         3
       B-geo       0.87      0.19      0.31        69
       B-gpe       0.59      0.71      0.64       102
       B-nat       0.00      0.00      0.00         0
       B-org       0.34      0.75      0.46        63
       B-per       0.94      0.41      0.58        41
       B-tim       0.59      0.94      0.73        52
       I-art       0.33      0.20      0.25        10
       I-eve       0.00      0.00      0.00         3
       I-geo       0.00      0.00      0.00        11
       I-gpe       1.00      0.17      0.29         6
       I-nat       0.00      0.00      0.00         1
       I-org       0.43      0.21      0.29        47
       I-per       0.69      0.27      0.39        66
       I-tim       0.00      0.00      0.00         4

   micro avg       0.51      0.48      0.49       487
   macro avg       0.38      0.27      0.26       487
weighted avg       0.59      0.48      0.46       487

3.2 SGD线性分类器

"""
具有SGD训练的线性分类器
"""
sgd = SGDClassifier(verbose=0)
sgd.partial_fit(X_train, y_train, classes)

print(classification_report(y_pred = sgd.predict(X_test), y_true = y_test, labels = new_classes))

# 结果
              precision    recall  f1-score   support

       B-art       0.75      0.33      0.46         9
       B-eve       0.00      0.00      0.00         3
       B-geo       0.38      0.77      0.51        69
       B-gpe       0.97      0.31      0.47       102
       B-nat       0.00      0.00      0.00         0
       B-org       0.83      0.30      0.44        63
       B-per       0.27      0.54      0.36        41
       B-tim       1.00      0.75      0.86        52
       I-art       0.40      0.20      0.27        10
       I-eve       0.25      0.33      0.29         3
       I-geo       0.00      0.00      0.00        11
       I-gpe       1.00      0.17      0.29         6
       I-nat       0.00      0.00      0.00         1
       I-org       0.70      0.30      0.42        47
       I-per       0.49      0.48      0.49        66
       I-tim       0.00      0.00      0.00         4

   micro avg       0.50      0.45      0.47       487
   macro avg       0.44      0.28      0.30       487
weighted avg       0.66      0.45      0.48       487

3.3 朴素贝叶斯分类器

"""
用于多项模型的朴素贝叶斯分类器
"""
nb = MultinomialNB(alpha=0.01)
nb.partial_fit(X_train, y_train, classes)

print(classification_report(y_pred = nb.predict(X_test), y_true = y_test, labels = new_classes))

# 结果
              precision    recall  f1-score   support

       B-art       0.11      0.33      0.16         9
       B-eve       0.00      0.00      0.00         3
       B-geo       0.62      0.49      0.55        69
       B-gpe       0.64      0.73      0.68       102
       B-nat       0.00      0.00      0.00         0
       B-org       0.58      0.49      0.53        63
       B-per       0.40      0.51      0.45        41
       B-tim       0.67      0.85      0.75        52
       I-art       0.29      0.20      0.24        10
       I-eve       0.20      0.33      0.25         3
       I-geo       0.25      0.36      0.30        11
       I-gpe       0.25      0.50      0.33         6
       I-nat       0.00      0.00      0.00         1
       I-org       0.48      0.62      0.54        47
       I-per       0.60      0.39      0.48        66
       I-tim       0.06      0.25      0.10         4

   micro avg       0.51      0.56      0.53       487
   macro avg       0.32      0.38      0.33       487
weighted avg       0.55      0.56      0.54       487

3.4 Passive Aggressive分类器

"""
Passive Aggressive分类器
"""
pa = PassiveAggressiveClassifier(verbose=0)
pa.partial_fit(X_train, y_train, classes)

print(classification_report(y_pred = pa.predict(X_test), y_true = y_test, labels = new_classes))

# 结果
              precision    recall  f1-score   support

       B-art       0.00      0.00      0.00         9
       B-eve       0.00      0.00      0.00         3
       B-geo       0.25      0.94      0.39        69
       B-gpe       0.95      0.57      0.71       102
       B-nat       0.00      0.00      0.00         0
       B-org       0.71      0.08      0.14        63
       B-per       0.51      0.46      0.49        41
       B-tim       0.92      0.87      0.89        52
       I-art       0.17      0.10      0.12        10
       I-eve       0.00      0.00      0.00         3
       I-geo       0.00      0.00      0.00        11
       I-gpe       0.20      0.17      0.18         6
       I-nat       0.00      0.00      0.00         1
       I-org       1.00      0.15      0.26        47
       I-per       0.82      0.21      0.34        66
       I-tim       0.00      0.00      0.00         4

   micro avg       0.47      0.44      0.46       487
   macro avg       0.35      0.22      0.22       487
weighted avg       0.68      0.44      0.43       487

上述分类器结果显示挺差的。

3.5 条件随机场

CRF通常用于标记或解析序列数据,例如自然语言处理,并且CRF查找POS标记、命名实体识别等应用。

"""
条件随机场(CRF)
CRF通常用于标记或解析序列数据,例如自然语言处理,并且CRF查找POS标记、命名实体识别等应用。
我们将使用sklearn-crfsuite在我们的数据集上训练用于命名实体识别的CRF模型。
"""

import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics
from collections import Counter

# 以下代码用于检索带有POS和标签的句子
class SentenceGetter(object):

    def __init__(self, dataset):
        self.n_sent = 1
        self.dataset = dataset
        self.empty = False
        agg_func = lambda s: [(w, p, t) for w, p, t in zip(s['Word'].values,
                                s['POS'].values,
                                s['Tag'].values)]
        self.grouped = self.dataset.groupby('Sentence #').apply(agg_func)
        self.sentences = [s for s in self.grouped]
    
    def get_next(self):
        try:
            s = self.grouped['Sentence: {}'.format(self.n_sent)]
            self.n_sent += 1
            return s
        except:
            return None

getter = SentenceGetter(df)
sentences = getter.sentences

# print(sentences[:1])

# 特征提取
# 我们提取更多特征(单词构成,简化的POS标签,下部/标题/上部标志,附近词的特征)并将它们转换为sklearn-crfsuite格式 – 每个句子应转换为词典列表。
# 以下代码取自sklearn-crfsuites官网。

def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]
    
    features = {
        'bias': 1.0, 
        'word.lower()': word.lower(), 
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True
    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True
    return features

def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]
def sent2labels(sent):
    return [label for token, postag, label in sent]
def sent2tokens(sent):
    return [token for token, postag, label in sent]

# 拆分训练集和测试集
X = [sent2features(s) for s in sentences]
y = [sent2labels(s) for s in sentences]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)
print(len(X_train), X_train[0])
print(len(y_train), y_train[0])

# 训练CRF模型
crf = sklearn_crfsuite.CRF(
    algorithm ='lbfgs',
    c1 = 0.1,
    c2 = 0.1,
    max_iterations = 100,
    all_possible_transitions = True
)
crf.fit(X_train, y_train)
# 评估
y_pred = crf.predict(X_test)

print(metrics.flat_classification_report(y_pred = y_pred, y_true = y_test, labels = new_classes))

# 结果
306 [{'bias': 1.0, 'word.lower()': 'the', 'word[-3:]': 'The', 'word[-2:]': 'he', 'word.isupper()': False, 'word.istitle()': True, 'word.isdigit()': False, 'postag': 'DT', 'postag[:2]': 'DT', 'BOS': True, '+1:word.lower()': 'spokesman', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'NN', '+1:postag[:2]': 'NN'}, {'bias': 1.0, 'word.lower()': 'spokesman', 'word[-3:]': 'man', 'word[-2:]': 'an', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'NN', 'postag[:2]': 'NN', '-1:word.lower()': 'the', '-1:word.istitle()': True, '-1:word.isupper()': False, '-1:postag': 'DT', '-1:postag[:2]': 'DT', '+1:word.lower()': 'says', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'VBZ', '+1:postag[:2]': 'VB'}, {'bias': 1.0, 'word.lower()': 'says', 'word[-3:]': 'ays', 'word[-2:]': 'ys', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'VBZ', 'postag[:2]': 'VB', '-1:word.lower()': 'spokesman', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'NN', '-1:postag[:2]': 'NN', '+1:word.lower()': 'a', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'DT', '+1:postag[:2]': 'DT'}, {'bias': 1.0, 'word.lower()': 'a', 'word[-3:]': 'a', 'word[-2:]': 'a', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'DT', 'postag[:2]': 'DT', '-1:word.lower()': 'says', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'VBZ', '-1:postag[:2]': 'VB', '+1:word.lower()': 'formal', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'JJ', '+1:postag[:2]': 'JJ'}, {'bias': 1.0, 'word.lower()': 'formal', 'word[-3:]': 'mal', 'word[-2:]': 'al', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'JJ', 'postag[:2]': 'JJ', '-1:word.lower()': 'a', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'DT', '-1:postag[:2]': 'DT', '+1:word.lower()': 'agreement', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'NN', '+1:postag[:2]': 'NN'}, {'bias': 1.0, 'word.lower()': 'agreement', 'word[-3:]': 'ent', 'word[-2:]': 'nt', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'NN', 'postag[:2]': 'NN', '-1:word.lower()': 'formal', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'JJ', '-1:postag[:2]': 'JJ', '+1:word.lower()': 'on', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'IN', '+1:postag[:2]': 'IN'}, {'bias': 1.0, 'word.lower()': 'on', 'word[-3:]': 'on', 'word[-2:]': 'on', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'IN', 'postag[:2]': 'IN', '-1:word.lower()': 'agreement', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'NN', '-1:postag[:2]': 'NN', '+1:word.lower()': 'the', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'DT', '+1:postag[:2]': 'DT'}, {'bias': 1.0, 'word.lower()': 'the', 'word[-3:]': 'the', 'word[-2:]': 'he', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'DT', 'postag[:2]': 'DT', '-1:word.lower()': 'on', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'IN', '-1:postag[:2]': 'IN', '+1:word.lower()': 'project', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'NN', '+1:postag[:2]': 'NN'}, {'bias': 1.0, 'word.lower()': 'project', 'word[-3:]': 'ect', 'word[-2:]': 'ct', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'NN', 'postag[:2]': 'NN', '-1:word.lower()': 'the', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'DT', '-1:postag[:2]': 'DT', '+1:word.lower()': 'will', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'MD', '+1:postag[:2]': 'MD'}, {'bias': 1.0, 'word.lower()': 'will', 'word[-3:]': 'ill', 'word[-2:]': 'll', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'MD', 'postag[:2]': 'MD', '-1:word.lower()': 'project', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'NN', '-1:postag[:2]': 'NN', '+1:word.lower()': 'be', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'VB', '+1:postag[:2]': 'VB'}, {'bias': 1.0, 'word.lower()': 'be', 'word[-3:]': 'be', 'word[-2:]': 'be', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'VB', 'postag[:2]': 'VB', '-1:word.lower()': 'will', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'MD', '-1:postag[:2]': 'MD', '+1:word.lower()': 'signed', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'VBN', '+1:postag[:2]': 'VB'}, {'bias': 1.0, 'word.lower()': 'signed', 'word[-3:]': 'ned', 'word[-2:]': 'ed', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'VBN', 'postag[:2]': 'VB', '-1:word.lower()': 'be', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'VB', '-1:postag[:2]': 'VB', '+1:word.lower()': 'in', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'IN', '+1:postag[:2]': 'IN'}, {'bias': 1.0, 'word.lower()': 'in', 'word[-3:]': 'in', 'word[-2:]': 'in', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'IN', 'postag[:2]': 'IN', '-1:word.lower()': 'signed', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'VBN', '-1:postag[:2]': 'VB', '+1:word.lower()': 'june', '+1:word.istitle()': True, '+1:word.isupper()': False, '+1:postag': 'NNP', '+1:postag[:2]': 'NN'}, {'bias': 1.0, 'word.lower()': 'june', 'word[-3:]': 'une', 'word[-2:]': 'ne', 'word.isupper()': False, 'word.istitle()': True, 'word.isdigit()': False, 'postag': 'NNP', 'postag[:2]': 'NN', '-1:word.lower()': 'in', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'IN', '-1:postag[:2]': 'IN', '+1:word.lower()': 'when', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'WRB', '+1:postag[:2]': 'WR'}, {'bias': 1.0, 'word.lower()': 'when', 'word[-3:]': 'hen', 'word[-2:]': 'en', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'WRB', 'postag[:2]': 'WR', '-1:word.lower()': 'june', '-1:word.istitle()': True, '-1:word.isupper()': False, '-1:postag': 'NNP', '-1:postag[:2]': 'NN', '+1:word.lower()': 'indonesian', '+1:word.istitle()': True, '+1:word.isupper()': False, '+1:postag': 'JJ', '+1:postag[:2]': 'JJ'}, {'bias': 1.0, 'word.lower()': 'indonesian', 'word[-3:]': 'ian', 'word[-2:]': 'an', 'word.isupper()': False, 'word.istitle()': True, 'word.isdigit()': False, 'postag': 'JJ', 'postag[:2]': 'JJ', '-1:word.lower()': 'when', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'WRB', '-1:postag[:2]': 'WR', '+1:word.lower()': 'president', '+1:word.istitle()': True, '+1:word.isupper()': False, '+1:postag': 'NNP', '+1:postag[:2]': 'NN'}, {'bias': 1.0, 'word.lower()': 'president', 'word[-3:]': 'ent', 'word[-2:]': 'nt', 'word.isupper()': False, 'word.istitle()': True, 'word.isdigit()': False, 'postag': 'NNP', 'postag[:2]': 'NN', '-1:word.lower()': 'indonesian', '-1:word.istitle()': True, '-1:word.isupper()': False, '-1:postag': 'JJ', '-1:postag[:2]': 'JJ', '+1:word.lower()': 'susilo', '+1:word.istitle()': True, '+1:word.isupper()': False, '+1:postag': 'NNP', '+1:postag[:2]': 'NN'}, {'bias': 1.0, 'word.lower()': 'susilo', 'word[-3:]': 'ilo', 'word[-2:]': 'lo', 'word.isupper()': False, 'word.istitle()': True, 'word.isdigit()': False, 'postag': 'NNP', 'postag[:2]': 'NN', '-1:word.lower()': 'president', '-1:word.istitle()': True, '-1:word.isupper()': False, '-1:postag': 'NNP', '-1:postag[:2]': 'NN', '+1:word.lower()': 'bambang', '+1:word.istitle()': True, '+1:word.isupper()': False, '+1:postag': 'NNP', '+1:postag[:2]': 'NN'}, {'bias': 1.0, 'word.lower()': 'bambang', 'word[-3:]': 'ang', 'word[-2:]': 'ng', 'word.isupper()': False, 'word.istitle()': True, 'word.isdigit()': False, 'postag': 'NNP', 'postag[:2]': 'NN', '-1:word.lower()': 'susilo', '-1:word.istitle()': True, '-1:word.isupper()': False, '-1:postag': 'NNP', '-1:postag[:2]': 'NN', '+1:word.lower()': 'yudhoyono', '+1:word.istitle()': True, '+1:word.isupper()': False, '+1:postag': 'NNP', '+1:postag[:2]': 'NN'}, {'bias': 1.0, 'word.lower()': 'yudhoyono', 'word[-3:]': 'ono', 'word[-2:]': 'no', 'word.isupper()': False, 'word.istitle()': True, 'word.isdigit()': False, 'postag': 'NNP', 'postag[:2]': 'NN', '-1:word.lower()': 'bambang', '-1:word.istitle()': True, '-1:word.isupper()': False, '-1:postag': 'NNP', '-1:postag[:2]': 'NN', '+1:word.lower()': 'is', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'VBZ', '+1:postag[:2]': 'VB'}, {'bias': 1.0, 'word.lower()': 'is', 'word[-3:]': 'is', 'word[-2:]': 'is', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'VBZ', 'postag[:2]': 'VB', '-1:word.lower()': 'yudhoyono', '-1:word.istitle()': True, '-1:word.isupper()': False, '-1:postag': 'NNP', '-1:postag[:2]': 'NN', '+1:word.lower()': 'scheduled', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'VBN', '+1:postag[:2]': 'VB'}, {'bias': 1.0, 'word.lower()': 'scheduled', 'word[-3:]': 'led', 'word[-2:]': 'ed', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'VBN', 'postag[:2]': 'VB', '-1:word.lower()': 'is', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'VBZ', '-1:postag[:2]': 'VB', '+1:word.lower()': 'to', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'TO', '+1:postag[:2]': 'TO'}, {'bias': 1.0, 'word.lower()': 'to', 'word[-3:]': 'to', 'word[-2:]': 'to', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'TO', 'postag[:2]': 'TO', '-1:word.lower()': 'scheduled', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'VBN', '-1:postag[:2]': 'VB', '+1:word.lower()': 'visit', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'VB', '+1:postag[:2]': 'VB'}, {'bias': 1.0, 'word.lower()': 'visit', 'word[-3:]': 'sit', 'word[-2:]': 'it', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'VB', 'postag[:2]': 'VB', '-1:word.lower()': 'to', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'TO', '-1:postag[:2]': 'TO', '+1:word.lower()': 'moscow', '+1:word.istitle()': True, '+1:word.isupper()': False, '+1:postag': 'NNP', '+1:postag[:2]': 'NN'}, {'bias': 1.0, 'word.lower()': 'moscow', 'word[-3:]': 'cow', 'word[-2:]': 'ow', 'word.isupper()': False, 'word.istitle()': True, 'word.isdigit()': False, 'postag': 'NNP', 'postag[:2]': 'NN', '-1:word.lower()': 'visit', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'VB', '-1:postag[:2]': 'VB', '+1:word.lower()': '.', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': '.', '+1:postag[:2]': '.'}, {'bias': 1.0, 'word.lower()': '.', 'word[-3:]': '.', 'word[-2:]': '.', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': '.', 'postag[:2]': '.', '-1:word.lower()': 'moscow', '-1:word.istitle()': True, '-1:word.isupper()': False, '-1:postag': 'NNP', '-1:postag[:2]': 'NN', 'EOS': True}]
306 ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-tim', 'O', 'B-gpe', 'B-per', 'I-per', 'I-per', 'I-per', 'O', 'O', 'O', 'O', 'B-geo', 'O']
              precision    recall  f1-score   support

       B-art       0.50      0.40      0.44         5
       B-eve       0.00      0.00      0.00         2
       B-geo       0.79      0.68      0.73        77
       B-gpe       0.75      0.88      0.81        91
       B-nat       0.00      0.00      0.00         2
       B-org       0.77      0.68      0.72        53
       B-per       0.85      0.92      0.88        61
       B-tim       0.95      0.89      0.92        45
       I-art       0.00      0.00      0.00         4
       I-eve       0.00      0.00      0.00         1
       I-geo       0.75      0.38      0.50        16
       I-gpe       0.67      0.57      0.62         7
       I-nat       0.00      0.00      0.00         2
       I-org       0.74      0.70      0.72        50
       I-per       0.87      0.97      0.92        75
       I-tim       0.33      1.00      0.50         1

   micro avg       0.80      0.78      0.79       492
   macro avg       0.50      0.50      0.48       492
weighted avg       0.78      0.78      0.78       492

你可能感兴趣的:(Scikit-Learn进行命名实体识别)