

1. 数据集设置与分析


类别 训练集 测试集
正常短信(正类) 475,560 203,805
垃圾短信(负类) 52,830 22,648
总计 528,390 226,453






2. 算法实现


算法 优点 缺点
朴素贝叶斯 有着坚实的数学理论基础;实现简单;学习与预测的效率都较高。 实际往往不能满足特征条件独立性,在特征之间的相关性较大时分类效果不好;预设的先验概率分布的影响分类效果;在类别不平衡的数据上表现不佳。
逻辑回归 实现简单;训练速度快。 对于非线性的样本数据难以建模拟合;在特征空间很大时,性能不好;临界值不易确定,容易欠拟合。
随机森林 训练可以高度并行化,在大数据集上训练速度有优势;能够处理高维度数据;能给出各个特征属性对输出的重要性评分。 在噪声较大的情况下容易发生过拟合。
SVM 可以处理线性与非线性的数据;具有较良好的泛化推广能力。 参数调节与核函数选择较多地依赖于经验,具有一定的随意性。
LSTM 结合词序信息。 只能结合正向的词序信息。
BiLSTM 结合上下文信息。 模型收敛需要较长的训练时间。
BERT 捕捉上下文信息的能力更强。 预训练的[MASK]标记造成预训练与微调阶段的不匹配,影响模型效果;模型收敛需要更多时间。


2.1 朴素贝叶斯



# -*- coding: utf-8 -*-

import pandas as pd
import numpy as np
import jieba
import re
from time import time
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report
import io
import sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')

def stopwordslist(filepath):  
    stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]  
    return stopwords  

if __name__ == '__main__':
    print("Loading train dataset ...")
    t = time()
    train_data = pd.read_csv('train.csv', names=['labels', 'text'], sep='\t')
    print("Done in {0} seconds\n".format(round(time() - t, 2)))
    print("Loading test dataset ...")
    t = time()
    test_data = pd.read_csv('test.csv', names=['labels', 'text'], sep='\t')
    print("Done in {0} seconds\n".format(round(time() - t, 2)))
    print("Total number of labeled documents(train): %d ." % len(train_data))
    print("Total number of labeled documents(test): %d ." % len(test_data))
    X_train = train_data['text']
    X_test = test_data['text']

    y_train  = train_data['labels']
    y_test = test_data['labels']
    d = {'labels':train_data['labels'].value_counts().index, 'count': train_data['labels'].value_counts()}
    df_label = pd.DataFrame(data=d).reset_index(drop=True)
    print("Loading stopwords ...")
    t = time()
    stopwords = stopwordslist("stopwords.txt")
    print("Done in {0} seconds\n".format(round(time() - t, 2)))
    print("Starting word segmentation on train dataset...")
    t = time()
    X_train = X_train.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))
    print("Done in {0} seconds\n".format(round(time() - t, 2)))
    print("Starting word segmentation on test dataset...")
    t = time()
    X_test = X_test.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))
    print("Done in {0} seconds\n".format(round(time() - t, 2)))
    print("Vectorizing train dataset...")
    t = time()
    tfidf = TfidfVectorizer(norm='l2', ngram_range=(1, 2))
    X_train = tfidf.fit_transform(X_train)
    print("Done in {0} seconds\n".format(round(time() - t, 2)))
    print("Vectorizing test dataset...")
    t = time()
    X_test = tfidf.transform(X_test)
    print("Done in {0} seconds\n".format(round(time() - t, 2)))

    print("Training model...")
    t = time()
    model = MultinomialNB()
    model.fit(X_train, y_train)
    print("Done in {0} seconds\n".format(round(time() - t, 2)))
    print("Predicting test dataset...")
    t = time()
    y_pred = model.predict(X_test)
    print("Done in {0} seconds\n".format(round(time() - t, 2)))

    conf_mat = confusion_matrix(y_test, y_pred)

    print('accuracy %s' % accuracy_score(y_pred, y_test))
    print(classification_report(y_test, y_pred, digits=4))

2.2 逻辑回归



# -*- coding: utf-8 -*-

import pandas as pd
import numpy as np
import jieba
import re
from time import time
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report
import io
import sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')

def stopwordslist(filepath):  
    stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]  
    return stopwords  

if __name__ == '__main__':
    print("Loading train dataset ...")
    t = time()
    train_data = pd.read_csv('train.csv', names=['labels', 'text'], sep='\t')
    print("Done in {0} seconds\n".format(round(time() - t, 2)))
    print("Loading test dataset ...")
    t = time()
    test_data = pd.read_csv('test.csv', names=['labels', 'text'], sep='\t')
    print("Done in {0} seconds\n".format(round(time() - t, 2)))
    print("Total number of labeled documents(train): %d ." % len(train_data))
    print("Total number of labeled documents(test): %d ." % len(test_data))
    X_train = train_data['text']
    X_test = test_data['text']

    y_train  = train_data['labels']
    y_test = test_data['labels']
    d = {'labels':train_data['labels'].value_counts().index, 'count': train_data['labels'].value_counts()}
    df_label = pd.DataFrame(data=d).reset_index(drop=True)
    print("Loading stopwords ...")
    t = time()
    stopwords = stopwordslist("stopwords.txt")
    print("Done in {0} seconds\n".format(round(time() - t, 2)))
    print("Starting word segmentation on train dataset...")
    t = time()
    X_train = X_train.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))
    print("Done in {0} seconds\n".format(round(time() - t, 2)))
    print("Starting word segmentation on test dataset...")
    t = time()
    X_test = X_test.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))
    print("Done in {0} seconds\n".format(round(time() - t, 2)))
    print("Vectorizing train dataset...")
    t = time()
    tfidf = TfidfVectorizer(norm='l2', ngram_range=(1, 2))
    X_train = tfidf.fit_transform(X_train)
    print("Done in {0} seconds\n".format(round(time() - t, 2)))
    print("Vectorizing test dataset...")
    t = time()
    X_test = tfidf.transform(X_test)
    print("Done in {0} seconds\n".format(round(time() - t, 2)))

    print("Training model...")
    t = time()
    model = LogisticRegression(random_state=0)
    model.fit(X_train, y_train)
    print("Done in {0} seconds\n".format(round(time() - t, 2)))
    print("Predicting test dataset...")
    t = time()
    y_pred = model.predict(X_test)
    print("Done in {0} seconds\n".format(round(time() - t, 2)))

    conf_mat = confusion_matrix(y_test, y_pred)

    print('accuracy %s' % accuracy_score(y_pred, y_test))
    print(classification_report(y_test, y_pred, digits=4))

2.3 随机森林



# -*- coding: utf-8 -*-

import pandas as pd
import numpy as np
import jieba
import re
from time import time
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report
import io
import sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')

def stopwordslist(filepath):  
    stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]  
    return stopwords  

if __name__ == '__main__':
    print("Loading train dataset ...")
    t = time()
    train_data = pd.read_csv('train.csv', names=['labels', 'text'], sep='\t')
    print("Done in {0} seconds\n".format(round(time() - t, 2)))
    print("Loading test dataset ...")
    t = time()
    test_data = pd.read_csv('test.csv', names=['labels', 'text'], sep='\t')
    print("Done in {0} seconds\n".format(round(time() - t, 2)))
    print("Total number of labeled documents(train): %d ." % len(train_data))
    print("Total number of labeled documents(test): %d ." % len(test_data))
    X_train = train_data['text']
    X_test = test_data['text']

    y_train  = train_data['labels']
    y_test = test_data['labels']
    d = {'labels':train_data['labels'].value_counts().index, 'count': train_data['labels'].value_counts()}
    df_label = pd.DataFrame(data=d).reset_index(drop=True)
    print("Loading stopwords ...")
    t = time()
    stopwords = stopwordslist("stopwords.txt")
    print("Done in {0} seconds\n".format(round(time() - t, 2)))
    print("Starting word segmentation on train dataset...")
    t = time()
    X_train = X_train.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))
    print("Done in {0} seconds\n".format(round(time() - t, 2)))
    print("Starting word segmentation on test dataset...")
    t = time()
    X_test = X_test.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))
    print("Done in {0} seconds\n".format(round(time() - t, 2)))
    print("Vectorizing train dataset...")
    t = time()
    tfidf = TfidfVectorizer(norm='l2', ngram_range=(1, 2))
    X_train = tfidf.fit_transform(X_train)
    print("Done in {0} seconds\n".format(round(time() - t, 2)))
    print("Vectorizing test dataset...")
    t = time()
    X_test = tfidf.transform(X_test)
    print("Done in {0} seconds\n".format(round(time() - t, 2)))

    print("Training model...")
    t = time()
    model = RandomForestClassifier()
    model.fit(X_train, y_train)
    print("Done in {0} seconds\n".format(round(time() - t, 2)))
    print("Predicting test dataset...")
    t = time()
    y_pred = model.predict(X_test)
    print("Done in {0} seconds\n".format(round(time() - t, 2)))

    conf_mat = confusion_matrix(y_test, y_pred)

    print('accuracy %s' % accuracy_score(y_pred, y_test))
    print(classification_report(y_test, y_pred, digits=4))

2.4 SVM



# -*- coding: utf-8 -*-

import pandas as pd
import numpy as np
import jieba
import re
from time import time
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report
import io
import sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')

def stopwordslist(filepath):  
    stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]  
    return stopwords  

if __name__ == '__main__':
    print("Loading train dataset ...")
    t = time()
    train_data = pd.read_csv('train.csv', names=['labels', 'text'], sep='\t')
    print("Done in {0} seconds\n".format(round(time() - t, 2)))
    print("Loading test dataset ...")
    t = time()
    test_data = pd.read_csv('test.csv', names=['labels', 'text'], sep='\t')
    print("Done in {0} seconds\n".format(round(time() - t, 2)))
    print("Total number of labeled documents(train): %d ." % len(train_data))
    print("Total number of labeled documents(test): %d ." % len(test_data))
    X_train = train_data['text']
    X_test = test_data['text']

    y_train  = train_data['labels']
    y_test = test_data['labels']
    d = {'labels':train_data['labels'].value_counts().index, 'count': train_data['labels'].value_counts()}
    df_label = pd.DataFrame(data=d).reset_index(drop=True)
    print("Loading stopwords ...")
    t = time()
    stopwords = stopwordslist("stopwords.txt")
    print("Done in {0} seconds\n".format(round(time() - t, 2)))
    print("Starting word segmentation on train dataset...")
    t = time()
    X_train = X_train.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))
    print("Done in {0} seconds\n".format(round(time() - t, 2)))
    print("Starting word segmentation on test dataset...")
    t = time()
    X_test = X_test.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))
    print("Done in {0} seconds\n".format(round(time() - t, 2)))
    print("Vectorizing train dataset...")
    t = time()
    tfidf = TfidfVectorizer(norm='l2', ngram_range=(1, 2))
    X_train = tfidf.fit_transform(X_train)
    print("Done in {0} seconds\n".format(round(time() - t, 2)))
    print("Vectorizing test dataset...")
    t = time()
    X_test = tfidf.transform(X_test)
    print("Done in {0} seconds\n".format(round(time() - t, 2)))

    print("Training model...")
    t = time()
    model = LinearSVC()
    model.fit(X_train, y_train)
    print("Done in {0} seconds\n".format(round(time() - t, 2)))
    print("Predicting test dataset...")
    t = time()
    y_pred = model.predict(X_test)
    print("Done in {0} seconds\n".format(round(time() - t, 2)))

    conf_mat = confusion_matrix(y_test, y_pred)

    print('accuracy %s' % accuracy_score(y_pred, y_test))
    print(classification_report(y_test, y_pred, digits=4))

2.5 LSTM



# -*- coding: utf-8 -*-

import pandas as pd
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import jieba
import re
from time import time
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from keras.utils.np_utils import to_categorical
from keras.callbacks import EarlyStopping
from keras.layers import Dropout
from gensim.models import KeyedVectors
import io
import sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')

def stopwordslist(filepath):  
    stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]  
    return stopwords  

if __name__ == '__main__':
    train_data = pd.read_csv('train.csv', names=['labels', 'text'], sep='\t')
    test_data = pd.read_csv('test.csv', names=['labels', 'text'], sep='\t')
    print("Total number of labeled documents(train): %d ." % len(train_data))
    print("Total number of labeled documents(test): %d ." % len(test_data))
    X_train = train_data['text']
    X_test = test_data['text']

    y_train  = train_data['labels']
    y_test = test_data['labels']
    d = {'labels':train_data['labels'].value_counts().index, 'count': train_data['labels'].value_counts()}
    df_label = pd.DataFrame(data=d).reset_index(drop=True)
    stopwords = stopwordslist("stopwords.txt")
    X_train = X_train.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))
    X_test = X_test.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))
    # 设置最频繁使用的50000个词(在texts_to_matrix是会取前MAX_NB_WORDS,会取前MAX_NB_WORDS列)
    MAX_NB_WORDS = 50000
    # 每个标题最大的长度
    # 设置Embeddingceng层的维度

    tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)
    word_index = tokenizer.word_index
    print('There are %s different words.' % len(word_index))
    X_train = tokenizer.texts_to_sequences(X_train)
    X_test = tokenizer.texts_to_sequences(X_test)
    X_train = pad_sequences(X_train, maxlen=MAX_SEQUENCE_LENGTH)
    X_test = pad_sequences(X_test, maxlen=MAX_SEQUENCE_LENGTH)
    y_train = pd.get_dummies(y_train).values
    y_test = pd.get_dummies(y_test).values
    wv_from_text = KeyedVectors.load_word2vec_format('tencent.txt', binary=False, unicode_errors='ignore')
    embedding_matrix = np.zeros((MAX_NB_WORDS, EMBEDDING_DIM))
    for word, i in word_index.items():
        if i > MAX_NB_WORDS:
            embedding_matrix[i] = wv_from_text.wv.get_vector(word)
    del wv_from_text
    print("Training model...")
    t = time()
    model = Sequential()
    model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X_train.shape[1], weights = [embedding_matrix], trainable = False))
    model.add(LSTM(300, dropout=0.2, recurrent_dropout=0.2))
    model.add(Dense(2, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    epochs = 10
    batch_size = 64

    history = model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size,validation_split=0.1,
                    callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])
    print("Done in {0} seconds\n".format(round(time() - t, 2)))
    accr = model.evaluate(X_test,y_test)
    print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(accr[0],accr[1]))
    print("Predicting test dataset...")
    t = time()
    y_pred = model.predict(X_test)
    print("Done in {0} seconds\n".format(round(time() - t, 2)))
    y_pred = y_pred.argmax(axis = 1)
    y_test = y_test.argmax(axis = 1)

    conf_mat = confusion_matrix(y_test, y_pred)

    print('accuracy %s' % accuracy_score(y_pred, y_test))
    print(classification_report(y_test, y_pred, digits=4))

2.6 BiLSTM



# -*- coding: utf-8 -*-

import pandas as pd
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import jieba
import re
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D, Bidirectional
from keras.utils.np_utils import to_categorical
from keras.callbacks import EarlyStopping
from keras.layers import Dropout
from gensim.models import KeyedVectors
import io
import sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')

def stopwordslist(filepath):  
    stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]  
    return stopwords  

if __name__ == '__main__':
    train_data = pd.read_csv('train.csv', names=['labels', 'text'], sep='\t')
    test_data = pd.read_csv('test.csv', names=['labels', 'text'], sep='\t')
    print("Total number of labeled documents(train): %d ." % len(train_data))
    print("Total number of labeled documents(test): %d ." % len(test_data))
    X_train = train_data['text']
    X_test = test_data['text']

    y_train  = train_data['labels']
    y_test = test_data['labels']
    d = {'labels':train_data['labels'].value_counts().index, 'count': train_data['labels'].value_counts()}
    df_label = pd.DataFrame(data=d).reset_index(drop=True)
    stopwords = stopwordslist("stopwords.txt")
    X_train = X_train.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))
    X_test = X_test.apply(lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords]))
    # 设置最频繁使用的50000个词(在texts_to_matrix是会取前MAX_NB_WORDS,会取前MAX_NB_WORDS列)
    MAX_NB_WORDS = 50000
    # 每个标题最大的长度
    # 设置Embeddingceng层的维度

    tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)
    word_index = tokenizer.word_index
    print('There are %s different words.' % len(word_index))
    X_train = tokenizer.texts_to_sequences(X_train)
    X_test = tokenizer.texts_to_sequences(X_test)
    X_train = pad_sequences(X_train, maxlen=MAX_SEQUENCE_LENGTH)
    X_test = pad_sequences(X_test, maxlen=MAX_SEQUENCE_LENGTH)
    y_train = pd.get_dummies(y_train).values
    y_test = pd.get_dummies(y_test).values
    wv_from_text = KeyedVectors.load_word2vec_format('tencent.txt', binary=False, unicode_errors='ignore')
    embedding_matrix = np.zeros((MAX_NB_WORDS, EMBEDDING_DIM))
    for word, i in word_index.items():
        if i > MAX_NB_WORDS:
            embedding_matrix[i] = wv_from_text.wv.get_vector(word)
    del wv_from_text
    model = Sequential()
    model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X_train.shape[1], weights = [embedding_matrix], trainable = False))
    model.add(Dense(2, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    epochs = 10
    batch_size = 64

    history = model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size,validation_split=0.1,
                    callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])
    accr = model.evaluate(X_test,y_test)
    print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(accr[0],accr[1]))
    y_pred = model.predict(X_test)
    y_pred = y_pred.argmax(axis = 1)
    y_test = y_test.argmax(axis = 1)

    conf_mat = confusion_matrix(y_test, y_pred)

    print('accuracy %s' % accuracy_score(y_pred, y_test))
    print(classification_report(y_test, y_pred, digits=4))

2.7 BERT



import pandas as pd
from simpletransformers.model import TransformerModel
from sklearn.metrics import f1_score, accuracy_score

def f1_multiclass(labels, preds):
      return f1_score(labels, preds, average='micro')

if __name__ == '__main__':
    train_data = pd.read_csv('train.csv', names=['labels', 'text'], sep='\t')
    test_data = pd.read_csv('test.csv', names=['labels', 'text'], sep='\t')
    print("Total number of labeled papers(train): %d ." % len(train_data))
    print("Total number of labeled papers(test): %d ." % len(test_data))
    model = TransformerModel('bert', 'bert-base-chinese', num_labels=2, args={'learning_rate':1e-5, 'num_train_epochs': 2, 
'reprocess_input_data': True, 'overwrite_output_dir': True, 'fp16': False})
    #bert-base-multilingual 前两个参数换成: 'bert', 'bert-base-multilingual-cased'
    #roberta 前两个参数换成: 'roberta', 'roberta-base'
    #xlmroberta 前两个参数换成: 'xlmroberta', 'xlm-roberta-base'

    result, model_outputs, wrong_predictions = model.eval_model(test_data, f1=f1_multiclass, acc=accuracy_score)

3. 结果对比

为定量分析算法效果,假设正常短信为正样本,数量为P(Positive);垃圾短信为负样本,数量为N(Negative);文本分类算法正确分类样本数为T(True);错误分类样本数为F(False)。因此,真正(True positive, TP)表示正常短信被正确分类的数量;假正(False positive, FP)表示垃圾短信被误认为正常短信的数量;真负(True negative, TN)表示垃圾短信被正确分类的数量;假负(False negative, FN)表示正常短信被误认为垃圾短信的数量。在此基础上,实验中使用如下五个评估指标:

Precision-weighted = ( P r e c i s i o n P ∗ P + P r e c i s i o n N ∗ N ) / ( P + N ) =(Precision_P*P+Precision_N*N)/(P+N) =(PrecisionPP+PrecisionNN)/(P+N)
其中 P r e c i s i o n P = T P / ( T P + F P ) Precision_P=TP/(TP+FP) PrecisionP=TP/(TP+FP) P r e c i s i o n N = T N / ( T N + F N ) Precision_N=TN/(TN+FN) PrecisionN=TN/(TN+FN)

Recall-weighted = ( R e c a l l P ∗ P + R e c a l l N ∗ N ) / ( P + N ) =(Recall_P*P+Recall_N*N)/(P+N) =(RecallPP+RecallNN)/(P+N)
其中 R e c a l l P = T P / ( T P + F N ) Recall_P=TP/(TP+FN) RecallP=TP/(TP+FN) R e c a l l N = T N / ( T N + F P ) Recall_N=TN/(TN+FP) RecallN=TN/(TN+FP)

F1-score-weighted = ( F 1 P ∗ P + F 1 N ∗ N ) / ( P + N ) =(F1_P*P+F1_N*N)/(P+N) =(F1PP+F1NN)/(P+N)
F 1 P = 2 ∗ P r e c i s i o n P ∗ R e c a l l P / ( P r e c i s i o n P + R e c a l l P ) F1_P=2*Precision_P*Recall_P/(Precision_P+Recall_P) F1P=2PrecisionPRecallP/(PrecisionP+RecallP)
F 1 N = 2 ∗ P r e c i s i o n N ∗ R e c a l l N / ( P r e c i s i o n N + R e c a l l N ) F1_N=2*Precision_N*Recall_N/(Precision_N+Recall_N) F1N=2PrecisionNRecallN/(PrecisionN+RecallN)

(4)假负率(False negative rate, FNR),计算如下:
FNR = F N / ( T P + F N ) =FN/(TP+FN) =FN/(TP+FN),即被预测为垃圾短信的正常短信数量/正常短信实际的数量。

(5)真负率(True negative rate, TNR),计算如下:
TNR = T N / ( T N + F P ) =TN/(TN+FP) =TN/(TN+FP),即垃圾短信的正确识别数量/垃圾短信实际的数量,亦为垃圾短信的召回率。


模型 精确率加权平均 召回率加权平均 F1值加权平均 假负率 真负率
朴素贝叶斯 0.9764 0.9761 0.9748 0.0010 0.7700
逻辑回归 0.9886 0.9887 0.9887 0.0061 0.9414
随机森林 0.9809 0.9808 0.9800 0.0012 0.8181
SVM 0.9925 0.9924 0.9924 0.0052 0.9713
LSTM 0.9963 0.9963 0.9963 0.0015 0.9771
BiLSTM 0.9964 0.9964 0.9964 0.0009 0.9720
BERT 0.9991 0.9991 0.9991 0.0002 0.9926





