sklearn svm实现文本分类 入门

正在学习sklearn , 实验室项目需要实现一些文本的分类的功能。
sklearn提供了许多机器学习方面使用的到的写好的工具。其中也包括分类器。sklearn在这里不作介绍。有官网,有博客,也正在学习中

最开始是参照着这片文章:
https://segmentfault.com/a/1190000002472791
用的是朴素贝叶斯,文本向量化用的是HashingVectorizer
实现过后,效果不够好,在这个基础上改用了 TfidfVectorizer,CountVectorizer,其中TfidfVectorizer效果较好,达到了50%左右,但是对于实验来说是不够的

参照着写了使用svm来进行分类,改了数据处理的部分,按照0.65左右的比例在整个数据集随机的生成训练集与测试集来比较效果。
数据从txt读取的,格式如下:

男默女泪啊:0
自杀者永世不得为人乃铁律,!不珍惜生命:0
发达国家都能结婚了,中国人的思维还在百年前。差劲啊:0
爱不是这么样表达的,不一定需要拥有,社会这样我们改变不了什么,但是,非要死吗:0

数据本身是存在样本不均匀问题的,且文本较短。故而有些向量化工具效果不好。

代码:

# -*- coding: utf-8 -*-
from sklearn import datasets
from sklearn import svm
import random
import jieba
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
import numpy

#调整了格式,一行是一条数据
def inputdata(filename):
    f = open(filename,'r')
    linelist = f.readlines()
    return linelist

def splitset(trainset,testset):
    train_words = []
    train_tags = []
    test_words = []
    test_tags = []
    for i in trainset:
        i = i.strip()
        # index = i.index(':')
        train_words.append(i[:-2])
        # print i
        train_tags.append(int(i[-1]))

    for i in testset:
        i = i.strip()
        # index = i.index(':')
        test_words.append(i[:-2])
        # print i
        test_tags.append(int(i[-1]))

    return train_words,train_tags,test_words,test_tags

#完成打开文件后的准备工作

comma_tokenizer = lambda x: jieba.cut(x, cut_all=True)

def tfvectorize(train_words,test_words):
    v = TfidfVectorizer(tokenizer=comma_tokenizer,binary = False, decode_error = 'ignore',stop_words = 'english')
    train_data = v.fit_transform(train_words)
    test_data = v.transform(test_words)
    return train_data,test_data

#按比例划分训练集与测试集
def splitDataset(dataset,splitRatio):
    trainSize = int(len(dataset)*splitRatio)
    trainSet = []
    copy = dataset
    while len(trainSet)return trainSet,copy

#得到准确率和召回率
def evaluate(actual, pred):
    m_precision = metrics.precision_score(actual, pred,average='macro')
    m_recall = metrics.recall_score(actual,pred,average='macro')
    print 'precision:{0:.3f}'.format(m_precision)
    print 'recall:{0:0.3f}'.format(m_recall)

#创建svm分类器
def train_clf(train_data, train_tags):
    clf = svm.SVC(C=10.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3,
                  gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True,
                  tol=0.001, verbose=False)
    clf.fit(train_data, numpy.asarray(train_tags))

    return clf

def covectorize(train_words,test_words):
    v = CountVectorizer(tokenizer=comma_tokenizer,binary = False, decode_error = 'ignore',stop_words = 'english')
    train_data = v.fit_transform(train_words)
    test_data = v.transform(test_words)
    return train_data,test_data

if __name__ == '__main__':
    linelist = inputdata('data/newdata.txt')
    # for i in linelist:
    #     print i.decode('utf-8')

    # 划分成两个list
    trainset, testset = splitDataset(linelist, 0.65)
    # for i in trainset:
    #     print i.decode('utf-8')
    print 'train number:', len(trainset)
    print 'test number:', len(testset)

    train_words, train_tags, test_words, test_tags = splitset(trainset, testset)
    # for i in train_words:
    #     print i
    # for i in train_tags:
    #     print i
    # for i in numpy.asarray(train_tags):
    #     print i
    # for i in test_words:
    #     print i
    # for i in test_tags:
    #     print i


    # train_data, test_data = tfvectorize(train_words, test_words)
    train_data, test_data = covectorize(train_words, test_words)
    # for i in test_data:
    #     print i

    clf = train_clf(train_data,train_tags)

    re =  clf.predict(test_data)
    # print re
    evaluate(numpy.asarray(test_tags),re)
    # print re

svm算法下,TfidfVectorizer后,C=100.0(惩罚因子)左右,效果才达得到85%左右 ,而CountVectorizer在 C=10.0左右,就可以达到80%多。
可以看出,使用什么样的特征抽取方法,以及模型设置什么样的参数,对预测结果是有非常大影响的。

在这里附一个讲解svm中参数问题的博客
http://blog.csdn.net/xiaodongxiexie/article/details/70667101

这部分还在学习中。

你可能感兴趣的:(python自然语言处理)