正在学习sklearn , 实验室项目需要实现一些文本的分类的功能。
sklearn提供了许多机器学习方面使用的到的写好的工具。其中也包括分类器。sklearn在这里不作介绍。有官网,有博客,也正在学习中
最开始是参照着这片文章:
https://segmentfault.com/a/1190000002472791
用的是朴素贝叶斯,文本向量化用的是HashingVectorizer
实现过后,效果不够好,在这个基础上改用了 TfidfVectorizer,CountVectorizer,其中TfidfVectorizer效果较好,达到了50%左右,但是对于实验来说是不够的
参照着写了使用svm来进行分类,改了数据处理的部分,按照0.65左右的比例在整个数据集随机的生成训练集与测试集来比较效果。
数据从txt读取的,格式如下:
男默女泪啊:0
自杀者永世不得为人乃铁律,!不珍惜生命:0
发达国家都能结婚了,中国人的思维还在百年前。差劲啊:0
爱不是这么样表达的,不一定需要拥有,社会这样我们改变不了什么,但是,非要死吗:0
数据本身是存在样本不均匀问题的,且文本较短。故而有些向量化工具效果不好。
代码:
# -*- coding: utf-8 -*-
from sklearn import datasets
from sklearn import svm
import random
import jieba
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
import numpy
#调整了格式,一行是一条数据
def inputdata(filename):
f = open(filename,'r')
linelist = f.readlines()
return linelist
def splitset(trainset,testset):
train_words = []
train_tags = []
test_words = []
test_tags = []
for i in trainset:
i = i.strip()
# index = i.index(':')
train_words.append(i[:-2])
# print i
train_tags.append(int(i[-1]))
for i in testset:
i = i.strip()
# index = i.index(':')
test_words.append(i[:-2])
# print i
test_tags.append(int(i[-1]))
return train_words,train_tags,test_words,test_tags
#完成打开文件后的准备工作
comma_tokenizer = lambda x: jieba.cut(x, cut_all=True)
def tfvectorize(train_words,test_words):
v = TfidfVectorizer(tokenizer=comma_tokenizer,binary = False, decode_error = 'ignore',stop_words = 'english')
train_data = v.fit_transform(train_words)
test_data = v.transform(test_words)
return train_data,test_data
#按比例划分训练集与测试集
def splitDataset(dataset,splitRatio):
trainSize = int(len(dataset)*splitRatio)
trainSet = []
copy = dataset
while len(trainSet)return trainSet,copy
#得到准确率和召回率
def evaluate(actual, pred):
m_precision = metrics.precision_score(actual, pred,average='macro')
m_recall = metrics.recall_score(actual,pred,average='macro')
print 'precision:{0:.3f}'.format(m_precision)
print 'recall:{0:0.3f}'.format(m_recall)
#创建svm分类器
def train_clf(train_data, train_tags):
clf = svm.SVC(C=10.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3,
gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
clf.fit(train_data, numpy.asarray(train_tags))
return clf
def covectorize(train_words,test_words):
v = CountVectorizer(tokenizer=comma_tokenizer,binary = False, decode_error = 'ignore',stop_words = 'english')
train_data = v.fit_transform(train_words)
test_data = v.transform(test_words)
return train_data,test_data
if __name__ == '__main__':
linelist = inputdata('data/newdata.txt')
# for i in linelist:
# print i.decode('utf-8')
# 划分成两个list
trainset, testset = splitDataset(linelist, 0.65)
# for i in trainset:
# print i.decode('utf-8')
print 'train number:', len(trainset)
print 'test number:', len(testset)
train_words, train_tags, test_words, test_tags = splitset(trainset, testset)
# for i in train_words:
# print i
# for i in train_tags:
# print i
# for i in numpy.asarray(train_tags):
# print i
# for i in test_words:
# print i
# for i in test_tags:
# print i
# train_data, test_data = tfvectorize(train_words, test_words)
train_data, test_data = covectorize(train_words, test_words)
# for i in test_data:
# print i
clf = train_clf(train_data,train_tags)
re = clf.predict(test_data)
# print re
evaluate(numpy.asarray(test_tags),re)
# print re
svm算法下,TfidfVectorizer后,C=100.0(惩罚因子)左右,效果才达得到85%左右 ,而CountVectorizer在 C=10.0左右,就可以达到80%多。
可以看出,使用什么样的特征抽取方法,以及模型设置什么样的参数,对预测结果是有非常大影响的。
在这里附一个讲解svm中参数问题的博客
http://blog.csdn.net/xiaodongxiexie/article/details/70667101
这部分还在学习中。