Keras LSTM对20 Newsgroups数据集进行分类

1.20 Newsgroup数据集介绍

20newsgroups数据集是用于文本分类、文本挖据和信息检索研究的国际标准数据集之一。数据集收集了大约20,000左右的新闻组文档,均匀分为20个不同主题的新闻组集合。一些新闻组的主题特别相似(e.g. comp.sys.ibm.pc.hardware/ comp.sys.mac.hardware),还有一些却完全不相关 (e.g misc.forsale /soc.religion.christian)。

comp.graphics

comp.os.ms-windows.misc

comp.sys.ibm.pc.hardware

comp.sys.mac.hardware

comp.windows.x

rec.autos

rec.motorcycles

rec.sport.baseball

rec.sport.hockey

sci.crypt

sci.electronics

sci.med

sci.space

misc.forsale

talk.politics.misc

talk.politics.guns

talk.politics.mideast

talk.religion.misc

alt.atheism

soc.religion.christian

20newsgroups数据集有三个版本。第一个版本19997是原始的并没有修改过的版本。第二个版本bydate是按时间顺序分为训练(60%)和测试(40%)两部分数据集,不包含重复文档和新闻组名(新闻组,路径,隶属于,日期)。第三个版本18828不包含重复文档,只有来源和主题。

  • 20news-19997.tar.gz –原始20 Newsgroups数据集
  • 20news-bydate.tar.gz –按时间分类; 不包含重复文档和新闻组名(18846 个文档)
  • 20news-18828.tar.gz–  不包含重复文档,只有来源和主题 (18828 个文档)

在sklearn中,该模型有两种装载方式,第一种是sklearn.datasets.fetch_20newsgroups,返回一个可以被文本特征提取器(如sklearn.feature_extraction.text.CountVectorizer)自定义参数提取特征的原始文本序列;第二种是sklearn.datasets.fetch_20newsgroups_vectorized,返回一个已提取特征的文本序列,即不需要使用特征提取器。

2.加载训练好的向量(Glove,100维)

BASE_DIR = './data'
GLOVE_DIR = BASE_DIR + '/glove.6B/'
TEXT_DATA_DIR = BASE_DIR + '/20_newsgroup/'
MAX_SEQUENCE_LENGTH = 1000
MAX_NB_WORDS = 20000
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.2
batch_size = 32

print('Indexing word vectors.')
embeddings_index = {}
f = open(os.path.join(GLOVE_DIR,'glove.6B.100d.txt'),encoding='utf-8')
for line in f:
	values = line.split()
	word = values[0]
	coefs = np.asarray(values[1:],dtype='float32')
	embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.'%len(embeddings_index) )

3.加载数据集,这里也包括标签,我们是根据文档所在的文件夹用数字进行分类

print('Processing text dataset')

texts = []
labels_index = {}
labels = []

for name in sorted(os.listdir(TEXT_DATA_DIR)):
	path = os.path.join(TEXT_DATA_DIR,name)
	if os.path.isdir(path):
		label_id = len(labels_index)
		labels_index[name] = label_id	#每个文件夹给一个ID
		for fname in sorted(os.listdir(path)):
			if fname.isdigit():
				fpath = os.path.join(path,fname)
				if sys.version_info<(3,):
					f = open(fpath)
				else:
					f = open(fpath,encoding='latin-1')
				texts.append(f.read())
				f.close()
				labels.append(label_id)
print('Found %s texts.'%len(texts))

4.将文本数据向量化

tokenizer = Tokenizer(nb_words = MAX_NB_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print('Found %s unique tokens.'%len(word_index))

5.构造训练集和测试集,这里我们对数据进行了清洗

 

data = pad_sequences(sequences,maxlen = MAX_SEQUENCE_LENGTH)

labels = to_categorical(np.asarray(labels))

print('Shape of data tensor:',data.shape)
print('Shape of label tensor:',labels.shape)

indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
nb_validation_samples = int(VALIDATION_SPLIT*data.shape[0])

x_train = data[:-nb_validation_samples]
y_train = labels[:-nb_validation_samples]
x_val = data[-nb_validation_samples:]
y_val = labels[-nb_validation_samples:]

print('Preparing embedding matrix.')
print(nb_validation_samples)

6.我们构建了LSTM网络模型,并对其评估

 

nb_words = min(MAX_NB_WORDS,len(word_index))
embedding_matrix = np.zeros((nb_words +1,EMBEDDING_DIM))
for word,i in word_index.items():
	if i>MAX_NB_WORDS:
		continue
	embedding_vector = embeddings_index.get(word)
	if embedding_vector is not None:
		embedding_matrix[i] = embedding_vector
print(embedding_matrix.shape)

embedding_layer = Embedding(
	nb_words+1,
	EMBEDDING_DIM,
	weights= [embedding_matrix],
	input_length = MAX_SEQUENCE_LENGTH,
	trainable=False,#trainable,由于我们的W是word2vec训练出来的,算作预训练模型,所以就无需训练了。
	dropout = 0.2
)
batch_size = 32
print('Build model...')

model = Sequential()
model.add(embedding_layer)
model.add(LSTM(100,dropout_W = 0.2,dropout_U=0.2))#输出维度 :100
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.add(Dense(len(labels_index),activation='softmax'))

model.compile(
	loss='binary_crossentropy',
	optimizer='adam',
	metrics=['accuracy']
)
print('Train...')
model.fit(x_train,y_train,batch_size=batch_size,nb_epoch=5,validation_data=(x_val,y_val))

score,acc = model.evaluate(x_val,y_val,batch_size=batch_size)

print('Test score:',score)
print('Test sccuracy:',acc)

7.训练结果(部分结果)

3936/3999 [============================>.] - ETA: 0s
3968/3999 [============================>.] - ETA: 0s
3999/3999 [==============================] - 52s 13ms/step
Test score: 0.18472591743048325
Test sccuracy: 0.9499999885411226

 

 

 

你可能感兴趣的:(NLP)