20newsgroups数据集是用于文本分类、文本挖据和信息检索研究的国际标准数据集之一。数据集收集了大约20,000左右的新闻组文档,均匀分为20个不同主题的新闻组集合。一些新闻组的主题特别相似(e.g. comp.sys.ibm.pc.hardware/ comp.sys.mac.hardware),还有一些却完全不相关 (e.g misc.forsale /soc.religion.christian)。
comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x |
rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey |
sci.crypt sci.electronics sci.med sci.space |
misc.forsale |
talk.politics.misc talk.politics.guns talk.politics.mideast |
talk.religion.misc alt.atheism soc.religion.christian |
20newsgroups数据集有三个版本。第一个版本19997是原始的并没有修改过的版本。第二个版本bydate是按时间顺序分为训练(60%)和测试(40%)两部分数据集,不包含重复文档和新闻组名(新闻组,路径,隶属于,日期)。第三个版本18828不包含重复文档,只有来源和主题。
在sklearn中,该模型有两种装载方式,第一种是sklearn.datasets.fetch_20newsgroups,返回一个可以被文本特征提取器(如sklearn.feature_extraction.text.CountVectorizer)自定义参数提取特征的原始文本序列;第二种是sklearn.datasets.fetch_20newsgroups_vectorized,返回一个已提取特征的文本序列,即不需要使用特征提取器。
BASE_DIR = './data'
GLOVE_DIR = BASE_DIR + '/glove.6B/'
TEXT_DATA_DIR = BASE_DIR + '/20_newsgroup/'
MAX_SEQUENCE_LENGTH = 1000
MAX_NB_WORDS = 20000
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.2
batch_size = 32
print('Indexing word vectors.')
embeddings_index = {}
f = open(os.path.join(GLOVE_DIR,'glove.6B.100d.txt'),encoding='utf-8')
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:],dtype='float32')
embeddings_index[word] = coefs
f.close()
print('Found %s word vectors.'%len(embeddings_index) )
print('Processing text dataset')
texts = []
labels_index = {}
labels = []
for name in sorted(os.listdir(TEXT_DATA_DIR)):
path = os.path.join(TEXT_DATA_DIR,name)
if os.path.isdir(path):
label_id = len(labels_index)
labels_index[name] = label_id #每个文件夹给一个ID
for fname in sorted(os.listdir(path)):
if fname.isdigit():
fpath = os.path.join(path,fname)
if sys.version_info<(3,):
f = open(fpath)
else:
f = open(fpath,encoding='latin-1')
texts.append(f.read())
f.close()
labels.append(label_id)
print('Found %s texts.'%len(texts))
tokenizer = Tokenizer(nb_words = MAX_NB_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index
print('Found %s unique tokens.'%len(word_index))
data = pad_sequences(sequences,maxlen = MAX_SEQUENCE_LENGTH)
labels = to_categorical(np.asarray(labels))
print('Shape of data tensor:',data.shape)
print('Shape of label tensor:',labels.shape)
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
nb_validation_samples = int(VALIDATION_SPLIT*data.shape[0])
x_train = data[:-nb_validation_samples]
y_train = labels[:-nb_validation_samples]
x_val = data[-nb_validation_samples:]
y_val = labels[-nb_validation_samples:]
print('Preparing embedding matrix.')
print(nb_validation_samples)
nb_words = min(MAX_NB_WORDS,len(word_index))
embedding_matrix = np.zeros((nb_words +1,EMBEDDING_DIM))
for word,i in word_index.items():
if i>MAX_NB_WORDS:
continue
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
print(embedding_matrix.shape)
embedding_layer = Embedding(
nb_words+1,
EMBEDDING_DIM,
weights= [embedding_matrix],
input_length = MAX_SEQUENCE_LENGTH,
trainable=False,#trainable,由于我们的W是word2vec训练出来的,算作预训练模型,所以就无需训练了。
dropout = 0.2
)
batch_size = 32
print('Build model...')
model = Sequential()
model.add(embedding_layer)
model.add(LSTM(100,dropout_W = 0.2,dropout_U=0.2))#输出维度 :100
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.add(Dense(len(labels_index),activation='softmax'))
model.compile(
loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy']
)
print('Train...')
model.fit(x_train,y_train,batch_size=batch_size,nb_epoch=5,validation_data=(x_val,y_val))
score,acc = model.evaluate(x_val,y_val,batch_size=batch_size)
print('Test score:',score)
print('Test sccuracy:',acc)
3936/3999 [============================>.] - ETA: 0s
3968/3999 [============================>.] - ETA: 0s
3999/3999 [==============================] - 52s 13ms/step
Test score: 0.18472591743048325
Test sccuracy: 0.9499999885411226