《SemEval-2018 Task7: Effectively Combining Recurrent and Convolutional Neural Networks for Relation Classification and Extraction》阅读笔记

论文链接:https://www.paperweekly.site/papers/1833

非结构化text中的针对实体关系抽取任务,SemEval2018 Task7包含的任务 

Subtask1: relation classification(assigning a type of relation to an entity pair)

Subtask2: relation extraction(detecting the existence of relation between two entities and determining its type)

这个任务与relation classification on clean data不同,(Subtask 1.1, manually annotated entities),(Subtask 1.2, automatically annotated entities)。而是语义关系的分类,主要有以下6类:USAGE, RESULT, MODEL-FEATURE, PART-WHOLE, TOPIC, COMPARE. 前五种是不对称的关系,最后一种是order-independent。

One of the current challenges in analyzing unstructured data is to extract valuable knowledge by detecting the relevant entities and relations between them.

架构图:

CNN包含的是最初始的embedding层,紧接着一个卷积层with multiple filter widths,feature maps使用ReLU激活函数,一个max-pooling layer(applied over time),接着一个全连接层,使用dropout来训练,再接着softmax。

RNN包含的是最初始的embedding层,接着是双向LSTM,接着是全连接层,使用dropout进行训练,最后是softmax。

整体的架构是独立训练多次,最后将两个模型的结果ensemble起来去获得最终的分类结果,多层stacked LSTM的效果也并没有比单层的效果好。

1. Preprocessing

cropping sentences裁剪句子

Since the most relevant portion of text to determine the relation type is generally the one contained between and including the entities

只是分析句子的部分,然后丢弃掉surrounding words。对于subtask2,实体之间存在关联的情况随着两者之间的距离的增大概率急剧减小,因此本文只考虑没有超过最大阈值的情况,用于减少预测长句子中false positives的情况。

2. Experiments

数据集扩增,将两个任务的数据集合并起来训练模型;生成额外数据,

We generated automatically-tagged artificial training samples for Subtask 1 by combining the entities that appeared in the test data with the text between entities and relation labels of those from the training set。

为了判断生成句子的质量,使用KenLM Language Model Toolkit[1],只使用有效的句子

参数优化:

ran a grid search over the parameter space for those parameters that were part of our automatic pipeline.

upsampling

target类不平衡,NONE占了很多,为了克服这个困难resorted to an upsampling scheme for which we defined an arbitrary ratio of positive to negative examples to present to the networks for the combination of all positive classes


总结:这篇文章在模型选择和数据处理以及调参方面都写得非常详细,学到很多,在实践过程中可以参考






[1] Kenlm: Faster and smaller language model queries.

你可能感兴趣的:(《SemEval-2018 Task7: Effectively Combining Recurrent and Convolutional Neural Networks for Relation Classification and Extraction》阅读笔记)