基于强化学习的自我完善聊天机器人

Elena Ricciardelli, Debmalya Biswas

埃琳娜·里恰德利(Elena Ricciardelli)

Abstract. We present a Reinforcement Learning (RL) model for self-improving chatbots, specifically targeting FAQ-type chatbots. The model is not aimed at building a dialog system from scratch, but to leverage data from user conversations to improve chatbot performance. At the core of our approach is a score model, which is trained to score chatbot utterance-response tuples based on user feedback. The scores predicted by this model are used as rewards for the RL agent. Policy learning takes place offline, thanks to an user simulator which is fed with utterances from the FAQ-database. Policy learning is implemented using a Deep Q-Network (DQN) agent with epsilon-greedy exploration, which is tailored to effectively include fallback answers for out-of-scope questions. The potential of our approach is shown on a small case extracted from an enterprise chatbot. It shows an increase in performance from an initial 50% success rate to 75% in 20–30 training epochs.

抽象。 我们提出了一种自我学习型聊天机器人的强化学习(RL)模型，专门针对FAQ型聊天机器人。 该模型的目的不是从头开始构建对话系统，而是利用来自用户对话的数据来改善聊天机器人的性能。 我们方法的核心是一个评分模型，该模型经过训练可以根据用户反馈对聊天机器人语音响应元组进行评分。 该模型预测的分数将用作RL代理的奖励。 借助用户模拟器，可以从FAQ数据库中获取语音信息，从而离线进行策略学习。 策略学习是使用具有epsilon贪婪探索功能的Deep Q-Network(DQN)代理实施的，该代理经过量身定制，可以有效地包含范围外问题的后备答案。 我们的方法的潜力在从企业聊天机器人中提取的一个小案例中得以展示。 它显示了在20-30个训练时期内，性能从最初的50％成功率提高到了75％。

The published version of the paper is available below, in proceedings of the 4th Multidisciplinary Conference on Reinforcement Learning and Decision Making (RLDM), Montreal, 2019: http://rldm.org/papers/abstracts.pdf

该论文的出版版本可在以下第四届强化学习和决策多学科会议(RLDM)会议录中找到，2019年蒙特利尔： http ：//rldm.org/papers/abstracts.pdf

1引言 (1 Introduction)

The majority of dialog agents in an enterprise setting are domain specific, consisting of a Natural Language Understanding (NLU) unit trained to recognize the user’s goal in a supervised manner. However, collecting a good training set for a production system is a time-consuming and cumbersome process. Chatbots covering a wide range of intents often face poor performance due to intent overlap and confusion. Furthermore, it is difficult to autonomously retrain a chatbot taking into account the user feedback from live usage or testing phase. Self-improving chatbots are challenging to achieve, primarily because of the difficulty in choosing and prioritizing metrics for chatbot performance evaluation. Ideally, one wants a dialog agent to be capable to learn from the user’s experience and improve autonomously.

企业设置中的大多数对话代理都是特定于域的，由自然语言理解(NLU)单元组成，该单元受过训练，可以以监督方式识别用户的目标。但是，为生产系统收集好的培训集是一个既耗时又麻烦的过程。由于意图重叠和混乱，涵盖多种意图的聊天机器人通常会面临较差的性能。此外，考虑到来自实时使用或测试阶段的用户反馈，很难自主地重新训练聊天机器人。自我完善的聊天机器人很难实现，这主要是因为难以选择和优先确定聊天机器人性能评估的指标。理想情况下，人们希望对话代理能够从用户的经验中学习并自主改进。

In this work, we present a reinforcement learning approach for self-improving chatbots, specifically targeting FAQ-type chatbots. The core of such chatbots is an intent recognition NLU, which is trained with hard-coded examples of question variations. When no intent is matched with a confidence level above 30%, the chatbot returns a fallback answer. For all others, the NLU engine returns the corresponding confidence level along with the response.

在这项工作中，我们提出了一种针对自我改进聊天机器人的强化学习方法，特别是针对FAQ型聊天机器人。此类聊天机器人的核心是意图识别NLU，其中包含问题变体的硬编码示例。当没有意图与高于30％的置信度匹配时，聊天机器人将返回后备答案。对于其他所有变量，NLU引擎将返回相应的置信度以及响应。

Several research papers [2, 3, 7, 8] have shown the effectiveness of a RL approach in developing dialog systems. Critical to this approach is the choice of a good reward model. A typical reward model is the implementation of a penalty term for each dialog turn. However, such rewards only apply to task completion chatbots where the purpose of the agent is to satisfy user’s request in the shortest time, but it is not suitable for FAQ-type chatbots where the chatbot is expected to provide a good answer in one turn. The

基于强化学习的自我完善聊天机器人

1引言 (1 Introduction)

你可能感兴趣的:(人工智能,强化学习,python)