道边苦李，励志笃行

【论文阅读】《Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation》

Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation

大语言模型助力Text-to-SQL：一项基准评估
Dawei Gao∗ Alibaba Group [email protected] Haibin Wang∗ Alibaba Group [email protected] Yaliang Li Alibaba Group [email protected] Xiuyu Sun Alibaba Group [email protected] Yichen Qian Alibaba Group [email protected] Bolin Ding Alibaba Group [email protected] Jingren Zhou Alibaba Group [email protected]

ABSTRACT
摘要

Large language models (LLMs) have emerged as a new paradigm for Text-to-SQL task. However, the absence of a systematical benchmark inhibits the development of designing effective, efficient and economic LLM-based Text-to-SQL solutions. To address this challenge, in this paper, we first conduct a systematical and extensive comparison over existing prompt engineering methods, including question representation, example selection and example organization, and with these experimental results, we elaborate their pros and cons. Based on these findings, we propose a new integrated solution, named DAIL-SQL, which refreshes the Spider leaderboard with 86.6% execution accuracy and sets a new bar.
大型语言模型（LLMs）已成为Text-to-SQL任务的一种新范式。然而，缺乏系统性的基准限制了设计有效、高效和经济的以大型语言模型为基础的 Text-to-SQL 解决方案的发展。为了应对这一挑战，本文首先对现有的提示工程方法进行了系统和广泛的比较，包括问题表示、示例选择和示例组织，并通过这些实验结果阐述了它们的优缺点。基于这些发现，我们提出了一种新的综合解决方案，名为 DAIL-SQL，它在 Spider 排行榜上以 86.6% 的执行准确性刷新了纪录，树立了新的标杆。

To explore the potential of open-source LLM, we investigate them in various scenarios, and further enhance their performance with supervised fine-tuning. Our explorations highlight open-source LLMs’ potential in Text-to-SQL, as well as the advantages and disadvantages of the supervised fine-tuning. Additionally, towards an efficient and economic LLM-based Text-to-SQL solution, we emphasize the token efficiency in prompt engineering and compare the prior studies under this metric. We hope that our work provides a deeper understanding of Text-to-SQL with LLMs, and inspires further investigations and broad applications.
为了挖掘开源大规模语言模型的潜力，我们在各种场景中探讨它们的表现，并通过有监督的微调进一步优化它们的性能。我们的研究揭示了开源大规模语言模型在Text-to-SQL 领域的潜力，以及有监督微调的优缺点。此外，为了实现高效且经济的大规模语言模型为基础的Text-to-SQL解决方案，我们强调了提示工程中的token效率，并以此为准则比较了先前的研究。我们希望我们的工作能加深对大规模语言模型中Text-to-SQL 的理解，并激发进一步的研究和广泛的应用。

1 INTRODUCTION
1 简介

Text-to-SQL, as one challenging task in both natural language processing and database communities, maps natural language questions on the given relational database into SQL queries [7, 15]. Most previous works [14, 19, 20, 42, 50] focus on extracting the question-to-SQL patterns and generalizing them by training an encoder-decoder model with Text-to-SQL corpus. In recent years, large language models (LLMs) have emerged as a new paradigm for Text-to-SQL [22, 34, 41]. Notably, equipped with GPT-4 [26], Pourreza et al. [31] achieved the first place in Spider leaderboard [3] with 85.3% execution accuracy. Different from prior studies, the core problem in LLM-based Text-to-SQL solution is how to prompt LLM to generate correct SQL queries, namely prompt engineering. Such prompt engineering involves question representations [5, 10, 28, 31], examples selection [11, 24, 25], and example organization [11].
Text-to-SQL ，作为自然语言处理和数据库领域的一个具有挑战性的任务，将自然语言问题映射到给定的关系数据库中的 SQL 查询【7, 15】。以前的研究【14, 19, 20, 42, 50】主要关注提取 question-to-SQL 的模式并通过训练带有Text-to-SQL 语料库的编码器 - 解码器模型来概括它们。近年来，大型语言模型（LLMs）作为Text-to-SQL 的新范式出现【22, 34, 41】。值得注意的是，借助 GPT-4【26】，Pourreza 等人【31】在 Spider 排行榜【3】上以 85.3% 的执行准确性获得了第一名。与以前的研究不同，基于 LLM 的Text-to-SQL 解决方案的核心问题是如何引导 LLM 生成正确的 SQL 查询，即提示工程。这种提示工程包括问题表示【5, 10, 28, 31】、示例选择【11, 24, 25】和示例组织【11】。

Text-to-SQL prompt engineering needs a systematic study. Although prior studies have made remarkable progress, there still lacks a systematic study for prompt engineering in LLM-based Text-to-SQL solutions. Specifically, for question representation, most existing research textualize structured knowledge as schema, and further add task instructions and foreign keys to form prompts [16, 25]. Besides, some studies [5, 25] represent tables as several “CREATE TABLE” SQL statements, and prompt LLMs to answer the target question in comments. However, even with similar representation, their detailed task instructions can lead to significant performance gap. For example, in OpenAI’s official Text-to-SQL demo [28], they employ the pound sign “#” to differentiate prompt from response, yielding an impressive performance [22]; If such a sign is removed, the performance will significantly drop. Therefore, there are burgeoning demands for a systematic study over different representations and examine how to work well with LLMs. Regarding example selection, a common practice is to encode the most similar examples in the same representation with the target question [5, 22, 25]. Nan et al. [25] further underline the importance of diversity in example selection. While for organization, most prior studies represent examples with full information, including instruction, schema, question and ground truth SQL queries. Besides, Guo et al. [11] only keep SQL queries in the selected examples to guide the LLM with less tokens. Together with different LLMs’ preferences, the optimal selection and organization strategies in LLM-based Text-to-SQL solution remain ambiguous. Therefore, a systematical study on prompt engineering, spanning different LLMs, question representations, example selection and organizations, is highly anticipated.
Text-to-SQL 提示工程需要系统化的研究。尽管先前的研究已经取得了显著的进步，但仍然缺乏针对基于 LLM 的Text-to-SQL 解决方案的提示工程系统化研究。具体来说，在问题表示方面，现有研究将结构化知识文本化为schema，并进一步添加任务指令和外键以形成提示 [16, 25]。此外，一些研究 [5, 25] 将表表示为多个“CREATE TABLE”SQL 语句，并在注释中提示 LLM 回答目标问题。然而，即使具有类似的表示，详细的任务指令也可能导致性能差距。例如，在 OpenAI 官方的文本至 SQL 演示 [28] 中，他们使用井号“#”将提示与响应区分，取得了令人印象深刻的性能 [22]；如果去掉这样的符号，性能将显著下降。因此，对于不同表示的系统化研究日益迫切，以探讨如何与 LLM 良好配合。关于示例选择，常见的方法是将目标问题在相同表示中最相似的示例进行编码 [5, 22, 25]。Nan 等人进一步强调了示例选择中多样性的重要性。在组织方面，大多数先前的研究使用包含指令、模式、问题和真实 SQL 查询的完整信息来表示示例。此外，郭等人 [11] 仅在选定的示例中保留 SQL 查询，以使用较少的令牌指导 LLM。结合不同 LLM 的偏好，基于 LLM 的Text-to-SQL 解决方案中的最佳示例选择和组织策略仍然模糊。因此，迫切需要对提示工程进行系统化研究，涵盖不同的 LLM、问题表示、示例选择和组织。

The potential of open-source LLMs is underexplored. Very recently, open-source LLMs are constantly expanding and show remarkable advancement in programming, mathematical reasoning, and text generation tasks. However, previous Text-to-SQL research primarily focuses on OpenAI LLMs, leaving open-source LLMs unstudied. Besides, compared with OpenAI LLMs, open-source ones generally have limited functionality in understanding context and generating coherent response. Thus, a critical challenge for open-source LLMs is to further enhance their performance in Text-to-SQL, which can be achieved by supervised fine-tuning.
开源大型语言模型的潜力尚未得到充分发掘。近期，开源大型语言模型不断壮大，并在编程、数学推理和文本生成任务方面展现出显著的进步。然而，以往的研究主要关注 OpenAI 的大型语言模型，而忽视了开源大型语言模型。与 OpenAI 的大型语言模型相比，开源大型语言模型在理解上下文和生成连贯回应方面通常功能有限。因此，开源大型语言模型面临的关键挑战是进一步提高在 Text-to-SQL 领域的性能，这可以通过监督微调来实现。

Prompt efficiency remains a challenging open question. In LLM-based Text-to-SQL, another critical challenge is efficiency. The reason is that most prior studies focus on OpenAI LLMs, and calling their APIs are expensive, time-consuming and restricted in rate limits [27], especially for in-context learning prompts with multiple examples. However, the prior studies may not well tackle this challenge. Specifically, based on the observed inverted-U shape in execution accuracy with respect to prompt length, Chang et al. [5] conjectures that LLMs may have a sweet spot in terms of prompt length, but leaves exploring efficient prompt engineering a challenging open question.
提高效率依然是一个具有挑战性的开放问题。在基于 LLM 的文本到 SQL 任务中，另一个关键挑战是效率。原因在于，大多数先前的研究都关注 OpenAI 的 LLM，而调用他们的 API 既昂贵又耗时，且受到速率限制 [27]，尤其是对于具有多个示例的上下文学习提示。然而，先前的研究可能并未充分解决这一挑战。具体来说，根据执行准确性关于提示长度的倒 U 形曲线，Chang 等人 [5] 猜测 LLM 可能在提示长度方面有一个甜蜜点，但探索高效提示工程仍是一个具有挑战性的开放问题。

In light of above challenges, we focus on providing a comprehensive, systematical and fair benchmark for LLM-based Text-to-SQL. Specifically, our benchmark discusses both the effectiveness and efficiency of various prompt engineering strategies, as well as the feasibility of open-source LLMs. They are detailed as follows. To provide a systematical and in-depth understanding of Text-to-SQL prompt engineering, we empirically evaluate several strategies from prior studies. First, we compare several typical question representations in zero-shot scenario with different LLMs, and identify their pros and cons. After that, we investigate example selection and organization strategies in few-shot scenario. For example selection, we compare different selection strategies and further verify the hypothesis that LLMs learn from the mappings between question and SQL skeleton. Regarding example organization, we explore the option of displaying full information, solely SQL queries or question-SQL pair.
鉴于上述挑战，我们专注于为基于 LLM 的Text-to-SQL 提供全面、系统和公平的基准。具体而言，我们的基准讨论了各种提示工程策略的有效性和效率，以及开源 LLM 的可行性。详细如下。为了提供一个系统化和深入的Text-to-SQL 提示工程理解，我们实证评估了先前的研究中几种策略。首先，我们在不同 LLM 的few-shot场景中比较了几种典型的疑问表示，并确定了它们的优劣。然后，我们在有限景观中研究了示例选择和组织策略。关于示例选择，我们比较了不同的选择策略，并进一步验证了 LLM 从问题与 SQL 骨架之间的映射学习的假设。关于示例组织，我们探讨了显示完整信息、仅 SQL 查询或问题 SQL 对的选项。

After that, we highlight the potential of open-source LLMs in both in-context learning and supervised fine-tuning. Specifically, we empirically study various open-source LLMs with different prompt engineering strategies, and observe the significant benefits of increasing scale of LLMs and having a good alignment[29]. To further enhance their performance, we fine-tune and evaluate open-source LLMs using various representations. With this comparison, we demonstrate that similar to in-context learning, representation strategy is also critical for supervised fine-tuning. These explorations underline the potential of an effective solution for Text-to-SQL. Moreover, after fine-tuning we also observe a decrease in in-context learning capability, which requires further study. We believe these explorations will benefit practical Text-to-SQL applications.
之后，我们重点探讨了开源大规模语言模型（LLMs）在上下文学习和有监督微调方面的潜力。具体来说，我们实证研究了采用不同提示工程策略的各种开源 LLM，并观察到扩大 LLM 规模和实现良好对齐 [29] 所带来的显著优势。为了进一步提高它们的性能，我们使用各种表示对开源 LLM 进行微调和评估。通过这种比较，我们证明了与上下文学习类似，表示策略对有监督微调也至关重要。这些探讨突显出有效解决 Text-to-SQL 问题的潜力。此外，在微调后，我们也观察到上下文学习能力的下降，这需要进一步研究。我们相信这些探讨将有利于实际 Text-to-SQL 应用。

Towards a more economic and efficient solution, we further evaluate different strategies in terms of token efficiency. Such evaluation aims at searching for a cost-effective strategy, which is supposed to achieve considerable performance with less tokens. To fulfill such goal, we consider token efficiency in the whole process of prompt engineering, including choices for question representation, example selection and organization.
为了寻求更经济、更高效的解决方案，我们进一步从token效率的角度评估不同的策略。这种评估的目标是寻找一种成本效益较高的策略，以便在较少的使用token的情况下实现较好的性能。为了实现这一目标，我们在整个提示工程过程中考虑token效率，包括选择问题表示、示例选取和组织的方式。

Last but not least, our integrated solution, named DAIL-SQL, refreshes the Spider leaderboard with 86.6% execution accuracy, and wins the first place. Compared with previous solutions, DAIL-SQL encodes structure knowledge as SQL statements, selects examples based on their skeleton similarities and removes cross-domain knowledge from examples for token efficiency. Before DAIL-SQL, the state-of-the-art performance in the Spider leaderboard is 85.3% [31]. Therefore, our solution sets a new bar, and hope our comprehensive study will inspire more further works.
最后但同样重要的是，我们名为DAIL-SQL的集成解决方案以86.6%的执行准确率刷新了Spider排行榜，并赢得了第一名。与以前的解决方案相比，DAIL-SQL将结构知识编码为SQL语句，根据其基本的相似性选择示例，并从示例中删除跨领域知识，以提高token效率。在DAIL-SQL之前，Spider排行榜上最先进的性能为85.3%[31]。因此，我们的解决方案设置了一个新的标准，并希望我们的全面研究能启发更多的进一步工作。

Contribution Our main contributions and results are summarized as follows:
贡献我们的主要贡献和成果总结如下：

We conduct a comprehensive comparison for prompt engineering methods, including five question representations four example selection strategies and three example organization strategies. With such a comparison, we methodically dissect each strategy to identity its merits and demerits. We further propose a new integrated solution, named DAIL-SQL, and it refreshes the Spider leaderboard with 86.6% execution accuracy. Notably, this performance surpasses the best state-of-the-art solution by 1.3%.
我们对 prompt 工程方法进行了全面的比较，包括五种问题表示方式、四种示例选择策略和三种示例组织策略。通过这样的比较，我们系统地剖析了每种策略的优劣之处。进一步地，我们提出了一种新的集成解决方案，名为 DAIL-SQL，它以 86.6% 的执行准确度刷新了Spider排行榜。值得注意的是，这一性能比最好的前沿解决方案提高了 1.3%。
To the best of our knowledge, we are the first to explore open-source LLMs for both in-context learning and supervised fine-tuning for Text-to-SQL task. Specifically, we point out the importance of representation in supervised fine-tuning, and the decrease of in-context learning capability after fine-tuning, both of which need further studies.
据我们所知，我们是第一个探索开源LLM用于上下文学习和监督微调到Text-to-SQL任务的。具体而言，我们指出了表征在监督微调中的重要性，以及微调后上下文学习能力的下降，这两者都需要进一步研究。
We also empirically compare different representation and example organization strategies in terms of token efficiency, which provides practical guidance for real-world Text-to-SQL applications.
我们还从实际应用的角度，对比了各种表征和示例组织策略在token效率方面的差异，为现实世界的Text-to-SQL应用提供了实际指导。

2 PRELIMINARY
2 准备工作

Text-to-SQL aims at automatically translating natural language questions into SQL queries. It bridges the gap between non-expert users and database systems, greatly improves the efficiency of data processing, and contributes to a wider range of applications such as intelligent database service, automatic data analysis and database question-answering. However, Text-to-SQL is still a quiet challenging task, due to the difficulty in fully understanding natural language questions and generating correct SQL queries [15, 33].
Text-to-SQL 旨在实现自然语言问题自动转换为 SQL 查询。它填补了非专业用户与数据库系统之间的空白，大大提高了数据处理的效率，并有助于拓展应用范围，如智能数据库服务、自动数据分析和数据库问题解答。然而，Text-to-SQL 仍然是一项颇具挑战的任务，因为充分理解自然语言问题并生成正确的 SQL 查询具有一定的困难 [15, 33]。

Extensive studies of Text-to-SQL have been conducted in both database and natural language processing communities. Early studies treat Text-to-SQL as a sequence-to-sequence task, and focus on training machine learning models with an encoder-decoder architecture [4, 30, 32]. With rapid advancement of deep learning, numerous techniques are applied to help Text-to-SQL task, such as attention mechanism [23], graph representation [14, 20, 32, 42, 46, 50], syntax parsing [12, 19, 35, 43], etc. Besides, to narrow the gap between Text-to-SQL research and its real-world deployment, numerous large-scale benchmark datasets have been released, including WikiSQL [52], Spider [47], KaggleDBQA [18], BIRD [21] etc. With these great efforts, the research communities have made impressive progress in Text-to-SQL.
已经在数据库和自然语言处理领域进行了Text-to-SQL的广泛研究。早期的研究将 Text-to-SQL 视为一个序列到序列的任务，重点在于使用编码器 - 解码器架构训练机器学习模型 [4, 30, 32]。随着深度学习的快速发展，许多技术被应用于帮助 Text-to-SQL 任务，例如注意力机制 [23]，图表示 [14, 20, 32, 42, 46, 50]，语法解析 [12, 19, 35, 43] 等。此外，为了缩小 Text-to-SQL 研究与实际应用之间的差距，已经发布了许多大规模的基准数据集，包括 WikiSQL [52]，Spider [47]，KaggleDBQA [18]，BIRD [21] 等。有了这些努力，研究社区在 Text-to-SQL 领域取得了显著的进步。

Recently, large language models (LLMs), such as GPT-4 [26] from OpenAI and LLaMA [39] from Meta, have emerged as a milestone for natural language processing and machine learning. Different from general machine learning model, LLMs are pre-trained on massive text corpus, which can perform various natural language tasks. The basic operating principle is to gradually produce the next word that has the highest probability based on the input prompt [48]. Therefore, to tackle Text-to-SQL task with LLMs, the core is to find the optimal prompt, also known as prompt engineering [22, 25].
近期，大型语言模型（LLMs），如 OpenAI 的 GPT-4 [26] 和 Meta 的 LLaMA [39]，已经成为自然语言处理和机器学习领域的里程碑。与一般的机器学习模型不同，LLMs 在大规模文本语料库上进行预训练，因此能够执行各种自然语言任务。其基本运作原理是根据输入提示逐步生成具有最高概率的下一个单词 [48]。因此，用 LLMs 应对 Text-to-SQL 任务的核心在于找到最优的提示，也称为提示工程 [22, 25]。

Specifically, according to number of examples provided in prompt, prompt engineering are classified into two scenarios: zero-shot scenario and few-shot scenario. In zero-shot scenario, no example is provided, and the main challenge is to represent the natural language question effectively, including incorporating relevant information such as the corresponding database schema [5, 10, 22, 41]. In this paper, we refer it as question representation. While in few-shot scenario, a limited number of examples are available, thus besides question representation, we also need to study how to select the most helpful examples and organize them in the prompt appropriately. In natural language processing, the above progress that LLMs learn from contextual examples is referred as in-context learning [9]. In this paper, we will discuss in-context learning in the scope of example selection and example organization. Although LLMs are demonstrated to be effective in both zero-shot and few-shot scenarios in prior studies [5, 16, 22, 25, 37], their performances can be further enhanced by supervised fine-tuning, which further tune LLMs using additional Text-to-SQL instances to make it more suitable for specific downstream task. However, it is worth noting that despite the extensive research on prompt engineering for Text-to-SQL, there is a scarcity of studies exploring the supervised fine-tuning of LLMs for Text-to-SQL [37], leaving this area as an open question.
具体来说，根据提示中提供的示例数量，提示工程可分为两类场景：zero-shot场景和few-shot场景。在zero-shot场景中，没有提供示例，主要挑战在于如何有效地表示自然语言问题，包括融入相关信息如相应数据库模式 [5, 10, 22, 41]。在本文中，我们将其称为问题表示。而在few-shot场景中，有限的示例可供使用，因此除了问题表示外，我们还需要研究如何选择最有帮助的示例，并恰当地组织到提示中。在自然语言处理中，LLM 从上下文示例中学习的上述过程被称为上下文学习 [9]。在本文中，我们将讨论在示例选择和示例组织方面的上下文学习。尽管在先前的研究中 [5, 16, 22, 25, 37]，LLM 在zero-shot和few-shot场景中均表现出有效性，但通过监督微调，其性能可以进一步提高，这种微调使用额外的 Text-to-SQL 实例进一步调整 LLM，使其更适合特定下游任务。然而，值得注意的是，尽管对 Text-to-SQL 的提示工程进行了大量研究，但探索 Text-to-SQL 中 LLM 的监督微调的研究较少 [37]，这使得该领域仍存在一个开放性问题。

In summary, question representation, in-context learning, together with supervised fine-tuning are three essential knobs in large language model based Text-to-SQL. In this paper, we will provide a systematical study and discussion about them.
总之，在基于Text-to-SQL的大型语言模型中，问题表示、上下文学习以及监督微调是三个关键的调控因素。在本文中，我们将对它们进行系统的研究和讨论。

3 METHODOLOGY
3 研究方法

As stated above, in this paper we focus on question representation, in-context learning and supervised fine-tuning. In this section, we provide formal definitions for these three problems, survey their existing solutions systematically, and point out the potential issues in existing techniques. To address these issues, we propose a new Text-to-SQL prompt engineering method, named DAIL-SQL, which refreshes the best performance in Spider leaderboard with 86.6% execution accuracy.
如前所述，在这篇论文中，我们关注问题表示、上下文学习和有监督的微调。在本节中，我们为这三个问题提供正式定义，系统地调查了现有解决方案，并指出现有技术中可能存在的问题。为解决这些问题，我们提出了一种新的Text-to-SQL提示工程方法，名为 DAIL-SQL，它在 Spider 排行榜上以 86.6% 的执行准确率刷新了最佳性能。

3.1 Question Representation
3.1 问题表示

In this section, we first discuss question representations under zero-shot scenario for Text-to-SQL. Considering a target question $q$ in natural language on certain database $\mathcal D$ , the target of question representation is to maximize the possibility of LLM $\mathcal M$ generating the correct SQL $s^∗$ as follows:
$\underset{\sigma}{max} \quad\quad \mathbb P_\mathcal M(s^∗|\sigma (q, \mathcal D)),$
where function (·, ·) decides representation for target question , with the useful information from the schema of database $\mathcal D$ . Besides, (·, ·) also can include information such as instruction statement, rule implication and foreign key.
在本节中，我们首先讨论zero-shot场景下ext-to-SQL的问题表达。考虑到某个特定数据库 $\mathcal D$ 上的目标自然语言问题 $q$ ，问题表示的目标是最大化 LLM模型 $\mathcal M$ 生成正确 SQL $s^∗$ 语句的可能性，如下所示：
$\underset{\sigma}{max} \quad\quad \mathbb P_\mathcal M(s^∗|\sigma (q, \mathcal D)),$

在这个表达式中，函数σ(·, ·) 决定目标问题 $q$ 的表示，同时包含来自数据库 $\mathcal D$ 模式的有用信息。此外，σ(·, ·) 还可以包含诸如指令语句、规则推导和外键等相关信息。

Follow the above definition, we survey different choices of in zero-shot scenario and choose four most representative ones from literature. In addition, we also include the question representation used in Alpaca [38] since it’s popular in supervised fine-tuning. Table 1 summarizes these five representation methods and lists their reported details from their original papers.
根据上述定义，我们在zero-shot场景中调查了不同的选择，并从文献中选择了四个最具代表性的选项。此外，我们还包含了在 Alpaca[38] 中使用的问题表示，因为它在监督式微调中非常受欢迎。表 1 总结了这五种表示方法，并列出了它们在原始论文中的报告细节。

Table 1: Question representations in existing works, as well as their reported execution accuracy (EX) in zero-shot scenario. The Instruction (INS), Rule Implication (RI) and Foreign Key (FK) are possible components in a prompt. INS is the task description, such as “Write a SQL to answer the question”. RI is the guiding statement, such as “Complete sqlite SQL query only and with no explanation”. FK is the foreign key information of the database. 表 1：现有工作中的问题表示，以及其在zero-shot场景中的报告执行精度（EX）。指令（INS）、规则推导（RI）和外键（FK）可能是提示中的组件。INS 是任务描述，例如“编写一个 SQL 来回答问题”。RI 是指导性语句，例如“仅完成 sqlite SQL 查询，无需解释”。FK 是数据库的外键信息。

• Basic Prompt ( $\mathrm BS_P$ ). Basic Prompt [31] is a simple representation shown in Listing 1. It is consisted of table schemas, natural language question prefixed by “Q: ” and a response prefix “A: SELECT” to prompt LLM to generate SQL. In this paper we named it as Basic Prompt due to its absence of instructions.
基本提示( $\mathrm BS_P$ )。基本提示[31]是一个简单的表示，如Listing 1所示。它由表模式、前缀为“Q:”的自然语言问题和提示LLM生成SQL的响应前缀“a:SELECT”组成。在本文中，由于它没有指令，我们将其命名为基本提示。

• Text Representation Prompt ( $\mathrm TR_P$ ). As shown in Listing 2, Text Representation Prompt [25] represents both schema and question in natural language. Compared with Basic Prompt, it adds instruction at the very beginning of prompt to guide LLMs. In [25], it achieves 69.0% execution accuracy on Spider-dev in zero-shot scenario.
文本表示法提示( $\mathrm TR_P$ )。如Listing 2所示，文本表示提示[25]表示自然语言中的模式和问题。与基本提示相比，它在提示的最开始就添加了指导LLM的指令。在[25]中，在零样本场景中，它在Spider-dev上实现了69.0%的执行准确率。

• OpenAI Demostration Prompt ( $\mathrm OD_P$ ). The OpenAI Demostration Prompt (Listing 3) is first used in OpenAI’s official Text-to-SQL demo [28], and evaluated in [22, 31]. It’s consisted of instruction, table schemas, and question, where all information are commented by pound sign “#”. Compared with Text Representation Prompt, the instruction in OpenAI Demostration Prompt is more specific with a rule, “Complete sqlite SQL query only and with no explanation”, which we will further discuss in the Sec. 4.3 along with experimental results.
OpenAI演示提示( $\mathrm OD_P$ )。OpenAI演示提示（Listing 3）首次在OpenAI的官方文本到SQL演示[28]中使用，并在[22，31]中进行了评估。它由指令、表模式和问题组成，其中所有信息都用磅号“#”进行注释。与文本表示提示相比，OpenAI演示提示中的指令更具体，有一条规则，“仅完成sqlite SQL查询，无需解释”，我们将在第4.3节中进一步讨论，并结合实验结果。

• Code Representation Prompt ( $\mathrm CR_P$ ). The Code Representation Prompt [5, 25] presents Text-to-SQL task in SQL syntax. Specifically, as shown in Listing 4, it directly presents “CREAT TABLE” SQLs, and prompts LLM with natural language question in comments. Compared with other representations, $\mathrm CR_P$ stands out due to its ability to provide comprehensive information necessary for database creation, such as column types and primary/foreign keys. With such a representation, [25] correctly predicts about 75.6% SQLs with LLM CODE-DAVINCI-002.
代码表示提示（ $\mathrm CR_P$ ）。代码表示提示[5，25]以SQL语法表示文本到SQL任务。具体来说，如Listing 4所示，它直接呈现“CREATTABLE”SQL，并在注释中用自然语言问题提示LLM。与其他表示相比， $\mathrm CR_P$ 它之所以引人注目，是因为它能够提供创建数据库所需的全面信息，例如列类型和主键/外键。有了这样的表示，[25]使用LLM CODE-DAVINCI-002正确预测了约75.6%的SQL。

• Alpaca SFT Prompt ( $\mathrm AS_P$ ). The Alpaca SFT Prompt is a prompt designed for supervised fine-tuning [38]. As shown in Listing 5, it prompts LLM to follow instruction and finish task according to the input context in Markdown format. We include it to examine its effectiveness and efficiency in both prompt engineering and supervised fine-tuning scenarios.
Alpaca SFT提示( $\mathrm AS_P$ )。Alpaca SFT提示是一种用于监督微调的提示[38]。如Listing 5所示，它提示LLM按照指示并根据Markdown格式的输入上下文完成任务。我们将其包括在内，以检查其在即时工程和监督微调场景中的有效性和效率。

As shown in Table 1, different representations are experimented with different LLMs, and integrated in different frameworks, making it difficult to compare them fairly and effectively. Additionally, the specific roles played by individual components such as foreign key information and rule implication remain unclear. Consequently, it is essential to conduct a systematical study to better understand question representations, and investigate their advantages and disadvantages through a fair comparison.
如表1所示，不同的表示用不同的LLM进行实验，并集成在不同的框架中，这使得很难公平有效地进行比较。此外，外国关键信息和规则含义等个别组成部分所扮演的具体角色仍不清楚。因此，有必要进行系统的研究，以更好地理解问题表征，并通过公平的比较来考察它们的优缺点。

3.2 In-Context Learning for Text-to-SQL
3.2 Text-to-SQL的上下文学习

The above question representation methods enable LLMs to directly output desired SQLs by zero-shot learning. However, LLMs can perform better for Text-to-SQL through in-context learning, in which only a few examples are provided in the input prompts. Therefore, in this subsection, we discuss the keys of in-context learning, that are example selection and example organization. We first give a formulation of in-context learning to ease the further discussions.
上述问题表示方法使大型语言模型（LLM）能够通过zero-shot学习直接输出所需的SQL。然而，在文本到SQL任务中，LLM通过上下文学习可以表现更出色，其中只在输入提示中提供了少量示例。因此，在本小节中，我们讨论上下文学习的关键因素，即示例选择和示例组织。我们首先提出了上下文学习的一个表达形式，以便进一步讨论。

In Text-to-SQL, given a set of triples ${(q_i, s_i, \mathcal D_i)}$ , where and are natural language question and its corresponding SQL query on database $\mathcal D_i$ , the target of in-context learning for Text-to-SQL is to maximize the possibility of LLM $\mathcal M$ generating the correct SQL $s^∗$ on the target question and database $\mathcal D$ as follows:
$\underset{Q',\sigma}{max} \quad\quad \mathbb P_\mathcal M(s^∗|\sigma (q, \mathcal D,Q')),$
$\quad\quad |Q'| = k \quad and \quad Q' \subset Q,$
where function (·, ·, ·) decides representation for target question , with the useful information from the schema in database $\mathcal D$ and examples selected from $Q$ . In this paper, we focus on cross-domain Text-to-SQL, which means the target database $\mathcal D$ is not included among the databases $\mathcal D_i$ mentioned in $\mathcal D ∉ \{\mathcal D_i| (q_i, s_i, \mathcal D_i) ∈ Q \}$ .
在 Text-to-SQL 中，给定一组三元组 ${(q_i, s_i, \mathcal D_i)}$ ，其中 $q_i$ 和 $s_i$ 分别是自然语言问题及其对应的数据库 $D_i$ 上的 SQL 查询，目标是在上下文学习中使大语言模型 $\mathcal M$ 生成的正确 SQL $s^∗$ 与目标问题 $q$ 和数据库 $\mathcal D$ 最大程度地相似：
$\underset{Q',\sigma}{max} \quad\quad \mathbb P_\mathcal M(s^∗|\sigma (q, \mathcal D,Q')),$
$\quad\quad |Q'| = k \quad and \quad Q' \subset Q,$
其中，函数(·, ·, ·) 决定目标问题的表示，该表示包含了数据库 $\mathcal D$ 中 schema 的有用信息以及从 $Q$ 中选择的个示例。在本文中，我们关注跨领域 Text-to-SQL，这意味着目标数据库 $\mathcal D$ 并未包含在前面 $Q$ 中提到的 $\mathcal D_i$ 个数据库之中，即 $\mathcal D ∉ \{\mathcal D_i| (q_i, s_i, \mathcal D_i) ∈ Q \}$

In-context learning for Text-to-SQL involves selecting the most helpful examples $Q^ ′$ and deciding how to organize the information of these selected examples into prompt. Next we discuss these two sub-tasks: example selection and example organization.
在上下文学习中，Text-to-SQL 翻译涉及两个子任务：选择最有帮助的示例 $Q^ ′$ 和确定如何将这些选定示例的信息组织成提示。接下来，我们讨论这两个子任务：示例选择和示例组织。

3.2.1 Example Selection. We summarize various example selection strategies in prior studies as follows.
3.2.1 示例选择。我们先总结一下先前研究中各种示例选择策略如下。

Random. This strategy randomly samples examples from the available candidates. Previous works [11, 24, 25] have adopted it as a baseline for example selection.
随机选择。此策略随机采样可用候选人的示例。先前的工作[11，24，25]已经将其作为示例选择的基线。
Question Similarity Selection (QTS ). QTS [24] chooses examples with the most similar questions. Specifically, it embeds both example questions in Q and the target question with a pre-trained language model. Then it applies a pre-defined distance measure, such as the Euclidean distance or negative cosine similarity, to each example-target pair. Finally NN algorithm is leveraged to select examples from Q that closely match the target question .
问题相似性选择（QTS）。QTS [24] 选择与目标问题最相似的个示例。具体来说，它使用预训练的语言模型将示例问题 Q 和目标问题嵌入在一起。然后，它对每个示例 - 目标对应用预定义的距离度量，如欧氏距离或负余弦相似度。最后，利用近邻（KNN）算法从 Q 中选择个与目标问题相近的示例。
Masked Question Similarity Selection (MQS ). For crossdomain Text-to-SQL, MQS [11] eliminates the negative influence of domain-specific information by replacing table names, column names, and values in all questions with a mask token, and then compute the similarities of their embedding with NN algorithm.
遮蔽问题相似性选择（MQS ）。对于跨领域文本到SQL，MQS [11] 通过使用掩码标记替换所有问题中的表名、列名和值，消除了领域特定信息的负面影响，然后使用最近邻算法计算其嵌入的相似性。
Query Similarity Selection (QRS ). Instead of using the target question , QRS [25] aims to select examples that are similar to target SQL query $s^*$ . Specifically, it employs a preliminary model to generate SQL query ′ using target question and database , where this generated ′ can be regarded as an approximation of $s^*$ . Then it encodes queries from examples into binary discrete syntax vectors according to their keywords. After that, it chooses $k$ examples by considering both similarity to the approximated query ′ and diversity among selected examples.
查询相似性选择(QRS )。与使用目标问题相比，QRS [25] 不使用目标问题，而是旨在选择与目标SQL查询 ∗ 相似的个示例。具体来说，它使用初步模型根据目标问题和数据库生成SQL查询′，其中这个生成的 ′ 可以视为对 ∗ 的近似。然后，它根据关键词将示例中的查询编码为二进制离散语法向量。然后，通过考虑与近似查询 ′ 的相似性和所选示例之间的多样性，选择了个示例。

Above strategies focus on selecting examples using only target question or query. However, according to prior studies [9], in-context learning is essentially learning from analogy. In the case of Text-to-SQL, the objective is to generate queries that match the given questions, thus LLMs are supposed to learn the mapping from questions to SQL queries. Therefore, we point out that during example selection, taking both question and SQL queries into consideration may benefit Text-to-SQL task. We will further discuss it in Sec. 3.3.
上述策略重点是仅使用目标问题或查询来选择示例。然而，上下文学习本质上是通过类比进行学习。在文本到SQL的情况下，目标是生成与给定问题匹配的查询，因此大型语言模型应该学习从问题到SQL查询的映射。因此，我们指出在示例选择过程中，考虑问题和SQL查询可能有助Text2SQL任务。

3.2.2 Example Organization. The example organization plays a pivotal role in determining what information of the above selected examples will be organized into the prompt. We summarize existing strategies in prior studies into two categories, Full-Information Organization and SQL-Only Organization, as demonstrated in Listing 6 and Listing 7. In these examples, $ {DATABASE_SCHEMA} represents the database schema, and $ {TARGET_QUESTION} stands for the question representation in Listing 4.
3.2.2 示例组织。示例组织在决定上述选定示例的哪些信息将被组织到提示中方面发挥了关键作用。我们将先前研究中的现有策略总结为两个类别，即Full-Information Organization 和 SQL-Only Organization，如Listing 6和Listing7所示。在这些示例中，${DATABASE_SCHEMA}表示数据库模式，${TARGET_QUESTION}代表清单4中的问题表示。

Full-Information Organization ( $FI_O$ ). $FI_O$ [5, 25] organizes examples in the same representation with the target question. As shown in Listing 6, examples are structured identically to the target question, and the only difference is that instead of the “SELECT” token at the end, the selected examples have the corresponding SQL queries after “SELECT”.
完整信息组织（ $FI_O$ ）： $FI_O$ [5, 25] 将示例组织成与目标问题相同的表示形式。如listing 6所示，示例的结构与目标问题完全相同，唯一的区别是，选定的示例在“SELECT”后有相应的SQL查询，而不是在最后有“SELECT”标记。
SQL-Only Organization ( $SO_O$ ). $SO_O$ [11] includes only SQL queries of the selected examples with a prefix instruction in the prompt, as demonstrated in Listing 7. Such organization aims at maximizing the number of examples
with limited token length. However, it removes the mapping information between questions and corresponding SQL queries, and such information can be useful, which we will demonstrate later.
仅SQL组织（ $SO_O$ ）： $SO_O$ 在提示中包括选定示例的仅SQL查询，并在前缀中附加指示，如listing 7所示。这种组织旨在最大程度地利用有限的令牌长度来包括示例数量。然而，它去除了问题和相应SQL查询之间的映射信息，而这些信息可能很有用，我们将在后面加以证明。

In summary, $FI_O$ includes the full information of examples, which ensures the quality; while $SO_O$ only keeps SQL queries to accommodate more examples, which prefers the quantity. We wonder if there exists a better trade-off between quality and quantity in example organization , which can further benefit the Text-to-SQL task.
总之， $FI_O$ 包含了示例的全部信息，以确保质量；而 $SO_O$ 仅保留 SQL 查询以适应更多示例，侧重于数量。我们好奇是否在示例组织中存在一种更好的质量与数量的权衡，这可以进一步有利于文本到SQL任务。

3.3 DAIL-SQL
3.3 DAIL-SQL

To address the aforementioned issues in example selection and organization, in this subsection, we present a novel Text-to-SQL method named DAIL-SQL.
为了解决示例选择和组织中的上述问题，在本小节中，我们提出了一种新的文本到SQL方法，名为DAIL-SQL。

For example selection, inspired by MQS and QRS , we proposed DAIL Selection (DAIL ), considering both questions and queries to select candidates. Specifically, DAIL Selection first masks domain-specific words in both target question and example questions in the candidate set Q. It then ranks the candidate examples based on the Euclidean distance between the embeddings of masked and . Simultaneously, it calculates the query similarity between the pre-predicted SQL query ′ and in Q. Finally, the selection criterion prioritizes the sorted candidates by question similarity with a query similarity greater than a predefined threshold . In this way, the selected top examples have good similarity with both question and query.
对于示例选择，受到MQS 和QRS 的启发，我们提出了DAIL选择（DAIL ），考虑了问题和查询来选择候选示例。具体来说，DAIL选择首先屏蔽了目标问题和候选集 Q 中的示例问题中的领域特定词。然后，它根据屏蔽的和的嵌入之间的欧几里得距离对候选示例进行排名。同时，它计算了预测的SQL查询 ′ 和 Q 中的之间的查询相似度。最后，选择标准通过问题相似度对排序后的候选示例进行优先排序，同时，其中查询相似度大于预定义的阈值。通过这种方式，选择的前个示例在问题和查询上都具有很好的相似性。

To preserve the mapping information between questions and SQL queries and also improve the token efficiency, we propose a new example organization strategy DAIL Organization (DAIL ) to trade-off in terms of quality and quantity. Specifically, DAIL presents both questions and corresponding SQL queries , as illustrated in Listing 8. As a compromise between FI and SO , DAIL reserves the question-SQL mapping, and reduces the token length of examples by removing token-cost database schema.
为了保留问题和SQL查询之间的映射信息并提高令牌效率，我们提出了一种新的示例组织策略，DAIL组织（DAIL ），以在质量和数量方面进行权衡。具体来说，DAIL 呈现了示例的问题和相应的SQL查询，如listing 8所示。作为FI 和SO 之间的折中，DAIL 保留了question-SQL映射，并通过去除具有token成本的数据库模式来减少示例的token长度。

In DAIL-SQL, we adopt CR as our question representation. The reason is that compared with other representations, CR contains full information of the database, including primary and foreign keys, which may offers more useful information for LLMs, such as foreign keys for the prediction of “JOIN” clauses. Besides, pre-trained on extensive coding corpora, LLMs could better understand the prompt in CR without too much additional effort.
在DAIL-SQL中，我们采用CR 作为问题表示形式。原因是，与其他表示形式相比，CR 包含了数据库的全部信息，包括主键和外键，这可能会为LLMs提供更多有用的信息，例如用于“JOIN”子句预测的外键。此外，在广泛的编码语料库上进行预训练，LLMs可以更好地理解CR 中的提示，而无需太多额外的工作。

In summary, DAIL-SQL utilizes CR as the question representation, selects examples based on information from both question and query, and organizes them to keep question-to-SQL mappings.In such prompt design, LLMs could work better for Text-to-SQL task, and in Spider leaderboard, the proposed DAIL-SQL refresh the performance with 86.2% execution accuracy.
总之，DAIL-SQL 使用 CR 作为问题表示，根据问题和查询的信息选择示例，并组织它们以保持问题到 SQL 的映射。在这种提示设计中，LLM 可以更好地应对 Text-to-SQL 任务，而在 Spider 排行榜中，提出的 DAIL-SQL 通过 86.2% 的执行准确性刷新了性能。

Note DAIL-SQL is a flexible LLM-based Text-to-SQL solution, which can be further extended and integrated with other components easily. For example, to improve the performance, we equip DAIL-SQL with self-consistency [44], which achieves a performance of 86.6% execution accuracy. Although self-consistency improves the execution accuracy by 0.4%, it is very time consuming and yields many times the cost of original DAIL-SQL. Therefore, in this paper we still focus on DAIL-SQL.
注意，DAIL-SQL 是一个灵活的基于 LLM 的 Text-to-SQL 解决方案，可以轻松地进一步扩展和与其他组件集成。例如，为了提高性能，我们为 DAIL-SQL 配备了自一致性 [44]，实现了 86.6% 的执行准确性。虽然自一致性将执行准确性提高了 0.4%，但它的耗时较长，成本是原始 DAIL-SQL 的许多倍。因此，本文仍专注于 DAIL-SQL。

3.4 Supervised Fine-Tuning for Text-to-SQL
3.4 Text-to-SQL的监督微调

To enhance the performance of LLMs in zero-shot scenario, the popular option for existing Text-to-SQL methods is in-context learning, which is discuss in above subsections. As an alternative yet promising option, supervised fine-tuning is less explored so far. Similar to supervised fine-tuning for various language task, we can adopt it to the field of Text-to-SQL, and improve LLMs’ performance on this downstream task. To further understand how supervised fine-tuning works for Text-to-SQL, we first provide a brief formulation as follows.
为了在zero-shot 场景中提高 LLM 的性能，现有的文本到 SQL 方法普遍采用的策略是上下文学习，这在上述小节中有讨论。作为一种替代但富有前景的选项，监督微调至今还鲜有探讨。与各种语言任务的监督微调类似，我们也可以将其应用于文本到 SQL 领域，以提高 LLM 在这个下游任务的性能。为了进一步了解监督微调如何应用于文本到 SQL，我们首先简要介绍一下如下形式。

For Text-to-SQL, given a large language model $\mathcal M$ , a set of Text-to-SQL training data $\mathcal T = \{(q_i, s_i, \mathcal D_i)\}$ , where and are the natural language question and its corresponding query on database $\mathcal D_i$ , the objective of supervised fine-tuning is to minimize the following empirical loss:
$\underset{\sigma,\mathcal M^*}{min} \sum_{i=1}^\mathcal T{\mathcal L_{\mathcal M^*}(\sigma(q_i, \mathcal D_i),s_i)} ,$
where $\mathcal L$ is the loss function to measure the difference between the generated query and the groundtruh query. Similar to question representation, decides question representation with useful information from the schema in database $\mathcal D$ . In this definition, supervised fine-tuning for Text-to-SQL covers two sub-tasks, including fine-tuning the given LLM $\mathcal M$ using supervised data $\mathcal T$ in order to get the optimal LLM $\mathcal M^*$ , and searching for the optimal question representation . Since question representations have been discussed in Sec. 3.1, this section will primarily focus on data preparation $\mathcal T$ and fine-tuning.
对于Text-to-SQL任务，考虑一个大型语言模型 $\mathcal M$ ，以及一组Text-to-SQL的训练数据 $\mathcal T = \{(q_i, s_i, \mathcal D_i)\}$ ，其中和分别是自然语言问题和其在数据库 $\mathcal D_i$ 上的相应查询，监督微调的目标是最小化以下损失：
$\underset{\sigma,\mathcal M^*}{min} \sum_{i=1}^\mathcal T{\mathcal L_{\mathcal M^*}(\sigma(q_i, \mathcal D_i),s_i)} ,$
其中， $\mathcal L$ 是用来衡量生成的查询与真实查询之间差异的损失函数。与问题表示一样，确定了具有来自数据库 $\mathcal D$ 的有用信息的问题表示。在这个定义中，文本到SQL的监督微调包括两个子任务，包括使用监督数据 $\mathcal T$ 微调给定的LLM $\mathcal M$ ，以获得最佳的 $\mathcal M^*$ ，以及寻找最佳的问题表示。由于问题表示已经讨论过，因此本节将主要集中在数据准备 $\mathcal T$ 和微调上。

For general domain, each item in supervised data $\mathcal T = \{(p_i, r_i)\}$ contains an input prompt and an expected respond from LLM. To ensure consistency with the inference process, we employ a supervised fine-tuning and generate prompt-response pairs from a given Text-to-SQL dataset. Specifically, given a Text-to-SQL data set $\mathcal T = \{(q_i, s_i,\mathcal D_i)\}$ , we fine-tune the LLMs using the generated tuning data by using target question and the given database as prompt, and treating the desired query as response from LLM, i.e., $\mathcal T = \{(p_i = \sigma(q_i,\mathcal D_i), r_i = s_i)\}$ . Once the data is ready, we can use existing package to fine-tune the given LLM $\mathcal M$ through either full fine-tuning [29] or parameter-efficient fine-tuning [13] depending on the available computational resources. After fine-tuning, the optimized LLM $\mathcal M^*$ can be used to do inference, that is asking it to generate queries through natural language questions. Note that we utilize the same question representation in both fine-tuning and inference processes. We will conduct a series of experiments and discuss the great potential of supervised fine-tuning for Text-to-SQL.
对于通用领域，监督数据 $\mathcal T = \{(p_i, r_i)\}$ 中的每个项目包含一个输入提示和来自LLM的期望响应。为了确保与推理过程的一致性，我们使用监督微调并从给定的文本到SQL数据集生成提示-响应对。具体来说，给定一个文本到SQL数据集 $\mathcal T = \{(q_i, s_i,\mathcal D_i)\}$ ，我们使用生成的微调数据通过使用目标问题和给定的数据库作为提示来微调LLM，将LLM的期望查询视为响应，即 $\mathcal T = \{(p_i = \sigma(q_i,\mathcal D_i), r_i = s_i)\}$ 。一旦数据准备就绪，我们可以根据可用的计算资源，使用现有的工具包通过全面微调[29] 或参数高效微调[13] 来调整给定的 LLM $\mathcal M$ 。微调完成后，优化的LLM $\mathcal M^*$ 可以用于进行推理，即要求它通过自然语言问题生成查询。请注意，我们在微调和推理过程中使用相同的问题表示。我们将进行一系列实验，讨论监督微调在 Text-to-SQL 领域的巨大潜力。

4 EXPERIMENT
4 实验

In this section, we first introduce our experimental settings. Then we conduct extensive comparisons with existing solutions in question representation, in-context learning and supervised fine-tuning respectively. After that, we further compare them in terms of token efficiency to inspire more efficient solutions.
在本节中，我们首先介绍我们的实验设置。然后，我们分别在问题表征、上下文学习和监督微调方面与现有解决方案进行了广泛的比较。之后，我们进一步从token效率的角度对它们进行比较，以启发更高效的解决方案。

4.1 Setting
4.1 设置

Dataset. We evaluate Text-to-SQL methods on two well recognized datasets, Spider [47] and Spider-Realistic [8]. Spider is a large-scale cross-domain Text-to-SQL dataset, which contains 8659 instances in training split and 1034 instances in development split over 200 databases. Each instance is consisted of a natural language question on a specific database and its corresponding SQL query. In this paper, we use the development split Spider-dev for the purpose of evaluation as the test split is not released. Spider-Realistic [8] is a more challenging variant of Spider. It selects a subset of 508 examples from Spider-dev and manually revises the questions while keeping the SQL queries unchanged. For few-shot scenarios, we utilize the training split of Spider as the example candidates when testing with both Spider-dev and Spider-Realistic.
数据集。我们在两个广受认可的数据集上评估了文本到 Text-to-SQ 的方法，分别是 Spider[47] 和 Spider-Realistic[8]。Spider 是一个大规模的跨领域文本到 SQL 数据集，包含训练划分中的 8659 个实例和开发划分中的 1034 个实例，共计 200 个数据库。每个实例包括一个特定数据库的自然语言问题和相应的 SQL 查询。在这篇文章中，我们使用开发划分的 Spider-dev 进行评估，因为测试划分尚未发布。Spider-Realistic[8] 是 Spider 的一个更具挑战性的变体。它从 Spider-dev 中选取 508 个示例，手动修改问题，同时保持 SQL 查询不变。在少样本场景中，当我们同时使用 Spider-dev 和 Spider-Realistic 进行测试时，我们将 Spider 的训练划分作为示例候选者。

Metric. To make a fair comparison, we follow prior study [51] to use exact-set-match accuracy (EM) and execution accuracy (EX). The exact-set-match accuracy measures the matched SQL keywords between the predicted SQL query and its corresponding ground truth. The execution accuracy, on the other hand, compares the execution output of the predicted SQL query with that of the ground truth SQL query on some database instances. This metric provides a more precise estimate of the model’s performance since there may be multiple valid SQL queries for a single given question. We use the existing released evaluation scripts at https://github.com/taoyds/test-suite-sql-eval.
指标。为了进行公平的比较，我们遵循先前的研究 [51] 使用精确集匹配准确度（EM）和执行准确度（EX）。精确集匹配准确度衡量预测 SQL 查询与其对应的真实 ground truth 之间的匹配 SQL 关键词。另一方面，执行准确度将预测 SQL 查询的执行输出与某些数据库实例上真实 SQL 查询的执行输出进行比较。由于可能存在多个有效的 SQL 查询针对单个给定问题，因此此指标提供了更精确的模型性能估算。我们使用现有发布的评估脚本 https://github.com/taoyds/test-suite-sql-eval。

LLM. To ensure a fair comparison, for all the methods, we use the same maximal context length, that is 4096 for OpenAI LLM and 2048 for open-source LLM. During evaluation, we leave 200 tokens for response generation. By default, we set the argument temperature as 0 to eliminate the influence of randomness. Regarding post-processing, we follow existing work to extract the first SQL query in response and remove additional output. For more implementation details, please refer to Appendix A.1.
LLM。为了确保公平比较，对于所有方法，我们使用相同的最长上下文长度，即 OpenAI LLM 的 4096，开源 LLM 的 2048。在评估过程中，我们为响应生成留下 200 个令牌。默认情况下，我们将论证温度设置为 0，以消除随机性的影响。在后处理方面，我们遵循现有研究提取响应中的第一个 SQL 查询，并移除额外输出。关于更多实现细节，请参见附录 A.1。

4.2 Question Representations
4.2 问题表示

In this subsection, we evaluate the question representations presented in Sec. 3.1 under zero-shot scenario, employing three OpenAI LLMs: GPT-4, GPT-3.5-TURBO, and TEXT-DAVINCI-003.
在本小节中，我们使用三种OpenAI LLM：GPT-4、GPT-3.5-TURBO和TEXT-DAVINCI003，在零样本场景下评估第3.1节中提出的问题表示。

Fig. 1 presents the comparison of different question representations over Spider-dev. By comparing different representations, we can observe that OD fits to all three LLMs and achieves 75.5% execution accuracy with GPT-3.5-TURBO. In contrast, AS exhibits poor performance with GPT-3.5-TURBO and TEXT DAVINCI-003, necessitating a suitable LLM to work well with. Unexpectedly, GPT-4 exhibits a preference for the simple BS derived from Din-SQL [31], indicating that a powerful LLM can mitigate the complexities associated with representation design. Besides, by comparing the average performance for three LLMs, GPT-4 and GPT-3.5-TURBO are more capable in the zero-shot scenario. Due to the expensive cost of GPT-4, GPT-3.5-TURBO together with OD maybe a better choice for the zero-shot scenario. For detailed numerical results, please refer to Appendix A.2.
图 1 展示了 Spider-dev 上的不同问题表示的比较。通过比较不同的表示，我们可以观察到 OD 适合所有三种 LLM，并在 GPT-3.5-TURBO 上实现了 75.5% 的执行精度。相比之下，AS 在 GPT-3.5-TURBO 和 TEXT DAVINCI-003 上表现不佳，需要合适的 LLM 才能发挥良好作用。出乎意料的是，GPT-4 对源于 Din-SQL [31] 的简单 BS 表现出偏好，这表明强大的 LLM 可以减轻与表示设计相关的复杂性。此外，通过比较三种 LLM 的平均表现，GPT-4 和 GPT-3.5-TURBO 在零散景观中更具能力。由于 GPT-4 的成本高昂，GPT-3.5-TURBO 搭配 OD 或许在零散景观中是一个更好的选择。关于详细数值结果，请参见附录 A.2。

Figure 1: Results of different question representations on Spider-dev under zero-shot scenario. 图1：zero-shot 情景下Spider-dev上不同问题表示的结果。

To further investigate the different question representations, we conduct ablation study to explore the effects of their invidual components.
为了进一步研究不同的问题表征，我们进行了消融研究，以探索其无形成分的影响。

Foreign Key (FK). Foreign Key implies the relation among different relational tables, which might be helpful in Text-to-SQL task. In our evaluation, only $CR_P$ contains foreign key information. To examine its effect, we add foreign key information into other representations and evaluate them in Fig. 2. We observe that foreign key information significantly improves the execution accuracy of LLMs by 0.6% to 2.9%, except the combinations of $TR_P$ with GPT4 (−0.2%) and $AS_P$ with TEXT-DAVINCI-003 (−0.4%). Among all question representations, $OD_P$ benefits the most from foreign key information. Such observation demonstrates that foreign keys are helpful for Text-to-SQL task. For detailed performance, please refer to Appendix A.3.
外键（FK）。外键表示不同关系表之间的关系，这可能有助于Text-to-SQL任务。在我们的评估中，只有 $CR_P$ 包含外键信息。为了检查其效果，我们将外键信息添加到其他表示中，并在图2中对它们进行评估。我们观察到，外键信息显著提高了LLMs的执行准确性，提高了0.6%到2.9%，除了 $TR_P$ 与GPT4（-0.2%）和 $AS_P$ 与TEXT-DAVINCI-003（-0.4%）的组合外。在所有问题表示中， $OD_P$ 从外键信息中获益最多。这种观察表明，外键对于文本到SQL任务是有帮助的。关于详细数值结果，请参见附录 A.3。

Figure 2: Ablation studies of foreign keys information on Spider-dev. The green arrow indicates an increase, and red arrow indicates a decrease. 图2：Spider-dev上外键信息的消融研究。绿色箭头表示增加，红色箭头表示减少。

Rule Implication (RI). Inspired by the outperformance of $OD_P$ , we explore the effect of rule implication. Specifically, $OD_P$ implicate LLMs to generate SQL queries “with no explanation” To examine the effect of “with no explanation” rule in question representation, we present an ablation study in Fig. 3. Specifically, we remove “with no explanation” from $OD_P$ , and add it to other representations. From Fig. 3 we observe adding this rule consistently booms the performance of all LLMs in both exact-set-match and execution accuracy, with the most significant improvements exceeding 6% and 2%, respectively. While for $OD_P$ , removing this rule incurs about 2.4% − 6.2% drop in exact-set-match accuracy, and 1.0% − 1.8% drop in execution accuracy, indicating the importance of this rule implication. As a comparison, we also test a popular rule implication “Let’s think step by step” [17], which guides LLM to generate response with analysis. However, its performance is highly volatile in Text-to-SQL task as Appendix A.5 shows. Due to limited resources, we leave the exploration of other possible rule implications as an open question for future research.
规则含义（RI）。受到 $OD_P$ 的表现优越性的启发，我们探讨了规则含义的效果。具体来说， $OD_P$ 含有LLMs生成SQL查询的“不解释”的规则。为了检查问题表示中“不解释”的规则的影响，我们在图3中进行了割除研究。具体来说，我们从 $OD_P$ 中去掉“不解释”，并将其添加到其他表示中。从图3中，我们观察到添加这一规则始终能够显著提高所有LLMs的精确集匹配和执行准确性，其中最显著的改进分别超过了6%和2%。而对于 $OD_P$ 来说，去掉这一规则导致精确集匹配准确度下降约2.4%至6.2%，执行准确度下降1.0%至1.8%，这表明了这一规则含义的重要性。作为比较，我们还测试了一个流行的规则含义“一步一步来思考”[17]，该规则引导LLM生成带有分析的响应。然而，它在文本到SQL任务中的性能高度不稳定，其表现如附录 A.5 所示。由于资源有限，我们将其他可能的规则启示的探索留作未来研究的开放问题。

Figure 3: Ablation studies of “with no explanation” rule implication on Spider-dev. The green arrow indicates an increase, and ed arrow indicates a decrease. 图3：Spider-dev“无解释”规则含义的消融研究。绿色箭头表示增加，红色箭头表示减少。

In summary, both the foreign key and the “with no explanation” implication rule are beneficial for Text-to-SQL task. In our evaluation, $OD_P$ with foreign keys and GPT-3.5-TURBO are the most effective and economic combination, which achieves 51.5% exact-set-match accuracy and 78.4% execution accuracy.
总之，外键以及“无解释”的启示规则都对文本到 SQL 任务有益。在我们的评估中，带有外键的 $OD_P$ 和 GPT-3.5-TURBO 是最有效且经济的组合，精确集匹配准确性达到了 51.5%，执行准确性达到了 78.4%。

4.3 In-Context Learning for Text-to-SQL
4.3 Text-to-SQL的上下文学习

In few-shot scenario, we examine different example selection and organization strategies with GPT-4, GPT-3.5-TURBO, and TEXT-DAVINCI-003. To ensure a fair comparison, we adopt $CR_P$ as the question representation for all the experiments in this subsection, due to its superior performance in one-shot preliminary experiment in Appendix B.1.
在few-shot场景下，我们研究了不同示例选择和组织策略与 GPT-4、GPT-3.5-TURBO 和 TEXT-DAVINCI-003 的结合。为了保证公平比较，我们采用 $CR_P$ 作为本小节所有实验的问题表示，因为它在附录 B.1 中的单次预实验中表现优异。

4.3.1 Example Selection. To verify the importance of both question and query for example selection, we calculate question’s and query’s Jaccard similarities between chosen examples and the target instance, and report the averaged numbers under column question similarity and query similarity in Table 2. Specifically, we remove database-specific information from questions [42] and queries [19], and calculate the Jaccard similaritis of the remained tokens. Besides, we introduce Upper Limit for reference, which utilizes the ground truth query $s^*$ rather than the query generated by preliminary predictor. To some extent, Upper Limit indicates the upper bound of performance for similarity based selection methods.
4.3.1 示例选择。为了验证问题和查询在示例选择中的重要性，我们计算了所选示例与目标实例之间的问题和查询的 Jaccard 相似度，并在表 2 中的问题相似度和查询相似度列报告了平均值。具体来说，我们从问题 [42] 和查询 [19] 中移除数据库特定信息，并计算剩余 tokens 的 Jaccard 相似度。此外，我们引入了 Upper Limit 作为参考，它使用领域真实查询 $s^*$ 而非初步预测器生成的查询。在一定程度上，Upper Limit 表示基于相似度的选择方法的上限性能。

Table 2: Results of different selection strategies on Spider-dev with few-shot evaluation. 表 2：在 Spider-dev 上 few-shot场景中采用不同选择策略评估的实验成果。

Table 2 shows the comparisons of different example selection strategies in 1-, 3- and 5-shot scenarios on Spider-dev. By comparing different selection strategies, it is demonstrated that DAIL generally outperforms other strategies. In 5-shot scenario, equipped with GPT-4, DAIL-SQL achieves 82.4% execution accuracy. Besides, in Table 2 we observe the increasing question and query similarity corresponds to higher execution accuracy mostly, indicating the importance of considering both question and query similarity. Note DAIL ’s execution accuracy is still lower than Upper Limit. This discrepancy can be attributed to the lower query similarity, indicating the gap between the ground truth query and that generated by the preliminary model.
表 2 显示了在不同小样本场景（1、3 和 5 ）下Spider-dev中不同示例选择策略的比较。通过比较不同的选择策略，证明了 DAIL 通常优于其他策略。在 5-shot场景中，配备 GPT-4 的 DAIL-SQL 实现了 82.4% 的执行准确性。此外，在表 2 中，我们观察到问题和建议相似度的增加通常对应着更高的执行准确性，表明了同时考虑问题和建议相似度的重要性。请注意，DAIL 的执行准确性仍低于 Upper Limit。这种差异可以归因于查询相似度较低，表明了地面真实查询与初步模型生成的查询之间的差距。

4.3.2 Example Organization. To compare different example organization strategies, we evaluate Full-Information Organization, SQL-Only Organization and DAIL Organization in few-shot scenario on both Spider-dev and Spider-Realistic. Fig. 4 shows the comparison results, and refer to Appendix B.2 for detailed numerical results. From Fig. 4(a) and Fig. 4(d), we can observe that GPT-4 benefits from contextual examples steadily on both Spider-dev and Spider-Realistic. With DAIL Organization, its execution accuracy increases from 72.3% to 83.5% on Spider-dev and from 66.5% to 76.0% on Spider-Realistic. While for GPT-3.5-TURBO and TEXT-DAVINCI-003, adding examples may incur drop in execution accuracy due to limited in-context learning capability. By comparing different organization strategies, we observe that GPT-4 shows preference for DAIL Organization in both Spider-dev and Spider-Realistic, suggesting it can effectively learn the mapping from question-SQL pairs. For GPT-3.5-TURBO (Fig. 4(b) and Fig. 4(e)), compared with its zero-shot performance in Fig. 1, its enhancement in in-context learning is the smallest among three LLMs, due to its weakness in in-context learning. For TEXT-DAVINCI-003, Full-Information Organization is far beyond the other two strategies, especially with increasing example number, as depicted in Fig. 4© and Fig. 4(f). By comparing different LLMs, we infer that for LLM with greater in-context learning capability, like GPT-4, benefits from DAIL Organization the most, while the weaker LLMs require more information to learn from examples. However, we emphasize DAIL Organization can be a good choice to achieve higher performance, and the best execution accuracy in our evaluation is achieved by DAIL Organization with GPT-4.
4.3.2 示例组织。为了比较不同的示例组织策略，我们在 Spider-dev 和 Spider-Realistic 上的少量示例场景中评估了全信息组织、仅 SQL 组织和 DAIL 组织。图 4 展示了比较结果，具体参考附录 B.2 的详细数值结果。从图 4（a）和图 4（d）中，我们可以观察到 GPT-4 在 Spider-dev 和 Spider-Realistic 上均从上下文示例中稳步受益。使用 DAIL 组织，其在 Spider-dev 上的执行准确率从 72.3% 提高到 83.5%，在 Spider-Realistic 上的执行准确率从 66.5% 提高到 76.0%。而對於 GPT-3.5-TURBO 和 TEXT-DAVINCI-003，由于上下文学习能力的限制，添加示例可能会导致执行准确率的下降。通过比较不同的组织策略，我们发现 GPT-4 在 Spider-dev 和 Spider-Realistic 两方面都更倾向于 DAIL 组织，这表明它能够有效地学习问题 SQL 对之间的映射。对于 GPT-3.5-TURBO（图 4（b）和图 4（e）），与零散景观中的表现相比，其在上下文学习中的提升是最小的，原因是其上下文学习能力较弱。对于 TEXT-DAVINCI-003，全信息组织远远超过其他两种策略，尤其是在示例数量增加的情况下，如图 4（c）和图 4（f）所示。通过比较不同的 LLM，我们推断，对于具有更强上下文学习能力的 LLM（如 GPT-4），DAIL 组织的益处最大，而较弱的 LLM 需要更多的信息来从示例中学习。然而，我们强调 DAIL 组织是一个提高性能的好选择，而在我们的评估中，最佳的执行准确率是由 GPT-4 结合 DAIL 组织实现的。

Figure 4: Results of few-shot evaluation with different example organizations. 图 4：不同示例组织下的少量样本评估结果。

In summary, for example selection, our findings emphasize the importance of the mapping from question to SQL query. Considering both question and query similarities simultaneously, DAIL $_S$ outperforms other selection strategies in our evaluation. For example organization, we show the effectiveness of DAIL $_O$ , and point out its demands for potent LLMs. Finally, in our evaluation, we observe that our approach, DAIL-SQL, equipped with GPT-4, achieves the highest performance with an execution accuracy of 83.5% on Spider-dev and 76.0% on Spider-Realistic.
总之，在示例选择方面，我们的研究强调了问题到 SQL 查询映射的重要性。同时考虑问题和查询的相似性，DAIL $_S$ 在我们的评估中超越了其他选择策略。在示例组织方面，我们展示了 DAIL $_O$ 的有效性，并指出了其对强大 LLM 的需求。最后，在我们的评估中，我们观察到我们的方法，即搭载 GPT-4 的 DAIL-SQL，在执行准确率方面取得了最高成绩，Spider-dev 上的准确率为 83.5%，Spider-Realistic 上的准确率为 76.0%。

4.4 Supervised Fine-Tuning for Text-to-SQL
4.4 Text-to-SQL的监督微调

In this section, we investigate supervised fine-tuning in Text-to-SQL. Due to the unaffordable cost of fine-tuning OpenAI LLMs, we focus on open-source LLMs. Given the fact that very few existing work adopt open-source LLMs and their performance remain unknown, we first undertake a thorough evaluation for open-source LLMs, employing various question representation, example selection and organization strategies. After that, we fine-tune open-source LLMs in Text-to-SQL and observe their enhancement in both zero-shot and few-shot scenarios.
在本节中，我们研究了Text-to-SQL 领域的监督微调。由于 OpenAI LLMs 的微调成本过高，我们关注开源 LLMs。鉴于现有很少的工作采用开源 LLM，且其性能尚不为人所知，我们首先对开源 LLM 进行全面评估，采用各种问题表示、示例选择和组织策略。随后，我们在文本到 SQL 领域对开源 LLM 进行微调，并观察到在zero-shot和few-shot 场景中性能的提升。

4.4.1 Open-source LLM. To investigate the potential of open-source LLM, we choose LLaMA [39], and its aligned variants in varying scales. They are detailed as follows. Note the aligned variants means the LLM is aligned to be more helpful, harmless and honest [2], and the suffix “-7B” means the LLM has 7 billions parameters, the same meaning for “-13B” and “-33B”.
4.4.1 开源大语言模型。为了探究开源大语言模型的潜力，我们选择了 LLaMA [39]，以及在不同尺度下的其对齐变种。以下是对这些变种的详细介绍。注意，对齐变种意味着大语言模型经过对齐以更有帮助、无害和诚实 [2]；后缀“-7B”表示大语言模型拥有 70 亿个参数，“-13B”和“-33B”的含义相同，均为表示大语言模型参数规模。

LLaMA-7B/13B/33B [39] is a collection of widely recognized open-source LLMs, which are pre-trained on massive corpus by Meta.
LLaMA-7B/13B/33B[39] 是一组广泛认可的开源LLMs（大语言模型），由Meta在大规模语料库上进行预训练。
Alpaca-7B [38] is an aligned version of LLaMA-7B, which is fine-tuned with 52k instruction-following data generated by TEXT-DAVINCI-003.
Alpaca-7B [38] 是LLaMA-7B的一个对齐版本，经过TEXT-DAVINCI-003生成的52,000条指令遵循数据进行了精细调整。
GPT4ALL-7B [1] is another aligned version of LLaMA-7B with about 800k data designed for helpful, harmless and honest AI assistant.
GPT4ALL-7B[1] 是LLaMA-7B的另一个对齐版本，包含大约800,000条用于有益、无害和诚实AI助手的数据。
LLaMA-2-CHAT-7B/13B [40] are up-to-date version of LLaMA. They are both pre-trained and aligned, and outperform the previous version on most benchmarks.
LLaMA-2-CHAT-7B/13B[40] 是LLaMA的最新版本。它们都经过预训练和对齐，并在大多数基准测试中优于以前的版本。
Vicuna-7/13/33B [6, 49] is a collection of open-source chatbot aligned from LLaMA with user-shared conversations. Vicuna-13B [6] is declared to perform similar to OpenAI ChatGPT and Google Bard, and outperforms LLaMA and Alpaca in most scenarios.
Vicuna-7/13/33B [6, 49] 是从LLaMA对齐的一组开源聊天机器人，包括用户共享的对话。Vicuna-13B [6] 被宣称在大多数情况下表现类似于OpenAI ChatGPT和Google Bard，并在大多数情景中优于LLaMA和Alpaca。

4.4.2 Zero-shot Scenario with Open-source LLM. Table 3 shows their zero-shot performances on Spider-dev with different question
representations. Due to limited space, please refer to Appendix C.1 for the performance on Spider-Realistic. Next, we provide several analysis from aspects of question representations, model scale and alignment as follows.
4.4.2 开源大语言模型的Zero-shot场景。表 3 展示了他们在不同问题表示下的Spider-dev的zero-shot场景表现。由于空间有限，请参阅附录 C.1 以获取Spider-Realistic环境的性能。接下来，我们从问题表示、模型规模和对齐等方面提供了一些分析。

Table 3: Zero-shot evaluation results on Spider-dev with different open-source LLMs. The best performances of pre-trained and aligned LLM are in bold. 表 3：不同开源大语言模型在Spider-dev中的Zero-shot 情况下评估结果。预训练和对齐的大语言模型的最佳性能加粗显示。

Effect of Question Representation. We can observe that the best performances is achieved by $CR_P$ with 43.7% execution accuracy on Spider-dev. The possible reason is that full database knowledge in $CR_P$ (Code Representation Prompt) compensates the incapability of open-source LLMs.
问题表示的影响。我们可以观察到，在Spider-dev上， $CR_P$ （代码表示提示）实现了43.7%的执行准确性，达到了最佳性能。可能的原因是 $CR_P$ 中的完整数据库知识弥补了开源LLM的不足。

Effect of Model Scale. From the results we observe a positive correlation between model scale and performance on Text-to-SQL for both LLaMA and Vicuna. Specifically, the average execution match accuracy of LLaMA shows a notable progression from 9.1% to 34.0% on Spider-dev, and Vicuna shows a similar upward trend from 15.3% to 36.6%. In the more challenging dataset Spider-Realistic, the same pattern can be observed and execution accuracy of LLaMA and Vicuna rise from 7.56% to 25.4% and 12.3% to 30.0%.
模型规模的影响。从结果中我们观察到模型规模与Text-to-SQL性能之间存在正相关关系，对于LLaMA和Vicuna都是如此。具体来说，LLaMA的平均执行匹配准确性从Spider-dev上的9.1%显著提高到34.0%，而Vicuna在相似的趋势下从15.3%提高到36.6%。在更具挑战性的Spider-Realistic数据集中，同样的模式也可以观察到，LLaMA和Vicuna的执行准确性分别从7.56%提高到25.4%和12.3%提高到30.0%。

Effect of Alignment. From the results we observe that LLM alignment can benefit Text-to-SQL. Specifically, with the same model scale, Vicuna outperforms LLaMA by about 5% in execution accuracy on both Spider-dev and Spider-Realistic. Besides, the aligned LLMs share similar preference in question representation with the pre-trained ones, and their highest execution accuracies are all achieved with $CR_P$ .
对齐的影响。从结果中我们观察到，LLM的对齐可以提升Text-to-SQL性能。具体而言，在相同的模型规模下，Vicuna在Spider-dev和Spider-Realistic上的执行准确性都比LLaMA高约5%。此外，对齐的LLM与预训练模型在问题表示方面具有相似的倾向，它们的最高执行准确性都是使用 $CR_P$ 实现的。

4.4.3 Few-shot Scenario with Open-source LLM. For few-shot scenario, Fig. 5 shows the performance of LLaMA-33B and Vicuna-33B
with $CR_P$ . We use DAIL Selection to select example as it is reported as the best strategy in Sec. 4.3. For more details, refer to Appendix C.2. From this Figure, we can see that LLaMA-33B benefits more than Vicuna-33B, and achieves 36.4% exact-set-match accuracy with 5-shot Full-Information Organization examples. Regarding execution match accuracy, increasing number of examples benefits Text-to-SQL in most cases. Besides, among different organization strategies, Full-Information Organization outperforms other strategies in different k-shot scenarios, which achieves 51.5% execution accuracy with Vicuna-33B.
4.4.3 Few-shot 场景下的开源 LLM。在Few-shot场景下，图 5 展示了 LLaMA-33B 和 Vicuna-33B 的性能，使用了 $CR_P$ 。我们采用 DAIL 选择法来选取示例，因为据 4.3 节报告，这是最佳策略。更多细节请参考附录 C.2。从这张图中，我们可以看出 LLaMA-33B 的收益超过了 Vicuna-33B，并在 5 次全信息组织示例中达到了 36.4% 的精确集匹配准确率。关于执行匹配准确率，大多数情况下，示例数量的增加有利于 Text-to-SQL。此外，在不同的 k-shot 场景中，全信息组织策略优于其他策略，Vicuna-33B 实现了 51.5% 的执行准确率。

Figure 5: Few-shot evaluation with open-source LLMs on Spider-dev. 图 5：在 Spider-dev 上的开源 LLM Few-shot评估。

Notably, in both zero-shot and few-shot scenarios, the opensource LLMs are far behind OpenAI LLMs. We will try to further enhance their performance with supervised fine-tuning.
值得注意的是，在zero-shot和少few-shot 场景中，开源 LLM 远远落后于 OpenAI 的 LLM。我们将尝试通过监督微调进一步提高它们的性能。

4.4.4 Supervised Fine-tuning with Open-source LLM. To further enhance Open-source LLMs’ performances, we explore supervised
fine-tuning for Text-to-SQL. Similar to in-context learning, it may prefer different representations. Thus, we first fine-tune open-source LLMs on zero-shot training samples with different representations. Following the setting of supervised fine-tuning [29, 38], we block the gradients from prompt and only update weights with those from response (SQL queries). We use the train split in Spider,which contains 8659 training samples. For more training details, please refer to Appendix C.3.
4.4.4 监督微调与开源语言模型。为了进一步提高开源语言模型的性能，我们探讨了 Text-to-SQL 的监督微调方法。与上下文学习类似，它可能偏好不同的表示。因此，我们首先在不同表示的零散训练样本上对开源语言模型进行微调。遵循监督微调的设置 [29, 38]，我们阻止提示的梯度，只更新来自响应（SQL 查询）的权重。我们使用 Spider 中的训练数据集，包含 8659 个训练样本。关于更多的训练细节，请参见附录 C.3。

Zero-shot Scenario. Fig. 6 shows the performance of supervised fine-tuning with various LLMs and question representations in zero-shot scenario. Compared with zero-shot performance before fine-tuning in Table 3 , their performances are greatly enhanced. By comparing different representations, Alpaca SFT Prompt show obvious advantages in supervised fine-tuning as it is designed for such scenario.
Zero-shot场景。图 6 展示了在Zero-shot场景中，使用各种 LLM 和问题表示的监督微调性能。与表 3 中微调前Zero-shot场景的性能相比，它们的性能得到了显著提升。通过比较不同的表示，可以看出 Alpaca SFT 提示在监督微调中具有明显优势，因为它是为这种场景设计的。

Figure 6: Zero-shot evaluation results on Spider-dev with different fine-tuned open-source LLMs. 图 6：使用不同微调的开源 LLM 在 Spider-dev 上的Zero-shot评估结果。

We also observe the gap among different representations and model scales becomes narrow. The possible reason is that after fine-tuning, LLM can generalize to evaluation samples even with weak representations, like Basic Prompt. In this experiment, the best performance on Spider is achieved by the combination of LLaMA-13B and Alpaca SFT Prompt, whose exact-set-match and execution accuracy are 65.1% and 68.6%. For more detailed numerical results, please refer to Appendix C.4. As for larger LLM, the combination of LLaMA-33B and Code Representation Prompt achieves 69.1% execution accuracy and 65.9% exact-set-match accuracy. Due to the limited resources, we leave LLMs larger than 33B as our future work.
我们还观察到不同表示和模型规模之间的差距在缩小。可能的原因是，在微调后，LLM 甚至可以使用较弱的表示（如 Basic Prompt）对评估样本进行泛化。在这个实验中，Spider 上的最佳性能是由 LLaMA-13B 和 Alpaca SFT Prompt 组合实现的，其精确集匹配和执行精度分别为 65.1% 和 68.6%。关于更详细的数值结果，请参见附录 C.4。对于更大的 LLM，LLaMA-33B 和 Code Representation Prompt 的组合实现了 69.1% 的执行准确率和 65.9% 的精确集匹配准确率。由于资源有限，我们将在未来工作中探讨大于 33B 的 LLM。

In summary, supervised fine-tuning is quite beneficial for open-source LLMs in Text-to-SQL. Compared with OpenAI LLMs, in zero-shot scenario, fine-tuned LLaMA-13B and 30B are comparable to TEXT-DAVINCI-003 and slightly weaker than GPT-4 and GPT-3.5-TURBO.
总之，监督微调对于文本到 SQL 的开源 LLM 非常有益。与 OpenAI 的 LLM 相比，在zero-shot场景中，微调后的 LLaMA-13B 和 30B 相当于 TEXT-DAVINCI-003，且稍逊于 GPT-4 和 GPT-3.5-TURBO。

Few-shot Scenario. After supervised fine-tuning, an important issue is: Can we continue to enhance the performance of open-source LLM by adding contextual examples? To answer this question, we evaluate fine-tuned LLaMA-7B and 13B with 0, 1, 3 and 5-shot prompts as shown in Table 4. We also add the evaluation results of original LLaMA-7B and 13B for clear comparison. Unexpectedly, the fine-tuned LLMs fail to learn from examples. Specifically, adding contextual examples in test prompts incurs sudden decrease in both exact-set-match and execution match accuracy, and adding more examples is also unhelpful. A possible reason is that LLM overfits to zero-shot prompt, which makes examples unuseful.
;Few-shot 场景。在经过监督微调后，一个重要问题是：我们能否通过添加上下文示例继续提高开源 LLM 的性能？为了回答这个问题，我们用 0、1、3 和 5 样本提示评估微调后的 LLaMA-7B 和 13B，如表 4 所示。我们还加入了原始 LLaMA-7B 和 13B 的评估结果，以便进行清晰比较。出乎意料的是，微调后的 LLM 无法从示例中学习。具体来说，在测试提示中添加上下文示例会导致精确集匹配和执行匹配准确率突然下降，并且添加更多示例也无济于事。一个可能的原因是 LLM 对zero-shot提示过拟合，这使得示例变得无用。

Table 4: Few-shot evaluation results of supervised fine-tuned LLMs on Spider-dev. 表 4：在 Spider-dev 上监督微调 LLM的 Few-shot评估结果。

In summary, open-source LLMs demonstrate significant potential for Text-to-SQL tasks, particularly in supervised fine-tuning. Specifically, after fine-tuning, their performances are comparable to TEXT-DAVINCI-003 in zero-shot scenario. However, unlike OpenAI LLMs, fine-tuned LLMs fail to learn from contextual examples. The question of preserving in-context learning ability after fine-tuning remains to be explored in future studies.
总之，开源 LLM 在文本到 SQL 任务中显示出巨大的潜力，尤其是在监督微调方面。具体来说，经过微调后，它们的性能在zero-shot场景中可与 TEXT-DAVINCI-003 相媲美。然而，与 OpenAI 的 LLM 不同，微调后的 LLM 无法从上下文示例中学习。在微调后如何保持上下文学习能力的问题还有待于未来的研究探索。

4.5 Token Efficiency
4.5 token效率

Considering OpenAI LLMs are charged by token numbers, and LLMs’ running time are proportional to token lengths, we underscore token efficiency in prompt engineering, which aims to achieve higher accuracy with less tokens. In this section, we review our experiments on Spider-dev in terms of token efficiency. Specifically, for both OpenAI and open-source LLMs, we experimentally study the trade-off between execution accuracy and token numbers, and the token number is mainly affected by question representation and example organization. For example selection, we fix it as DAIL $_S$ . Besides, we also include several state-of-the-art Text-to-SQL methods in our comparison, including DIN-SQL [31], STRIKE [25] and CBR-ApSQL [11]. We take their reported highest execution accuracy as their performances. For token cost, we average the token number of 10 randomly sampled instances for DIN-SQL. For STRIKE, the optimal performance are achieved by majority voting from 1-shot to 5-shot results, resulting in a significant increase in token cost. Further, for CBR-ApSQL the token cost is calculated with their question representation and 8-shot examples in SQL-Only Organization.
考虑到 OpenAI 的 LLM 是按照token数量收费的，而 LLM 的运行时间与token 长度成比例，我们在提示工程中强调token效率，以实现用更少的token达到更高准确度。在本节中，我们将回顾我们在 Spider-dev 上的实验，重点讨论token效率。具体来说，对于 OpenAI 和开源 LLM，我们研究了执行准确性与token数量之间的权衡，token数量主要受问题表示和示例组织的影响。在示例选择方面，我们将其固定为 DAIL $_S$ 。此外，我们还将在比较中包括几种最先进的 Text-to-SQL 方法，包括 DIN-SQL [31]、STRIKE [25] 和 CBR-ApSQL [11]。我们以他们报告的最高执行准确性作为其性能。对于token成本，我们计算 DIN-SQL 的 10 个随机采样实例的token数量的平均值。对于 STRIKE，我们通过从 1-shot 到 5-shot结果进行多数投票来获得最佳性能，这导致token成本显著增加。此外，对于 CBR-ApSQL，我们使用其问题表示和 SQL 唯一组织的 8 shot示例来计算token成本。

Fig. 7 shows the comparison in terms of token efficiency. In zero-shot scenario, compared with rule implication, prompt with foreign keys generally achieve higher execution accuracy at the expense of more tokens. In few-shot scenario, comparing different organization strategies, FI are very inefficient, whose tokens numbers are several times that of DAIL and SO . Comparing DAIL and SO , DAIL together with GPT-4 achieve the highest accuracy of 83.5%, yet having similar token cost with SO . Therefore, we demonstrate DAIL are more efficient than SO and FI in terms of token.
图7显示了在token效率方面的比较。在零样本情景中，与规则含义相比，带有外键的提示通常以更多token为代价实现更高的执行准确性。在小样本情景中，比较不同的组织策略，FI非常低效，其标记数量是DAIL和 SO的数倍。比较DAIL和SO ，DAIL与GPT-4一起实现了83.5%的最高准确性，但与SO具有类似的标记成本。因此，我们证明在标记效率方面，DAIL比SO 和FI 更高效。

Figure 7: Token efficiency of different representations in Spider-dev for OpenAI LLMs. We utilize different colors to represent different question representations and different shapes to denote different example organizations as well as the usage of foreign key information and rule implication. In particular, the overlap of shapes is used to indicate the usage of both foreign key information and rule implication. The rings stand for the prompts in zero-shot scenario and the stars stand for the previous SOTA results of few-shot methods in LLMs. 图 7：Spider-dev 中不同表示的token 效率。我们使用不同的颜色代表不同的提问表示，不同的形状表示不同的示例组织，以及外键信息和使用规则暗示。特别地，形状的重叠用于表示同时使用外键信息和使用规则暗示。环形代表zero-shot场景中的提示，星星代表在 LLM 中的few-shot方法的前沿成果。

Compared with other state-of-the-art solutions, DAIL-SQL out-performs DIN-SQL and STRIKE in terms of both accuracy and efficiency. While for CBR-ApSQL, it achieves 78.2% accuracy with TEXT-DAVINCI-003, but still lower than the optimal performance achieved by DAIL + FI .
与其他先进解决方案相比，DAIL-SQL 在准确性和效率方面均优于 DIN-SQL 和 STRIKE。对于 CBR-ApSQL，它使用 TEXT-DAVINCI-003 达到了 78.2% 的准确性，但仍然低于 DAIL + FI 所取得的最佳性能。

Besides, For open-source LLM in Fig. 7(d), the LLMs fine-tuned on Text-to-SQL are much more efficient. However, as discussed in Sec. 4.4, adding examples is unhelpful for open-source LLMs, and even reduces their token efficiency.
此外，在图 7(d) 中的开源 LLM 方面，针对 Text-to-SQL 进行微调的 LLM 具有更高的效率。然而，如第 4.4 节中所讨论的，为开源 LLM 添加示例对提高性能并无帮助，甚至降低了它们的token效率。

In summary, token efficiency is a critical metric for real-world applications of LLMs on Text-to-SQL. In light of this, our approach, DAIL-SQL, offers a compelling solution that combines high execution accuracy with improved token efficiency. This makes it highly practical and suitable for real-world applications.
总之，token 效率是评估 LLM（大型语言模型）在 Text-to-SQL 实际应用中的关键指标。鉴于这一点，我们的方法 DAIL-SQL 提供了一个有竞争力的解决方案，它将高执行准确性与提高标记效率相结合。这使得 DAIL-SQL 非常实用，适合实际应用。

5 DISCUSSION
5 讨论

Based on our experiments, we can have some empirical insights and guidelines as follows:
基于我们的实验，我们可以有一些经验见解和指导方针如下：

For question representation, Code Representation Prompt and OpenAI Demostration Prompt are recommended, and other information such as foreign key and rule implication can be very helpful.
对于问题表示，建议使用Code Representation Prompt和OpenAI Demostration Prompt，而其他信息，如外键和规则暗示，也可以非常有帮助。
For example selection, the similarities of both natural language question and SQL query are important. These two similarities together are a good indicator for designing effective selection strategy.
对于示例选择，自然语言问题和SQL查询的相似性都很重要。这两种相似性的结合是设计有效的选择策略的良好指标。
For example organization, if the adopted LLM is powerful enough, like GPT-4, presenting them question and SQL query pairs is an effective yet efficient choice. Otherwise, presenting them full information examples is suggested.
对于示例组织，如果采用的LLM足够强大，比如GPT-4，向它们呈现问题和SQL查询对是一种有效而高效的选择。否则，建议向它们呈现完整的信息示例。
For open-source LLM, supervised fine-tuning seems to be necessary, and Alpaca SFT Prompt used in Alpaca is suggested. Besides, to obtain comparable performance with OpenAI LLMs, larger LLMs might be helpful.
对于开源LLM，似乎有必要进行监督微调，建议使用Alpaca SFT Prompt中使用的Alpaca。此外，为了获得与OpenAI LLMs相媲美的性能，更大规模的LLMs可能会有所帮助。

There are also some limitations in this paper. Due to limited resources, we only test two rule implications, and the exploration of more rules can further benefit LLM-based Text-to-SQL solutions. Besides, the databases in Spider and Spider-Realistic may be not large enough, and we believe some new challenges in effectiveness and efficiency will emerge if there are a mass of tables in Text-top-SQL task.
本论文也存在一些局限性。由于有限的资源，我们只测试了两种规则暗示方式，而进一步探索更多规则可能会进一步有益于基于LLM的Text-to-SQL解决方案。此外，Spider和Spider-Realistic中的数据库可能不够大，我们相信如果Text-to-SQL任务中有大量表格，将会出现一些新的有效性和效率挑战。

Furthermore, the current evaluation metric focuses more on correctness rather than efficiency, and thus how to promote LLM to generate efficient SQL among all correct ones is an important yet unexplored question. We will keep working on these limitations and open questions.
此外，当前的评估指标更侧重于正确性而非效率，因此如何提高 LLM 在所有正确答案中生成高效 SQL 的问题是一个重要且尚未探索的课题。我们会继续研究这些局限性和开放性问题。

6 CONCLUSIONS
6 结论

In this paper, we conduct a systematical study on LLM-based Text-to-SQL from aspects of question representation, in-context learning and supervised fine-tuning. We point out that existing in-context learning techniques for Text-to-SQL neglect the mapping between questions and queries, as well as the trade-off between example quality and quantity. To address these issues, we proposed a new prompt engineering method, named DAIL-SQL, which refreshes the Spider leaderboard with 86.6% execution accuracy and ranks the first place. Regarding supervised fine-tuning, we demonstrate the great potentials of open-source LLMs for Text-to-SQL, underline the importance of question representation and model scaling, and point out the degeneracy of in-context learning capability after fine-tuning. Further, we conduct an observation over existing solutions in terms of token efficiency, which indicates DAIL-SQL is much more efficient and emphasizes the importance of token efficiency in prompt engineering. All of these are open challenges and opportunities for future study. We hope that our work can provide a comprehensive study about Text-to-SQL, give some guidelines for real world applications, and help people advance its frontiers.
在本文中，我们对基于 Text-to-SQL 从问题表示、上下文学习以及监督微调等方面进行了系统的研究。我们指出，现有的文本到 SQL 的上下文学习技术忽略了问题与查询之间的映射，以及示例质量与数量之间的权衡。为了应对这些问题，我们提出了一种新的提示工程方法，名为 DAIL-SQL，它以 86.6% 的执行准确度刷新了 Spider 排行榜，并位居榜首。在监督微调方面，我们展示了开源 LLM 在文本到 SQL 领域的巨大潜力，强调了问题表示和模型扩展的重要性，并指出在微调后上下文学习能力的退化。此外，我们对现有解决方案在token效率方面进行了观察，结果显示 DAIL-SQL 更加高效，强调了提示工程中token效率的重要性。所有这些都是未来研究中的开放挑战和机遇。我们希望我们的工作能对文本到 SQL 进行全面研究，为实际应用提供一些指导，并帮助人们推进其前沿技术。

REFERENCES
参考文献

[1] Yuvanesh Anand, Zach Nussbaum, Brandon Duderstadt, Benjamin Schmidt, and Andriy Mulyar. 2023. GPT4All: Training an Assistant-style Chatbot with Large Scale Data Distillation from GPT-3.5-Turbo. https://github.com/nomic-ai/gpt4all.
[2] Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Benjamin Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom B. Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. 2021. A General Language Assistant as a Laboratory for Alignment. CoRR abs/2112.00861 (2021).
[3] LILY Group at Yale University. 2018. Spider 1.0, Yale Semantic Parsing and Text-to-SQL Challenge. https://yale-lily.github.io/spider.
[4] Ruichu Cai, Boyan Xu, Zhenjie Zhang, Xiaoyan Yang, Zijian Li, and Zhihao Liang. 2018. An Encoder-Decoder Framework Translating Natural Language to Database Queries. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence. 3977–3983.
[5] Shuaichen Chang and Eric Fosler-Lussier. 2023. How to Prompt LLMs for Textto-SQL: A Study in Zero-shot, Single-domain, and Cross-domain Settings. CoRR abs/2305.11853 (2023).
[6] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/
[7] Naihao Deng, Yulong Chen, and Yue Zhang. 2022. Recent Advances in Text-toSQL: A Survey of What We Have and What We Expect. In Proceedings of the 29th International Conference on Computational Linguistics. 2166–2187.
[8] Xiang Deng, Ahmed Hassan Awadallah, Christopher Meek, Oleksandr Polozov, Huan Sun, and Matthew Richardson. 2021. Structure-Grounded Pretraining for Text-to-SQL. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1337–1350.
[9] Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei Li, and Zhifang Sui. 2023. A Survey for In-context Learning. CoRR abs/2301.00234 (2023).
[10] Xuemei Dong, Chao Zhang, Yuhang Ge, Yuren Mao, Yunjun Gao, Lu Chen, Jinshu Lin, and Dongfang Lou. 2023. C3: Zero-shot Text-to-SQL with ChatGPT. CoRR abs/2307.07306 (2023).
[11] Chunxi Guo, Zhiliang Tian, Jintao Tang, Pancheng Wang, Zhihua Wen, Kang Yang, and Ting Wang. 2023. A Case-Based Reasoning Framework for Adaptive Prompting in Cross-Domain Text-to-SQL. CoRR abs/2304.13301 (2023).
[12] Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao, Jian-Guang Lou, Ting Liu, and Dongmei Zhang. 2019. Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation. In Proceedings of the 57th Conference of the Association for Computational Linguistics. 4524–4535.
[13] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. In The 10th International Conference on Learning Representations.
[14] Binyuan Hui, Ruiying Geng, Lihan Wang, Bowen Qin, Yanyang Li, Bowen Li, Jian Sun, and Yongbin Li. 2022. S2 SQL: Injecting Syntax to Question-Schema Interaction Graph Encoder for Text-to-SQL Parsers. In Findings of the Association for Computational Linguistics. 1254–1262.
[15] George Katsogiannis-Meimarakis and Georgia Koutrika. 2023. A Survey on Deep Learning Approaches for Text-to-SQL. VLDB J. 32, 4 (2023), 905–936.
[16] Anirudh Khatry, Joyce Cahoon, Jordan Henkel, Shaleen Deep, K. Venkatesh Emani, Avrilia Floratou, Sumit Gulwani, Vu Le, Mohammad Raza, Sherry Shi, Mukul Singh, and Ashish Tiwari. 2023. From Words to Code: Harnessing Data for Program Synthesis from Natural Language. CoRR abs/2305.01598 (2023).
[17] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large Language Models Are Zero-Shot Reasoners. Advances in neural information processing systems 35 (2022), 22199–22213.
[18] Chia-Hsuan Lee, Oleksandr Polozov, and Matthew Richardson. 2021. KaggleDBQA: Realistic Evaluation of Text-to-SQL Parsers. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2261–2273.
[19] Haoyang Li, Jing Zhang, Cuiping Li, and Hong Chen. 2023. RESDSQL: Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL. In 37th AAAI Conference on Artificial Intelligence. 13067–13075.
[20] Jinyang Li, Binyuan Hui, Reynold Cheng, Bowen Qin, Chenhao Ma, Nan Huo, Fei Huang, Wenyu Du, Luo Si, and Yongbin Li. 2023. Graphix-T5: Mixing Pre-trained Transformers with Graph-Aware Layers for Text-to-SQL Parsing. In 37th AAAI Conference on Artificial Intelligence. 13076–13084.
[21] Jinyang Li, Binyuan Hui, Ge Qu, Binhua Li, Jiaxi Yang, Bowen Li, Bailin Wang, Bowen Qin, Rongyu Cao, Ruiying Geng, Nan Huo, Xuanhe Zhou, Chenhao Ma, Guoliang Li, Kevin Chen-Chuan Chang, Fei Huang, Reynold Cheng, and Yongbin Li. 2023. Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs. CoRR abs/2305.03111 (2023).
[22] Aiwei Liu, Xuming Hu, Lijie Wen, and Philip S. Yu. 2023. A Comprehensive Evaluation of ChatGPT’s Zero-Shot Text-to-SQL Capability. CoRR abs/2303.13547 (2023).
[23] Hu Liu, Yuliang Shi, Jianlin Zhang, Xinjun Wang, Hui Li, and Fanyu Kong. 2023. Multi-hop Relational Graph Attention Network for Text-to-SQL Parsing. In International Joint Conference on Neural Networks. 1–8.
[24] Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. 2022. What Makes Good In-Context Examples for GPT-3?. In Proceedings of Deep Learning Inside Out: The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures. 100–114.
[25] Linyong Nan, Yilun Zhao, Weijin Zou, Narutatsu Ri, Jaesung Tae, Ellen Zhang, Arman Cohan, and Dragomir Radev. 2023. Enhancing Few-shot Text-to-SQL Capabilities of Large Language Models: A Study on Prompt Design Strategies. CoRR abs/2305.12586 (2023).
[26] OpenAI. 2023. GPT-4 Technical Report. CoRR abs/2303.08774 (2023).
[27] OpenAI. 2023. Rate limits. https://platform.openai.com/docs/guides/rate-limits/overview. Last accessed on 2023-07-24.
[28] OpenAI. 2023. SQL translate. https://platform.openai.com/examples/default-sqltranslate. Last accessed on 2023-07-24.
[29] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. 2022. Training Language Models to Follow Instructions with Human Feedback. In NeurIPS.
[30] Octavian Popescu, Irene Manotas, Ngoc Phuoc An Vo, Hangu Yeo, Elahe Khorashani, and Vadim Sheinin. 2022. Addressing Limitations of Encoder-Decoder Based Approach to Text-to-SQL. In Proceedings of the 29th International Conference on Computational Linguistics. 1593–1603.
[31] Mohammadreza Pourreza and Davood Rafiei. 2023. DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction. CoRR abs/2304.11015 (2023).
[32] Jiexing Qi, Jingyao Tang, Ziwei He, Xiangpeng Wan, Yu Cheng, Chenghu Zhou, Xinbing Wang, Quanshi Zhang, and Zhouhan Lin. 2022. RASAT: Integrating Relational Structures into Pretrained Seq2Seq Model for Text-to-SQL. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 3215–3229.
[33] Bowen Qin, Binyuan Hui, Lihan Wang, Min Yang, Jinyang Li, Binhua Li, Ruiying Geng, Rongyu Cao, Jian Sun, Luo Si, Fei Huang, and Yongbin Li. 2022. A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future Directions. CoRR abs/2208.13629 (2022).
[34] Nitarshan Rajkumar, Raymond Li, and Dzmitry Bahdanau. 2022. Evaluating the Text-to-SQL Capabilities of Large Language Models. CoRR abs/2204.00498 (2022).
[35] Torsten Scholak, Nathan Schucher, and Dzmitry Bahdanau. 2021. PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 9895–9901.
[36] Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2020. MPNet: Masked and Permuted Pre-training for Language Understanding. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems.
[37] Ruoxi Sun, Sercan Ö. Arik, Hootan Nakhost, Hanjun Dai, Rajarishi Sinha, Pengcheng Yin, and Tomas Pfister. 2023. SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL. CoRR abs/2306.00739 (2023).
[38] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca.
[39] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. CoRR abs/2302.13971 (2023).
[40] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem
Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller,
Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar
Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,
Isabel Kloumann, Artem Korenev, Singh Koura, Marie-Anne Lachaux, Thibaut
Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet,
Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton,
Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva,
Eric Michael, Smith Ranjan, Subramanian Xiaoqing, Ellen Tan, Binh Tang, Ross
Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov,
Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. LLAMA2:Open Foundation and Fine-Tuned Chat Models. CoRR (2023).
[41] Immanuel Trummer. 2022. CodexDB: Synthesizing Code for Query Processing from Natural Language Instructions Using GPT-3 Codex. Proceedings of the VLDB Endowment 15, 11 (2022), 2921–2928.
[42] Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson. 2020. RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 7567–7578.
[43] Lihan Wang, Bowen Qin, Binyuan Hui, Bowen Li, Min Yang, Bailin Wang, Binhua Li, Jian Sun, Fei Huang, Luo Si, and Yongbin Li. 2022. Proton: Probing Schema Linking Information from Pre-trained Language Models for Text-to-SQL Parsing. In The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1889–1898.
[44] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In The Eleventh International Conference on Learning Representations.
[45] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. 2019. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. CoRR abs/1910.03771 (2019).
[46] Kun Xu, Lingfei Wu, Zhiguo Wang, Yansong Feng, and Vadim Sheinin. 2018. SQL-to-Text Generation with Graph-to-Sequence Model. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 931–936.
[47] Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir R. Radev. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 3911–3921.
[48] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023. A Survey of Large Language Models. CoRR abs/2303.18223 (2023).
[49] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. CoRR abs/2306.05685 (2023).
[50] Yanzhao Zheng, Haibin Wang, Baohua Dong, Xingjun Wang, and Changshan Li. 2022. HIE-SQL: History Information Enhanced Network for Context-Dependent Text-to-SQL Semantic Parsing. In Findings of the Association for Computational Linguistics. 2997–3007.
[51] Ruiqi Zhong, Tao Yu, and Dan Klein. 2020. Semantic Evaluation for Text-to-SQL with Distilled Test Suites. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 396–411.
[52] Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. CoRR abs/1709.00103 (2017).

A QUESTION REPRESENTATIONS
A 问题表示

A.1 Implementation Details
A.1 实施详情

For question similarity in QTS , MQS and DAIL , we first connect question words to the database with a -gram matching based schema-linking method [42]. Then to obtain the skeleton, we replace table and column names with “”, and values with “”. At last, we embed the masked questions with a pre-trained sentence Transformer, all-mpnet-base-v2 [36], to calculate their similarities.
在 QTS、MQS和 DAIL中的问题相似度方面，我们首先使用基于 schema-linking 的-gram 匹配方法将问题词连接到数据库 [42]。然后，为了获取数据库结构，我们将表和列名称替换为“”，将值替换为“”。最后，我们使用预训练的句子转换器 all-mpnet-base-v2 [36] 将mask的问题嵌入，以计算它们之间的相似度。

For query similarity in DAIL , we utilize Graphix [20] as the preliminary model to generate the predicted query ′ . Then we obtain its skeleton by removing its database-specific information [19], including column names and values. Finally, we calculate the Jaccard similarity between example candidate and the predicted query ′ as their query similarity. For the query similarity threshold in DAIL , we set it as 0.9 in the experiments of this paper.
在 DAIL 中的查询相似度方面，我们使用 Graphix [20] 作为初步模型来生成预测查询 ′。然后，通过去除其数据库特定信息 [19]，包括列名和值，获取其骨架。最后，我们计算实例候选查询和预测查询′之间的 Jaccard 相似度，以获得它们的查询相似度。在本文的实验中，我们将 DAIL 中的查询相似度阈值设置为 0.9。

For the submission to the Spider leaderboard, we set to be 0.85 and utilize GPT-4, with CR , MQS and DAIL , as the preliminary model to ensure only one model involved. Furthermore, we process the self consistency voting on 5 produced queries for each question and set the argument temperature as 1.0 for variety in voting.
为了提交到Spider 排行榜，我们设定为 0.85，并使用 GPT-4 作为初步模型，同时采用 CR、MQS和 DAIL。此外，我们对每个问题生成的 5 个查询进行自一致性投票，并将论点温度设定为 1.0 以增加投票的多样性。

A.2 Detailed Performance of Different Question Representations
A.2 不同问题表征的详细表现

The numerical results of different question representations in zero-shot scenario are show in Table 5.
zero-shot场景中不同问题表示的数值结果展示在表 5 中。

Table 5: Details of zero-shot evaluation on Spider-dev with different question representations. 表 5：Spider-dev 上不同问题表示的 zero-shot场景评估细节 ### A.3 Detailed Performance of Different Representations with Foreign Keys
A.3 使用外键的不同表示的详细性能 The numerical results of the ablation study about foreign key information are shown in Table 6. 表 6 展示了关于外键信息消融研究的数值结果。

Table 6: Details of zero-shot evaluation on Spider-dev with foreign keys and comparisons with the results obtained without foreign keys in Table 5. 表 6：Spider-dev 上的zero-shot场景评估细节，包括外键信息，以及与表 5 中无外键情况下获得的结果的比较。

A.4 Detailed Performance of Different Representations with/without Explanation rule
A.4 有/无解释规则的不同表示的详细表现

The numerical results of ablation study about rule implication are shown in Table 7.
表 7 展示了关于规则暗示的消融研究数值结果。

Table 7: Details of zero-shot evaluation on Spider-dev with/without rule implication “with no explanation” in instructions and comparisons with their opposites in Table 5. 表 7：Spider-dev 上的zero-shot场景评估细节，包括是否有规则暗示“无解释”的指令，以及与表 5 中相应的比较。 ### A.5 Question Representations with Rule Implication “Let’s think step by step”
A.5 问题表示与规则暗示：“让我们一步步思考” The numerical results of the ablation study with the rule “Let’s think step by step” are shown in Table 8. 表 8 展示了带有规则“让我们一步步思考”的消融研究的数值结果。

Table 8: Zero-shot evaluation results on Spider-dev with "Let’s think step by step" rule implication in instructions and comparisons with Table 5. 表 8：在Spider-dev上，带有“让我们一步步思考”规则暗示的 Zero-shot评估结果，并与表 5 进行对比。 ## B IN-CONTEXT LEARNING FOR TEXT-TO-SQL
B TEXT-TO-SQL的上下文学习 ### B.1 One-Shot Evaluation on Different Question Representation
B.1 不同问题表征的One-Shot评估· ![在这里插入图片描述](https://img-blog.csdnimg.cn/836747ea565448c380a321a559bcaf96.png#pic_center) Figure 8: Results of one-shot evaluation on Spider-dev with different question representations and comparisons with the results of zero-shot evaluation in Fig. 1. The green arrow indicates increase, and red arrow indicates decrease. 图 8：在 Spider-dev上，不同问题表示下一轮评估的结果，并与图 1 中的zero-shot评估结果进行对比。绿色箭头表示增加，红色箭头表示减少

Table 9: Details of one-shot evaluation on Spider-dev with different question representations and comparisons with the results of zero-shot evaluation in Table 5. 表 9：在pider-dev中，不同问题表示one-shot评估的详细结果，并与表 5 中的zero-shot评估结果进行对比。

In Fig. 8, we present the results of our one-shot evaluation for different question representations. Specifically, we use FI examples with DAIL here, and Table 9 shows the comparisons with the zero-shot scenario. By comparing zero-shot and one-shot evaluation results, adding contextual example show obvious and consistent improvements for all LLMs in exact-set-match accuracy. In term of execution accuracy, contextual examples benefits both GPT-4 and TEXT-DAVINCI-003. However, for GPT-3.5-TURBO, adding contextual examples only benefits TR and CR , indicating the in-context learning capability bias in different LLMs. By comparing different representations, CR shows obvious advantage in execution accuracy, as taking the advantage of programming.
在图 8 中，我们展示了不同问题表示one-shot 评估的结果。具体来说，我们在这里使用 FI 示例带有 DAIL，表 9 显示了与零散场景的对比。通过对比zero-shot 和one-shot 评估的结果，添加上下文示例在精确集匹配准确性方面为所有 LLM 显示出明显和一致的改进。在执行准确性方面，上下文示例既有利于 GPT-4，也有利于 TEXT-DAVINCI-003。然而，对于 GPT-3.5-TURBO，添加上下文示例仅有利于 TR 和 CR，这表明不同 LLM 中的上下文学习能力的偏差。通过对比不同的表示，CR在执行准确性方面显示出明显的优势，因为它利用了编程的优势。

B.2 Detailed Performance of Different Example Organizations
B.2 不同示例组织的详细表现

The numerical results of different example organization strategies in few-shot scenario are shown in Table 10 and 11.
不同示例组织策略在 few-shot 情景下的数值结果展示在表 10 和 11 中。

Table 10: Details of Few-shot evaluation on Spider development split with different organizations. 表 10：不同组织情况下Spide-dev的Few-shot评估细节。

Table 11: Details of Few-shot evaluation on Spider-Realistic dataset with different organizations. 表 11：不同组织情况下Spider-Realistic数据集的Few-shot评估细节。

C SUPERVISED FINE-TUNING FOR TEXT-TO-SQL
C TEXT-TO-SQL的监督微调

C.1 Detailed Performance of Open-source LLMs on Spider-Realistic
C.1 开源大型语言模型在Spider-Realistic任务上的详细性能。

The numerical results are shown in Table 12.
数值结果如表12所示。

Table 12: Zero-shot evaluation results on Spider-Realistic with different open-source LLMs. 表12:不同开源LLM对Spider-Realistic的零样本评估结果。

C.2 Detailed Performance of Open-source LLMs on Spider-dev in Few-shot Scenario
C.2 Spider dev上开源LLM在Few-shot场景中的详细性能

The numerical results are shown in Table 13.
数值结果如表13所示。

Table 13: Detailed performance of open-source LLMs on Spider-dev in few-shot scenario. 表13：在few-shot 场景下，Spider dev上开源LLM的详细性能。

C.3 Details for Supervised Fine-tuning
C.3 监督微调的详细信息

For dataset, we use the train split in Spider, which totally contains 8659 training samples. For hyper-parameters, we set global batch size as 256, and search learning rate in [1 − 6, 1 − 4] and weight decay in {1, 0.1, 0.01, 0}. During fine-tuning, we use a cosine learning rate scheduler in transformers [45] with a warm ratio 0.03. Besides, all LLMs are fine-tuned on a server with eight 64G A100 GPUs.
对于数据集，我们使用了Spider中的训练集分割，总共包含 8659 个训练样本。对于超参数，我们将全局批大小设为 256，并在 [1 − 6，1 − 4] 中搜索学习率，权重衰减设置为{1，0.1，0.01，0]。在微调过程中，我们使用了一个余弦学习率调度器（在变压器中） [45]，热身比例为 0.03。此外，所有 LLM 都在配备有八个 64G A100 GPU 的服务器上进行微调。

C.4 Detailed Performance of Fine-tuned Open-source LLM on Spider and Spider-Realistic
C.4 Spider和Spider Reality上微调开源LLM的详细性能

The numerical results on Spider-dev and Spider-Realistic are all shown in Table 14.
Spider dev和Spider Realistics的数值结果均如表14所示。

Table 14: Performance of supervised fine-tuning on Spider and Spider-Realistic with respect to different representations and LLMs. 表 14：针对不同的表示和LLM，在Spider和Spider Realistic上执行监督微调任务的性能。

D SPIDER LEADERBOARD
D SPIDER排行榜

Fig. 9 shows the performance rank in Spider leaderboard on Sep 19, 2023. In the leaderboard, our solution DAIL-SQL with GPT-4 is reported to achieve 86.2% execution accuracy; further, with self-consistency, our solution achieves 86.6% execution accuracy, ranked as the first place.
图 9 显示了 2023 年 9 月 19 日在Spider 排行榜上的性能排名。在排行榜中，我们的解决方案 DAIL-SQL 配合 GPT-4 报告实现了 86.2% 的执行准确性；进一步地，通过自我一致性，我们的解决方案实现了 86.6% 的执行准确性，排名首位。

Figure 9: Current performance rank in Spider leaderboard. (Last accessed on 2023-09-19.) 图9：Spider排行榜中的当前表现排名。（上次访问时间为2023-09-19。）

【论文阅读】《Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation》

Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation

ABSTRACT摘要

1 INTRODUCTION 1 简介

2 PRELIMINARY2 准备工作

3 METHODOLOGY3 研究方法

3.1 Question Representation3.1 问题表示

3.2 In-Context Learning for Text-to-SQL 3.2 Text-to-SQL的上下文学习

3.3 DAIL-SQL3.3 DAIL-SQL

3.4 Supervised Fine-Tuning for Text-to-SQL3.4 Text-to-SQL的监督微调

4 EXPERIMENT4 实验

4.1 Setting4.1 设置

4.2 Question Representations4.2 问题表示

4.3 In-Context Learning for Text-to-SQL4.3 Text-to-SQL的上下文学习

4.4 Supervised Fine-Tuning for Text-to-SQL4.4 Text-to-SQL的监督微调

4.5 Token Efficiency4.5 token效率

5 DISCUSSION5 讨论

6 CONCLUSIONS6 结论

REFERENCES参考文献

A QUESTION REPRESENTATIONS A 问题表示

A.1 Implementation DetailsA.1 实施详情

A.2 Detailed Performance of Different Question RepresentationsA.2 不同问题表征的详细表现

A.4 Detailed Performance of Different Representations with/without Explanation ruleA.4 有/无解释规则的不同表示的详细表现

B.2 Detailed Performance of Different Example OrganizationsB.2 不同示例组织的详细表现

C SUPERVISED FINE-TUNING FOR TEXT-TO-SQLC TEXT-TO-SQL的监督微调

C.1 Detailed Performance of Open-source LLMs on Spider-RealisticC.1 开源大型语言模型在Spider-Realistic任务上的详细性能。

C.2 Detailed Performance of Open-source LLMs on Spider-dev in Few-shot ScenarioC.2 Spider dev上开源LLM在Few-shot场景中的详细性能

C.3 Details for Supervised Fine-tuningC.3 监督微调的详细信息

C.4 Detailed Performance of Fine-tuned Open-source LLM on Spider and Spider-RealisticC.4 Spider和Spider Reality上微调开源LLM的详细性能

D SPIDER LEADERBOARDD SPIDER排行榜

你可能感兴趣的:(论文阅读,论文阅读,sql,语言模型,nlp)