最近研究在AGI方面投入精力较多,通过Workflow的方式,做了一些LLM+知识库+Tools的Agent。其中Workflow的实现,一种是通过Prompt来编排,另一种是LangGraph的方式。而Prompt书写质量的好坏,也严重影响了最后的效果。整理了当前主流的Prompt写法。其中包括一些Prompt示例,以及一些Prompting的思想。梳理、总结分析了20篇关于Prompt Engineering的论文。

Zero-shot Prompting

Zero-shot指的是在预训练和微调阶段,没有针对某个特定的任务进行训练,但依然在该任务上有准确的预测能力。说明模型的泛化能力、抽象推理能力、知识迁移能力、语义理解和高效的特征表示学习能力不错。指令微调可以提高模型的Zero-shot能力,论文参考《FINETUNED LANGUAGE MODELS ARE ZERO-SHOT LEARNERS》

Zero-shot可以处理以下任务:

理解未见命令(Understanding Unseen Commands)

Understand and Execute commands it has never encountered during its training phase.

1
2
3
任务: 解释一项模型未见过的运动规则。

Prompt: 假设有一项运动叫做‘光球大战’,在这项运动中,两队通过控制一束光来得分。请解释这项运动的可能规则。

图片识别(Image Recognition)

Identify objects or animals it has never seen before.

1
2
3
任务: 识别模型未曾见过的动物——“鹦鹉螺”。

Prompt: 鹦鹉螺是一种具有螺旋壳和鲜艳颜色的海洋生物,常见于热带海域。根据这些信息,判断以下图片中是否存在鹦鹉螺,并说明理由。

推荐系统(Recommendation Systems)

Suggest items (movies, books, etc.) to users based on descriptions or attributes of the items, even if those specific items have never been rated before.

1
2
3
任务: 推荐用户可能感兴趣的未评分的书籍。

Prompt: 给定用户喜欢的是关于'未来科技和人工智能的科幻小说',请推荐一本该用户可能感兴趣但系统中未有评分记录的书籍。

文本分类(Text Classification)

Categorize texts into topics or sentiments it has not been explicitly trained on.

1
2
3
任务: 对于一个新的主题"区块链技术"的文章进行分类。

Prompt: 考虑以下段落是关于‘区块链技术’的概念解释、应用案例或对未来的预测。请将其分类为'技术解释'、'应用案例'或'未来预测'。

跨语言理解转换(Cross-lingual Transfer)

A model trained on tasks in one language (e.g., English) can perform the same task in another language (e.g., French) that it was not trained on.

1
2
3
任务: 使用英语完成的模型解释一个中文概念。

Prompt: "八卦"在中文中是一个复杂的概念,涉及到宇宙哲学、自然科学以及人的命运。请用英语解释"八卦"的含义及其在文化中的应用。

Few-shot Prompting

虽然大模型有不错的Zero-shot能力,但在更加复杂的任务上仍然不足。Few-shot Prompting可以用来支持大模型进行上下文学习,从而使模型具备更好的表现。Few-shot使用有限数量(10-100个)的标记数据作为上下文的信息,模型会从提供的示例中进行任务抽取,并用于新的、未见过的示例上,通常标记数据很少时可使用该技术。

《Language Models are Few-Shot Learners》主要介绍了增加模型参数来提升Few-shot等场景下的性能。
《Rethinking the Role of Demonstrations:What Makes In-Context Learning Work?》 介绍了Few-shot的一些使用技巧:

  • 示例中,标签空间(所有可能的标签)和输入文本的分布(即不同类型文本的出现频率或样式)都是非常重要的因素,但不受示例中标签是否准确的影响。

    the label space and the distribution of the input text specified by the demonstrations are both key to in-context learning (regardless of whether the labels are correct for individual inputs);

  • 明确整体的格式至关重要,如标签未知,即使用随机单词作为标签也比没有标签效果要好。

    specifying the overall format is also crucial, e.g., when the label space is unknown, using random English words as labels is significantly better than using no labels;

  • 带有上下文学习目标的meta-training放大了模型在演示方面的效果,比如格式,而不是输入标签的映射。这也是常见的模型幻觉问题,有时候过度依赖于上下文,而导致不能深入理解和预测,增加幻觉的产生。

    meta-training with an in-context learning objective magnifies these effects—the models almost exclusively exploit simpler aspects of the demonstrations like the format rather than the input-label mapping.

Few-shot可以处理以下任务:

语言翻译

1
2
3
4
5
6
7
8
9
任务: 将英文翻译成法文。

Prompt:
英文:I want to eat.
法文:Je veux manger.
英文:How are you?
法文:Comment ça va?
英文:It is a beautiful day.
法文:

情感分析

1
2
3
4
5
6
7
8
9
任务: 分析客户评论的情绪倾向。

Prompt
评论:This product works great!
情感:Positive
评论:I am disappointed with the service.
情感:Negative
评论:Their support team is very helpful.
情感:
1
2
3
This product works great! // Positive
I am disappointed with the service. // Negative
Their support team is very helpful.

文本摘要

1
2
3
4
5
6
7
任务: 对新闻文章进行摘要。

Prompt:
文章:The company reported a 20% increase in profit this quarter, attributing the growth to an increase in sales and cost reduction measures.
摘要:Company profit grew 20% this quarter due to higher sales and cost cutting.
文章:Sacramento is the wild card here, and having a final game against Portland on Sunday could be the difference that keeps them on the 7-8 line and away from the win-or-go-home 9-10 line.
摘要:

问答系统

1
2
3
4
5
6
7
8
9
任务: 根据提供的信息回答问题。

Prompt:
文本:The Eiffel Tower is located in Paris, France. It was constructed in 1889.
问题:Where is the Eiffel Tower located?
答案:Paris, France
文本:Alexander Hamilton was one of the founding fathers of the United States.
问题:Who was Alexander Hamilton?
答案:

生成编程代码

1
2
3
4
5
6
7
任务: 根据需求描述生成简单的编程代码。

Prompt
描述:写一个函数,计算两个数字的和。
代码:def add_numbers(a, b): return a + b
描述:写一个函数,返回列表中的最大值。
代码:

Chain-of-Thought(CoT) Prompting

思维链(CoT)提示通过中间推理步骤实现复杂的推理能力。可以与Few-shot结合起来,在复杂任务推理中能获得更好的结果。

Cot

论文《Chain-of-Thought Prompting Elicits Reasoning in Large Language Models》:介绍了CoT,Paper Image:

1
2
3
4
5
6
7
8
9
Prompt:
The odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1.
A: Adding all the odd numbers (9, 15, 1) gives 25. The answer is False.

The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1.
A:

Output:
Adding all the odd numbers (15, 5, 13, 7, 1) gives 41. The answer is False.

Zero-Shot CoT

论文《Large Language Models are Zero-Shot Reasoners》:本质上是加了”Let’s think step by step“这句话,来让模型进一步思考。Paper Image:

1
2
3
4
5
6
7
8
9
10
Prompt:
I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. How many apples did I remain with?

Let's think step by step.

Output:
First, you started with 10 apples.
You gave away 2 apples to the neighbor and 2 to the repairman, so you had 6 apples left.
Then you bought 5 more apples, so now you had 11 apples.
Finally, you ate 1 apple, so you would remain with 10 apples.

Auto-CoT

论文《AUTOMATIC CHAIN OF THOUGHT PROMPTING IN LARGE LANGUAGE MODELS》:是对Zero-shot CoT的改进,认为Zero-shot CoT在每一步思维链中仍然会产生推理错误,而Manual CoT流程过于复杂并且手写成本巨大,而Auto-CoT是对不同的问题进行采样,并生成推理链来演示。主要有两步:

  • Stage1: question clustering,聚类问题,将给定的问题划分为几个簇。
  • Stage2: demonstration sampling,示范抽样,从每个聚类簇中选择一个有代表性的问题,并使用简单的启发式算法生成推理链。

Auto-CoT consists of two main stages:
(i) question clustering: partition questions of a given dataset into a few clusters;
(ii) demonstration sampling: select a representative question from each cluster and generate its reasoning chain using Zero-Shot-CoT with simple heuristics.

Paper Image:

实验结果也表明,在10个公开的基准推理任务上,Auto-CoT与Manual-CoT相比,性能持平或超过,说明LLMs能够通过自动构建示例来执行CoT推理。

Active-Prompt

论文《Active Prompting with Chain-of-Thought for Large Language Models》:作者发现,LLM在某些特别任务上回答问题时,会存在不确定性。即生成的答案多样,不确定。为解决这一问题,通过主动学习的方式,从大量未标注的问题中选择最具有不确定性的问题进行人工注释,以此来构建有效的提示(prompt),从而提高大型语言模型(LLMs)解决复杂推理任务的能力。

Paper Image:

步骤如上图所示,分四步:

  1. 不确定性估计:让LLM尝试解决这个数据集中的问题,并观察它在哪些问题上表现得最不确定。比如,我们让模型尝试解答每个问题若干次,并记录下来它给出的不同答案。如果一个问题的答案在多次尝试中变化很大,那么我们认为模型在这个问题上的不确定性很高。
  2. 选择问题进行注释:根据不确定性的评估,选择一些最不确定的问题。
  3. 人工注释推理链:为这些选定的问题编写详细的推理链。
  4. 使用新的注释推理:再次使用LLM解答问题时,会在输入中包含这些人工注释的推理链示例。LLM在解答新的、未见过的问题时,就可以参考这些示例中的推理过程,从而提高解答问题的能力。

Multimodal CoT Prompting

论文《Multimodal Chain-of-Thought Reasoning in Language Models》,Paper Image: Multimodal CoT Task

CoT比较注重语言的形式,处理文本比较有优势,Multimodal CoT主要是将文本和视觉结合到一个两阶段的框架中,两阶段共享想通的模型结构,但输入和输出不同。

  • Stage1: (i) rationale generation,将输入问题在不同模态(文本和图像)下进行处理,生成中间的推理链;
  • Stage2: answer inference. 将这些推理链与原始问题一起用于推断最终的答案。

Both stages share the same model architecture but differ in the input and output. In the first stage, we feed the model with
language and vision inputs to generate rationales. In the second stage, we append the original language input with the rationale generated
from the first stage. Then, we feed the updated language input with the original vision input to the model to infer the answer.

Self-Consistency CoT

论文:《SELF-CONSISTENCY IMPROVES CHAIN OF THOUGHT REASONING IN LANGUAGE MODELS》
感觉没啥创意,核心思想就是先通过CoT的方式,生成多条推理链。再分别用多条推理链去去推理,选择概率较高的结果作为最终结果,虽然能提升准确率,但也比较浪费资源。

The self-consistency method contains three steps:
(1) prompt a language model usingchain-of-thought (CoT) prompting;
(2) replace the “greedy decode” in CoT prompting by sampling from the language model’s decoder to generate a diverse set of reasoning paths;
(3) marginalize out the reasoning paths and aggregate by choosing the most consistent answer in the final answer set.

Generated Knowledge Prompting

论文《Generated Knowledge Prompting for Commonsense Reasoning》,这篇论文比较早2022年的,其实后来的Few-shot Prompting跟此方法类似,或者说Few-shot Prompting包含了该Prompting的写法。核心思想就是在不需要访问结构化知识库或着进行特定任务微调的情况下,通过生成知识来提升大模型在常识推理任务上的性能。主要分两步:

  • 知识生成:使用语言模型根据问题生成知识声明,然后将这些知识作为输入提示与问题一起用于推断答案。这一步骤不需要任何结构化的知识库。
  • 知识集成:生成的知识被整合到决策过程中,以增强语言模型用于推理的能力。这个过程通过提高零样本或微调模型在特定任务上的表现。

一个论文中的Prompt例子:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Prompt:
Generate some knowledge about the input. Examples:
Input: What type of water formation is formed by clouds?
Knowledge: Clouds are made of water vapor.
Input: What can prevent food spoilage?
Knowledge: Dehydrating food is used for preserving food.
Input: The process by which genes are passed is
Knowledge: Genes are passed from parent to offspring.
Input: The stomach does what in the body?
Knowledge: The stomach is part of the digestive system.
Input: What can cause rocks to break down?
Knowledge: Mechanical weathering is when rocks are broken down by mechanical means.
Input: {question}
Knowledge:

Prompt Chaining

Prompt Chaining涉及使用一个提示的输出作为另一个提示的输入。通过链接提示在一起。通过引导LLMs一系列子任务,从而实现一个复杂的目标。这种Prompting的思想就是将任务拆分,每个子任务就是一个chain,在Langchain比较常见。

以下场景可以使用Prompt Chaining:

  1. Multi-step tasks: 如果任务需要多个不同的步骤,Workflow的方式,Chaining Prompts可以确保每一步执行完成后,完成一个复杂的任务。
  2. Complex instructions: 指令比较负责,单个Prompt包含的指令和细节较多,LLMs难以始终如一的完成这些指令和细节。可以将任务拆解为一系列连锁的子任务,可以提高每个子任务的完成度。
  3. Verifying outputs: 可以使用Chaining要求LLMs去重复验证其输出的结果,并可以要求对结果进行改进。
  4. Parallel processing: 并行处理,如果任务有多个独立的子任务,可以为子任务创建单独的提示,并行运行。

高效Prompt Chaining Tips:

  • Keep subtasks simple and clear
  • Use XML tags

https://docs.anthropic.com/claude/docs/chain-prompts

Multi-step

提取引用信息,一个task:

1
2
3
4
5
6
7
Here is a document, in <document></document> XML tags:

<document>
{{DOCUMENT}}
</document>

Please extract, word-for-word, any quotes relevant to the question {{QUESTION}}. Please enclose the full list of quotes in <quotes></quotes> XML tags. If there are no quotes in this document that seem relevant to this question, please say "I can't find any relevant quotes".

将提取的引用信息结果,作为Knowledge,回答问题, 两个task:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
I want you to use a document and relevant quotes from the document to answer a question.

Here is the document:
<document>
{{DOCUMENT}}
</document>

Here are direct quotes from the document that are most relevant to the question:
<quotes>
{{QUOTES}}
</quotes>

Please use these to construct an answer to the question "{{QUESTION}}"

Ensure that your answer is accurate and doesn't contain any information not directly supported by the quotes.

Validating outputs

比如一个task,先根据文章内容,来列举文章的语法错误:

1
2
3
4
5
6
Here is an article:
<article>
{{ARTICLE}}
</article>

Please identify any grammatical errors in the article. Please only respond with the list of errors, and nothing else. If there are no grammatical errors, say "There are no errors."

两个task,分析文章的语法错误,与列举的语法Errors list作比较:

1
2
3
4
5
6
7
8
9
10
11
Here is an article:
<article>
{{ARTICLE}}
</article>

Please identify any grammatical errors in the article that are missing from the following list:
<list>
{{ERRORS}}
</list>

If there are no errors in the article that are missing from the list, say "There are no additional errors."

Parallel processing

生成不同版本的文章:

1
2
3
4
5
Here is a concept: {{CONCEPT}}

I want you to write a three sentence outline of an essay about this concept that is appropriate for this level of reader: {{LEVEL}}

Please only respond with your outline, one sentence per line, in <outline></outline> XML tags. Don't say anything else.

Tree of Thoughts (ToT)

ToT的思想是模拟人的思考过程,构建思考树,来探索不同的解决方案路径。可以提高LLMs的能力,能更加有效的对复杂任务进行深入推理,找到更加合理的、有效的解决方案。

两篇论文,分别总结分析了ToT的思想,以及ToT区别于其他CoT在应用场景上的优势。
《Large Language Model Guided Tree-of-Thought》
《Tree of Thoughts: Deliberate Problem Solving with Large Language Models》

ToT与CoT、CoT-SC的结构对比:

ToT的思考过程,以及在System中的架构:

ToT的流程非常有意思,可以结合BFS和DFS的算法来制定整个ToT的策略,当局部最优不等于全局最优时,可能会产生不同的结果。BFS的方案,认为是局部最优就决定了全局最优,会对每一步制定多个plan,再对每个plan的结果进行奇数次的选举,得票最高的plan会作为这一步的策略,再进行第二个任务的plan。直到最后,会生成一条plan链,推理出最终的结果。

论文官方代码:

https://github.com/dave1010/tree-of-thought-prompting

但个人认为,ToT的这种策略在规定plan数量以及step数量时,不宜过长。plan为3就够了,step最好不超过5。

Retrieval Augmented Generation (RAG) Prompting

RAG整体是一个研究领域,是比较有市场前景的,可以解决模型幻觉等一些问题,能提高输出的质量以及效果。

这里给出一篇RAG的综述:《Retrieval-Augmented Generation for LargeLanguage Models: A Survey》,分别分析了Naive RAG、Advanced RAG和Modular RAG的优劣势,有时间会专门写一篇RAG方向的文章。

RAG的一个基本流程:

几种主流RAG方案的流程:

文章同时介绍了,RAG与Fine-Tuning的区别:

  • RAG在检索任务上,输出的信息会更加精准。
  • FT会让模型具备数据集的结构、风格等特点。

RAG can be likened to providing a model with a tailored textbook for information retrieval, ideal for precise information retrieval tasks. In contrast, FT is comparable to a student internalizing knowledge over time, suitable for scenarios requiring replication of specific structures, styles, or formats.

比如Langchain中RAG的一个Prompt:

1
2
3
4
5
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.

Question: {{question}}
Context: {{context}}
Answer:

Automatic Reasoning and Tool-use (ART) Prompting

本质上就是LLM+Tool相结合,目前可实现的框架很多,如Langchain,ChatGPT自带的Function Call等。

论文:《ART: Automatic multi-step reasoning and tool-use for large language models》

Prompt如:

1
2
3
4
5
You are an assistant that has access to the following set of tools. Here are the names and descriptions for each tool:

{tools}

Given the user input, return the name and input of the tool to use. Return your response as a JSON blob with 'name' and 'arguments' keys.

Langchain实现Tool,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from langchain_core.prompts import ChatPromptTemplate
from langchain.tools.render import render_text_description
from langchain_core.tools import tool


@tool
def multiply(first_int: int, second_int: int) -> int:
"""Multiply two integers together."""
return first_int * second_int

rendered_tools = render_text_description([multiply])


system_prompt = f"""You are an assistant that has access to the following set of tools. Here are the names and descriptions for each tool:

{rendered_tools}

Given the user input, return the name and input of the tool to use. Return your response as a JSON blob with 'name' and 'arguments' keys."""

prompt = ChatPromptTemplate.from_messages(
[("system", system_prompt), ("user", "{input}")]
)

其他Tool实现:

https://python.langchain.com/docs/use_cases/tool_use/

Directional Stimulus Prompting

论文《Guiding Large Language Models via Directional Stimulus Prompting》

大致的思想就是,通过一些实体标签,来约束住LLMs要回答的范围以及内容。从而起到对人工控制关键信息,生成高质量摘要的目的。

Program-Aided Language Models(PAL) Prompting

论文《PAL: Program-aided Language Models》
论文比较早了,2022年发表。本质上还是一种CoT,而且这种Coding的任务现在用LLMs+Tool,挂一个解释器会更准确。所以可能在实际应用场景中,作用不大。但是论文的思想还是值得借鉴的。比如对Programing任务进行拆解,通过CoT的方式最终给出答案。 可以用于苏格拉底式的启发式编程使用。

ReAct Prompting

论文:《REAC T: SYNERGIZING REASONING AND ACTING IN LANGUAGE MODELS》
核心思想还是通过提出多个thought-action-observation的步骤,用于解决任务轨迹。
其实就是针对某一个任务,按照thought-action-observation这一套流程,步骤,分别执行了n次,然后统计n次执行后的结果,输出最优解。复杂度也挺高,但思想比较简单。

Reflection

Reflection这种Prompting思想,我认为非常有用。会对AGI的技术发展起到关键作用。相关论文:

《Reflexion: Language Agents with Verbal Reinforcement Learning》

步骤:

  1. 生成轨迹
  2. 评估
  3. 执行反思
  4. 生成下个轨迹

如在decision-making, programming, and reasoning方面的Reflection过程。

代码:

https://github.com/noahshinn/reflexion

其中在西部小镇的项目中,创建的25个智能体,都具备了Reflection的能力。
《Generative Agents: Interactive Simulacra of Human Behavior》

Klaus这个Agent的反思树,观察->反思->观察->反思

A reflection tree for Klaus Mueller. The agent’s observations of the world, represented in the leaf nodes, are recursively synthesized to derive Klaus’s self-notion that he is highly dedicated to his research.

Useful Prompt

  • Let’s think step by step.
  • We should think about this step by step.
  • First,
  • Before we dive into the answer,
  • Proof followed by the answer.
  • Let’s think step by step in a realistic way.
  • Let’s think step by step using common sense and knowledge.
  • Let’s think like a detective step by step.
  • Let’s think about this logically.
  • Let’s think step by step. First,
  • Let’s think
  • Let’s solve this problem by splitting it into steps.
  • The answer is after the proof.
  • Let’s be realistic and think step by step.
  • APE: Let’s work this out in a step by step way to be
    sure we have the right answer
  • Now, please write…, following the formatting of the examples above.

Directory structure reference1