CodeAsPolicies-JackyLiang等

项目地址

项目站：https://code-as-policies.github.io/

想要复现该论文的代码，请访问：https://github.com/google-research/google-research/tree/master/code_as_policies

主代码下载Experiment_ Robot Code-Gen Benchmark.ipynb即可。如需要工作中其他效果对比的验证代码，可以选择其他ipynb文件下载。

Colab: https://colab.research.google.com/drive/124TE4TsGYyrvduzeDclufyvwcc2qbbrE

Paper原文：https://arxiv.org/abs/2209.07753

1. 摘要

Abstract—Large language models (LLMs) trained on code-completion have been shown to be capable of synthesizing simple Python programs from docstrings[1].We find that the secode-writing LLMs can be re-purposed to write robot policy code, given natural language commands. Specifically, policy code can express functions or feedback loops that process perception outputs (e.g., from object detectors [2], [3]) and parameterize control primitive APIs. When provided as input several example language commands (formatted as comments) followed by corresponding policy code (via few-shot prompting), LLMs can take in new commands and autonomously re-compose API calls to generate new policy code respectively. By chaining classic logic structures and referencing third-party libraries (e.g., NumPy, Shapely) to perform arithmetic, LLMs used in this way can write robot policies that (i) exhibit spatial-geometric reasoning, (ii) generalize to new instructions, and (iii) prescribe precise values (e.g., velocities) to ambiguous descriptions (“faster”) depending on context (i.e., behavioral commonsense). This paper presents Code as Policies: a robot-centric formulation of language model generated programs (LMPs) that can represent reactive policies (e.g., impedance controllers), as well as waypoint-based policies (vision- based pick and place, trajectory-based control), demonstrated across multiple real robot platforms. Central to our approach is prompting hierarchical code-gen (recursively defining undefined functions), which can write more complex code and also improves state-of-the- art to solve 39.8% of problems on the HumanEval [1] benchmark. Code and videos are available at https://code-as-policies.github.io

翻译

经过代码补全任务训练的大型语言模型（LLMs）已被证明能够根据文本注释生成简单的Python程序[1]。我们发现，这些擅长编写代码的LLM可通过重新定位用途，依据自然语言指令生成机器人策略代码。具体而言，策略代码可表述为处理感知输出（如来自目标检测器[2][3]）的函数或反馈回路，并参数化控制原语的应用程序编程接口（API）。当输入多个自然语言指令样例（以注释形式格式化）及其对应策略代码（通过少量示例提示）时，LLM能够接收新指令并自主重组API调用以生成相应的新策略代码。通过链式调用经典逻辑结构及引用第三方库（如NumPy、Shapely）执行算术运算，这种LLM可生成的机器人策略能够：(i) 展现空间几何推理能力；(ii) 泛化至新指令；(iii) 根据上下文（即行为常识）对模糊描述（如"更快"）赋予精确参数值（如速度值）。本文提出"代码即策略"（Code as Policies）框架——一种以机器人为核心的语言模型生成程序（LMPs）表述方式，能够表示反应式策略（如阻抗控制）和基于路径点的策略（视觉抓取放置、轨迹控制），并在多个真实机器人平台上完成验证。本方法的核心在于分层代码生成提示策略（通过递归定义未实现函数），该策略不仅能编写更复杂代码，还将HumanEval基准测试[1]的最优解决率提升至39.8%。

信息

机器人策略（Robot Policy）指的是一套系统化的决策逻辑或行为规则，用于控制机器人如何根据感知到的情况（如传感器数据、视觉信息）生成具体的动作指令。

通俗来说，就是机器人从“所感”导致“所想”，再到“所做”的这一层。

摘要中指出，原先的研究证明了文本注释可以生成简单的Python程序，而现在该团队的进展是通过自然语言指令也可以生成策略代码。必须指出的是，文本注释和自然语言指令是有区别的，在下文中，

2. 引言

Robots that use language need it to be grounded (or situated) to reference the physical world and bridge connections between words, percepts, and actions [4]. Classic methods ground language using lexical analysis to extract semantic representations that inform policies [5]–[7], but they often struggle to handle unseen instructions. More recent methods learn the grounding end-to-end (language to action) [8]–[10], but they require copious amounts of training data, which can be expensive to obtain on real robots. Meanwhile, recent progress in natural language processing shows that large language models (LLMs) pretrained on Internet- scale data [11]–[13] exhibit out-of-the-box capabilities [14]–[16] that can be applied to language-using robots e.g., planning a sequence of steps from natural language instructions [16]–[18] without additional model finetuning. These steps can be grounded in real robot affordances from value functions among a fixed set of skills i.e., policies pretrained with behavior cloning or rein- forcement learning [19]–[21]. While promising, this abstraction prevents the LLMs from directly influencing the perception-action feedback loop, making it difficult to ground language in ways that (i) generalize modes of feedback that share percepts and actions e.g., from "put the apple down on the orange" to "put the apple down when you see the orange", (ii) express commonsense priors in control e.g., "move faster", "push harder", or (iii) comprehend spatial relationships "move the apple a bit to the left". As a result, incorporating each new skill (and mode of grounding) requires additional data and retraining – ergo the data burden persists, albeit passed to skill acquisition. This leads us to ask: how can LLMs be applied beyond just planning a sequence of skills?

Herein, we find that code-writing LLMs [1], [11], [22] are proficient at going further:or chest rating planning,policy logic,and control. LLMs trained on code-completion have shown to be capa- ble of synthesizing Python programs from docstrings. We find that these models can be re-purposed to write robot policy code, given natural language commands(formattedascomments).Policycode can express functions or feedback loops that process perception outputs (e.g., open vocabulary object detectors [2], [3]) and param- eterize control primitive APIs (see Fig. 1). When provided with several example language commands followed by corresponding policy code (via few-shot prompting, in gray), LLMs can take in new commands (in green) and autonomously re-compose the API calls to generate new policy code (highlighted) respectively:

# if you see an orange, move backwards.
if detect_object("orange"):
    robot.set_velocity(x=-0.1, y=0, z=0)
# move rightwards until you see the apple.
while not detect_object("apple"):
    robot.set_velocity(x=0, y=0.1, z=0)

Code-writing models can express a variety of arithmetic operations as well as feedback loops grounded in language. They not only generalize to new instructions, but having been trained on billions of lines of code and comments, can also prescribe precise values (e.g., velocities) to ambiguous descriptions ("faster" and "to the left") depending on context –to elicit behavioral commonsense:

# do it again but faster, to the left, and with a banana.
while not detect_object("banana"):
    robot.set_velocity(x=0, y=-0.2, z=0)

Representing code as policies inherits a number of benefits from LLMs: not only the capacity to interpret natural language, but also the ability to engage in human-robot dialogue and Q&A simply by using "say(text)" as an available action primitive API:

# tell me why you stopped moving.
robot.say("I stopped moving because I saw a banana.")

We present Code as Policies (CaP): a robot-centric formulation of language model generated programs (LMPs) executed on real systems. Pythonic LMPs can express complex policies using:

• Classic logic structures e.g., sequences, selection (if/else), and loops (for/while) to assemble new behaviors at runtime.

• Third-party libraries to interpolate points (NumPy), analyze and generate shapes (Shapely) for spatial-geometric reasoning, etc.

LMPs can be hierarchical: prompted to recursively define new functions, accumulate their own libraries over time, and self- architectadynamiccodebase.We demonstrate across several robot systems that LLMs can autonomously interpret language com- mands to generate LMPs that represent reactive low-level policies (e.g., PD or impedance controllers), and waypoint-based policies (e.g., for vision-based pick and place, or trajectory-based control). Our main contributions are: (i) code as policies: a formulation of using LLMs to write robot code, (ii) a method for hierarchical code-gen that improves state-of-the-art on both robotics and standard code-gen problems with 39.8% P@1 on HumanEval [1], (iii) a new benchmark to evaluate future language models on robotics code-gen problems, and (iv) ablations that analyze how CaP improves metrics of generalization [23] and that it abides by scaling laws–larger models perform better. Code as policies presents a new approach to linking words, percepts, and actions; enablingapplicationsinhuman-robotinteraction,butisnotwithout limitations. We discuss these in Sec.V.Full prompts and generated outputs are in the Appendix, which can be found along with additional results, videos, and code at code-as-policies.github.io.

翻译

为实现与物理世界的交互，依赖语言的机器人需将语言具象化（grounded）——即建立词汇、感知与动作之间的关联[4]。传统方法通过词法分析提取语义表征以指导策略（policies）[5]–[7]，但面对未见指令时表现受限。较新的端到端学习方法（语言直接映射到动作）[8]–[10]虽然能够提升泛化性，却需要大量真实机器人数据进行训练，成本高昂。

与此同时，自然语言处理领域的最新进展表明：通过互联网规模数据预训练的大型语言模型（LLMs） [11]–[13]具备直接应用于语言交互机器人的能力——例如，无需额外微调即可从自然语言指令中规划行为序列[16]–[18]]。这些行为序列可基于固定技能集合（通过行为克隆或强化学习预训练的既定策略[19]–[21]）在真实机器人功能可供性中通过价值函数建立关联。尽管此类方法前景广阔，但抽象分层（LLM仅规划技能调用）隔断了其对感知-动作反馈回路的直接影响，导致以下局限性：

难以泛化共享感知与动作的反馈模式

例如将“将苹果放在橙子上”扩展为“当检测到橙子时放下苹果”需重新设计逻辑。

无法表达控制中的常识性先验

如“更快移动”、“加大推力”需依赖隐含物理知识。

空间关系理解受限

如“将苹果稍微左移”需调用几何计算而非单纯路径规划。

因此，每增加一项新技能（或其关联的具象化模式）均需重新收集数据并整体训练——这虽将数据负担转移至技能获取阶段，但问题本质未变。这引出了核心问题：如何扩展LLMs的应用，使其超越单一技能序列规划？

我们的发现是：擅长代码生成的LLMs[1][11][22]可更进一步——能够统筹规划、策略逻辑与控制。代码补全训练的LLMs已可基于文本注释生成Python程序，而本研究证明其可被改造为按自然语言指令（注释形式）编写机器人策略代码。策略代码可定义函数或反馈循环以处理感知输出（如开放词汇物体检测器[2][3]结果）并参数化控制原语API（见图1）。当输入包含多个<指令-代码>示例（通过少量示例提示，灰色部分）后，LLM可针对新指令（绿色部分）自主重组API调用生成对应策略代码（高亮部分）：

# if you see an orange, move backwards.
if detect_object("orange"):
    robot.set_velocity(x=-0.1, y=0, z=0)
# move rightwards until you see the apple.
while not detect_object("apple"):
    robot.set_velocity(x=0, y=0.1, z=0)

代码生成模型能够通过语言具象化表达多样化的算术运算与反馈循环。此类模型不仅能够泛化理解新指令，其基于海量代码与注释训练（规模达数十亿行）的特性，还使其可根据上下文环境——例如对"更快"这类模糊描述或"向左轻微移动"等空间语义——动态赋予精确参数值（如速度值、坐标偏移量），从而在控制过程中自然融入行为常识：

# do it again but faster, to the left, and with a banana.
while not detect_object("banana"):
    robot.set_velocity(x=0, y=-0.2, z=0)

提示

这里明确指出该论文所研究的代码生成和决策能力还是依赖于外部大模型的，经过测试，用于代码补全的最优LLMs：Code-Davinci-002，策略LLMs：gpt-3.5-turbo-instruct（提示词大模型），此外Baseline还使用了text-curie-001 模型。基线方法： CLIPort vs. CaP。

在下文和代码中可以看出本文的重点是为这个外部大模型编写一套用于多层决策的框架。

将代码作为策略的表征方式继承了大型语言模型（LLMs）的多项优势：不仅能够解析自然语言指令，还能通过调用简单的基础动作API（如 say(text)），直接实现人机对话与问答交互功能。

# tell me why you stopped moving.
robot.say("I stopped moving because I saw a banana.")

本文提出代码即策略（Code as Policies, CaP）：一种面向实机系统、基于语言模型生成程序（Language Model Generated Programs, LMPs）的机器人核心框架。基于Python的LMPs可通过以下方式实现复杂策略：

经典逻辑结构：序列化指令、条件分支（if/else）及循环控制（for/while），动态组合运行时行为。

第三方库调用：利用数值计算库（如NumPy插值）、几何分析库（如Shapely形状生成）实现空间推理等功能。

LMPs支持层级架构：通过递归定义新函数、持续累积自有函数库，可自主构建动态代码基座。我们在多个真实机器人系统上验证，LLMs能自主解析语言指令生成LMPs，其涵盖的能力包含：

反应式底层控制策略：PD控制、阻抗控制等。

基于路径点的策略：如视觉引导抓放、轨迹规划控制等。

主要贡献：

1.代码即策略框架：将LLMs生成代码拓展至机器人领域的形式化模型。

2.分层代码生成方法：在代码生成任务中刷新SOTA（HumanEval基准达39.8%首次通过率[1]）。

3.机器人代码生成评估基准：为未来语言模型的机器人代码生成能力提供量化标准。

4.消融实验验证：分析CaP对泛化性能[23]的提升，并验证其遵循模型规模效应（参数量越大性能越优）。

CaP通过链接语言-感知-动作，开辟了人机交互新路径，但也存在局限性（详见第V节）。完整提示词、生成示例代码及附加结果、视频、源码均发布于合作实验室官网 code-as-policies.github.io。

3. 相关工作

Controlling robots via language has a long history, including early demonstrations of human-robot interaction through lexical parsing of natural language [5]. Language serves not only as an interface for non-experts to interact with robots [24], [25], but also as a means to compositionally scale generalization to new tasks[9], [17].The literature is vast(we refer to Tellex et al.[4]andLuketina etal.[26]for comprehensive surveys),but recent works fall broadly into the categories of high-level interpretation (e.g., semantic parsing [25], [27]–[32]), planning [14], [17], [18], and low-level policies (e.g., model-based [33]–[35], imitation learning [8], [9], [36], [37], or reinforcement learning [38]–[42]). In contrast, our work focuses on the code generation aspect of LLMs and use the generated procedures as an expressive way to control the robot.

Large language models exhibit impressive zero-shot reasoning capabilities: from planning [14] to writing math programs [43]; from solving science problems [44] to using trained verifiers [45] for math word problems. These can be improved with prompting methods such as Least-to-Most [46], Think-Step-by-Step [15] or Chain-of-Thought [47]. Most closely related to this paper are works that use LLM capabilities for robot agents without additional model training. For example, Huang et al. decompose natural lan- guage commands into sequences of executable actions by text com- pletion and semantic translation[14],while SayCan[17]generates feasible plans for robots by jointly decoding an LLM weighted by skill affordances [20] from value functions. Inner Monologue [18] expands LLM planning by incorporating outputs from success de- tectors or other visual language models and uses their feedback to re-plan. Socratic Models [16] uses visual language models to sub- stitute perceptual information (in teal) into the language prompts that generate plans, and it uses language-conditioned policies e.g., for grasping [36]. The following example illustrates the qualitative differences between our approach versus the aforementioned prior works. When tasked to "move the coke can a bit to the right":

LLM Plan [14], [17], [18]
Pick up coke can
Move a bit right
Place coke can

Socratic Models Plan [16]
objects = [coke can]
1. robot.grasp(coke can) open vocab
2. robot.place_a_bit_right()

plans generated by prior works assume there exists a skill that allows the robot to move an object a bit right.Our approach differs in that it uses an LLM to directly generate policy code (plans nested within) to run on the robot and avoids the requirement of having predefined policies to map every step in the plan:

Code as Policies (ours)
while not obj_in_gripper("coke can"):
    robot.move_gripper_to("coke can")
robot.close_gripper()
pos = robot.gripper.position
robot.move_gripper(pos.x, pos.y+0.1, pos.z)
robot.open_gripper()

Our approach (CaP) not only leverages logic structures to specify feedback loops, but it also parameterizes (and write parts of) low-level control primitives. CaP alleviates the need to collect data and train a fixed set of predefined skills or language-conditioned policies –which are expensive and often remain domain-specific.

Code generation has been explored with LLMs [1], [48] and without [49]. Program synthesis has been demonstrated to be capable of drawing simple figures [50] and generating policies that solve 2D tasks [51]. We expand on these works, showing that (i)code-writing LLMs enable novel reasoning capabilities(e.g.,en- coding spatial relationships by leaning on familiarity of third party libraries) without additional training needed in prior works [35], [36], [52]–[56], and (ii) hierarchical code-writing (inspired by re- cursive summarization [57]) improves state-of-the-art code genera- tion. We also present a new robotics-themed code-gen benchmark to evaluate future language models in the robotics domain.

翻译

语言操控机器人技术发展沿革悠久，早期以自然语言词法解析为基础实现人机交互[5]。语言不仅为非专业用户提供控制接口[24][25]，亦被用于通过组合式泛化扩展至新任务[9][17]。相关文献浩繁（详见Tellex等[4]与Luketina等[26]综述），近期研究可归类为：

高层语义解析：从自然语言指令中提取结构化语义（如任务逻辑图）[25][27]–[32]行为规划层：分解指令为技能调用序列[14][17][18]底层策略控制：

基于模型的方法（动力学模型推导控制律）[33]–[35]
模仿学习（示教数据驱动）[8][9][36][37]
强化学习（奖励函数优化）[38]–[42]

本研究创新点聚焦于利用LLMs生成可执行代码直接控制机器人底层，突破传统范式限制。

提示

这里我认为本文强调的创新点确实属于第四种范式，原因是利用LLMs生成的代码的方式与传统范式有较大差异。

差异之处在于，LLMs是基于什么生成的代码？我们都知道LLM的输入是自然语言，而生成代码的方式依靠的主要还是LLM这一支所依赖的学习方法，这是一大差异之处。

如果说强化学习和模仿学习是一种思考如何去做的方式，那么大模型的提示（甚至是推理大模型的推理链）方式就是一种不一样的思考方式。为了补足这一思考方式对于精细化、专业化的不足之处，本文提出了逻辑构建和底层控制参数化，还有供以验证的RoboCodeGen基准。

在下文提及了LLMs生成的代码的方式是自回归方式，也印证了这一点.

大型语言模型（LLMs）的零样本推理能力拓展：

多领域研究已证明LLMs具备零样本推理能力：数学程序生成[43]、科学问题解答[44]、训练验证器辅助数学题求解[45]等，且通过提示优化策略（如Least-to-Most[46]、逐步思考[15]、思想链[47]）可进一步提升性能。

机器人领域相关方法对比：

近年研究尝试直接利用LLMs赋能机器人，无须额外微调：

Huang等[14]: 通过文本补全与语义翻译分解指令为动作序列
SayCan[17]: 结合技能可供性价值函数进行联合解码生成可行计划
Inner Monologue[18]: 整合检测器与视觉语言模型反馈实现动态重规划
Socratic Models[16]: 使用视觉语言模型注入感知信息（青色部分）至规划提示词

示例对比：CaP与传统方法的核心差异：

以“将可乐罐稍向右移”任务为例，不同方法的指令解析结果如下：先前方法生成的计划（假设存在预定义技能）

LLM Plan [14], [17], [18]
Pick up coke can
Move a bit right
Place coke can

Socratic Models Plan [16]
objects = [coke can]
1. robot.grasp(coke can) open vocab
2. robot.place_a_bit_right()

假设缺陷：须预先定义place_a_bit_right()等具体技能，导致泛化能力受限。

本研究（Code as Policies, CaP）生成策略代码：

Code as Policies (ours)
while not obj_in_gripper("coke can"):
    robot.move_gripper_to("coke can")
robot.close_gripper()
pos = robot.gripper.position
robot.move_gripper(pos.x, pos.y+0.1, pos.z)
robot.open_gripper()

核心突破：

动态逻辑构建：集成反馈循环（如夹爪运动控制）
底层控制参数化：直接生成运动学参数（如y轴+0.1米位移）
规避预定义技能需求：降低数据收集与域适配成本

早前代码生成研究包含基于LLMs[1][48]与非LLMs方法[49]，如生成二维任务策略[51]。本文贡献包括：

代码生成新能力验证: 利用第三方库（如Shapely）实现空间几何推理 [优势对比研究：35][36][52]–[56]; 分层式代码生成（受递归摘要[57]启发）显著提升HumanEval基准至39.8%首次通过率
机器人专用评测基准: 为未来模型在机器人代码生成领域提供量化评估标准

4. 方法

In this section, we characterize the extent to which pretrained LLMs can be prompted to generate code as policies –represented as a set of language model programs (LMPs). Broadly, we use the term LMP to refer to any program generated by a language model and executed on a system. This work investigates Code as Policies, a class of LMPs that maps from language instructions to code snip- pets that(i)react to perceptual inputs(i.e.,from sensors or modules on top of sensors), (ii) parameterize control primitive APIs, and (iii) are directly compiled and executed on a robot, for example:

# stack the blocks in the empty bowl.
empty_bowl_name = parse_obj('empty bowl')
block_names = parse_obj('blocks')
obj_names = [empty_bowl_name] + block_names
stack_objs_in_order(obj_names=obj_names)

Input instructions are formatted as comments(green),which can be provided by humans or written by another LMP. Predicted outputs from the LLM ( highlighted ) are expected to be valid Python code, generated autoregressively [11], [12]. LMPs are few-shot prompted with examples to generate different subprograms that may process object detection results,build trajectories,or sequence control primitives. LMPs can be generated hierarchically by com- posing known functions (e.g., get_obj_names() using perception modules) or invoking other LMPs to define undefined functions:

# define function stack_objs_in_order(obj_names).
def stack_objs_in_order(obj_names):
    for i in range(len(obj_names) - 1):
        put_first_on_second(obj_names[i + 1], obj_names[i])

where put_first_on_second is an existing open vocabulary pick and place primitive (e.g., CLIPort [36]). For new embodiments, these active function calls can be replaced with available control APIs that represent the action space (e.g., set_velocity ) of the agent. Hierarchical code-gen with verbose variable names can be viewed as a variant of chain of thought prompting [47] via functional programming. Functions defined by LMPs can progressively accumulate over time, where new LMPs can reference previously constructed functions to expand policy logic.

To execute an LMP, we first check that it is safe to run by ensuring there are no import statements, special variables that begin with __ , or calls to exec and eval . Then, we call Python's exec function with the code as the input string and two dictionaries that form the scope of that code execution: (i) globals , containing all APIs that the generated code might call, and (ii) locals , an empty dictionary which will be populated with variables and new functions defined during exec . If the LMP is expected to return a value, we obtain it from locals after exec finishes.

翻译

本章探讨如何通过提示（prompting）使预训练大型语言模型（LLMs）生成代码即策略（Code as Policies）——表现为一组语言模型生成程序（LMPs）。广义上，LMP指代由语言模型生成并在系统上执行的程序。本文研究的CaP类LMPs可将语言指令映射为满足以下特性的代码片段：

感知响应性：接收传感器或感知模块（如目标检测器）输入
控制原语参数化：调整底层控制API的参数
可直接编译执行：无须中间层转换，例如：

# stack the blocks in the empty bowl.
empty_bowl_name = parse_obj('empty bowl')
block_names = parse_obj('blocks')
obj_names = [empty_bowl_name] + block_names
stack_objs_in_order(obj_names=obj_names)

输入指令以绿色注释形式提供（可由人类或另一LMP生成），LLM生成的响应代码（高亮部分）需为有效Python代码（基于自回归生成机制[11][12]）。LMPs通过少量示例提示生成不同子程序，实现目标检测结果处理、轨迹构建或控制原语序列化。通过分层生成策略：

组合已知函数：如利用感知模块的get_obj_names()
递归定义未实现函数：通过调用子LMPs填补功能缺口

提示

相当于是说，分层逻辑的功能实现基本都是基于构建更合理的LMP。

# define function stack_objs_in_order(obj_names).
def stack_objs_in_order(obj_names):
    for i in range(len(obj_names) - 1):
        put_first_on_second(obj_names[i + 1], obj_names[i])

其中put_first_on_second为预设的开放词汇抓放原语（如CLIPort[36]）。针对新机器人本体，可替换为对应动作空间API（如set_velocity）。采用语义化变量名的分层代码生成可视作函数式编程形态的思想链提示[47]。LMPs定义的函数可持续累积，后续代码可引用前期函数扩展策略逻辑。

执行前进行安全检查：禁用危险操作———禁止import语句、双下划线变量及exec/eval调用。

沙箱执行环境：

调用Python的exec函数执行代码
通过两个字典界定作用域： globals：包含所有可调用API的全局命名空间 locals：初始为空，用于存储执行生成的变量与函数
若需返回值，从执行后locals字典中提取

A. Prompting Language Model Programs

Prompts to generate LMPs contain two elements:

Hints e.g., import statements that inform the LLM which APIs are available and type hints on how to use those APIs.

import numpy as np
from utils import get_obj_names, put_first_on_second

Examples are instruction-to-code pairs that present few-shot "demonstrations" of how natural language instructions should be converted into code. These may include performing arithmetic, calling other APIs, and other features of the programming language. Instructions are written as comments directly preceding a block of corresponding solution code. We can maintain an LMP "session" by incrementally appending new instructions and responses to the prompt, allowing later instructions to refer back to previous instructions, like "undo the last action".

翻译

生成LMP的提示包含两类要素：

核心提示元素（Hints）：

API导入声明：限定模型可用函数库
类型提示：引导参数使用规范

import numpy as np
from utils import get_obj_names, put_first_on_second

示例指令-代码对：

格式模板：指令以注释前置，代码块紧随其后
上下文延续：依会话持续添加新指令与响应，支持引用历史操作（如"撤销上一步"）

B. Example Language Model Programs (Low-Level)

LMPs are perhaps best understood through examples, to which the following section builds up from simple pure-Python instructions to more complex ones that can complete robot tasks. All examples and experiments in this paper, unless otherwise stated, use OpenAI Codex code-davinci-002 with temperature 0 (i.e., deterministic greedy token decoding). Here, the prompt (in gray) starts with a Hint to indicate we are writing Python. It then gives one Example to specify the format of the return values, to be assigned to a variable called ret_val . Input instructions are green, and generated outputs are highlighted:

# Python script
# get the variable a.
ret_val = a
# find the sum of variables a and b.
ret_val = a + b
# see if any number is divisible by 3 in a list called xs.
ret_val = any(x % 3 == 0 for x in xs)

Third-party libraries. Python code-writing LLMs store knowledge of many popular libraries. LMPs can be prompted to use these libraries to perform complex instructions without writing all of the code e.g., using NumPy to elicit spatial reasoning with coordinates. Hints here include import statements, and Examples define cardinal directions. Variable names are also important to indicate that pts_np and pt_np are NumPy arrays. Operations with 2D vectors imply that the points are also 2D. Example:

import numpy as np
# move all points in pts_np toward the right.
ret_val = pts_np + [0.3, 0]
# move a pt_np toward the top.
ret_val = pt_np + [0, 0.3]
# get the left most point in pts_np.
ret_val = pts_np[np.argmin(pts_np[:, 0]), :]
# get the center of pts_np.
ret_val = np.mean(pts_np, axis=0)
# the closest point in pts_np to pt_np.
ret_val = pts_np[np.argmin(np.sum((pts_np - pt_np)**2, axis=1))]

First-party libraries. LMPs can also use first-party libraries (perception or control primitive APIs) not found in the training data if those functions have meaningful names and are provided in Hints/Examples. For example (full prompt in B.2):

from utils import get_pos, put_first_on_second
...
# move the purple bowl toward the left.
target_pos = get_pos('purple bowl') + [-0.3, 0]
put_first_on_second('purple bowl', target_pos)
objs = ['blue bowl', 'red block', 'red bowl', 'blue block']
# move the red block a bit to the right.
target_pos = get_pos('red block') + [0.1, 0]
put_first_on_second('red block', target_pos)
# put the blue block on the bowl with the same color.
put_first_on_second('blue block', 'blue bowl')

The Hints import two functions for a robot domain: one to obtain the 2D position of an object by name (using an open vocabulary object detector [2]) and another to put the first object on the second target, which can be an object name or a 2D position. Note the LMP's ability to adapt to new instructions — the first modifies the movement magnitude by using "a bit," while the second associates the object with "the same color." Language reasoning can be few-shot prompted using code- writing LLMs (full prompt in B.1) to e.g., associate object names with natural language descriptions ("sea-colored block"), categories ("bowls"), or past context ("other block"):

objs = ['blue bowl', 'red block', 'red bowl', 'blue block']
# the bowls.
ret_val = ['blue bowl', 'red bowl']
# sea-colored block.
ret_val = 'blue block'
# the other block.
ret_val = 'red block'

翻译

通过示例展现LMPs从基础指令到复杂任务的能力（实验默认使用OpenAI Codex code-davinci-002，温度参数0即确定性生成）。

基础数值操作示例提示部分（灰色）起始于Python声明，并通过<<示例>>定义返回变量规范：

# Python script
# get the variable a.
ret_val = a
# find the sum of variables a and b.
ret_val = a + b
# see if any number is divisible by 3 in a list called xs.
ret_val = any(x % 3 == 0 for x in xs)

第三方库调用：空间坐标推理

利用LLM对常见库的知识存储（如NumPy），通过变量命名（pts_np/pt_np）暗示数据结构：

import numpy as np
# move all points in pts_np toward the right.
ret_val = pts_np + [0.3, 0]
# move a pt_np toward the top.
ret_val = pt_np + [0, 0.3]
# get the left most point in pts_np.
ret_val = pts_np[np.argmin(pts_np[:, 0]), :]
# get the center of pts_np.
ret_val = np.mean(pts_np, axis=0)
# the closest point in pts_np to pt_np.
ret_val = pts_np[np.argmin(np.sum((pts_np - pt_np)**2, axis=1))]

自有库调用：语义化机器人操作

结合自主实现的感知与控制API：

from utils import get_pos, put_first_on_second
...
# move the purple bowl toward the left.
target_pos = get_pos('purple bowl') + [-0.3, 0]
put_first_on_second('purple bowl', target_pos)
objs = ['blue bowl', 'red block', 'red bowl', 'blue block']
# move the red block a bit to the right.
target_pos = get_pos('red block') + [0.1, 0]
put_first_on_second('red block', target_pos)
# put the blue block on the bowl with the same color.
put_first_on_second('blue block', 'blue bowl')

语义推理增强

通过少量示例提示，令LLM关联：

objs = ['blue bowl', 'red block', 'red bowl', 'blue block']
# the bowls.
ret_val = ['blue bowl', 'red bowl']
# sea-colored block.
ret_val = 'blue block'
# the other block.
ret_val = 'red block'

核心机制对比优势

CaP通过层级代码生成将自然语言直接映射至精确控制参数（如坐标偏移量），突破传统方法对预定义技能的依赖，在减少数据收集成本的同时实现更高泛化能力。

提示

有必要指出的是Baseline代码还是比较简单的，针对模糊描述，可以构建提示词，baseline的变化程度在0.1、0.15、0.2三个区间内，可以看作是一个模拟，后续需要更细腻的调整可以改进此处代码。

C. Example Language Model Programs (High-Level)

Control flows. Programming languages allow using control structures such as if-else and loop statements. Previously we showed LMPs can express for-loops in the form of list comprehensions. Here we show how they can write a while-loop can form a simple feedback policy. Note that the prompt (same as the one in B.2) does not contain such Examples:

# while the red block is to the left of the blue bowl, move it to the right 5cm at a time.
while get_pos('red block')[0] < get_pos('blue bowl')[0]:
    target_pos = get_pos('red block') + [0.05, 0]
    put_first_on_second('red block', target_pos)

LMPs can be composed via nested function calls. This allows including more few-shot examples into individual prompts to improve functional accuracy and scope, while remaining within the LLM's maximum input token length. The following (full prompt in B.4) generates a response that uses parse_obj , another LMP that associates object names with language descriptions:

objs = ['red block', 'blue bowl', 'blue block', 'red bowl']
# while the left most block is the red block, move it toward the right.
block_name = parse_obj('the left most block')
while block_name == 'red block':
    target_pos = get_pos(block_name) + [0.3, 0]
    put_first_on_second(block_name, target_pos)
    block_name = parse_obj('the left most block')

The parse_obj LMP (full prompt in Appendix B.5):

objs = ['red block', 'blue bowl', 'blue block', 'red bowl']
# the left most block.
block_names = ['red block', 'blue block']
block_positions = np.array([get_pos(name) for name in block_names])
left_block_name = block_names[np.argmin(block_positions[:, 0])]
ret_val = left_block_name

We describe more on prompt engineering in the Appendix A.

翻译

控制流结构:

编程语言的控制结构（如条件分支、循环）支持构建动态策略。LMPs可生成包含循环逻辑的反馈策略代码。

注意

注意：生成此时所用的提示（与B.2节相同）并未包含此类示例：

# while the red block is to the left of the blue bowl, move it to the right 5cm at a time.
while get_pos('red block')[0] < get_pos('blue bowl')[0]:
    target_pos = get_pos('red block') + [0.05, 0]
    put_first_on_second('red block', target_pos)

函数嵌套组合:

通过分层调用LMPs函数，可在单次提示中整合多个示例以提升功能精度与范围（同时受限于模型最大输入长度）。下例（完整提示见附录B.4）的代码生成过程使用了parse_obj（另一LMP，功能为实现语言描述与物体名称的关联解析）：

objs = ['red block', 'blue bowl', 'blue block', 'red bowl']
# while the left most block is the red block, move it toward the right.
block_name = parse_obj('the left most block')
while block_name == 'red block':
    target_pos = get_pos(block_name) + [0.3, 0]
    put_first_on_second(block_name, target_pos)
    block_name = parse_obj('the left most block')

parse_obj LMP实现细节（完整提示见附录B.5）：

objs = ['red block', 'blue bowl', 'blue block', 'red bowl']
# the left most block.
block_names = ['red block', 'blue block']
block_positions = np.array([get_pos(name) for name in block_names])
left_block_name = block_names[np.argmin(block_positions[:, 0])]
ret_val = left_block_name

D. Language Model Programs as Policies

In the context of robot policies, LMPs can compose perception- to-control feedback logic given natural language instructions, where the high-level outputs of perception model(s) (states) can be programmatically manipulated and used to inform the parameters of low-level control APIs (actions). Prior information about available perception and control APIs can be guided through Examples and Hints. These APIs "ground" the LMPs to a real-world robot system, and improvements in perception and control algorithms can directly lead to improved capabilities of LMP-based policies. For example, in real-world experiments below, we use recently developed open-vocabulary object detection models like ViLD [3] and MDETR [2] off-the-shelf to obtain object positions and bounding boxes. The benefits of LMP-based policies are threefold: they (i) can adapt policy code and parameters to new tasks and behaviors specified by unseen natural language instructions, (ii) can generalize to new objects and environments by bootstrapping off of open-vocabulary perception systems and/or saliency models, and (iii) do not require any additional data collection or model training. The generated plans and policies are also interpretable as they are represented in code, allowing for easy modification and reuse. Using LMPs for high-level user interactions inherits the benefits of LLMs, including parsing expressive natural language with commonsense knowledge, taking prior context into account, multilingual capabilities, and engaging in dialog. In the experiment section that follows, we demonstrate multiple instantiations of LMPs across different robots and different tasks, showcasing the approach's flexible capabilities and ease of use.

Results are in Table III. CaP compares competitively to the supervised CLIPort baseline on tasks with seen attributes and instructions, despite only few-shot prompted with one example rollout for each task. With unseen task attributes, CLIPort's performance degrades significantly, while LLM-based methods retain similar performance. On unseen tasks and attributes, end- to-end systems like CLIPort struggle to generalize, and CaP outperforms LLM reasoning directly with language(also observed in [20]). Moreover, the natural-language planners [14], [16]–[18] are not applicable for tasks that require precise numerical spatial- geometric reasoning. We additionally show the benefits reasoning with code over natural language (both direct question and an- swering and Chain of Thought [47]), specifically the ability of the former to perform precise numerical computations,in Appendix C.

翻译

在机器人策略的语境下，语言模型生成程序（LMPs）能够根据自然语言指令组合感知到控制的反馈逻辑。感知模型的高层输出（状态）可通过编程方式操作，并用于参数化底层控制API（动作）。通过示例和提示可引导机器人系统关于可用感知与控制API的先验知识。这些API将LMPs具象化到现实世界的机器人系统，而感知与控制算法的改进可直接提升基于LMP的策略能力。例如，下述真实世界实验中，我们直接使用近期开发的开放词汇目标检测模型ViLD[3]和MDETR[2]，以获取物体位置与边界框。

基于LMP的策略具有三重优势：

适应性与泛化能力

可针对未见过自然语言指令描述的新任务与行为调整策略代码与参数，通过开放词汇感知系统或显著性模型引导，泛化至新物体与环境

零数据依赖与可解释性

无需额外数据收集或模型训练，生成的计划与策略以代码形式呈现，易于修改复用

继承LLM的自然语言交互优势

解析含常识性知识的自然语言指令，考虑上下文历史、支持多语言及对话式交互，后续实验展示不同机器人/任务中LMP的实例化，证明该方法的灵活性与易用性。表III结果显示：在涉及已见任务属性的场景中，CaP与监督式基线方法CLIPort性能相当，尽管CaP仅通过单任务单样本提示生成策略；面对未见过任务属性时，CLIPort性能显著下降，而基于LLM的方法保持稳定。对于需精确数值化空间几何推理的任务，自然语言规划方法[14][16][18]无法适用，而CaP通过代码生成能力（详见附录C）展现显著优势。

5. 实验

The goals of our experiments are threefold: (i) evaluate the impact of using hierarchical code generation (across different lan- guage models) and analyze modes of generalization, (ii) compare Code as Policies (CaP) against baselines in simulated language- instructed manipulation tasks, and (iii) demonstrate CaP on differ- ent robot systems to show its flexibility and ease-of-use. Additional experiments can be found in the Appendix, such as generating reactive controllers to balance a cartpole and perform end-effector impedance control (Appendix F). The Appendix also contains the prompt and responses for all experiments. Videos and Colab Notebooks that reproduce these experiments can be found on the website. Due to the difficulty of evaluating open-ended tasks and a lack of comparable baselines, quantitative evaluations of a robot system using CaP is limited to a constrained set of simulated tasks in IV-D, while in IV-B, IV-C, and IV-E we demonstrate the system’s full range of supported commands without quantitative evaluations.

总结

本实验体系旨在验证以下三重核心目标：

分层代码生成效果评估与泛化模式分析

横向对比不同语言模型（LLMs）的分层代码生成能力, 探究代码生成策略对跨任务场景泛化行为的影响机制 2. 模拟环境下的策略性能对比

在语言驱动的机械臂操控任务中，量化比较**代码即策略（CaP）**与传统基线方法（如CLIPort[36]）的优劣 3. 跨平台部署验证

通过在不同机器人系统（移动机器人、机械臂等）的实机实验，验证CaP框架的灵活性与易用性.

补充实验与资源详述

附录扩展实验（Appendix F）：

生成反应式控制器实现倒立摆平衡
末端执行器阻抗控制策略验证

量化评估的局限性说明

由于开放式任务设计的复杂性及领域可比基线方法的缺失，系统性量化评估主要集中于：

IV-D节: 限定场景的模拟任务性能测试而以下章节侧重能力覆盖验证：
IV-B节 (逻辑控制泛化)
IV-C节 (空间几何推理)
IV-E节 (跨本体任务适应性) 其目标为通过定性案例展示CaP在真实场景中的完整功能范围。

A. Hierarchical LMPs on Code-Generation Benchmarks

We evaluate our code-generation approach on two code- generation benchmarks: (i) a robotics-themed RoboCodeGen and (ii) HumanEval [1], which consists of standard code-gen problems.

RoboCodeGen: we introduce a new benchmark with 37 func- tion generation problems with several key differences from previ- ous code-gen benchmarks: (i) it is robotics-themed with questions on spatial reasoning (e.g., find the closest point to a set of points), geometric reasoning (e.g., check if one bounding box is contained in another), and controls (e.g., PD control), (ii) using third-party libraries (e.g. NumPy) are both allowed and encouraged, (iii) provided function headers have no docstrings nor explicit type hints, so LLMs need to infer and use common conventions, and (iv) using not-yet-defined functions are also allowed, which can be created with hierarchical code-gen. Example benchmark questions can be found in Appendix E. We evaluate on four LLMs accessible from the OpenAI API 1 . As with standard benchmarks [1], our evaluation metric is the percentage of the generated code that passes human-written unit tests. See Table I. Domain-specific language models (Codex model) generally perform better. Within each model family, performance improves with larger models. Hierarchical performs better across the board, showing the benefit of allowing the LLM to break down complex functions into hierarchical parts and generate code for each part separately. We also analyze how code generation performance varies across the five types of generalization proposed in [23]. Hierarchical helps Productivity the most, which is when the new instruction requires longer code, or code with more logic layers than those in Examples. These improvements however, only occur with the two davinci models, and not cushman, suggesting that a certain level of code-generation capability needs to be reached first before hierarchical code-gen can bring further improvements. More results are in Appendix E.2.

Evaluations in HumanEval [1] demonstrate that hierarchical code-gen improves not only policy code, but also general-purpose code. See Table II. Numbers achieved are higher than in recent works [1], [11], [58]. More details in Appendix D.

总结

本研究基于两个代码生成基准展开评估：

自研机器人主题测试集RoboCodeGen

包含37个函数生成问题

差异化特性：

领域专精：问题聚焦空间推理（如求点集中最近点）、几何关系判断（如包围盒包含检测）、控制策略（PD控制实现）
第三方库激励：允许并推荐使用NumPy等库简化实现
弱约束函数定义：函数头不含文档描述与显式类型提示，模型需按命名惯例推断参数类型
支持分层生成：允许调用未定义函数（通过递归生成填充）

提示

RoboCodeGen区别于已有的HumanEval，对机器人领域的代码生成填补了空白。它包含37个函数生成问题，有空间推理、几何推理、控制问题（包括PD控制），应对专门的机器人应用场景，揭示模型在不同类型泛化任务上的优劣势。

标准化基准HumanEval[1]: 覆盖通用编程问题

测试模型：通过OpenAI API开放的4种LLMs（包含Codex系列模型）

评估标准：生成代码通过人工编写单元测试的百分比（详见表I）

核心发现:

领域专精模型的性能优势

Codex（代码特化模型）整体表现最优，同系列内模型规模越大性能越强。

提示

通俗一点说就是堆算力，再通俗一点说就是追逐当时最优，参数最大的模型。（该研究发布于2023年）

分层生成技术的普适增益

在各模型中，允许模型将复杂函数拆分为分层子任务并独立生成代码，平均性能提升显著（详见表I）。

泛化维度深析（基于[23]分类）

生产力泛化(Productivity) 提升最大：针对代码长度或逻辑层数超出示范案例的任务（例如复杂控制律代码生成），分层生成使大模型（如Davinci）有效拆解问题。

模型规模阈值效应：仅在Davinci级别模型中观测到分层增益，Cushman模型未见明显提升，表明分层技术需基础代码生成能力达标方能生效（更多结果见附录E.2）。

提示

可以考虑换用目前代码最优的模型，比如Grok 3、DeepSeek、GPT-o3等。

分层生成方法不仅提升机器人策略代码质量，也显著改进通用代码生成性能（对比近期研究[1][11][58]，本文成果指标全面超越，详见表II），具体分析参见附录D。

B. CaP: Drawing Shapes via Generated Waypoints

In this domain, we task a real UR5e robot arm to draw various shapes by generating and following a series of 2D waypoints. For perception, the LMPs are given APIs that detect object positions, implemented with off-the-shelf open vocabulary object detector MDETR [2]. For actions, an end-effector trajectory following API is provided. There are four LMPs: (i) parse user commands, maintain a session, and call action APIs, (ii) parse object names from language descriptions, (iii) parse waypoints from language descriptions, and (iv) generate new functions. Examples of successful on-robot executions of unseen language commands are in Fig. 2c. The system can elicit spatial reasoning to draw entirely new shapes from language commands. Additional examples which demonstrate the ability to parse precise dimensions, manipulate previous shapes, and multi-step commands, as well as full prompts, are in Appendix H.

总结

实验配置：

机器人本体：UR5e工业机械臂

核心任务：根据语言指令生成并追踪2D路径点，完成多样化图形绘制

系统模块分解：

感知层API

采用现成的MDETR开放词汇目标检测器[2]，实时获取物体位置坐标。

注意

MDETR 目前只能处理静态的图像，不能处理视频；只能处理单个文本查询和单张图像的情况；只能输出边界框，而不能输出更精细的掩码或关键点。这一部分如果要移植，需要再写一层边界-物体掩码的代码。

执行层API

提供末端执行器轨迹跟踪接口，将路径点序列转化为关节运动指令。

分层LMPs架构（4个核心模块）

主控LMP:

解析用户指令语义
维护会话上下文
协调子LMP生成并调用动作API

物体解析LMP:

从语言描述（如“红色方框”）映射至目标物体名称

路径点生成LMP:

根据指令（如“绘制边长30cm的正五边形”）转化为几何坐标序列

函数动态生成LMP:

按需创建新功能函数（如轨迹平滑算法）
能力验证结果
零样本新形状绘制

图2c展示了首次出现的指令（如“螺旋线”、“六芒星”）的成功执行轨迹。

精确尺寸解析

根据语言描述的定量参数（如“半径15cm的圆”）生成对应路径。

多步指令组合

支持组合操作（如“擦除前一图形，并在其右侧绘制箭头”）。

动态策略扩展

通过函数生成模块实现实时算法增强（如轨迹插值优化）。

完整提示模板与示例代码参见附录H。

C. CaP: Pick & Place Policies for Table-Top Manipulation

The table-top manipulation domain tasks a UR5e robot arm to pick and place various plastic toy objects on a table. The arm is equipped with a suction gripper and an in-hand Intel Realsense D435 camera. We provide perception APIs that detect the presences of objects, their positions, and bounding boxes, via MDETR [2]. We also provide a scripted primitive that picks an object and places it on a target position. Prompts are similar to those from the last domain, except trajectory parsing is replaced with position parsing. Examples of on-robot executions of unseen language commands are in Fig. 2 panels a and b, showing the capacity to reason about object descriptions and spatial relationships. Other commands that use historical context (e.g., "undo that"), reason about objects via geometric (e.g., "smallest") and spatial (e.g., "right-most") descriptions are in Appendix I.

注解

系统配置与硬件参数

机器人本体：

机械臂型号：UR5e协作型机械臂
末端执行器：真空吸盘夹具（支持非结构化物体抓取）
视觉模块：Intel Realsense D435手眼相机（实时RGB-D感知）

感知层API：MDETR开放词汇检测器[2] 提供物体存在性检测、2D/3D坐标及边界框信息。
动作原语：

pick_and_place(obj, target_pos)：

参数obj：物体名称（如'红色积木'）

参数target_pos：目标坐标（通过视觉系统解析或数值生成）

语言指令解析能力

零样本场景验证（图2a&b）

物体描述推理："将三角形块放在绿色圆形块左侧" → 需识别形状与颜色
空间关系解析："将最大的积木移动到离机器人最近的碗中" → 需计算尺寸与距离

上下文敏感指令处理（更多案例见附录I）

动作回溯："撤销上一步操作" → 需维护执行历史栈
几何属性推理："移动最小的蓝色物体" → 需联合颜色筛选与尺寸排序
动态空间定位："最右侧的碗" → 需实时计算X轴最大坐标

对比先前的轨迹规划实验，此处核心改进为：

从生成连续路径点（如绘制图形）转为直接获取离散目标点（简单抓放）

删除多余的运动规划关键词，强化物体属性与位置关系描述.

危险

受限于MDETR的检测精度与语言描述粒度，系统暂无法处理亚毫米级定位或抽象符号指令（如"摆成笑脸图案"）。需混合高层策略与低层反馈控制实现复杂任务（如动态避障）。

如果想要移植，特别是处理比较精细的如柔性插拔这样的工业场景，需要进一步调研合适的通用感知检测模型。

D. CaP: Table-Top Manipulation Simulation Evaluations

We evaluate CaP on a simulated table-top manipulation environment from [16], [18]. The setup tasks a UR5e arm and Robotiq 2F85 gripper to manipulate 10 colored blocks and 10 colored bowls. We inherit all 8 tasks, referred as "long-horizon" tasks due to their multi-step nature (e.g., "put the blocks in matching bowls"). We define 6 new tasks that require more challenging and precise spatial-geometric reasoning capabilities (e.g., "place the blocks in a diagonal line"). Each task is parameterized by some attributes (e.g., "pick up <obj> and place it in <corner>"), which are sampled during each trial. We split the task instructions (I) and the attributes (A) into "seen" (SI, SA) and "unseen" categories (UI, UA), where "seen" means it’s allowed to appear in the prompts or be trained on (in the case of supervised baseline). More details in Appendix K. We consider two baselines: (i) language-conditioned multi-task CLIPort [36] policies trained via imitation learning on 30k demonstrations, and (ii) few-shot prompted LLM planner using natural language instead of code.

Results are in Table III. CaP compares competitively to the supervised CLIPort baseline on tasks with seen attributes and instructions, despite only few-shot prompted with one example rollout for each task. With unseen task attributes, CLIPort’s performance degrades significantly, while LLM-based methods retain similar performance. On unseen tasks and attributes, end- to-end systems like CLIPort struggle to generalize, and CaP outperforms LLM reasoning directly with language(also observed in [20]). Moreover, the natural-language planners [14], [16]–[18] are not applicable for tasks that require precise numerical spatial- geometric reasoning. We additionally show the benefits reasoning with code over natural language (both direct question and an- swering and Chain of Thought [47]), specifically the ability of the former to perform precise numerical computations,in Appendix C.

注解

基于文献[16][18]构建的模拟桌面操作平台，系统配置如下：

机器人装置：UR5e机械臂 + Robotiq 2F85夹具

操作对象：10个彩色积木 + 10个彩色碗

基础任务（8类）：多步骤长时程任务（如“积木颜色配对入碗”）

新增任务（6类）：需精确空间几何推理的任务（如“将积木沿对角线排列”）

参数化设计：

任务属性（如“将<物体>放置至<角落>”）分为两类：

已见属性（SA）：可包含在提示示例或监督学习基线训练集中
未见属性（UA）：完全未在训练或提示中出现的属性组合

指令分类（附录K详述）：

已见指令（SI）：指令模板已在提示中示范
未见指令（UI）：新颖任务描述语句

提示

是否“可见”主要指的是是否是已有知识、固定技能组合、能够在预训练数据中匹配的参数。

基线方法与实验结果（表III）

监督学习基线CLIPort[36]

数据依赖：30,000次专家演示的模仿学习

性能表现：

SA/SI任务：表现良好
UA/UI任务：成功率显著下降

提示

不可见的任务的低成功率说明，LLM还是在泛化能力上有所缺陷。

自然语言LLM规划器

方法特点：纯语言规划，不含代码生成

功能限制：缺乏数值计算能力，无法完成需坐标运算的任务

代码即策略（CaP）

配置要求：每个任务仅需单次示例提示

关键结果：

SA/SI任务：性能与CLIPort相当
UA/UI任务：成功率保持稳定（代码生成实现零样本泛化）

对比结论：

监督学习局限：CLIPort受限于训练数据分布，难以应对属性/指令的开放组合
自然语言规划缺陷：无法处理需精确空间计算的几何任务
代码生成优势：直接生成数学运算代码（如坐标转换、距离计算），性能完胜自然语言链式推理（CoT）
深层技术验证（附录C）: 针对几何推理场景（如“构建垂直排列”），对比分析显示：

代码即策略框架的核心优势——通过代数运算生成精确控制指令，规避自然语言的模糊性局限。

In this domain, a robot with a mobile base and a 7 DoF arm is tasked to perform navigation and manipulation tasks in real-world kitchen. For perception, the LMPs are given object detection APIs implemented via ViLD [3]. For actions, the robot is given APIs to navigate to locations and grasp objects via both names and coordi- nates. Examples of on-robot executions of unseen language com- mands are in Fig. 2. This domain shows that CaP can be deployed across realistic tasks on different robot systems with different APIs. It also illustrates the ability to follow long-horizon reactive com- mands with control structures as well as precise spatial reasoning, which cannot be easily accomplished by prior works [16], [17], [36]. See prompts and additional examples in Appendix J.

翻译

提示

感知由专用的大模型ViLD、MDETR完成，可以获取物体的位置和边界框。这里并不属于LLM的管辖范围。

包括处理新物体也是应用了开放词汇检测系统（Open-Vocabulary Perception）和显著性模型。

在此领域，配备移动底盘与7自由度机械臂的机器人在真实厨房环境中执行导航与操作任务。感知层通过ViLD[3]实现目标检测API，动作层提供基于名称/坐标的导航与抓取API。图2展示了未训练语言指令的实机执行示例，例如：

长时程反应式指令："若餐桌上的杯子未被清理，则先将其移至水槽再整理餐具"
精密空间推理指令："绕过左侧障碍物，移至右侧台面中央位置抓取勺子"

该实验证明：

跨机器人系统部署能力 CaP可适配不同API的机器人平台（移动导航与机械臂联动）。
超越先前工作的性能传统方法[16][17][36]难以合成含控制结构（如循环判断）的复杂策略, 无法生成依赖数值计算（如坐标差补偿）的精确动作序列

完整提示词与附加案例见附录J。

提示

移动机器人平台选择:

Everyday Robot作为移动机器人平台，演示机械臂为Ur5e，夹爪为Robotiq 2F85 gripper，相机为Realsense D435 camera。

5. 讨论和限制

CaP generalizes at a specific layer in the robot stack: interpreting natural language instructions, processing perception outputs, then parameterizing low-dimensional inputs to control primitives. This fits into systems with factorized perception and control, and it imparts a degree of generalization (acquired from pretrained LLMs) without the magnitude of data collection needed for end-to-end learning. Our method also inherits LLM capabilities unrelated to code writing e.g., supporting instructions with non-English languages or emojis (Appendix L). CaP can also express cross-embodied plans that perform the same task differently depending on the available APIs (Appendix M). However, this ability is brittle with existing LLMs, and it may require larger ones trained on domain-specific code.

CaP today are restricted by the scope of (i) what the perception APIs can describe (e.g., no visual-language models to date can describe whether a trajectory is "bumpy" or "more C-shaped"), and (ii) which control primitives are available. Only a handful of named primitive parameters can be adjusted without over- saturating the prompts. CaP also struggle to interpret commands that are significantly longer or more complex, or operate at a different abstraction level than the given Examples. In the tabletop domain, it would be difficult for LMPs to "build a house with the blocks," since there are no Examples on building complex 3D structures. Our approach also assumes all given instructions are feasible, and we cannot tell if a response will be correct a priori.

翻译

CaP在机器人技术栈的特定层级实现泛化：即解析自然语言指令、处理感知输出，进而为控制原语生成低维输入参数。这种方法适用于感知与控制模块解耦的架构，通过预训练LLMs引入泛化能力，同时规避端到端学习所需的海量数据收集。CaP还继承了LLMs的非代码相关能力，例如支持非英语指令或表情符号（附录L），以及根据可用API生成跨本体适应性任务计划（附录M）。然而，这种能力在现有LLMs中较为脆弱，可能需依赖针对领域代码特化的更庞大模型。

当前CaP的局限性主要源于：

感知API的描述范围限制

现有视觉语言模型无法描述轨迹的抽象属性（如“颠簸路径”或“更接近C形的曲线”）。

控制原语功能有限性

若需调节的参数过多（超出命名原语承载范围），将导致提示信息过载。

指令复杂度与抽象层级适配性

面对远超示例复杂度的长指令（如“用积木搭建房屋”），或与给定示例抽象层级不符的任务（如要求定义复杂三维结构），LMPs难以生成有效代码。

此外，CaP默认假设所有输入指令均可行，缺乏对响应正确性的先验判断能力。

附录

A. 提示词工程

Using LMPs to reliably complete tasks via code generation requires careful prompt engineering. While these prompts do not have to be long, they do need to be relevant and specific. Here, we discuss a few general guidelines that we followed while developing prompts for this paper.

It is very important for prompts to contain code that has no bugs. Bugs in the prompt lead to unreliable and incorrect responses.Con- versely,if the LMP is writing incorrect code for a given Instruction, the prompt engineer should first verify that the prompt, especially the Examples most closely related to the Instruction, is bug-free. To reduce bugs related to syntax errors, one simple method is writing prompts in a code editor with syntax highlighting.

There are many cases where the prompt contains variables or functions whose names are ambiguous. To produce reliable responses under these conditions, Examples in the prompt should treat these ambiguities consistently. If a variable named point is treated as an numpy.ndarray object in one Example and as a shapely.geometry.Point object in another, the LMP will not be able to “decide" on which convention to use, resulting in unreliable responses. Another way to handle ambiguity is by providing informal type hints, such as appending _np to variable names to indicate its type, or appending it to function names to indicate the type of the returned variable. In general, more specific variable and function names give more consistent results.

For using third party libraries, including import statements in the prompt may not be necessary, as we found that LMPs can generate code that calls NumPy and SciPy without them. However, explicit import statements do improve reliability and increase the chance of LMPs using those libraries when the need arises. For using first party libraries, meaningful function names that follow popular conventions (e.g., begin with set_ and get_ ) and specify return object formats (e.g., get_bbox_xyxy ) induce more accurate usages. Import statements in the Hints should be formatted as if we were importing functions. For example, in Python this means using from utils import function_name instead of import function_name . If the latter is used, the LMP may treat the imported name as a package, and the generated code might write function_name.function_name() .

One type of LMP failures are related to code generation correctness. For example, minor coding mistakes when calling internal or external APIs, such as missing arguments, can be fixed with an Hint or Example demonstrating the correct usage. Incorrect assumptions on variable types can also be fixed in similar fashions. Other coding failures may be addressed by descriptive function names to encourage appropriate library usage ( perform_function_with_np() ) or succinct code logic ( # implement in one line. ) While it is possible to use LLMs to edit code and fix bugs (e.g., by using OpenAI's code edit API), in our experience this yielded inconsistent results (not always able to correct mistakes, and sometimes changed what the function was doing), so we did not employ this method in our experiments.

翻译

要利用语言模型生成程序（LMPs）通过代码生成可靠完成任务，需进行谨慎的提示工程。虽然提示的篇幅不需冗长，但必须具有相关性和针对性。本文讨论在开发提示时遵循的若干通用准则：

示例代码质量至关重要。提示中包含的代码必须完全无缺陷。若提示中存在错误，将导致生成不可靠或不正确的响应。反之，若LMP为给定指令编写的代码存在错误，提示工程師应首先验证提示本身（尤其是与指令最相关的示例代码）是否无漏洞。减少语法错误的小技巧：在支持语法高亮的代码编辑器中编写提示。

消除变量与函数名的歧义。当变量名存在多义性时，提示中的示例需保持跨示例的类型一致性。例如：若变量名point在某示例中作为numpy.ndarray类型使用，在另个示例中却作为shapely.geometry.Point类型，LMP将无法确定应采用哪种类型约定，导致结果不可靠。

命名修正技巧：通过以下方式暗示类型信息：

为变量名添加后缀（如_np表示NumPy数组类型）
在函数名中标记返回类型（如get_bbox_xyxy()明确边界框坐标格式）

普遍原则是：命名越具体，生成结果越稳定。

第三方与自建库的调用优化

第三方库（如NumPy/SciPy）：

尽管LMP可不依赖显式import语句调用这些库，但包含import numpy as np等声明可提高代码生成的可靠性，并增加模型使用该库的概率。

自建函数库的调用规范：

遵循常用命名约定（例如set_/get_前缀）
在函数名中明确返回值格式（例如get_pose_6d()表示返回6自由度位姿）
导入格式必须使用from utils import function_name而非import function_name，否则LMP可能生成function_name.function_name()的错误调用链。

代码错误的调试策略:

API参数缺失或变量类型误用可通过添加示例或提示（Hints）来修复。

引导技巧：

通过函数名驱动功能选择（如perform_filter_with_np()暗示使用NumPy实现过滤）

通过注释限制代码复杂度（如# 用一行代码实现）

避免自动纠错工具：

实验中发现，使用LLM自动修复代码（如OpenAI的代码编辑API）会导致结果不稳定（无法保证错误完全修正或意外修改功能逻辑），故未采用此方法。

B. 方法选择的提示词

Language-based reasoning: Full prompt:

objs = ['green block', 'green bowl', 'yellow block', 'yellow bowl']
# the yellow block.
ret_val = 'yellow block'
# the blocks.
ret_val = ['green block', 'yellow block']

注解

注释如果指的是单个有具体描述的方块，那么输出会是这个具体描述的方块。

如果指的是一组方块，那么输出会是这个群块。（没有限制的话，所有符合block性质的都会被选中）

First-party: Full prompt:

from utils import get_pos, put_first_on_second
objs = ['gray block', 'gray bowl']
# put the gray block on the gray bowl.
put_first_on_second('gray block', 'gray bowl')

objs = ['purple block', 'purple bowl']
# move the purple bowl toward the left.
target_pos = get_pos('purple bowl') + [-0.3, 0]
put_first_on_second('purple bowl', target_pos)

注解

这里有必要解释一下utils里这两个函数的作用。

首先get_pos函数的作用是获取一个物体在当前场景中的位置。put_first_on_second函数的作用是把一个物体放在另一个物体的上方。

target_pos = get_pos('purple bowl') + [-0.3, 0]这里指的是紫色碗的当前位置(x,y坐标)，那么加上[-0.3,0]，就是x坐标减去0.3，y坐标不变。

Combining language reasoning, third-party, and first-party libraries.: Full prompt:

import numpy as np
from utils import get_pos, put_first_on_second
objs = ['cyan block', 'cyan bowl', 'pink bowl']
# put the cyan block in cyan bowl.
put_first_on_second('cyan block', 'cyan bowl')
objs = ['gray block', 'silver block', 'gray bowl']
# place the top most block on the gray bowl.
names = ['gray block', 'silver block']
positions = np.array([get_pos(name) for name in names])
name = names[np.argmax(positions[:,1])]
put_first_on_second(name, 'gray bowl')
objs = ['purple block', 'purple bowl']
# put the purple bowl to the left of the purple block.
target_pos = get_pos('purple block') + [-0.3, 0]
put_first_on_second('purple bowl', target_pos)

注解

这里的argmax是用来找y坐标最大（所处最高）的那个物体。

LMPs can be composed.: Full prompt:

import numpy as np
from utils import get_pos, put_first_on_second, parse_obj
objs = ['yellow block', 'yellow bowl', 'gray block', 'gray bowl']
# move the sun colored block toward the left.
block_name = parse_obj('sun colored block')
target_pos = get_pos(block_name) + [-0.3, 0]
put_first_on_second(block_name, target_pos)
objs = ['white block', 'white bowl', 'yellow block', 'yellow bowl']
# place the block closest to the blue bowl on the other bowl.
block_name = parse_obj('the block closest to the blue bowl')
bowl_name = parse_obj('a bowl other than the blue bowl')
put_first_on_second(block_name, bowl_name)

注解

这里的parse_obj扮演的是一个翻译的角色。通过自然语言描述的物体属性（如位置、颜色、类别）转化为具体物体名称或标识符。这里就使用到了开放词汇检测器，翻译一个陌生的物体。

parse_obj prompt.: Full prompt:

import numpy as np
from utils import get_pos
objs = [’brown bowl’, ’green block’, ’brown block’, ’green bowl’]
# the blocks.
ret_val = [’brown block’, ’green block’]
# the sky colored block.
ret_val = ’blue block’
objs = [’orange block’, ’cyan block’, ’purple bowl’, ’gray bowl’]
# the right most block.
block_names = [’orange block’, ’cyan block’]
block_positions = np.array([
                    get_pos(block_name) for block_name in block_names])
right_block_name = block_names[np.argmax(block_positions[:, 0])]
ret_val = right_block_name

C. 代码推理 vs. 自然语言

To investigate how robot-relevant reasoning through LLMs can be performed with LMPs rather than with natural language, we created a benchmark that consists of two sets of tasks: (i) selecting objects in a scene from spatial-geometric descriptions, and (ii) selecting position coordinates from spatial-geometric descriptions. Object selection has 28 questions with commands such as"find the name of the block closest to the blue bowl," where a list of block and bowl positions are provided as input context in the prompt. Position selection has 23 questions with commands such as "interpolate 3 points on a line from the cyan bowl to the blue bowl." An LLM-generated answer for position selection is considered correct if all coordinates are within 1cm of the ground truth.

We evaluate LMPs against two variants of reasoning with natural language: (i) Vanilla, given a description of the setting (e.g., list of object positions) and the question, directly outputs the answer (e.g., "Q: What is the top-most block?" → "A: red block"), and (ii) Chain of Thought (CoT) [47], which performs step-by-step reasoning given examples of intermediate steps in the prompt (e.g., encouraging the LLM to list out y-coordinates of all blocks in the scene before identifying the top-most block).

Results in Table IV show that LMPs achieve accuracies in the high 90s, outperforming CoT, which outperforms Vanilla. CoT enables LLMs to reason about relations and orders (e.g. which coordinate is to the right of another coordinate), but failures occur for precise and multi-step numerical computations. By contrast, code from LMPs can use Python to perform such computations, and they often leverage external libraries to perform more complex operations (e.g., NumPy for vector addition). CoT and LMPs are not mutually exclusive –it is possible to prompt "step-by-step" code-generation to solve more complex tasks via CoT, but this is a direction not explored in this work.

翻译

为探究如何通过语言模型生成程序（LMPs）而非纯自然语言实现机器人相关的推理任务，我们构建了包含两类任务的基准测试集：

(i) 物体选择任务：基于空间几何描述从场景中选取目标物体（共28道问题，指令示例："找到距离蓝色碗最近的积木名称"），输入提示中提供积木和碗的坐标列表作为上下文；

(ii) 位置选择任务：根据几何描述生成坐标点（共23道问题，指令示例："在青色碗至蓝色碗连线上插值3个点"），若所有坐标值与真实值误差在1厘米以内，则认为LLM生成的位置选择答案正确。

对比方法设计：将LMP与两类自然语言推理方法进行对比：

直接输出法（Vanilla）：

给定场景描述（如物体位置列表）与问题，直接输出答案（例如："Q: 最顶部的积木是什么？" → "A: 红色积木"）

思维链推理法（CoT）[47]：

在提示中提供分步推理示例（例如诱导LLM先列举场景中所有积木的Y坐标，再识别最顶部积木）

提示

思维链的应用比较适合接入DeepSeek。

实验结果（表IV）

LMPs取得接近90%的高准确率，显著优于CoT方法，而CoT优于直接输出法。

CoT使LLM能处理关系与顺序推理（如判断坐标相对方位），但在需精确数值计算的多步骤任务中存在失败案例。

LMPs则通过Python代码生成能力实现数值计算，常借助外部库（如NumPy实现向量加法）完成复杂运算。

CoT与LMPs并非互斥——可结合"分步生成代码"提示策略通过思维链解决更复杂任务，但本文未探索该方向。

D. CodeGen HumanEval 附加的结果

Here we provide additional results to our HumanEval experi- ments. In total, three variants of the bigger Codex model (code- davinci-002) are tested. Our approach is Hier. CodeGen + Hier Prompts, where the prompt encourages the LLM to call yet-to-be- defined functions by including such examples. For comparisons, we evaluate against Flat CodeGen + No Prompt, essentially just using the LLM directly, and Flat CodeGen + Flat Prompt, for fair comparison with flat code-generation, since our hierarchical ap- proach has a prompt. The prompts only contain only 2 Examples:

Prompt for Flat CodeGen:

prompt_f_gen_flat = ”’
def get_total(xs: List[float]) -> float:
"""Find the sum of a list of numbers called xs.
"""
return sum(xs)
# end of function
def get_abs_diff_between_means(xs0: List[float],
xs1: List[float]) -> float:
"""Get the absolute difference between the means of two
lists of numbers.

m0 = sum(xs0) / len(xs0)
m1 = sum(xs1) / len(xs1)
return abs(m0 - m1) # end of function

Prompt for Hierarchical CodeGen:

def get_total(xs: List[float]) -> float:
"""Find the sum of a list of numbers called xs.
"""
return sum(xs)
# end of function
def get_abs_diff_between_means(xs0: List[float],
xs1: List[float]) -> float:
"""Get the absolute difference between the means of two lists of
numbers.
"""
m0 = get_mean(xs0)
m1 = get_mean(xs1)
return abs(m0 - m1)
# end of function

Note the only difference in the hierarchical prompt is using a yet-to-be-defined function get_mean instead of calculating the mean directly. This "allows" the LLM to generate code that also call yet-to-be-defined functions.

We report pass rates for when using the most likely outputs ("greedy", which is done by setting temperature to 0), as well as pass rates for at least one solution from sampling various numbers of solutions (1, 10, and 100) with temperature set to 0.8, similar to those used in prior works [1], [11], [58].

See results in Table II. In all instances hierarchical code generation outperforms flat code generation, and the numbers achieved are higher than those reported in recent works [1], [11], [58] Note that we use code-davinci-002 , while previous works use code-davinci-001 , but the relative improvements with hierarchical are consistent across the board. Out of the 164 questions in HumanEval, 6.5% led to hierarchical code generation, but of which both Flat CodeGen variants got 44% success, while Hier CodeGen code got 56%. While success rate when sampling 100 responses is above 90% across the board, we note that sampling multiple solutions is not practical for LMPs, which need to perform tasks in a zero-shot manner without engineering prior unit tests. As such, for LMPs we always set temperature to 0 and use the most likely output.

翻译

此处提供HumanEval实验的附加结果。我们对大型Codex模型（code-davinci-002）的三种变体进行测试：

实验方法对比

分层式代码生成（Hier. CodeGen + Hier Prompts）

提示设计：鼓励LLM调用未定义函数（通过包含此类调用示例）

平铺式（Flat）代码生成对照组

Flat CodeGen + No Prompt：直接使用原始模型无提示
Flat CodeGen + Flat Prompt：使用包含两个完整函数示例的提示代码，直接计算中间值

关键差异：

分层提示与非分层提示唯一区别在于是否允许使用未实现函数（如get_mean()）替代显式计算.

提示

解码策略：

确定性生成（温度=0，贪心解码）：仅取模型最高概率输出
采样生成（温度=0.8，采样次数=1/10/100次）：统计至少一次生成正确的概率

实验结果（见表II）

性能对比

无论解码策略如何，分层生成均优于平铺生成。

分层方法（Hier. CodeGen）代码通过率达56%，平铺方法仅44%（针对需分层生成的6.5%题目子集）。

总体表现超过近期工作[1][11][58]，尽管使用模型版本更新（code-davinci-002 vs. 前作的code-davinci-001）。

采样策略效能

生成100次时，两种方法的成功率均超90%，但这在机器人任务中不实用（LMP需零样本单次生成可靠代码）。

实际部署中，温度恒设为0以获取最高置信度输出，避免随机性风险。

E. Robot Code-Generation Benchmark

Example Questions: Here are four types of benchmark questions and their examples:

• Vector operations with Numpy:

pts = interpolate_pts_np(start, end, n)

信息

这段插值方法的源代码是：

  {
    'f_name': 'interpolate_pts_np',
    'f_sig': 'pts = interpolate_pts_np(start, end, n)',
    'f_sol': lambda start, end, n: np.linspace(start, end, n),
    'test_input_args': [(np.random.random(2), np.random.random(2), 3) for _ in range(5)],
    'test_equiv': np.allclose
  },

• Simple controls:

u = pd_control(x_curr, x_goal, x_dot, Kp, Kv)

• Manipulating shapes with shapely:

circle = make_circle(radius, center)

• Using first-party libraries: ret_val = obj_shape_does_not_contain_others(obj_name, other_obj_names)

复现工作流

由于Google Research在该方面做了比较多的工作，仓库中的ipynb较多，且并不都是CaP需要的，需要指明以下文件是文章的复现目标：

-> Experiment_ Reasoning with Code vs Natural Language.ipynb

模型需要指定code-davinci-002，而不是text-davinci-002，非代码补全LLMs会带来性能影响。

$ Pip install google-colab概率会失败，建议使用conda安装，实测成功:

$ conda install -c conda-forge google-colab

Ipython届时大概率报错，需要重新创建已被弃用的coloransi模块，以及其类TermColors（补充颜色Green的定义）

补装traitlets模块

Api_key需要准备官方的api_key，国内中转的gpt-4的key会报405 Not Allowed

概率需要更新ipykernel

官方给出的colab在中国大陆环境下挂代理不可用，可能会卡在ffmpeg的安装上，待试验

Python: 3.11.11(colab环境下)

环境：GTX 4090, ubuntu 20.04

1. 填入api_key

直接填入以sk为开头的api_key即可。

在这里可以考虑降版本到0.28.0（由于某些依赖关系），如果降版本就需要重新写一下比如api_key的读取。

预设的! pip 命令可以检查安装的版本是否正确。

2. LMP的构建（重点）

如果未来换用deepseek，需要弃用openai，重新按照ds的要求换用ds的api。

def solve_problems(problems, prompt, context_vars, log=False, recurse=False, bug_fix=False, query_kwargs=None):
  f_names = [problem['f_name'] for problem in problems]
  f_sigs = [problem['f_sig'] for problem in problems]
  f_gens = lmp_fgen_batch(
      prompt, f_names, f_sigs, 
      context_vars=context_vars, bug_fix=bug_fix, recurse=recurse, 
      log=log, query_kwargs=query_kwargs
    )
  return f_gens

以上代码负责传递问题到lmp中生成代码，recurse属性负责控制是否递归生成，如果需要用到递归（本文核心）的话可以传递为True。

bug_fix属性负责控制是否使用代码纠错，在文中提及了纠错的可用性但是效果不佳，暂时不知是否指的就是这一点。模型为code-davinci-edit-001。

recurse和bug_fix是默认为False的。

附lmp_fgen_batch的实现：

def lmp_fgen_batch(prompt, f_names, f_sigs, stop_tokens=['# define function:', '# example:'], 
                   recurse=False, context_vars=None, bug_fix=False, log=True, query_kwargs=None,
                   rate_limit_time=6):
    queries = [f'# define function: {f_sig}.' for f_sig in f_sigs]
    f_srcs_list = lmp_batch(prompt, queries, stop_tokens=stop_tokens, query_kwargs=query_kwargs)
    
    if bug_fix:
        for idx in trange(len(f_srcs_list)):
          sleep(rate_limit_time)
          f_srcs_list[idx] = openai.Edit.create(
            model='code-davinci-edit-001',
            input='# ' + f_srcs_list[idx],
            temperature=0,
            instruction="Fix syntax errors. Keep same inputs and outputs. Only small changes. No comments.",
          )['choices'][0]['text'].strip()

    f_srcs = {f_name: f_src for f_name, f_src in zip(f_names, f_srcs_list)}

    fs = {}
    all_child_fs, all_child_f_srcs = {}, {}
    for f_name, f_sig, f_src in zip(f_names, f_sigs, f_srcs_list):
      if context_vars is None:
        context_vars = {}
      gvars = merge_dicts([context_vars, fs, all_child_fs])
      lvars = {}
    
      try:
        exec_safe(f_src, gvars, lvars)
        fs[f_name] = lvars[f_name]
      except Exception as e:
        print(f_name)
        print(f_src)
        print(e)
        fs[f_name] = lambda *args, **kwargs: None
        continue      

      # recursively define child_fs in the function body if needed
      if recurse:
        f_def_body = astunparse.unparse(ast.parse(f_src).body[0].body)
        potential_child_fs, potential_child_f_sigs = {}, {}
        f_parser = FunctionParser(potential_child_fs, potential_child_f_sigs)
        f_parser.visit(ast.parse(f_def_body))
        for potential_child_f_name, potential_child_f_sig in potential_child_f_sigs.items():
          if potential_child_f_name in potential_child_fs:
            potential_child_fs[potential_child_f_name] = potential_child_f_sig

        child_fs, child_f_srcs = {}, {}
        for child_f_name, child_f_sig in potential_child_fs.items():
          all_vars = merge_dicts([context_vars, fs, all_child_fs, child_fs])
          if not var_exists(child_f_name, all_vars):
            sleep(rate_limit_time)
            child_f, child_f_src = lmp_fgen(
                prompt, child_f_name, child_f_sig, 
                stop_tokens=stop_tokens, 
                context_vars=all_vars, 
                bug_fix=bug_fix, 
                log=False,
                recurse=True,
                return_src=True,
                query_kwargs=query_kwargs
              )

            child_fs[child_f_name] = child_f
            child_f_srcs[child_f_name] = child_f_src

        if len(child_fs) > 0:
          # redefine parent f so newly created child_fs are in scope
          gvars = merge_dicts([context_vars, fs, all_child_fs, child_fs])
          lvars = {}
        
          exec_safe(f_src, gvars, lvars)
          
          fs[f_name] = lvars[f_name]
          all_child_fs.update(child_fs)
          all_child_f_srcs.update(child_f_srcs)

    if log:
      for query, f_src in zip(queries, f_srcs):
        to_print = highlight(f'{query}\n{f_src}', PythonLexer(), TerminalFormatter())
        print(f'LMP FGEN created:\n\n{to_print}\n')

    all_fs = merge_dicts([fs, all_child_fs])
    all_f_srcs = merge_dicts([f_srcs, all_child_f_srcs])
    f_gens = {
        f_name: {
            'f': f,
            'f_src': all_f_srcs[f_name]
        }
        for f_name, f in all_fs.items()
    }

    return f_gens

3. 提示词构建（重点）

预设的文本注释提示词样式：

# 这是两个对照组的实验设置
prompt_f_gen = '''
import numpy as np
from shapely.geometry import *
from shapely.affinity import *
from utils import get_obj_outer_pts_np
# define function line = make_line_by_length(length=x).
def make_line_by_length(length):
  start = np.array([0, 0])
  end = np.array([length, 0])
  line = make_line(start=start, end=end)
  return line
# example: scale a line by 2 around the centroid.
line = make_line_by_length(1)
new_shape = scale(line, xfact=2, yfact=2, origin='centroid')
'''
prompt_f_gen_flat = '''
import numpy as np
from shapely.geometry import *
from shapely.affinity import *
from utils import get_obj_outer_pts_np
def make_line_by_length(length):
  line = LineString([[0, 0], [length, 0]])
  return line
# example: scale a line by 2.
line = make_line_by_length(1)
new_shape = scale(line, xfact=2, yfact=2)
'''
prompt_f_gen_exec = '''
import numpy as np
from shapely.geometry import *
from shapely.affinity import *
# define function line = make_line_by_length(length=x).
def make_line_by_length(length):
  line = LineString([[0, 0], [length, 0]])
  return line
'''
# exec不需要example

context_vars = {}
exec(prompt_f_gen_exec, context_vars)

对于模糊描述（更...多一点...少一些...这样的描述）的提示词处理，类似于Voxposer所使用的[-1,1]平滑，本文给出了一个处理方法，但是还是比较简单的。

import numpy as np
import shapely
from shapely.geometry import *
from shapely.affinity import *
from operator import eq

# 这两个函数用来处理"around"的概念

def rotate_pts_around_pts_center_np(pts_np, angle_deg):
  angle_rad = np.deg2rad(angle_deg)
  centroid = np.mean(pts_np, axis=0)
  R = np.array([
      [np.cos(angle_rad), -np.sin(angle_rad)], 
      [np.sin(angle_rad), np.cos(angle_rad)]
  ])
  new_pts = (pts_np - centroid) @ R.T + centroid
  return new_pts

def scale_pts_around_centroid_np(pts_np, scale_x=1.5, scale_y=1.5):
    centroid = np.mean(pts_np, axis=0)
    new_pts = pts_np - centroid
    new_pts[:, 0] = new_pts[:, 0] * scale_x
    new_pts[:, 1] = new_pts[:, 1] * scale_y
    new_pts = new_pts + centroid
    return new_pts

problems_np = [
    {
    'f_name': 'get_right_most_idx',
    'f_sig': 'idx = get_right_most_idx(points_np)',
    'f_sol': lambda points_np: np.argmax(points_np[:, 0]),
    'test_input_args': [(np.random.random((10, 2)),) for _ in range(5)],
    'test_equiv': np.allclose
  },
  # 相对于比较级来说，最高级的获取比较容易，因为可以划定边界后直接用argmax获取想要的最高值。
    {
    'f_name': 'bbox_xyxy_contains_pt',
    'f_sig': 'contains = bbox_xyxy_contains_pt(bbox_xyxy)',
    'f_sol': lambda bbox_xyxy, pt: pt[0] >= bbox_xyxy[0] and pt[0] <= bbox_xyxy[2] and pt[1] >= bbox_xyxy[1] and pt[1] <= bbox_xyxy[3],
    'test_input_args': [(np.random.random(4) * [-1, -1, 1, 1], np.random.random(2) *2 - 1) for _ in range(5)],
    'test_equiv': np.allclose
  },
  # 对于包含这样的空间关系，也是划定边界后划定范围
   {
    'f_name': 'interpolate_pts_np',
    'f_sig': 'pts = interpolate_pts_np(start, end, n)',
    'f_sol': lambda start, end, n: np.linspace(start, end, n),
    'test_input_args': [(np.random.random(2), np.random.random(2), 3) for _ in range(5)],
    'test_equiv': np.allclose
  },
  # 插值方法，还提供了坐标系变换和翻转等
  {
    'f_name': 'rotate_pts_around_pts_center_np',
    'f_sig': 'new_pts_np = rotate_pts_around_pts_center_np(pts_np, angle_deg)',
    'f_sol': rotate_pts_around_pts_center_np,
    'test_input_args': [(np.random.random((10, 2)), np.random.random()) for _ in range(5)],
    'test_equiv': np.allclose
  },
  {
    'f_name': 'scale_pts_around_centroid_np',
    'f_sig': 'new_pts_np = scale_pts_around_centroid_np(pts_np, scale_x=1.5, scale_y=1.5)',
    'f_sol': scale_pts_around_centroid_np,
    'test_input_args': [(np.random.random((10, 2)), np.random.random(), np.random.random()) for _ in range(5)],
    'test_equiv': np.allclose
  },
  # 用随机来做一个around的处理
]

此外控制问题的处理：

problems_ctrl = [
  {
    'f_name': 'pd_control',
    'f_sig': 'u = pd_control(x_curr, x_goal, x_dot, Kp, Kv)',
    'f_sol': lambda x_curr, x_goal, x_dot, Kp, Kv: Kp * (x_goal - x_curr) - Kv * x_dot,
    'test_input_args': [(
      np.random.random(2), np.random.random(2), np.random.random(2), np.random.random(), np.random.random()
      ) for _ in range(5)],
    'test_equiv': np.allclose
  },
  # PD控制问题的构建模板
  {
    'f_name': 'end_effector_impedance_control',
    'f_sig': 'tau = end_effector_impedance_control(x_curr, x_goal, x_dot, K_x_mat, D_x_mat, J)',
    'f_sol': lambda x_curr, x_goal, x_dot, K_x_mat, D_x_mat, J: J.T @ (K_x_mat @ (x_goal - x_curr) - D_x_mat @ x_dot),
    'test_input_args': [(
      np.random.random(3), np.random.random(3), np.random.random(3), 
      np.diag(np.random.random(3)), np.diag(np.random.random(3)), 
      np.random.random((3, 6))
      ) for _ in range(5)],
    'test_equiv': np.allclose
  },]
  # 阻抗控制
  def get_direction_orthogonal_to_line_to_point(line, point):
        # get the line's direction.
        direction = np.array(line.coords[1]) - np.array(line.coords[0])
        direction = direction / np.linalg.norm(direction)
        # get the orthogonal direction.
        orthogonal_direction = np.array([-direction[1], direction[0]])
        # get the point's direction.
        point_direction = np.array(point) - np.array(line.coords[0])
        # get the sign of the point's direction.
        sign = np.sign(np.dot(point_direction, orthogonal_direction))
        # get the orthogonal direction from the line to the point.
        orthogonal_direction_from_line_to_point = sign * orthogonal_direction
        return orthogonal_direction_from_line_to_point

def get_direction_orthogonal_to_line(line):
    direction = np.array(line.coords[1]) - np.array(line.coords[0])
    direction = direction / np.linalg.norm(direction)
    direction = np.array([direction[1], -direction[0]])
    return direction

shape_eq = lambda f_sol_out, f_gen_out: np.allclose(f_sol_out.union(f_gen_out).area, f_sol_out.area, f_gen_out.area)

make_rectangle = lambda width, height, center: box(center[0] - width / 2, center[1] - height / 2, center[0] + width / 2, center[1] + height / 2)

problems_shapely = [{
    'f_name': 'interpolate_pts_on_line',
    'f_sig': 'pts_coords = interpolate_pts_on_line(line, n)',
    'f_sol': lambda exterior, n: [exterior.interpolate(i / (n - 1), normalized=True).coords[0] for i in range(n)],
    'test_input_args': [(LineString(np.random.random((2, 2))), np.random.randint(2, 100)) for _ in range(5)],
    'test_equiv': np.allclose
  },
  {
    'f_name': 'make_line',
    'f_sig': 'line = make_line(start_pt_np, end_pt_np)',
    'f_sol': lambda start_pt_np, end_pt_np: LineString([start_pt_np, end_pt_np]),
    'test_input_args': [(np.random.random(2), np.random.random(2)) for _ in range(5)],
    'test_equiv': eq
  },]
  # 需要使用shapely来进行几何解算的问题

模糊生成的参数：

obj_names = ['block0', 'block1', 'block2', 'block3', 'block4', 'block5']
obj_positions = np.array([
    [0, 0],
    [0.1, 0],
    [0.5, 0.5],
    [0, 0],
    [0.6, 0.6],
    [0, 0.9],
    [0.4, 0.4]
])
sizes = [0.15, 0.15, 0.1, 0.2, 0.1, 0.05, 0.1]
obj_boxes = [make_rectangle(size, size, pos) for pos, size in zip(obj_positions, sizes)]
obj_outer_pts_np_map = {
    obj_name: np.array(obj_box.exterior.coords)
    for obj_name, obj_box in zip(obj_names, obj_boxes)
}
get_obj_outer_pts_np = lambda obj_name: obj_outer_pts_np_map[obj_name]
obj_names0 = obj_names[:3]
obj_names1 = obj_names[3:]

递归层处理：

def obj_shape_does_not_contain_others(obj_name, other_obj_names):
    obj_shape = get_obj_shape(obj_name)
    for other_obj_name in other_obj_names:
        other_obj_shape = get_obj_shape(other_obj_name)
        if obj_shape.contains(other_obj_shape):
            return False
    return True

def is_obj0_bigger_than_obj1(obj0_name, obj1_name):
    obj0_pts = get_obj_outer_pts_np(obj0_name)
    obj1_pts = get_obj_outer_pts_np(obj1_name)
    obj0_area = get_area(obj0_pts)
    obj1_area = get_area(obj1_pts)
    return obj0_area > obj1_area

问题api:

problems_api = [
  {
    'f_name': 'obj_shape_does_not_contain_others',
    'f_sig': 'ret_val = obj_shape_does_not_contain_others(obj_name, other_obj_names)',
    'f_sol': obj_shape_does_not_contain_others,
    'test_input_args': [(obj_name0, obj_names1) for obj_name0 in obj_names0],
    'test_equiv': eq
  },
  {
    'f_name': 'is_obj0_bigger_than_obj1',
    'f_sig': 'ret_val = is_obj0_bigger_than_obj1(obj0_name, obj1_name)',
    'f_sol': is_obj0_bigger_than_obj1,
    'test_input_args': [(np.random.choice(obj_names0), np.random.choice(obj_names1)) for _ in range(5)],
    'test_equiv': eq
  },]

最终集成处理：

all_problems = problems_np + problems_ctrl + problems_shapely + problems_api
print('n problems:', len(all_problems))

context_vars['get_obj_outer_pts_np'] = get_obj_outer_pts_np

4. 验证

print('Codex | Hierarchical Code-Gen | Hierarchical Prompt')

f_gens_codex_hc_hp = solve_problems(all_problems, prompt_f_gen, context_vars, recurse=True, query_kwargs={'engine': 'code-davinci-002'})

results_codex_hc_hp = eval_problems(all_problems, f_gens_codex_hc_hp)
for r in results_codex_hc_hp:
  print(int(r['success']))

# failures_codex_hc_hp = [r for r in results_codex_hc_hp if not r['success']]
# for failure in failures_codex_hc_hp:
#   print(failure['f_name'], failure['info'], failure['f_gen']['f_src'])

可以看到code-davinci-002模型的成功率为97%左右，基本能够解决文中提出的经典的问题。

References

此处指的是调研过程中的参考文献。

[1]. MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding.(2021).Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, Nicolas Carion.arXiv preprint arXiv:2104.12763.

[2].

项目地址​

1. 摘要​

2. 引言​

3. 相关工作​

4. 方法​

A. Prompting Language Model Programs​

B. Example Language Model Programs (Low-Level)​

C. Example Language Model Programs (High-Level)​

D. Language Model Programs as Policies​

5. 实验​

A. Hierarchical LMPs on Code-Generation Benchmarks​

B. CaP: Drawing Shapes via Generated Waypoints​

C. CaP: Pick & Place Policies for Table-Top Manipulation​

D. CaP: Table-Top Manipulation Simulation Evaluations​

E. CaP: Mobile Robot Navigation and Manipulation​

5. 讨论和限制​

附录​

A. 提示词工程​

B. 方法选择的提示词​

C. 代码推理 vs. 自然语言​

D. CodeGen HumanEval 附加的结果​

E. Robot Code-Generation Benchmark​

复现工作流​

1. 填入api_key​

2. LMP的构建（重点）​

3. 提示词构建（重点）​

4. 验证​

References​