CodeAsPolicies-JackyLiang等
项目地址
项目站:https://code-as-policies.github.io/
想要复现该论文的代码,请访问:https://github.com/google-research/google-research/tree/master/code_as_policies
主代码下载Experiment_ Robot Code-Gen Benchmark.ipynb即可。如需要工作中其他效果对比的验证代码,可以选择其他ipynb文件下载。
Colab: https://colab.research.google.com/drive/124TE4TsGYyrvduzeDclufyvwcc2qbbrE
Paper原文:https://arxiv.org/abs/2209.07753
1. 摘要
Abstract—Large language models (LLMs) trained on code-completion have been shown to be capable of synthesizing simple Python programs from docstrings[1].We find that the secode-writing LLMs can be re-purposed to write robot policy code, given natural language commands. Specifically, policy code can express functions or feedback loops that process perception outputs (e.g., from object detectors [2], [3]) and parameterize control primitive APIs. When provided as input several example language commands (formatted as comments) followed by corresponding policy code (via few-shot prompting), LLMs can take in new commands and autonomously re-compose API calls to generate new policy code respectively. By chaining classic logic structures and referencing third-party libraries (e.g., NumPy, Shapely) to perform arithmetic, LLMs used in this way can write robot policies that (i) exhibit spatial-geometric reasoning, (ii) generalize to new instructions, and (iii) prescribe precise values (e.g., velocities) to ambiguous descriptions (“faster”) depending on context (i.e., behavioral commonsense). This paper presents Code as Policies: a robot-centric formulation of language model generated programs (LMPs) that can represent reactive policies (e.g., impedance controllers), as well as waypoint-based policies (vision- based pick and place, trajectory-based control), demonstrated across multiple real robot platforms. Central to our approach is prompting hierarchical code-gen (recursively defining undefined functions), which can write more complex code and also improves state-of-the- art to solve 39.8% of problems on the HumanEval [1] benchmark. Code and videos are available at https://code-as-policies.github.io
翻译
经过代码补全任务训练的大型语言模型(LLMs)已被证明能够根据文本注释生成简单的Python程序[1]。我们发现,这些擅长编写代码的LLM可通过重新定位用途,依据自然语言指令生成机器人策略代码。具体而言,策略代码可表述为处理感知输出(如来自目标检测器[2][3])的函数或反馈回路,并参数化控制原语的应用程序编程接口(API)。当输入多个自然语言指令样例(以注释形式格式化)及其对应策略代码(通过少量示例提示)时,LLM能够接收新指令并自主重组API调用以生成相应的新策略代码。通过链式调用经典逻辑结构及引用第三方库(如NumPy、Shapely)执行算术运算,这种LLM可生成的机器人策略能够:(i) 展现空间几何推理能力;(ii) 泛化至新指令;(iii) 根据上下文(即行为常识)对模糊描述(如"更快")赋予精确参数值(如速度值)。本文提出"代码即策略"(Code as Policies)框架——一种以机器人为核心的语言模型生成程序(LMPs)表述方式,能够表示反应式策略(如阻抗控制)和基于路径点的策略(视觉抓取放置、轨迹控制),并在多个真实机器人平台上完成验证。本方法的核心在于分层代码生成提示策略(通过递归定义未实现函数),该策略不仅能编写更复杂代码,还将HumanEval基准测试[1]的最优解决率提升至39.8%。
机器人策略(Robot Policy)指的是一套系统化的决策逻辑或行为规则,用于控制机器人如何根据感知到的情况(如传感器数据、视觉信息)生成具体的动作指令。
通俗来说,就是机器人从“所感”导致“所想”,再到“所做”的这一层。
摘要中指出,原先的研究证明了文本注释可以生成简单的Python程序,而现在该团队的进展是通过自然语言指令也可以生成策略代码。必须指出的是,文本注释和自然语言指令是有区别的,在下文中,
2. 引言
Robots that use language need it to be grounded (or situated) to reference the physical world and bridge connections between words, percepts, and actions [4]. Classic methods ground language using lexical analysis to extract semantic representations that inform policies [5]–[7], but they often struggle to handle unseen instructions. More recent methods learn the grounding end-to-end (language to action) [8]–[10], but they require copious amounts of training data, which can be expensive to obtain on real robots. Meanwhile, recent progress in natural language processing shows that large language models (LLMs) pretrained on Internet- scale data [11]–[13] exhibit out-of-the-box capabilities [14]–[16] that can be applied to language-using robots e.g., planning a sequence of steps from natural language instructions [16]–[18] without additional model finetuning. These steps can be grounded in real robot affordances from value functions among a fixed set of skills i.e., policies pretrained with behavior cloning or rein- forcement learning [19]–[21]. While promising, this abstraction prevents the LLMs from directly influencing the perception-action feedback loop, making it difficult to ground language in ways that (i) generalize modes of feedback that share percepts and actions e.g., from "put the apple down on the orange" to "put the apple down when you see the orange", (ii) express commonsense priors in control e.g., "move faster", "push harder", or (iii) comprehend spatial relationships "move the apple a bit to the left". As a result, incorporating each new skill (and mode of grounding) requires additional data and retraining – ergo the data burden persists, albeit passed to skill acquisition. This leads us to ask: how can LLMs be applied beyond just planning a sequence of skills?
Herein, we find that code-writing LLMs [1], [11], [22] are proficient at going further:or chest rating planning,policy logic,and control. LLMs trained on code-completion have shown to be capa- ble of synthesizing Python programs from docstrings. We find that these models can be re-purposed to write robot policy code, given natural language commands(formattedascomments).Policycode can express functions or feedback loops that process perception outputs (e.g., open vocabulary object detectors [2], [3]) and param- eterize control primitive APIs (see Fig. 1). When provided with several example language commands followed by corresponding policy code (via few-shot prompting, in gray), LLMs can take in new commands (in green) and autonomously re-compose the API calls to generate new policy code (highlighted) respectively:
# if you see an orange, move backwards.
if detect_object("orange"):
robot.set_velocity(x=-0.1, y=0, z=0)
# move rightwards until you see the apple.
while not detect_object("apple"):
robot.set_velocity(x=0, y=0.1, z=0)
Code-writing models can express a variety of arithmetic operations as well as feedback loops grounded in language. They not only generalize to new instructions, but having been trained on billions of lines of code and comments, can also prescribe precise values (e.g., velocities) to ambiguous descriptions ("faster" and "to the left") depending on context –to elicit behavioral commonsense:
# do it again but faster, to the left, and with a banana.
while not detect_object("banana"):
robot.set_velocity(x=0, y=-0.2, z=0)
Representing code as policies inherits a number of benefits from LLMs: not only the capacity to interpret natural language, but also the ability to engage in human-robot dialogue and Q&A simply by using "say(text)" as an available action primitive API:
# tell me why you stopped moving.
robot.say("I stopped moving because I saw a banana.")
We present Code as Policies (CaP): a robot-centric formulation of language model generated programs (LMPs) executed on real systems. Pythonic LMPs can express complex policies using:
• Classic logic structures e.g., sequences, selection (if/else), and loops (for/while) to assemble new behaviors at runtime.
• Third-party libraries to interpolate points (NumPy), analyze and generate shapes (Shapely) for spatial-geometric reasoning, etc.
LMPs can be hierarchical: prompted to recursively define new functions, accumulate their own libraries over time, and self- architectadynamiccodebase.We demonstrate across several robot systems that LLMs can autonomously interpret language com- mands to generate LMPs that represent reactive low-level policies (e.g., PD or impedance controllers), and waypoint-based policies (e.g., for vision-based pick and place, or trajectory-based control). Our main contributions are: (i) code as policies: a formulation of using LLMs to write robot code, (ii) a method for hierarchical code-gen that improves state-of-the-art on both robotics and standard code-gen problems with 39.8% P@1 on HumanEval [1], (iii) a new benchmark to evaluate future language models on robotics code-gen problems, and (iv) ablations that analyze how CaP improves metrics of generalization [23] and that it abides by scaling laws–larger models perform better. Code as policies presents a new approach to linking words, percepts, and actions; enablingapplicationsinhuman-robotinteraction,butisnotwithout limitations. We discuss these in Sec.V.Full prompts and generated outputs are in the Appendix, which can be found along with additional results, videos, and code at code-as-policies.github.io.
翻译
为实现与物理世界的交互,依赖语言的机器人需将语言具象化(grounded)——即建立词汇、感知与动作之间的关联[4]。传统方法通过词法分析提取语义表征以指导策略(policies)[5]–[7],但面对未见指令时表现受限。较新的端到端学习方法(语言直接映射到动作)[8]–[10]虽然能够提升泛化性,却需要大量真实机器人数据进行训练,成本高昂。
与此同时,自然语言处理领域的最新进展表明:通过互联网规模数据预训练的大型语言模型(LLMs) [11]–[13]具备直接应用于语言交互机器人的能力——例如,无需额外微调即可从自然语言指令中规划行为序列[16]–[18]]。这些行为序列可基于固定技能集合(通过行为克隆或强化学习预训练的既定策略[19]–[21])在真实机器人功能可供性中通过价值函数建立关联。尽管此类方法前景广阔,但抽象分层(LLM仅规划技能调用)隔断了其对感知-动作反馈回路的直接影响,导致以下局限性:
- 难以泛化共享感知与动作的反馈模式
例如将“将苹果放在橙子上”扩展为“当检测到橙子时放下苹果”需重新设计逻辑。
- 无法表达控制中的常识性先验
如“更快移动”、“加大推力”需依赖隐含物理知识。
- 空间关系理解受限
如“将苹果稍微左移”需调用几何计算而非单纯路径规划。
因此,每增加一项新技能(或其关联的具象化模式)均需重新收集数据并整体训练——这虽将数据负担转移至技能获取阶段,但问题本质未变。这引 出了核心问题:如何扩展LLMs的应用,使其超越单一技能序列规划?
我们的发现是:擅长代码生成的LLMs[1][11][22]可更进一步——能够统筹规划、策略逻辑与控制。代码补全训练的LLMs已可基于文本注释生成Python程序,而本研究证明其可被改造为按自然语言指令(注释形式)编写机器人策略代码。策略代码可定义函数或反馈循环以处理感知输出(如开放词汇物体检测器[2][3]结果)并参数化控制原语API(见图1)。当输入包含多个<指令-代码>示例(通过少量示例提示,灰色部分)后,LLM可针对新指令(绿色部分)自主重组API调用生成对应策略代码(高亮部分):
# if you see an orange, move backwards.
if detect_object("orange"):
robot.set_velocity(x=-0.1, y=0, z=0)
# move rightwards until you see the apple.
while not detect_object("apple"):
robot.set_velocity(x=0, y=0.1, z=0)
代码生成模型能够通过语言具象化表达多样化的算术运算与反馈循环。此类模型不仅能够泛化理解新指令,其基于海量代码与注释训练(规模达数十亿行)的特性,还使其可根据上下文环境——例如对"更快"这类模糊描述或"向左轻微移动"等空间语义——动态赋予精确参数值(如速度值、坐标偏移量),从而在控制过程中自然融入行为常识:
# do it again but faster, to the left, and with a banana.
while not detect_object("banana"):
robot.set_velocity(x=0, y=-0.2, z=0)
这里明确指出该论文所研究的代码生成和决策能力还是依赖于外部大模型的,经过测试,用于代码补全的最优LLMs:Code-Davinci-002,策略LLMs:gpt-3.5-turbo-instruct(提示词大模型),此外Baseline还使用了text-curie-001 模型。基线方法: CLIPort vs. CaP。
在下文和代码中可以看出本文的重点是为这个外部大模型编写一套用于多层决策的框架。
将代码作为策略的表征方式继承了大型语言模型(LLMs)的多项优势:不仅能够解析自然语言指令,还能通过调用简单的基础动作API(如 say(text)),直接实现人机对话与问答交互功能。
# tell me why you stopped moving.
robot.say("I stopped moving because I saw a banana.")
本文提出代码即策略(Code as Policies, CaP):一种面向实机系统、基于语言模型生成程序(Language Model Generated Programs, LMPs)的机器人核心框架。基于Python的LMPs可通过以下方式实现复杂策略:
经典逻辑结构:序列化指令、条件分支(if/else)及循环控制(for/while),动态组合运行时行为。
第三方库调用:利用数值计算库(如NumPy插值)、几何分析库(如Shapely形状生成)实现空间推理等功能。
LMPs支持层级架构:通过递归定义新函数、持续累积自有函数库,可自主构建动态代码基座。我们在多个真实机器人系统上验证,LLMs能自主解析语言指令生成LMPs,其涵盖的能力包含:
反应式底层控制策略:PD控制、阻抗控制等。
基于路径点的策略:如视觉引导抓放、轨迹规划控制等。
主要贡献:
1.代码即策略框架:将LLMs生成代码拓展至机器人领域的形式化模型。
2.分层代码生成方法:在代码生成任务中刷新SOTA(HumanEval基准达39.8%首次通过率[1])。
3.机器人代码生成评估基准:为未来语言模型的机器人代码生成能力提供量化标准。
4.消融实验验证:分析CaP对泛化性能[23]的提升,并验证其遵循模型规模效应(参数量越大性能越优)。
CaP通过链接语言-感知-动作,开辟了人机交互新路径,但也存在局限性(详见第V节)。完整提示词、生成示例代码及附加结果、视频、源码均发布于合作实验室官网 code-as-policies.github.io。
3. 相关工作
Controlling robots via language has a long history, including early demonstrations of human-robot interaction through lexical parsing of natural language [5]. Language serves not only as an interface for non-experts to interact with robots [24], [25], but also as a means to compositionally scale generalization to new tasks[9], [17].The literature is vast(we refer to Tellex et al.[4]andLuketina etal.[26]for comprehensive surveys),but recent works fall broadly into the categories of high-level interpretation (e.g., semantic parsing [25], [27]–[32]), planning [14], [17], [18], and low-level policies (e.g., model-based [33]–[35], imitation learning [8], [9], [36], [37], or reinforcement learning [38]–[42]). In contrast, our work focuses on the code generation aspect of LLMs and use the generated procedures as an expressive way to control the robot.
Large language models exhibit impressive zero-shot reasoning capabilities: from planning [14] to writing math programs [43]; from solving science problems [44] to using trained verifiers [45] for math word problems. These can be improved with prompting methods such as Least-to-Most [46], Think-Step-by-Step [15] or Chain-of-Thought [47]. Most closely related to this paper are works that use LLM capabilities for robot agents without additional model training. For example, Huang et al. decompose natural lan- guage commands into sequences of executable actions by text com- pletion and semantic translation[14],while SayCan[17]generates feasible plans for robots by jointly decoding an LLM weighted by skill affordances [20] from value functions. Inner Monologue [18] expands LLM planning by incorporating outputs from success de- tectors or other visual language models and uses their feedback to re-plan. Socratic Models [16] uses visual language models to sub- stitute perceptual information (in teal) into the language prompts that generate plans, and it uses language-conditioned policies e.g., for grasping [36]. The following example illustrates the qualitative differences between our approach versus the aforementioned prior works. When tasked to "move the coke can a bit to the right":
LLM Plan [14], [17], [18]
1. Pick up coke can
2. Move a bit right
3. Place coke can
Socratic Models Plan [16]
objects = [coke can]
1. robot.grasp(coke can) open vocab
2. robot.place_a_bit_right()
plans generated by prior works assume there exists a skill that allows the robot to move an object a bit right.Our approach differs in that it uses an LLM to directly generate policy code (plans nested within) to run on the robot and avoids the requirement of having predefined policies to map every step in the plan:
Code as Policies (ours)
while not obj_in_gripper("coke can"):
robot.move_gripper_to("coke can")
robot.close_gripper()
pos = robot.gripper.position
robot.move_gripper(pos.x, pos.y+0.1, pos.z)
robot.open_gripper()
Our approach (CaP) not only leverages logic structures to specify feedback loops, but it also parameterizes (and write parts of) low-level control primitives. CaP alleviates the need to collect data and train a fixed set of predefined skills or language-conditioned policies –which are expensive and often remain domain-specific.
Code generation has been explored with LLMs [1], [48] and without [49]. Program synthesis has been demonstrated to be capable of drawing simple figures [50] and generating policies that solve 2D tasks [51]. We expand on these works, showing that (i)code-writing LLMs enable novel reasoning capabilities(e.g.,en- coding spatial relationships by leaning on familiarity of third party libraries) without additional training needed in prior works [35], [36], [52]–[56], and (ii) hierarchical code-writing (inspired by re- cursive summarization [57]) improves state-of-the-art code genera- tion. We also present a new robotics-themed code-gen benchmark to evaluate future language models in the robotics domain.
翻译
语言操控机器人技术发展沿革悠久,早期以自然语言词法解析为基础实现人机交互[5]。语言不仅为非专业用户提供控制接口[24][25],亦被用于通过组合式泛化扩展至新任务[9][17]。相关文献浩繁(详见Tellex等[4]与Luketina等[26]综述),近期研究可归类为:
高层语义解析:从自然语言指令中提取结构化语义(如任务逻辑图)[25][27]–[32]行为规划层:分解指令为技能调用序列[14][17][18]底层策略控制:
- 基于模型的方法(动力学模型推导控制律)[33]–[35]
- 模仿学习(示教数据驱动)[8][9][36][37]
- 强化学习(奖励函数优化)[38]–[42]
本研究创新点聚焦于利用LLMs生成可执行代码直接控制机器人底层,突破传统范式限制。
这里我认为本文强调的创新点确实属于第四种范式,原因是利用LLMs生成的代码的方式与传统范式有较大差异。
差异之处在于,LLMs是基于什么生成的代码?我们都知道LLM的输入是自然语言,而生成代码的方式依靠的主要还是LLM这一支所依赖的学习方法,这是一大差异之处。
如果说强化学习和模仿学习是一种思考如何去做的方式,那么大模型的提示(甚至是推理大模型的推理链)方式就是一种不一样的思考方式。为了补足这一思考方式对于精细化、专业化的不足之处,本文提出了逻辑构建和底层控制参数化,还有供以验证的RoboCodeGen基准。
在下文提及了LLMs生成的代码的方式是自回归方式,也印证了这一点.
大型语言模型(LLMs)的零样本推理能力拓展:
多领域研究已证明LLMs具备零样本推理能力:数学程序生成[43]、科学问题解答[44]、训练验证器辅助数学题求解[45]等,且通过提示优化策略(如Least-to-Most[46]、逐步思考[15]、思想链[47])可进一步提升性能。
机器人领域相关方法对比:
近年研究尝试直接利用LLMs赋能机器人,无须额外微调:
-
Huang等[14]: 通过文本补全与语义翻译分解指令为动作序列
-
SayCan[17]: 结合技能可供性价值函数进行联合解码生成可行计划
-
Inner Monologue[18]: 整合检测器与视觉语言模型反馈实现动态重规划
-
Socratic Models[16]: 使用视觉语言模型注入感知信息(青色部分)至规划提示词
示例对比:CaP与传统方法的核心差异:
以“将可乐罐稍向右移”任务为例,不同方法的指令解析结果如下: 先前方法生成的计划(假设存在预定义技能)
LLM Plan [14], [17], [18]
1. Pick up coke can
2. Move a bit right
3. Place coke can
Socratic Models Plan [16]
objects = [coke can]
1. robot.grasp(coke can) open vocab
2. robot.place_a_bit_right()
假设缺陷:须预先定义place_a_bit_right()等具体技能,导致泛化能力受限。
本研究(Code as Policies, CaP)生成策略代码:
Code as Policies (ours)
while not obj_in_gripper("coke can"):
robot.move_gripper_to("coke can")
robot.close_gripper()
pos = robot.gripper.position
robot.move_gripper(pos.x, pos.y+0.1, pos.z)
robot.open_gripper()
核心突破:
-
动态逻辑构建:集成反馈循环(如夹爪运动控制)
-
底层控制参数化:直接生成运动学参数(如y轴+0.1米位移)
-
规避预定义技能需求:降低数据收集与域适配成本
早前代码生成研究包含基于LLMs[1][48]与非LLMs方法[49],如生成二维任务策略[51]。本文贡献包括:
- 代码生成新能力验证: 利用第三方库(如Shapely)实现空间几何推理 [优势对比研究:35][36][52]–[56]; 分层式代码生成(受递归摘要[57]启发)显著提升HumanEval基准至39.8%首次通过率
- 机器人专用评测基准: 为未来模型在机器人代码生成领域提供量化评估标准