专栏名称: ChatAI42技术与产品

智能聊天机器人（Chatbots）是交互的新趋势，Google、Facebook、Microsoft、百度、阿里等众多公司已加入此阵列，就等你了！我们会定期发布聊天机器人的各种信息，其中使用的机器学习/深度学习技术、产品、分享活动等等

Aguvis：提升的不仅是 UI Agent 的规划推理能力

ChatAI42技术与产品 · 公众号 · 机器人 · 2024-12-13 20:08

正文

请到「今天看啥」查看全文

Home ^[1] | GitHub ^[2] | Twitter ^[3] | Youtube ^[4] | Bilibili ^[5]

本文介绍 来自 HKU & Salesforce 的 Aguvis 。如我之前所说，这篇论文（数据、代码都会开源）至少值 2 个算法工程师 1 个月的工资。论文里面有很多细节都值得深挖，属于外行看热闹，内行看门道的那种。

本文是视频 UI Agent 论文分享：Aguvis-来自 HKU & Salesforce 的大一统训练数据和训练框架 ^[6] 对应的文字版，建议与视频对照着看。

Aguvis 相关资料：

[2412.04454] Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction ^[7] , HKU & Salesforce
https://aguvis-project.github.io ^[8]
【视频分享】 UI Agent 论文分享：Aguvis-来自 HKU & Salesforce 的大一统训练数据和训练框架 ^[9]

Aguvis 这个词应该是作者造的，没查到什么意思。发现这个工作的作者跟 OS-Copilot ^[10] 还有耦合，而 OS-Copilot ^[11] 跟 OS-Atlas ^[12] 是相同的一作。

Aguvis 基于 Qwen2-VL-7B 和 Qwen2-VL-72B 进行全量微调（只 freeze ViT 部分） ，设置最大序列长度为 8192，max pixels 为 1280 x 720。

本文主要贡献：

生成了 IM（observation、thought、low-level instruction）数据，相当于 planning & reasoning 数据，用于第二阶段的模型微调。验证了 IM 数据能大幅提升模型的效果
构建了统一的 grounding 和 reasoning 大数据集，数据即将开源

利用 pyautogui 统一了不同平台的动作空间，这样来自不同平台的数据可以统一使用

训练数据使用 grounding packing strategy 方法，把训练效率提升了 5 倍

把多个单轮的 grounding 任务合成一个多轮的单个任务

统一了 grounding 和 planning & reasoning 2 个训练阶段的数据格式

论文详解

比较标准的两阶段训练方式。第一阶段主要针对 grounding 能力，第二阶段主要针对 planning & reasoning 能力。

Inner Monologue （ 内心独白 ，简称 IM ）包括 3 个部分：

1. observation description
2. internal reasoning (thought)
3. low-level action instruction

决策过程可以分为 2 步完成：Planner 生成 IM 内容，然后 Grounder 按照产生具体的 grounding 信息。

可插拔的动作空间

把动作执行 统一成了函数调用 （可以借力 base 模型的 function call 能力）：

类似函数调用的方式在 prompt 中告知有哪些函数是可调用的。

Aguvis Collection 数据集

Aguvis Collection 数据集 是作者汇总其他数据集构建的训练数据集；包括以下 2 部分，顾名思义，对应上面的两阶段训练； 后续会开源

1. grounding split ：作者把以下数据集中的 Meta 信息都统一成 pyautogui 命令格式的数据

2. planning & reasoning split

"Thanks to our detailed inner monologue trajectory data, we implement a reasoning mixture approach , where the model is exposed to various levels of cognitive complexity , from straightforward low-level action instructions to full inner monologues that include observation descriptions, thoughts, and detailed action plans. By dynamically adjusting the complexity of these trajectories, we train the model to be adaptable, fostering step-by-step reasoning and high-level decision-making abilities. This diversity in reasoning ensures that the model can handle a wide range of tasks with nuanced understanding and precision."

Grounding Stage

以下是 grounding 阶段训练使用的数据格式：

⁉️ 疑问：

1. 对于 grounding 数据，Prompt 中的 overall_goal 和 previous_actions 分别是什么？

2. 这个标记的用途是什么？

模型可以利用这个标记来 识别需要关注的特定部分 ，从而生成更加相关和准确的内容。例如，在进行内容编辑或补全时，模型能够基于此标记理解上下文中的变化。

Grounding Packing Strategy

效率提升了 5 倍，效果还稍微有点提升。

reduces overall GPU hours from 6 hours to 1 hour. Moreover, this strategy even marginally improve the performance of ScreenSpot website split from 73.3 to 76.8.
可以在 16 个节点的机器上花费 2 天微调 72B VLM。

⛔ "We train AGUVIS on a cluster of H100-80G GPUs: AGUVIS-7B uses 8 nodes and completes the grounding training within 5 hours and planning & reasoning training within 1 hour . AGUVIS-72B uses 16 nodes and completes the grounding training within 30 hours and planning & reasoning training within 6 hours ."

Planning & Reasoning Stage

IM 是用户自己通过 GPT-4o 构造出来的。

使用 GPT-4o 生成 planning & reasoning 数据，以下是 prompt 和示例：

上面获得的增强数据需要满足以下条件才被认为是成功的：

Match the action type and action target elements of the ground truth
Correctly describe the step’s intention
Establish a clear connection between the step’s intention and the overall goal
Assist the agent in successfully completing the task

在抽样的数据当中，作者发现 86.7％ 展现出了与真实动作和总体目标的动作意图相一致的中间推理。剩下的 7.8％ 的案例受到 数据集噪声 的影响（任务中的不相关或不必要动作）， 5.5％ 的案例则是由于在干净数据下对动作意图的误读。

作者分析发现，训练数据中的 非必要动作 可能致使 VLM 无法在这些多余动作和总体目标之间建立关联，最终造成不正确的推理和规划。

以下是此阶段训练使用的数据格式：

all ：预测 IM； os ：预测具体动作

作为对比，以下是上面给出的 Grounding 阶段的数据格式：

一些注意点：

planning 阶段的具体动作选择，形式上和 grounding 阶段是一样的
"Thanks to our detailed inner monologue trajectory data, we implement a reasoning mixture approach , where the model is exposed to various levels of cognitive complexity , from straightforward low-level action instructions to full inner monologues that include observation descriptions, thoughts, and detailed action plans. By dynamically adjusting the complexity of these trajectories, we train the model to be adaptable, fostering step-by-step reasoning and high-level decision-making abilities. This diversity in reasoning ensures that the model can handle a wide range of tasks with nuanced understanding and precision."

第二阶段的训练数据中，也混合了 low-level instructions 数据？

Enforced Plan & Self Plan

all ：预测 IM； os ：预测具体动作

Enforced Plan : employ the all\nThought prompt to compel the model to first generate a planning phase , and then a pyautogui command .

Self Plan : do not add any word after , so the model can choose to generate os to directly produce a pyautogui command, or generate all to first create natural language reasoning and then generate a pyautogui command.

作者发现 使用 Enforced Plan 能获得更好的效果 ，把 grounding Error 降低 20%。

各阶段训练效果

Grounding 能力：

Planning 能力：

消融实验

省略第二阶段（规划和推理）对模型的步骤成功率有更显著的负面影响，表明 规划训练 对于提高代理处理复杂 GUI 任务的能力至关重要。

提升可归因于两个关键因素：使用 IM 让模型能够引出对当前步骤的推理，同时推理作为背景也有助于为后续步骤进行更有效的规划。

另外，将训练数据中的 low-level instructions 纳入进来提高了模型动作执行的准确性。

一些未来的优化方向

improving instruction clarity through the agent model itself（40% 的错误来自于指令不够清晰）
developing adaptive planning mechanisms
refining training data to include more diverse planning scenarios（更多任务类型）

UI Agents 知识星球&分享视频

UI Agents 技术发展迅猛，想紧跟 UI agents 技术前沿？我们的知识星球每周以视频方式 解读最新论文 ，为你开启技术新视野，快来加入吧！

加入知识星球，每周获取会员专享视频👇

扫码加微信小助手为好友，备注「agent」，小助手会定期邀请入群👇

当前星球包含的专享视频包括：

【2024.12.08】 UI Agent 论文分享：Aguvis-来自 HKU & Salesforce 的大一统训练数据和训练框架 ^[13]
【2024.12.01】 UI Agent 论文分享：ShowUI-当前最好的 UI Agents 开源模型，还适用中文 APP？ ^[14]
【2024.11.24】 UI Agent 论文分享：使用世界模型提升 UI Agents 效果？ ^[15]
【2024.11.17】 UI Agent 论文分享：来自华为诺亚方舟实验室的 LiMAC ^[16]
【2024.11.11】 UI Agent 论文分享：来自 LG AI Research 的 Auto-Intent ^[17]

Aguvis：提升的不仅是 UI Agent 的规划推理能力

正文

请到「今天看啥」查看全文

论文详解

可插拔的动作空间

Aguvis Collection 数据集

Grounding Stage

Grounding Packing Strategy

Planning & Reasoning Stage

Enforced Plan & Self Plan

各阶段训练效果

消融实验

一些未来的优化方向

UI Agents 知识星球&分享视频

更多推荐阅读

引用链接

请到「今天看啥」查看全文