1. grounding split:作者把以下数据集中的 Meta 信息都统一成 pyautogui 命令格式的数据
2. planning & reasoning split
"Thanks to our detailed inner monologue trajectory data, we implement a reasoning mixture approach, where the model is exposed to various levels of cognitive complexity, from straightforward low-level action instructions to full inner monologues that include observation descriptions, thoughts, and detailed action plans. By dynamically adjusting the complexity of these trajectories, we train the model to be adaptable, fostering step-by-step reasoning and high-level decision-making abilities. This diversity in reasoning ensures that the model can handle a wide range of tasks with nuanced understanding and precision."
reduces overall GPU hours from 6 hours to 1 hour. Moreover, this strategy even marginally improve the performance of ScreenSpot website split from 73.3 to 76.8.
可以在 16 个节点的机器上花费 2 天微调 72B VLM。
⛔ "We train AGUVIS on a cluster of H100-80G GPUs: AGUVIS-7Buses8 nodesand completes the grounding training within5 hoursandplanning & reasoning trainingwithin1 hour.AGUVIS-72B uses 16 nodesand completes the grounding training within30 hoursandplanning & reasoning trainingwithin6 hours."
"Thanks to our detailed inner monologue trajectory data, we implement a reasoning mixture approach, where the model is exposed to various levels of cognitive complexity, from straightforward low-level action instructions to full inner monologues that include observation descriptions, thoughts, and detailed action plans. By dynamically adjusting the complexity of these trajectories, we train the model to be adaptable, fostering step-by-step reasoning and high-level decision-making abilities. This diversity in reasoning ensures that the model can handle a wide range of tasks with nuanced understanding and precision."
第二阶段的训练数据中,也混合了 low-level instructions 数据?
Enforced Plan & Self Plan
all:预测 IM;os:预测具体动作
Enforced Plan: employ the all\nThought prompt to compel the model to first generate a planning phase, and then a pyautogui command.
Self Plan: do not add any word after , so the model can choose to generate os to directly produce a pyautogui command, or generate all to first create natural language reasoning and then generate a pyautogui command.
作者发现使用 Enforced Plan 能获得更好的效果,把 grounding Error 降低 20%。