GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
- Qianhui Wu ,
- Kanzhi Cheng ,
- Rui Yang ,
- Chaoyun Zhang ,
- Jianwei Yang ,
- Huiqiang Jiang ,
- Jian Mu ,
- Baolin Peng ,
- Bo Qiao ,
- Reuben Tan ,
- Si Qin ,
- Lars Liden ,
- Qingwei Lin 林庆维 ,
- Huan Zhang ,
- Tong Zhang ,
- Jianbing Zhang ,
- Dongmei Zhang ,
- Jianfeng Gao
ArXiv |
One of the principal challenges in building VLM-powered GUI agents is visual grounding—localizing the appropriate screen region for action execution based on both the visual content and the textual plans. Most existing work formulates this as a text-based coordinate generation task. However, these approaches suffer from several limitations: weak spatial-semantic alignment due to the lack of explicit spatial supervision; inability to handle ambiguous supervision targets, as single-point predictions penalize valid variations; and a mismatch between the dense nature of screen coordinates and the coarse, patch-level granularity of visual features extracted by models like Vision Transformers. In this paper, we propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding. At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated