GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

Qianhui Wu; Kanzhi Cheng; Rui Yang; Chaoyun Zhang; Jianwei Yang; Huiqiang Jiang; Jian Mu; Baolin Peng; Bo Qiao; Reuben Tan; Si Qin; Lars Liden; Qingwei Lin 林庆维; Huan Zhang; Tong Zhang; Jianbing Zhang; Dongmei Zhang; Jianfeng Gao

GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

Qianhui Wu ,
Kanzhi Cheng ,
Rui Yang ,
Chaoyun Zhang ,
Jianwei Yang ,
Huiqiang Jiang ,
Jian Mu ,
Baolin Peng ,
Bo Qiao ,
Reuben Tan ,
Si Qin ,
Lars Liden ,
Qingwei Lin 林庆维 ,
Huan Zhang ,
Tong Zhang ,
Jianbing Zhang ,
Dongmei Zhang ,
Jianfeng Gao

ArXiv | June 2025

Download BibTex

One of the principal challenges in building VLM-powered GUI agents is visual grounding—localizing the appropriate screen region for action execution based on both the visual content and the textual plans. Most existing work formulates this as a text-based coordinate generation task. However, these approaches suffer from several limitations: weak spatial-semantic alignment due to the lack of explicit spatial supervision; inability to handle ambiguous supervision targets, as single-point predictions penalize valid variations; and a mismatch between the dense nature of screen coordinates and the coarse, patch-level granularity of visual features extracted by models like Vision Transformers. In this paper, we propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding. At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated token with all relevant visual patch tokens, enabling the model to propose one or more action regions in a single forward pass. In line with this, we further design a grounding verifier to evaluate and select the most plausible action region from the candidates proposed for action execution. Extensive experiments show that GUI-Actor outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks, with improved generalization to unseen screen resolutions and layouts. Notably, GUI-Actor-7B achieves scores of 40.7 with Qwen2-VL and 44.6 with Qwen2.5-VL as backbones, outperforming UI-TARS-72B (38.1) on ScreenSpot-Pro, with significantly fewer parameters and training data. Furthermore, by incorporating the verifier, we find that fine-tuning only the newly introduced action head (~100M parameters for 7B model) while keeping the VLM backbone frozen is sufficient to achieve performance comparable to previous state-of-the-art models, highlighting that GUI-Actor can endow the underlying VLM with effective grounding capabilities without compromising its general-purpose strengths.