Meet PC agent: A hierarchical multi-agent collaborative framework for complex task automation on PC

Multimodal Large Language Models (MLLMS) have shown remarkable capabilities across different domains, which propel their development into multimodal means of human help. GUI automation agents for PCs are facing particularly scary challenges compared to smartphone counterparts. PC environments present significantly more complex interactive elements with dense, different icons and widgets that often lack text marks, leading to perception problems. Even advanced models such as Claude-3.5 only achieve 24.0% accuracy in GUI earth connection tasks. PC productivity tasks also involve intricate workflows that span multiple applications with long operating sequences and inter-subtask dependencies, causing dramatic benefits, where GPT-4o’s success rate drops from 41.8% at the subtask level to only 8% for complete instructions.

Previous approaches have developed the framework for tackling PC task complexity with different strategies. UFO implements a double-agent architecture that separates application selection from specific control interactions. Meanwhile, agents planning capacities planning by combining online searching with local memory. However, these methods show significant limitations in fine-grained perception and operation of text on the screen-a critical requirements for productivity scenarios such as document editing. In addition, they generally fail to tackle the complex dependencies between sub-tasks, resulting in poor performance when dealing with realistic intra and inter-app work that characterize everyday PC use.

Researchers from Mais, Institute of Automation, Chinese Academy of Sciences, China, School of Artificial Intelligence, University of Chinese Academy of Sciences, Alibaba Group, Beijing Jiaotong University and School of Information Science and Technology, Shanghaitech University Introduces Introducers PC agent framework To address complex PC scenarios through three innovative designs. First Active perception module Improves fine-grained interaction by extracting locations and meanings of interactive elements via accessibility trees while using MLLM-driven intention understanding and OCR for precise text location. Other, Hierarchical collaboration with multiple agent Implements a three-level decision-making process (Instruction-Subtask-Action), where a manager agent breaks down instructions for parametized sub-tasks and administers dependencies, a progress agent traces operating history and a decision-making agent performs steps with perception and progress information. Third, Reflection -based dynamic decision making Introduces a reflection agent that assesses execution correction and provides feedback, enabling top-down task degradation with bottom-up precision feedback across all four collaborative agents.

PC agents architecture addresses Gui interaction through a formalized approach in which an agent ρ processes user instructions in, observations o and history H to determine actions A. The active perception module improves element recognition using Pywinauto to extract accessibility trees to interactive elements while applying Mllm-driven intention with over. For complex workflows, PC agent implements hierarchical multi-agent collaboration across three levels: The manager agent breaks down instructions for parametized sub-tasks and administers dependencies; The progress agent traces the operation of progress in underguards; And the decision-making agent performs step-by-step actions based on environmental perception and progress information. This hierarchical division effectively reduces decision -making complexity by breaking complex tasks into manageable components with clear interdependence.

Experimental results demonstrate the superior performance of PC agents compared to both single and multi-agent alternatives. Single Mllm-based agents (GPT-4O, Gemini-2.0, Claude3.5, Qwen2.5-VL) consistently fail on complex instructions, with even the best practitioners who only achieved 12% success rate, confirming that single agent is approaching battle with long operational sequences and complex dependence. Multi-agent frameworks such as UFO and agents show modest improvements, but remain limited by the lack of perception and addiction management questions. They are struggling with fine -grained operations such as text editing in words or correct data entry in Excel and often fail to use information from previous undergations. In contrast, the PC agent essentially surpasses all previous methods and surpasses UFO by 44% and agents with 32% in success rate through its active perception module and hierarchical collaboration.

This study introduces PC-Agent Framework, Significant development in handling complex PC-based tasks through three key innovations. The active perception module provides refined perception and operating functions, enabling precise interaction with GUI elements and text. The hierarchical multi-agent collaborative architecture is effectively breaking down decision-making across teaching, sub-task and levels of action, while reflection-based dynamic decision-making enables real-time failure detection and correction. Validation through the newly created PC-Eval-Benchmark with realistic, complex instructions confirms the superior performance of the PC agent compared to previous methods, demonstrating its effectiveness in navigating the complicated workflows and interactive environments characteristic of PC productivity scenarios.

Check out The paper and the github side. All credit for this research goes to the researchers in this project. You are also welcome to follow us on Twitter And don’t forget to join our 80k+ ml subbreddit.

Asjad is an internal consultant at Marketchpost. He surpasses B.Tech in Mechanical Engineering at the Indian Institute of Technology, Kharagpur. Asjad is a machine learning and deep learning enthusiast who always examines the use of machine learning in healthcare.

Parlant: Build Reliable AI customer facing agents with llms 💬 ✅ (promoted)

Leave a Comment Cancel reply