Problem Statement
On desktops, agents have it easy. macOS and Linux offer open file systems, accessible APIs, terminal access, and minimal sandboxing. An agent on your laptop can read your files, run scripts, open applications, and interact with the OS at will.
Mobiles are the opposite.
iOS and Android sandbox every application. Every app runs in its own container. There's no terminal. There's no shared filesystem. Inter-app communication is restricted to narrow, platform-defined channels. Accessibility APIs exist but are limited and heavily gated.
Yet mobile phones are where agents would be most useful:
| Mobile Advantage | Why It Matters for Agents |
|---|---|
| Always-on, always-carried | Continuous monitoring, real-time assistance |
| Rich sensors | GPS, accelerometer, magnetometer, camera, microphone — data desktops don't have |
| Communication hub | Calls, SMS, messaging apps, email — the primary human communication device |
| App ecosystem | Banking, health, social media, productivity — the richest app ecosystem |
| Personal context | Calendar, contacts, photos, health data — the most personal device |
An agent that can natively interact with your phone — not through a browser proxy, not through a remote API, but directly on the device — can:
- Track daily habits and provide health insights using sensor data
- Manage communications across all messaging apps
- Assist elderly users by operating apps on their behalf
- Automate repetitive tasks across multiple apps
- Provide real-time contextual assistance based on location and activity
The Two Pathways
There are fundamentally two approaches to giving agents mobile access:
┌─────────────────────────────────────────────────────────┐
│ PATH A: Bridge/Driver │
│ │
│ Agent runs EXTERNALLY (laptop/server) and CONTROLS │
│ the phone remotely: │
│ - ADB (Android Debug Bridge) │
│ - Accessibility Services │
│ - Screen mirroring + computer vision │
│ - USB/WiFi debugging protocols │
│ - Custom driver layer │
│ │
│ Pros: Works with existing phones │
│ Cons: Latency, requires tethering, limited access │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ PATH B: Native Agent OS │
│ │
│ Build or modify a MOBILE OS that is agent-native: │
│ - Custom Android ROM with agent privileges │
│ - iOS jailbreak framework for agent access │
│ - Agent-first mobile OS from scratch │
│ - Privileged agent app with system-level access │
│ │
│ Pros: Full access, native performance │
│ Cons: Requires custom hardware/ROM, security concerns │
└─────────────────────────────────────────────────────────┘
Both pathways are valid submissions. You may also propose a hybrid approach.
Path A: Bridge/Driver Approach
What Must Be Solved
App Interaction Without Source Access: The agent must interact with apps (tap buttons, read text, navigate) without access to the app's source code. This means:
- Screen reading via accessibility APIs or OCR
- Touch injection via ADB or accessibility services
- App state detection without internal hooks
Cross-App Workflows: The agent must chain actions across multiple apps:
- Read a message in WhatsApp → check calendar → send a reply → create a reminder
- This requires fast app switching, state preservation, and reliable coordination
Sensor Access: The agent must read phone sensors:
- GPS location for context-aware assistance
- Accelerometer/gyroscope for activity detection
- Camera for visual understanding
Low-Latency Control: The bridge must be fast enough for real-time interaction:
- Touch-to-response under 200ms
- Screen capture at 5+ FPS
- Reliable state synchronization
For Android (Via ADB + Accessibility)
┌──────────────┐ USB/WiFi ┌──────────────┐
│ Agent Host │────────────────▶│ Android │
│ (Laptop/ │ ADB Protocol │ Device │
│ Server) │◀────────────────│ │
│ │ Screen/State │ Accessibility │
│ Agent Logic │ │ Service │
│ + Planning │ │ + ADB Daemon │
└──────────────┘ └──────────────┘
For iOS (Via Instruments/XCTest)
iOS is significantly harder due to Apple's restrictions:
- No ADB equivalent (must use Xcode Instruments or libimobiledevice)
- Accessibility API is more restricted
- App sandboxing is stricter
- May require developer certificate or MDM enrollment
Path B: Native Agent OS Approach
What Must Be Solved
Agent-Privileged Layer: A system service or framework that gives the agent:
- Root-level or system-level access to all apps
- Ability to read app data, inject touches, intercept notifications
- Direct sensor access without app permission prompts
Security Model: Agent access must be controlled:
- Not all agents should have full access
- User must be able to define what the agent can and cannot do
- Audit trail of all agent actions
App Compatibility: The modified OS must still run standard apps:
- Play Store / App Store apps must work
- Banking apps (which detect rooting) should still function
- Performance should not degrade
For Android (Custom ROM)
- Fork AOSP (Android Open Source Project)
- Add an agent service layer between the framework and apps
- Expose agent APIs for app interaction, sensor access, and cross-app communication
- Package as a flashable ROM for common devices
For iOS (Jailbreak Framework or Supervised Mode)
- Use Apple's MDM (Mobile Device Management) or Supervised Mode for enterprise-level control
- Or develop a jailbreak-based framework (with clear security documentation)
- Expose agent APIs through a privileged daemon
Deliverables
1. Working Prototype (Required)
For Path A (Bridge/Driver):
- A host-side agent application that connects to a phone
- Demonstrated ability to:
- Open and interact with at least 3 different apps
- Read on-screen text and respond to it
- Execute a multi-app workflow end-to-end
- Read at least one sensor (GPS or accelerometer)
- Supported on at least one platform (Android OR iOS)
For Path B (Native Agent OS):
- A bootable/installable OS image or ROM
- Demonstrated ability to:
- Run standard apps from the respective app store
- Agent interacts with apps natively (not through screen scraping)
- Direct sensor access
- Cross-app data sharing through agent layer
- Supported on at least one device or emulator
2. README.md (Technical Documentation)
- Architecture: Complete system design with component diagrams
- Security Model: What access does the agent have? How is it controlled? What are the risks?
- Latency Measurements: Touch-to-response time, screen capture FPS, sensor polling rate
- Supported Devices: Which phones/Android versions/iOS versions are supported
- Limitation Analysis: What the system cannot do and why
- Privacy Framework: How user data is protected from unauthorized agent access
3. SDK / Integration Guide
- API documentation for agent developers
- Sample agent that demonstrates the full workflow
- Setup instructions (how to install, configure, and run)
Evaluation Criteria
| Criteria | Weight | Description |
|---|---|---|
| Functional Demo | 30% | Does the agent actually control a phone and complete real tasks? |
| Coverage | 20% | How many apps, sensors, and workflows are supported? |
| Latency & Reliability | 15% | Is it fast and reliable enough for practical use? |
| Security Model | 15% | Is user data protected? Is the access model well-designed? |
| Documentation & Usability | 10% | Can another developer use this system? |
| Novelty | 10% | Does this approach offer something new beyond existing tools? |
Constraints
- Must support at least one platform (Android OR iOS) — supporting both is bonus
- For Android: Must work on Android 12+ (API 31+)
- For iOS: Must work on iOS 16+ (or latest jailbreakable version)
- The agent must complete at least one end-to-end workflow involving 2+ apps
- Must include a security/privacy model — "root access to everything" is not acceptable without access controls
- Must work on a real device or official emulator (not just a simulator)
Bonus Points
- Dual Platform Support: Works on both Android AND iOS
- App Store Submission: A companion app that can be legitimately installed (for Path A)
- Offline Capable: Agent can operate without constant internet connection
- Elderly/Accessibility Use Case: Demonstrate an agent that helps elderly users navigate their phone
- Sensor Fusion: Agent uses multiple sensors simultaneously (GPS + accelerometer + time) for context understanding
Resources & Inspiration
- Android Debug Bridge (ADB) — Android remote control
- Android Accessibility Services — Programmatic UI interaction
- UI Automator (Android) — UI testing framework
- Appium — Cross-platform mobile automation
- libimobiledevice — Open-source iOS communication library
- AOSP (Android Open Source Project) — Base for custom Android ROMs
- LineageOS — Popular custom Android ROM
- Apple MDM Protocol — Enterprise device management
- Scrcpy — Screen mirroring and control for Android
- Frida — Dynamic instrumentation toolkit for apps