3PO - The AI Agent that can See and Talk
A voice-first AI agent that connects to my Hermes agent, reads my screen on demand, and writes to my Obsidian vault.
Project Scope
AI Skills:
Multi-Model Coordination, Tool Architecture, Permission Design, Voice agent latency, MacOS Process Responsibility (TCC)
Stack:
OpenAI Realtime API, gpt-4o (vision), Gmail API, Google Drive API, PyWebView, Python
Workflow:
Spec-first build with Claude Design and Claude Code
Starting up 3PO and having a sample conversation.
The Problem
I tried talking to my main agent (Hermes) through a Discord voice channel and the latency was killing the interaction model. I also had a recurring annoyance - sending screenshots to different chat models whenever something on my screen confused me - that turned every quick question into a multi-step process. I wanted a voice-first agent with screen vision and constrained permissions, sitting between me and my main agent.
My Approach
The voice-first interface — no keyboard input for the core interactions.
3PO has seven tools built directly into the Python server: create_note, web_search, describe_screen, search_email, draft_email, send_draft, search_drive. Gmail uses gmail.readonly + gmail.compose - deletion was made impossible by making OAuth scopes not permit it. Drive uses drive.readonly. Email sending is two-step: draft_email creates a draft, send_draft requires a separate explicit instruction. The model doesn't auto-send.
The hardest engineering problem was screen capture. OpenAI's Realtime API has no vision capability. The workaround was to set up a dual-model: when the agent calls describe_screen, the Python server runs screencapture as a subprocess, then makes a separate synchronous call to gpt-4o with the screenshot as a base64 image. The text description gets injected back into the realtime session as a tool result.
Screen Recording permission is tied to the responsible process, not the Python process doing the capture. This meant I needed to build a proper .app bundle so the bundle itself could becomes the responsible process and the permission could be allowed.
Obsidian integration writes notes mid-conversation. Email is two-step: draft now, send only on explicit instruction.
3PO reads three markdown files from the Obsidian vault on startup - personal context, project task board, weekly status - and injects them into the system prompt before every session. It already knows ongoing projects and priorities before I say a word.
Mid-conversation, create_note writes new markdown files directly to the vault. No sync delay, no separate step. The agent speaks any language and responds in kind.
A focused look at the dual-model vision flow:
Voice Query → Screen Capture → GPT-4o Vision Call → Spoken Response.
The Outcome
Failure modes of AI tools are architectural, not just probabilistic. Whether the model calls describe_screen or answers from training isn't a prompt engineering problem; it's a system design problem about how you frame tool availability and context. The model needs structural evidence the tool works before it'll call it reliably.
Permission design is product design — choosing gmail.readonly + gmail.compose over full Gmail access isn't a security afterthought; it's the feature that lets you tell users exactly what the agent can and can't do with a hard technical guarantee rather than a policy promise.
Latency shapes the entire interaction model in a way text agents hide. A 2-4 second vision round-trip feels fine in chat and completely wrong in voice, which changes what tools are worth building and how they should behave.
The complete agent in use, end to end.
