Agentic workflows and AI architecture. Projects that lie outside the scope of design.

Caleb Kim

Builder

1 year of experience

Agentic workflows and AI architecture. Projects that lie outside the scope of design.

Caleb Kim

Builder

1 year of experience

3PO - The AI Agent that can See and Talk

A voice-first AI agent that connects to my Hermes agent, reads my screen on demand, and writes to my Obsidian vault.

Project Scope

AI Skills:

Multi-Model Coordination, Tool Architecture, Permission Design, Voice agent latency, MacOS Process Responsibility (TCC)

Stack:

OpenAI Realtime API, gpt-4o (vision), Gmail API, Google Drive API, PyWebView, Python

Workflow:

Spec-first build with Claude Design and Claude Code

Starting up 3PO and having a sample conversation.

The Problem

I tried talking to my main agent (Hermes) through a Discord voice channel and the latency was killing the interaction model. I also had a recurring annoyance - sending screenshots to different chat models whenever something on my screen confused me - that turned every quick question into a multi-step process. I wanted a voice-first agent with screen vision and constrained permissions, sitting between me and my main agent.


My Approach


The voice-first interface — no keyboard input for the core interactions.

3PO has seven tools built directly into the Python server: create_note, web_search, describe_screen, search_email, draft_email, send_draft, search_drive. Gmail uses gmail.readonly + gmail.compose - deletion was made impossible by making OAuth scopes not permit it. Drive uses drive.readonly. Email sending is two-step: draft_email creates a draft, send_draft requires a separate explicit instruction. The model doesn't auto-send.

The hardest engineering problem was screen capture. OpenAI's Realtime API has no vision capability. The workaround was to set up a dual-model: when the agent calls describe_screen, the Python server runs screencapture as a subprocess, then makes a separate synchronous call to gpt-4o with the screenshot as a base64 image. The text description gets injected back into the realtime session as a tool result.

Screen Recording permission is tied to the responsible process, not the Python process doing the capture. This meant I needed to build a proper .app bundle so the bundle itself could becomes the responsible process and the permission could be allowed.

Obsidian integration writes notes mid-conversation. Email is two-step: draft now, send only on explicit instruction.

3PO reads three markdown files from the Obsidian vault on startup - personal context, project task board, weekly status - and injects them into the system prompt before every session. It already knows ongoing projects and priorities before I say a word.

Mid-conversation, create_note writes new markdown files directly to the vault. No sync delay, no separate step. The agent speaks any language and responds in kind.

A focused look at the dual-model vision flow:

Voice Query → Screen Capture → GPT-4o Vision Call → Spoken Response.

The Outcome

Failure modes of AI tools are architectural, not just probabilistic. Whether the model calls describe_screen or answers from training isn't a prompt engineering problem; it's a system design problem about how you frame tool availability and context. The model needs structural evidence the tool works before it'll call it reliably.

Permission design is product design — choosing gmail.readonly + gmail.compose over full Gmail access isn't a security afterthought; it's the feature that lets you tell users exactly what the agent can and can't do with a hard technical guarantee rather than a policy promise.

Latency shapes the entire interaction model in a way text agents hide. A 2-4 second vision round-trip feels fine in chat and completely wrong in voice, which changes what tools are worth building and how they should behave.

The complete agent in use, end to end.