PitCrew - Pre Deploy Agent Token Analysis Tool
A web tool that forecasts what your AI agent will cost before you build it, and tells you the things to change before you ship to optimize costs further.
Project Scope
AI Skills:
Token economics, multimodal cost modeling, model selection, prompt caching, batch APIs, uncertainty propagation, validation methodology
Stack:
Next.js, Supabase, Vercel, gpt-tokenizer + official count_tokens APIs, Claude Sonnet (parsing + generation layers)
Workflow:
Spec-first build with Claude Design and Claude Code
A walkthrough of what a user sees upon first entering PitCrew.
The Problem
A friend was paying $400/month for an OpenClaw agent that scraped his Granola notes and read a few files for context - work that should cost five dollars. The reason it didn't came down to architecture decisions made in the first hour of building, never revisited.
Most builders hit this same gap: provider pricing pages show per-token rates that don't translate to a monthly bill, and by the time real usage tells you the truth, your agent is already deployed.
PitCrew lives in the window before that — pre-deploy, when changing the architecture is still cheap.
My Approach
The wizard's "I don't know" toggle — every gating field has one, so users without firm numbers can still produce a forecast.
The strategic decision that defined the product: pre-deploy forecast, not post-deploy ingestion. Most cost tools require an existing agent and API keys. A developer who reads "you'll spend $400/mo" before writing code can change the architecture. A developer who reads "you spent $400 last month" mostly produces regret.
The hardest architectural decision: where to draw the AI line. The seductive path was end-to-end LLM — paste a description, model imagines the agent, model returns a price. I went the other way. AI lives at the input layer (parsing free-form descriptions into structured variables) and at the output layer (generating IDE-specific build specs from a finished forecast). Everything between — every multiplication, every rate-card lookup, every cascade step — is deterministic and traceable. When a user shows the report to their CFO and asks "where does $347 come from," the answer is a worked example with a public rate card as the citation, not "the model said so."
Wrapped around all of it: closed-form uncertainty propagation. Every input is flagged as inferred, typed, or unknown, and confidence bands tighten or widen based on what the user actually knows. The bands are calibrated against a validation library of published cost disclosures, with a separate test set the engine has never been tuned against, to catch overfitting on every commit.
The savings cascade and sensitivity panel - every dollar traces to a rate card, and every forecast shows what changes if the user's assumptions were wrong.
PitCrew takes structured input (archetype, system prompt, model, tools, expected volume across text and generation modalities), runs an ensemble of cost-optimization analyzers in parallel, and composes them into a cascade where each step's contribution is shown atop the cumulative running total. A sensitivity grid surfaces how the forecast moves under the inputs the user is least sure about. The output: a forecasted monthly bill, a ranked action plan, confidence bands on every figure, and a diagnostic warning when the configuration falls outside what's been validated.
A focused look at the "I don't know" toggle producing a useful forecast despite missing inputs.
What I Learned
A few things I don't think I could have learned from theory:
The Right AI Tool Placement and Usage: The real product-design question is "what should be a model and what should be code?" - and theory glosses it because the easy answer is "use AI for everything."
Trust in AI Output: It collapses the moment a user can't audit where a number came from. "The model said so" isn't an answer — a worked example traceable to public rate cards is. The auditability question scales: the more AI surfaces a product has, the more places that trust can break.
The Same Product, Two Users: One wants the AI to fill the form, the other wants to fill it themselves. Supporting both meant building an uncertainty model that knows where each value came from — the inferred / typed / unknown source distinction. The same pattern scales when adding new AI surfaces: each AI integration carries its own confidence signal and is allowed to fail visibly.
Calibration: The difference between an AI feature you can ship and one you can defend. Tracking inferred-vs-final-typed deltas, validating against real cost disclosures the engine has never been tuned against, and surfacing the engine's own uncertainty boundaries - that's the work that turns the AI half from a vibes feature into an accuracy claim you can defend. Without it, no matter how thoughtful the architecture is, you're guessing in a percentage sign.
The complete tool, end to end.
