Onagi
A multimodal autonomous reasoning agent
- Role
- AI engineering + product
- Timeline
- Gemini 3 Hackathon build
Real-time multimodal RAG over live screen content
Native function-calling that triggers DB writes and API calls
Sub-millisecond reasoning latency via context caching
The problem
Most 'AI assistants' can describe what they'd do but can't actually do it, and the ones that act tend to act blindly. The brief for Onagi was an agent that genuinely understands what's on screen, decides what the user wants, and takes the right action, without losing the plot on large, fast-moving context.
Constraints
- Hackathon timeline: a working multimodal pipeline, not a slide deck
- Vision + retrieval + action had to run fast enough to feel live
- Actions touch real data, so intent detection had to be reliable
The approach
See, then reason
A Playwright + Gemini 3 Vision pipeline captures live screen content and feeds it into a retrieval-augmented reasoning loop, so the agent reasons over what is actually there rather than a stale snapshot.
Intent to action via function calling
Native function calling maps the user's intent to concrete tools, database writes and API integrations, so the agent doesn't just answer, it executes the right operation.
Keep huge context cheap
Context caching holds massive datasets in working memory, bringing reasoning latency down to sub-millisecond on repeat queries instead of re-reading everything each turn.
Results
- A demoable agent that turns a screen plus a request into a grounded action.
- Showed that multimodal RAG + function calling can run at interactive speed.
- A reusable pattern for vision-grounded agents, not a one-off script.
Tradeoffs
Built under hackathon pressure, Onagi prioritized a convincing end-to-end loop over breadth of tools and hardened guardrails. The architecture leaves clear seams to add permissioning and a wider tool registry before anything like production use.