How We Built Polaris: A Production AI Life Assistant
A technical deep-dive into building Polaris, an autonomous AI assistant that manages calendars, tracks finances, enriches data, and learns from every interaction. Lessons from shipping a real agent to production.
Building an AI agent that works in a demo is easy. Building one that works reliably in production, every day, handling real user data — that is a fundamentally different challenge.
Polaris is our production AI life assistant, integrated into Dayze (a life-tracking platform). It manages calendars, tracks expenses, enriches personal data, answers questions with web search, and maintains long-term memory across conversations. Here is how we built it.
The Architecture
Polaris uses a ReAct (Reasoning + Acting) architecture. At its core:
- User sends a message (text or voice)
- The agent reasons about what tools it needs
- It calls tools (calendar API, database queries, web search)
- It synthesizes results into a natural response
- It updates memory for future context
This loop runs for every interaction, and the agent can chain multiple tool calls before responding.
Why ReAct Over Other Patterns
We evaluated several agent architectures:
- Simple chain: Too rigid. Cannot handle multi-step tasks.
- Plan-and-Execute: Good for complex tasks but adds latency. Users expect fast responses.
- ReAct: Best balance of flexibility and speed. The agent reasons one step at a time, which keeps latency low while handling complex workflows.
For tasks that genuinely need planning (like "organize my week"), Polaris internally breaks them into sub-tasks within the ReAct loop.
The Tool System
Polaris has access to these production tools:
| Tool | What It Does |
|---|---|
calendar_read | Fetches events from Google Calendar |
calendar_write | Creates, updates, deletes calendar events |
expense_track | Logs and categorizes expenses |
web_search | Searches the web for real-time information |
memory_store | Saves facts to long-term memory |
memory_recall | Retrieves relevant memories |
people_lookup | Queries the notable people database |
dayze_query | Reads user life data (days lived, milestones) |
Each tool has a strict schema that the LLM must follow. We use structured output (JSON mode) for tool calls to ensure reliability.
Tool Reliability
The biggest lesson: tools fail. APIs time out. Databases return unexpected shapes. Rate limits hit.
Every tool in Polaris has:
- Timeout handling (3-second default, 10-second for web search)
- Retry logic with exponential backoff
- Graceful degradation (if calendar fails, tell the user instead of crashing)
- Input validation before the API call
Memory: The Hardest Problem
Stateless chatbots are simple. An agent that remembers your preferences, past conversations, and personal context is not.
Polaris uses a hybrid memory system:
Short-Term Memory
The current conversation context, including all tool calls and results. This is passed to the LLM on every turn. We use a sliding window with summarization when the context gets too long.
Long-Term Memory
Stored in Supabase as vector embeddings. When a user says something worth remembering ("I am allergic to shellfish", "My anniversary is March 15"), Polaris stores it with an embedding.
On each new conversation, we retrieve the top-K most relevant memories using cosine similarity. This gives the agent persistent context without bloating the prompt.
The Cold Start Problem
New users have no memories. Polaris handles this by:
- Asking key onboarding questions naturally
- Extracting facts from early conversations
- Pre-loading public data (if the user links social profiles)
Streaming and UX
Users expect instant responses. But agent reasoning takes time — especially when calling tools.
We use Server-Sent Events (SSE) to stream:
- Thinking indicators while the agent reasons
- Tool status ("Searching the web...", "Checking your calendar...")
- Partial responses as the final answer generates
This keeps the UI responsive even when the agent takes 5-10 seconds to complete a complex task.
Evaluation and Monitoring
You cannot improve what you do not measure.
Offline Evals
We maintain a test suite of 200+ conversation scenarios covering:
- Tool selection accuracy
- Response quality
- Edge cases (ambiguous queries, tool failures)
- Safety (refusing harmful requests)
Production Monitoring
- Latency P50/P95 for each tool and overall response time
- Tool success rates per tool per day
- User satisfaction (thumbs up/down on responses)
- Error classification (model error vs tool error vs infrastructure)
Lessons Learned
1. Start with One Tool, Get It Perfect
We launched with just calendar management. Once that was bulletproof, we added expense tracking, then search, then memory. Each tool compounds complexity.
2. Prompt Engineering Is Not Enough
The real work is in the system around the model: tool schemas, error handling, memory management, evaluation. The prompt is maybe 10% of the effort.
3. Users Do Not Care About AI
They care about outcomes. "Did my event get created?" matters more than "Was the reasoning chain optimal?" Optimize for task completion, not impressiveness.
4. Cost Optimization Is Ongoing
We reduced costs 60% by:
- Caching common tool results
- Using smaller models for simple classification tasks
- Batching memory retrievals
- Truncating conversation history intelligently
What Is Next
We are working on:
- Multi-agent collaboration (specialized agents for finance, health, productivity that coordinate)
- Proactive agents that act without being asked ("You have a meeting in 30 minutes, traffic is heavy — leave now")
- Voice-first interaction with real-time speech-to-speech models
Want to build something like this for your business? Phi Intelligence helps companies design and deploy production AI agents. Book a discovery call to discuss your use case.
Get AI insights delivered to your inbox
Weekly perspectives on AI agents, automation, and building systems that work in production.
No spam. Unsubscribe anytime.