Technical

How We Built Polaris: A Production AI Life Assistant

A technical deep-dive into building Polaris, an autonomous AI assistant that manages calendars, tracks finances, enriches data, and learns from every interaction. Lessons from shipping a real agent to production.

Luke GohMarch 21, 202612 min

Building an AI agent that works in a demo is easy. Building one that works reliably in production, every day, handling real user data — that is a fundamentally different challenge.

Polaris is our production AI life assistant, integrated into Dayze (a life-tracking platform). It manages calendars, tracks expenses, enriches personal data, answers questions with web search, and maintains long-term memory across conversations. Here is how we built it.

The Architecture

Polaris uses a ReAct (Reasoning + Acting) architecture. At its core:

User sends a message (text or voice)
The agent reasons about what tools it needs
It calls tools (calendar API, database queries, web search)
It synthesizes results into a natural response
It updates memory for future context

This loop runs for every interaction, and the agent can chain multiple tool calls before responding.

Why ReAct Over Other Patterns

We evaluated several agent architectures:

Simple chain: Too rigid. Cannot handle multi-step tasks.
Plan-and-Execute: Good for complex tasks but adds latency. Users expect fast responses.
ReAct: Best balance of flexibility and speed. The agent reasons one step at a time, which keeps latency low while handling complex workflows.

For tasks that genuinely need planning (like "organize my week"), Polaris internally breaks them into sub-tasks within the ReAct loop.

The Tool System

Polaris has access to these production tools:

Tool	What It Does
`calendar_read`	Fetches events from Google Calendar
`calendar_write`	Creates, updates, deletes calendar events
`expense_track`	Logs and categorizes expenses
`web_search`	Searches the web for real-time information
`memory_store`	Saves facts to long-term memory
`memory_recall`	Retrieves relevant memories
`people_lookup`	Queries the notable people database
`dayze_query`	Reads user life data (days lived, milestones)

Each tool has a strict schema that the LLM must follow. We use structured output (JSON mode) for tool calls to ensure reliability.

Tool Reliability

The biggest lesson: tools fail. APIs time out. Databases return unexpected shapes. Rate limits hit.

Every tool in Polaris has:

Timeout handling (3-second default, 10-second for web search)
Retry logic with exponential backoff
Graceful degradation (if calendar fails, tell the user instead of crashing)
Input validation before the API call

Memory: The Hardest Problem

Stateless chatbots are simple. An agent that remembers your preferences, past conversations, and personal context is not.

Polaris uses a hybrid memory system:

Short-Term Memory

The current conversation context, including all tool calls and results. This is passed to the LLM on every turn. We use a sliding window with summarization when the context gets too long.

Long-Term Memory

Stored in Supabase as vector embeddings. When a user says something worth remembering ("I am allergic to shellfish", "My anniversary is March 15"), Polaris stores it with an embedding.

On each new conversation, we retrieve the top-K most relevant memories using cosine similarity. This gives the agent persistent context without bloating the prompt.

The Cold Start Problem

New users have no memories. Polaris handles this by:

Asking key onboarding questions naturally
Extracting facts from early conversations
Pre-loading public data (if the user links social profiles)

Streaming and UX

Users expect instant responses. But agent reasoning takes time — especially when calling tools.

We use Server-Sent Events (SSE) to stream:

Thinking indicators while the agent reasons
Tool status ("Searching the web...", "Checking your calendar...")
Partial responses as the final answer generates

This keeps the UI responsive even when the agent takes 5-10 seconds to complete a complex task.

Evaluation and Monitoring

You cannot improve what you do not measure.

Offline Evals

We maintain a test suite of 200+ conversation scenarios covering:

Tool selection accuracy
Response quality
Edge cases (ambiguous queries, tool failures)
Safety (refusing harmful requests)

Production Monitoring

Latency P50/P95 for each tool and overall response time
Tool success rates per tool per day
User satisfaction (thumbs up/down on responses)
Error classification (model error vs tool error vs infrastructure)

Lessons Learned

1. Start with One Tool, Get It Perfect

We launched with just calendar management. Once that was bulletproof, we added expense tracking, then search, then memory. Each tool compounds complexity.

2. Prompt Engineering Is Not Enough

The real work is in the system around the model: tool schemas, error handling, memory management, evaluation. The prompt is maybe 10% of the effort.

3. Users Do Not Care About AI

They care about outcomes. "Did my event get created?" matters more than "Was the reasoning chain optimal?" Optimize for task completion, not impressiveness.

4. Cost Optimization Is Ongoing

We reduced costs 60% by:

Caching common tool results
Using smaller models for simple classification tasks
Batching memory retrievals
Truncating conversation history intelligently

What Is Next

We are working on:

Multi-agent collaboration (specialized agents for finance, health, productivity that coordinate)
Proactive agents that act without being asked ("You have a meeting in 30 minutes, traffic is heavy — leave now")
Voice-first interaction with real-time speech-to-speech models

Want to build something like this for your business? Phi Intelligence helps companies design and deploy production AI agents. Book a discovery call to discuss your use case.

#ai-agents#polaris#technical#architecture#production#react-pattern

Get AI insights delivered to your inbox

Weekly perspectives on AI agents, automation, and building systems that work in production.

No spam. Unsubscribe anytime.