The Problem Every AI Chat App Faces
You're deep in conversation with your AI companion—maybe you've been discussing a complex project, sharing personal stories, or building on ideas over dozens of messages. The conversation is flowing naturally, building on everything you've said before. Then suddenly, the AI seems to forget earlier parts of your discussion. It asks questions you already answered. It loses track of important details you shared 50 messages ago.
This isn't the AI being forgetful—it's hitting a hard technical wall: the context window limit.
Every AI model has a maximum number of tokens (roughly words) it can process at once. For many models, that's 8,000-32,000 tokens. A typical conversation message might be 50-200 tokens. Do the math: even with a generous 32K context window, you're looking at 160-640 messages before you run out of space. And that's before accounting for system prompts, character personalities, and other necessary context.
For chat applications aiming to provide truly continuous, long-term conversations, this is a fundamental problem. You can't just keep feeding the entire conversation history to the AI—you'll hit token limits, costs will spiral, and response times will crawl.
The naive solutions don't work:
- Truncating old messages creates amnesia—the AI suddenly forgets everything
- Keeping only recent messages loses long-term context and conversation continuity
- Using massive context windows becomes prohibitively expensive and slow
We needed something better. Something that mimics how human memory actually works.
How Human Memory Inspired Our Solution
Think about how you remember a conversation from last week. You don't recall every single word—that would be overwhelming and unnecessary. Instead, you remember:
- Recent details vividly (what was said in the last few minutes)
- Key points from earlier (important topics, decisions, revelations)
- General themes from way back (the overall arc and context of your relationship)
Your memory naturally compresses older information while keeping recent experiences sharp. The further back you go, the more compressed and summarized the memories become. You remember that you discussed project planning last Tuesday, but not the exact words—just the important outcomes.
This natural compression is exactly what we built into Haroo Chat.
Introducing 3-Tier Memory Summarization
Our system maintains three distinct tiers of conversation memory, each with different levels of detail and compression. Together, they create the illusion of infinite memory within finite token budgets.
Tier 1: Recent Messages (24,000-30,000 tokens)
This is your working memory—the most recent part of the conversation in full, unsummarized detail. Every message, every word, every nuance is preserved exactly as it was written.
- Full fidelity: Complete messages with all context
- Dynamic size: Grows from 24K to 30K tokens before summarization triggers
- Sent as-is: These messages go to the AI in their original form
When you're chatting, the last 150-300 messages (depending on length) are always available in perfect detail. This ensures the AI responds naturally to your immediate context.
Tier 2: Mid-Term Summaries (Up to 20,000 tokens)
As your conversation grows beyond 30,000 tokens in Tier 1, we don't discard the old messages—we compress them intelligently.
Here's how it works:
- When Tier 1 reaches 30,000 tokens, we take the oldest ~2,000 tokens (roughly 10-20 messages)
- An AI summarization process distills those messages into a concise 200-token summary
- This summary captures the key points, topics, and context from that batch
- The original messages are archived, and the summary moves to Tier 2
- Tier 1 returns to ~24,000 tokens, ready to grow again
Compression ratio: 10:1 (2,000 tokens → 200 tokens)
Tier 2 can hold up to 200 active summaries (~20,000 tokens total) before it needs to compress further. Each summary represents a "chapter" of your conversation—not word-for-word, but preserving what mattered.
Think of this like meeting notes: you don't transcribe every word, but you capture decisions, key points, and action items.
Tier 3: Meta-Summary (~1,000 tokens)
When Tier 2 accumulates 200 or more summaries (meaning your conversation is getting quite long), we perform a second level of compression:
- The oldest Tier 2 summaries are combined
- A meta-summarization process creates a single, high-level overview
- This meta-summary captures the broad themes and long-term context
- It replaces the older summaries, freeing space in Tier 2
Tier 3 is your long-term memory—a birds-eye view of where the conversation has been. It might not remember the exact joke you told 200 messages ago, but it knows the overall arc of your relationship, major topics you've explored, and persistent themes.
This is like remembering that "we've been working on the project together for months" without recalling every single meeting.
How Context is Assembled for the AI
When you send a new message, here's the exact order in which context is assembled and sent to the AI:
- 1. System Prompt (character personality, instructions)
- 2. Tier 3 Summary (oldest compressed history)
- 3. Tier 2 Summaries (mid-term context, newest first)
- 4. Tier 1 Messages (recent conversation, oldest to newest)
- 5. Your new message
This ordering is intentional: it mirrors how you might brief someone on a conversation. Start with the big picture, add more detail as you get closer to the present, then dive into the immediate context.
Total token budget per request: ~45,000-53,000 tokens
- • Tier 3: ~1,000 tokens
- • Tier 2: ~20,000 tokens (100 active summaries × 200 tokens each)
- • Tier 1: 24,000-30,000 tokens
- • System prompt: ~500-1,000 tokens
- • New message: ~100-500 tokens
This fits comfortably within most AI model context windows while providing rich, layered context.
A Real-World Example
Let's walk through a conversation lifecycle to see this in action:
Messages 1-30
You start chatting with an AI character about planning a trip to Japan. All messages are in Tier 1, full detail preserved.
Messages 31-150
You dive into specific cities, food recommendations, cultural tips. At message 150, Tier 1 reaches 30,000 tokens. The system automatically summarizes messages 1-20 into a 200-token Tier 2 summary:
"User discussed trip to Japan, interested in Tokyo and Kyoto, asked about cherry blossom season, vegetarian food options, and traditional accommodations."
Messages 21-150 remain in Tier 1 with full detail.
Messages 151-300
You've now moved on to discussing budget, booking flights, and creating an itinerary. Each time Tier 1 hits 30K tokens, another batch summarizes into Tier 2. You now have 3-4 Tier 2 summaries, each capturing different phases of the planning conversation.
Messages 301-1000+
The conversation evolves—you're now sharing your experiences as you travel, uploading photos (in your imagination), discussing what you've seen. After many conversations, Tier 2 accumulates 200 summaries. The oldest summaries (covering messages 1-300+) are meta-summarized into Tier 3:
"User has been planning and discussing a trip to Japan, covering destinations (Tokyo, Kyoto), cultural interests, food preferences (vegetarian), accommodation (traditional ryokans), and budget considerations for a spring visit during cherry blossom season."
Message 1000+
You continue chatting for hundreds more messages. The AI still "remembers" that you're interested in Japan, that you were planning a spring trip, and that you're vegetarian—even though those specific messages are long gone from Tier 1. Meanwhile, your recent detailed experiences in Tokyo are right there in full detail in Tier 1.
The conversation feels continuous and coherent, even though you've long exceeded what would fit in a standard context window.
The Benefits: Why This Matters
1. Truly Unlimited Conversations
You're not constrained by arbitrary message limits. Chat for hundreds or thousands of messages—the system adapts automatically.
2. Natural Memory Degradation
Like human memory, older details fade while recent context stays sharp. This actually makes conversations feel more natural, not less. The AI doesn't have perfect recall of something from 500 messages ago—it has the gist, which is often more appropriate.
3. Cost Efficiency
By compressing older context, we drastically reduce token usage. Sending 10,000 tokens of old messages in full detail would cost 10x more than sending 1,000 tokens of compressed summaries. This keeps the service affordable while maintaining quality.
4. Faster Responses
Smaller context windows mean faster processing. The AI doesn't wade through hundreds of old messages—it gets a curated, relevant context digest.
5. Zero User Intervention
You don't manage this. You don't click "summarize." You don't clear history. It happens automatically, invisibly, as you chat. The experience is seamless.
6. Maintained Coherence
Crucially, the AI maintains a sense of continuity. It doesn't suddenly "forget" who you are or what you've discussed. The summaries preserve key information, relationships, and ongoing threads.
The Technical Implementation
For developers curious about the mechanics:
- Trigger condition: Automatic summarization when
unsummarized_tokens >= 30000 - Summarization: Edge functions handle async summarization using Claude or similar models
- Storage: PostgreSQL stores messages, summaries, and metadata
- Assembly: Server-side context assembly before each AI request
- Optimization: Summaries are cached and only regenerated when necessary
The system is designed to be transparent to users while being highly observable for developers. We track token counts, summary quality, and compression ratios to continuously improve.
What's Next: Future Enhancements
We're exploring several improvements:
- Adaptive compression: Adjust compression ratios based on conversation density
- Semantic importance: Weight summaries based on emotional significance or user feedback
- Multi-modal memory: Extend the system to handle images, voice, and other media
- User-controllable summaries: Optional manual editing or emphasis of important moments
- Cross-conversation memory: Share context between related conversations with the same character
The Bigger Picture
The 3-tier memory system isn't just a technical solution—it's a philosophy about how AI conversations should work. We believe the future of AI chat isn't about perfect recall or infinite context windows. It's about natural, human-like memory that balances detail with efficiency, recency with history.
As AI models evolve and context windows expand, the principles here will remain valuable: intelligent compression, layered context, and automatic management. Because no matter how large context windows become, there will always be conversations that exceed them. And there will always be value in mimicking how humans naturally remember.
If you're building an AI chat application, consider how your users will experience long-term conversations. Are you creating amnesia-prone bots that forget after 100 messages? Or are you building something that can truly grow with your users over time?
We chose the latter. And with 3-tier memory summarization, we've made it work.
Ready to experience unlimited AI conversations?
Try Haroo Chat and see how natural long-term AI relationships can feel when memory is handled right.
Start Chatting NowHave questions about our memory system or want to learn more? Contact us or join our community to discuss the technical details.