Your AI Agent Forgets Everything. Here's How to Fix That.

Every conversation with an AI coding agent starts from scratch. It doesn’t remember that you rejected that approach last week. It doesn’t know why the code looks the way it does. It will confidently suggest the exact thing you already tried and abandoned.

This is the memory problem. And it’s the single biggest source of friction in AI-assisted development.

I spent a week studying every serious memory architecture on GitHub, then built memory-setup — a decision-aware memory system for Claude Code that runs on markdown files and two daily habits. No databases, no vector stores, no infrastructure. Just structured text and Git.

The Research

The starting point was a deep research session — four parallel agents pulling from GitHub repos, papers, and community knowledge. The goal: understand how production memory systems actually work, then figure out what’s worth stealing for a file-based workflow.

Here’s what the landscape looks like.

The Titans

Mem0 (51k+ stars) is the most adopted universal memory layer. Multi-level memory — user, session, agent state — with +26% accuracy over OpenAI’s memory on the LOCOMO benchmark and 90% fewer tokens than full-context approaches. The key insight: extract atomic facts from conversations, reconcile against existing memories, retrieve by semantic relevance.

Letta/MemGPT (21k+ stars) pioneered stateful agents with self-modifying memory. The agent reads and writes its own memory — it decides what to remember. This is the pattern that makes memory feel alive rather than passive.

Hindsight (6.5k stars) made a critical distinction that shaped the entire project: most memory systems recall history, but Hindsight extracts lessons from history. The difference between “we used 2x oversampling” and “we learned that 2x oversampling is always worth the CPU cost for this type of effect.” State-of-the-art on the LongMemEval benchmark, verified by Virginia Tech and the Washington Post.

The Pattern Library

Across 14 systems — Mem0, Letta, Cognee, MemOS, Hindsight, HMLR, Mnemis (Microsoft), A-Mem (NeurIPS 2025), GAM, Nemori, Zep/Graphiti, Memoripy, and more — ten patterns kept recurring:

PatternSourceWhat It Does
Two-stage fact extractionMem0Extract atomic statements, then reconcile against existing memory
Three-tier memoryHindsightRaw facts → observations → mental models
Proof countingHindsightMore evidence = higher confidence = surface first
Self-modifying memoryLetta, A-MemThe agent decides what to store and what to forget
Dual-route retrievalMnemisFast intuitive lookup + slow deliberate reasoning
Memory decayMemoripyOlder, less-accessed memories gradually fade
Episodic granularityNemoriMatch human episodic memory granularity and simple methods rival complex ones
Skill memoryMemOSSkills remember what worked across sessions
Hierarchical compressionHMLR, GAMChunks → profiles → facts → temporal ordering
Background scribeHMLRAsync extraction of hard facts from conversations

The most surprising finding came from Nemori — a minimalist system proving that you don’t need complex infrastructure. Match the granularity of human episodic memory and simple retrieval works. That validated the markdown-first approach before I wrote a line of code.

What I Built

Memory-setup is five slash commands for Claude Code:

/memory-setup      → Bootstrap a project's memory layer (run once)
/manifest          → Capture reasoning before creative work
/match-outcomes    → Match results to prior reasoning
/memory-compact    → Synthesize, promote, archive
/memory-audit      → Score memory health

The daily habit is two commands: /manifest before creative work, /match-outcomes after results come in. Everything else is periodic.

The Intent Manifest

This is the core idea. Before any subjective or creative AI work, capture the reasoning:

Level 1 (30 seconds):

  • Goal — what this aims to accomplish
  • Audience — who it’s for
  • Key Choice — the subjective decision being made
  • Expected Outcome — how you’ll know it worked

Level 2 (2 minutes) adds strategy, alternatives considered, constraints, evaluation criteria, and confidence levels. Use Level 2 for high-stakes work — ad campaigns, architecture decisions, brand positioning.

The manifest exists because outcomes alone don’t teach you anything. You need to know what the hypothesis was before you can learn from the result.

Evidence-Based Promotion

When the same learning is validated 3+ times under comparable conditions with no contradictions, it gets promoted from a manifest insight to a permanent rule in CLAUDE.md. Failed approaches get recorded as anti-patterns so the agent never suggests them again.

The thresholds are borrowed from Hindsight’s proof counting pattern. Three validations isn’t arbitrary — it’s the minimum for distinguishing signal from coincidence while being low enough that the system actually promotes things.

Domain Templates

The system ships with seven domain-specific templates: marketing, software dev, game dev, audio/DSP, knowledge management, agency/client work, and product strategy. Each template includes directory structures, canonical context documents, domain-critical anti-patterns, and filled examples.

Projects can combine domains. A VST plugin project gets dev + audio. A client-facing SaaS gets dev + marketing + strategy. The templates overlay — they don’t conflict.

Lazy Generation

Setup doesn’t dump a dozen empty files into your repo. Level 0 projects (pure mechanical execution, no subjective decisions) get just CLAUDE.md and PROJECT_CONTEXT.md. Additional files spawn on first use. If you never run /manifest, the intent manifests directory never gets created.

What It Actually Prevents

The system is designed to stop specific failure modes:

“Here’s a fresh idea” — that was actually rejected two weeks ago. The anti-pattern log prevents the agent from resurrecting dead-end approaches.

“I improved it” — but broke something that was intentionally rough, simple, or constrained. Design invariants and non-goals are documented, so the agent knows what not to optimize.

“This is best practice” — when the project intentionally deviates from best practice. The CLAUDE.md rules capture deliberate exceptions.

“I used a similar pattern from…” — that leaks another client’s context. The agency template enforces strict data isolation between client projects.

Strategy drift — brand voice, design pillars, sonic identity, and positioning staying consistent across sessions, even when different people (or different AI sessions) are contributing.

Design Decisions

A few choices worth calling out:

Markdown-first, zero infrastructure. Every system I studied — Mem0, Letta, Cognee — requires a server, a vector database, or both. That’s the right call for production agents handling thousands of users. For a solo developer or small team using Claude Code, it’s overkill. Markdown files version-controlled in Git give you history, diffs, branching, and collaboration for free.

Threshold-based maintenance, not calendar-based. Compaction runs when material accumulates (10+ matched manifests, 15+ decisions, 20+ total memory items), not on a schedule. Calendar-based maintenance means you’re either maintaining too early (nothing to compact) or too late (memory is already stale).

Promotion requires evidence. The 3+ validation threshold with mandatory scope conditions prevents both false confidence and over-generalization. “This works” becomes “this works for X when Y” — which is the only form of learning that’s actually transferable.

Never destructive. Archive with provenance, never delete. The agent can always trace back to why a decision was made, even if the conclusion has since been superseded.

The Bigger Picture

The research surfaced a principle from the Continuous Claude project that stuck with me: “Compound, don’t compact.” Most people think about memory as reducing what you have. The better model is compounding what you’ve learned — turning raw observations into validated patterns, patterns into rules, rules into institutional knowledge.

That’s what memory-setup does. Not a knowledge base. Not a note-taking system. A learning loop that turns AI conversations into accumulated project intelligence.

The repo is public: github.com/frankmanley/memory-setup. Copy the .claude/ directory into your project, run /memory-setup, and start building memory.