Skip to content

Memory Configuration

Memory behavior is controlled through environment variables. All settings have sensible defaults. The advanced search features (multi-query, re-ranking, contextual rewrite) are disabled by default and can be enabled by setting their respective model variables.

VariableDefaultDescription
MEMORY_EXTRACTION_MODELProvider defaultModel used for automatic memory extraction after each turn
MEMORY_MAX_RELEVANT10Maximum relevant memories injected into context per turn
MEMORY_SIMILARITY_THRESHOLD0.7Minimum cosine similarity for vector search results (0-1)
MEMORY_EMBEDDING_MODELtext-embedding-3-smallEmbedding model for memory vectors
MEMORY_EMBEDDING_DIMENSION1536Vector dimension for embeddings

These control the hybrid search, scoring, and result selection pipeline.

VariableDefaultDescription
MEMORY_RRF_K60Reciprocal Rank Fusion smoothing constant. Higher values give more weight to lower-ranked results
MEMORY_FTS_BOOST1.2Multiplier for FTS results in RRF scoring. Values > 1 favor keyword matches
MEMORY_SUBJECT_BOOST1.3Score multiplier when a memory’s subject matches an entity in the query
MEMORY_CATEGORY_BOOST1.25Score multiplier for category-matching memories
MEMORY_TEMPORAL_DECAY_LAMBDA0.01Temporal decay rate. Higher = faster decay. Set to 0 to disable. Category-adjusted: facts decay 10× slower than decisions
MEMORY_TEMPORAL_DECAY_FLOOR0.7Minimum score multiplier from temporal decay. Prevents old memories from being completely suppressed
MEMORY_TOKEN_BUDGET0Max tokens for the memory block in prompt. 0 = unlimited (no budget enforcement)
MEMORY_RECENCY_BOOSTtrueEnable recency-based score boost (×1.5 today, ×1.25 this week, ×1.1 this month). Set to false to disable
MEMORY_ADAPTIVE_KtrueEnable adaptive result trimming based on score distribution
MEMORY_ADAPTIVE_K_MIN_SCORE_RATIO0.3Minimum score as a ratio of the top result. Results below this are dropped

These features use additional LLM calls to improve retrieval quality. Each is disabled by default (no model set). Set a model name to enable.

VariableDefaultDescription
MEMORY_MULTI_QUERY_MODEL(disabled)Model for generating query variations. Expands each query into 3 alternatives targeting different aspects
MEMORY_HYDE_MODEL(disabled)Model for HyDE (Hypothetical Document Embedding). Generates a hypothetical answer to use as an additional search query for better semantic matching
MEMORY_RERANK_MODEL(disabled)Model for re-ranking. If a rerank provider (Cohere/Jina) is configured, uses their cross-encoder API (~20× faster). Otherwise falls back to LLM-based scoring (0-10 scale)
MEMORY_CONTEXTUAL_REWRITE_MODEL(disabled)Model for rewriting short/ambiguous messages into standalone queries using conversation context
MEMORY_CONTEXTUAL_REWRITE_THRESHOLD80Character length threshold. Messages shorter than this are candidates for contextual rewriting
VariableDefaultDescription
MEMORY_CONSOLIDATION_MODEL(disabled)Model for memory consolidation (merging similar memories)
MEMORY_CONSOLIDATION_SIMILARITY0.85Cosine similarity threshold for considering two memories as candidates for consolidation
MEMORY_CONSOLIDATION_MAX_GEN5Maximum number of consolidated memories generated per run

Consolidation clusters are capped at 3 memories to preserve detail. Larger groups are split and merged incrementally across runs. The LLM can also abort a merge if it determines the memories are about different topics.

Session compacting uses incremental micro-batch summarization: instead of compacting all messages at once, old messages are summarized in fixed-size batches (by turn count) from the oldest end after each LLM turn, spreading LLM cost over time.

A turn is defined as one user message plus all following messages (assistant responses, tool calls, tool results) until the next user message. This avoids false triggers in tool-heavy conversations where a single turn can produce 10-30 messages.

VariableDefaultDescription
COMPACTING_BATCH_TURNS10Number of oldest turns to summarize per micro-compaction batch
COMPACTING_MIN_KEEP_TURNS15Minimum non-compacted turns to keep as raw context. A batch is only taken when non-compacted turns exceed COMPACTING_BATCH_TURNS + COMPACTING_MIN_KEEP_TURNS
VariableDefaultDescription
COMPACTING_MODELProvider defaultModel used for session compacting/summarization. Supports providerId:modelId format
COMPACTING_MAX_SNAPSHOTS10Maximum compacting snapshots kept per Kin

The threshold can also be configured per-Kin (overrides the global setting) via the Compaction tab in the Kin’s settings. Available per-Kin fields: turnThreshold (total turn threshold), compactingModel, and compactingProviderId.

Before compacting runs, KinBot applies a progressive pipeline to reduce context size without LLM calls:

VariableDefaultDescription
TOOL_RESULT_MASK_KEEP_LAST2Number of recent tool call groups kept fully intact. Older groups are collapsed to one-line summaries
OBSERVATION_COMPACTION_WINDOW10Number of recent turns kept at full resolution. Older turns have tool results truncated. 0 = disabled
OBSERVATION_MAX_CHARS200Max characters for truncated tool results in the observation zone
HISTORY_TOKEN_BUDGET0 (disabled)Emergency safety net: max tokens for conversation history. Messages trimmed from oldest end if exceeded. 0 = no limit

Large tool results are automatically spilled to temporary files instead of being included inline in the LLM context:

VariableDefaultDescription
TOOL_OUTPUT_SPILL_THRESHOLD10000Byte threshold before spilling to file. 0 = disabled
TOOL_OUTPUT_PREVIEW_LINES200Lines included in the compact preview reference
TOOL_OUTPUT_TTL_HOURS24Hours before spilled files are cleaned up

Memory requires an embedding provider to be configured in Settings > Providers. Supported embedding providers:

  • OpenAItext-embedding-3-small, text-embedding-3-large, text-embedding-ada-002
  • Voyage — Specialized embedding models
  • Jina AI — Multilingual embeddings
  • Nomic — Open-source embeddings
  • Mistral — Built-in embedding support
  • DeepSeek — Embedding support
  • Cohereembed-english-v3.0, embed-multilingual-v3.0
  • Together AI — Various embedding models
  • Fireworks AI — Embedding support
  • Ollama — Local embedding models
  • OpenRouter — Access to multiple embedding providers
  • xAI — Embedding support
  • Lower MEMORY_SIMILARITY_THRESHOLD (e.g., 0.5) to retrieve more memories at the cost of relevance
  • Raise MEMORY_MAX_RELEVANT if your Kin needs broader context awareness
  • Lower COMPACTING_BATCH_TURNS for more frequent, smaller compaction cycles
  • Raise COMPACTING_MIN_KEEP_TURNS to keep more raw context visible to the LLM
  • Enable multi-query (MEMORY_MULTI_QUERY_MODEL=gpt-4.1-mini) for better recall on complex queries
  • Enable re-ranking (MEMORY_RERANK_MODEL=gpt-4.1-mini) for better precision when you have many memories
  • Enable contextual rewrite (MEMORY_CONTEXTUAL_REWRITE_MODEL=gpt-4.1-mini) if your users send lots of short follow-up messages
  • Increase MEMORY_FTS_BOOST (e.g., 1.5) if keyword matching should matter more than semantic similarity
  • Use a faster/cheaper model for MEMORY_EXTRACTION_MODEL since it runs on every turn
  • LLM enhancements (multi-query, re-rank, rewrite) each add one LLM call per retrieval. Enable selectively based on your needs
  • Disable temporal decay (MEMORY_TEMPORAL_DECAY_LAMBDA=0) if all memories should be treated equally regardless of age