Skip to main content

Documentation Index

Fetch the complete documentation index at: https://threatbasis.io/llms.txt

Use this file to discover all available pages before exploring further.

Context window management is a fundamental skill for security engineers building AI-powered systems. Every Large Language Model operates within a finite context window—the maximum amount of text it can process in a single request. Modern models range from 8,000 tokens (GPT-3.5) to over 200,000 tokens (Claude 3, Gemini 1.5), but even the largest windows fill quickly when processing security data. The context window must accommodate multiple competing demands: system prompts that define AI behavior, retrieved documents from knowledge bases, conversation history for continuity, enrichment data from security tools, and the user’s current query—all while reserving space for the model’s response. Security data is particularly challenging because a single alert with full enrichment can consume thousands of tokens, and investigations often span hundreds of related events. Effective context management directly impacts AI quality. According to Anthropic’s Long Context Guide, optimal context utilization can improve response accuracy by 40-60% while reducing token costs. Poor management leads to truncated context, lost information, degraded responses, and unnecessary API costs.

Why Context Management Matters for Security

Security AI applications face unique context challenges that don’t exist in typical chatbot or content generation use cases. Volume challenges drive most context problems. Security operations centers process millions of log events daily, each potentially relevant to an investigation. A single alert might reference dozens of related events, multiple affected systems, and various threat intelligence sources. Fitting this information into even a 200K token window requires intelligent selection and summarization. Continuity requirements add complexity. Unlike simple question-answering, security investigations span hours or days with multiple analysts. The AI must maintain investigation state across sessions, remember previous findings, track hypotheses, and avoid asking questions already answered. This requires persistent memory systems beyond the basic context window. Compliance and audit needs mean you cannot simply discard context. Investigations require complete audit trails showing what information the AI considered, what it recommended, and why. External memory systems must preserve this provenance even when compressing content for the active context window. Multi-analyst collaboration requires shared context that transfers between team members. When a senior analyst reviews a junior analyst’s AI-assisted investigation, the context must convey the full investigation state including reasoning, not just raw findings.

Context Window Fundamentals

Understanding how context windows are consumed helps optimize allocation. Every piece of text sent to an LLM is converted to tokens—roughly corresponding to word fragments—using model-specific tokenizers. Different content types consume tokens at different rates per OpenAI’s Tokenization Guide. System prompts typically consume 500-2,000 tokens and define the AI’s role, capabilities, constraints, and output format. Security system prompts often include organization-specific policies, tool descriptions, and response templates. These prompts remain constant across conversations and should be cached and versioned. Retrieved context from RAG systems consumes the largest portion—often 50-70% of available tokens. This includes log entries, alert details, threat intelligence, asset information, and previous investigation notes. Each piece of retrieved context must justify its token cost through relevance to the current query. Conversation history accumulates over multi-turn interactions, growing from near-zero to potentially thousands of tokens. Without management, history eventually crowds out space for new retrieved context. Strategies include sliding windows that keep recent turns, summarization of older exchanges, and extraction of key facts into structured memory. Response reservation must be planned in advance. Models cannot generate responses that exceed available tokens, so 15-25% of the context window should be reserved for output. Response-heavy tasks like report generation require larger reservations (up to 45%). Token consumption varies significantly by data type. A single syslog entry might use 50-150 tokens, while a JSON log event with nested fields consumes 200-500 tokens. Fully enriched alerts with threat intelligence, asset context, and related events can exceed 2,000 tokens each. Understanding these patterns informs how many items can fit in available space.

Window Management Strategies

Sliding Window Approaches

Sliding windows are the most common context management technique, maintaining recent information while progressively discarding older content. The MemGPT research from Berkeley demonstrates how operating system concepts like virtual memory can be adapted for LLM context management. The basic sliding window maintains a queue of conversation messages with a fixed token budget. When new messages would exceed the budget, the oldest messages are evicted. However, this simple FIFO (first-in-first-out) approach has limitations for security applications—an important finding from early in an investigation shouldn’t be discarded just because it’s old. More sophisticated approaches add priority levels to messages. Security-critical content like identified IOCs, confirmed findings, or explicit analyst decisions can be marked as high-priority and protected from eviction. Lower-priority content like clarifying questions or intermediate reasoning gets evicted first regardless of age. Recency-decay combines age and importance, where messages gradually lose their priority score over time. This naturally phases out routine exchanges while preserving important content longer. Semantic grouping ensures related messages are evicted together—removing half of a question-answer pair would leave confusing context. Pinned messages represent content that should never be evicted, such as the initial investigation request or confirmed malicious indicators. These consume fixed token budget regardless of window pressure, so they should be used sparingly.

Hierarchical Memory

Rather than treating all context equally, hierarchical memory systems maintain multiple tiers with different retention policies and compression levels. This pattern, described in LlamaIndex’s memory architecture, mirrors how human memory works—detailed working memory for current tasks, summarized short-term memory for recent context, and compressed long-term memory for historical knowledge. Working memory holds the current session’s full detail. Every message, retrieved document, and response remains available in its original form. This tier typically spans minutes to hours and provides the immediate context for ongoing interactions. Short-term memory stores summarized versions of recent sessions. When a working memory session ends or exceeds capacity, its content is summarized—extracting key findings, decisions, entities mentioned, and action items—then stored for hours to days. This tier answers questions like “what did we discuss yesterday?” without consuming full working memory budget. Long-term memory preserves only the most essential information across weeks or months. A multi-day investigation might compress to a paragraph summary plus a list of confirmed IOCs and affected systems. This tier provides background context for related future investigations. Episodic memory handles specific memorable events that should be retained indefinitely—successful attack patterns, novel techniques, or significant organizational incidents. These entries are indexed for retrieval when similar situations arise. The tier promotion process moves content through these levels: monitor capacity in each tier, identify candidates for promotion (typically oldest or least-accessed), summarize content appropriately for the destination tier, transfer with metadata preserved, then cleanup the source. Each transition involves compression—working to short-term might preserve 30% of tokens, while short-term to long-term keeps only 10-15%.

Dynamic Allocation

Different security tasks have fundamentally different context needs. Alert triage requires extensive retrieved context (related alerts, asset information, threat intelligence) but minimal conversation history since each alert is independent. Incident investigations need balanced history and context to maintain continuity across many interactions. Report generation requires substantial response reservation since outputs are lengthy, while system prompts can be minimal. Effective systems detect task type and adjust allocation accordingly. An alert triage task might allocate 55% to retrieved context, 20% to response reservation, 10% each to system prompt and query, leaving just 5% for history. An investigation session might shift to 35% retrieved, 25% history, 25% response, and smaller allocations for prompts. The key principle: context-heavy tasks should prioritize retrieved data over conversation history, response-heavy tasks need larger output reservations, interactive tasks balance history and context for conversational coherence, and precision tasks like query generation benefit from larger system prompts that constrain output format.

Conversation Summarization

As conversations grow, summarization becomes essential to prevent history from consuming the entire context budget. LangChain ConversationSummaryMemory provides reference implementations of this pattern. Security conversations require careful summarization that preserves critical details. Indicators of compromise must be extracted verbatim—you cannot summarize an IP address. Decisions made during investigation should include their rationale. Systems and users mentioned should appear in list format for reference. Timeline events need chronological structure. Outstanding questions should remain visible. Summarization triggers can be threshold-based (summarize when history exceeds 40% of budget), turn-based (summarize every 10-15 conversation turns), topic-based (summarize when investigation focus shifts to a new area), or session-based (always summarize at session boundaries). Most production systems combine multiple triggers.

Security-Specific Patterns

Log Window Management

Security logs present the most challenging context management scenario. A single server might generate thousands of events per minute, and investigations often span multiple systems over extended time periods. You cannot fit all potentially relevant logs into any context window—intelligent selection is mandatory. Following NIST SP 800-92 log management guidelines, effective log selection combines multiple factors into a priority score. Severity weighting ensures critical and high-severity events are always considered first. Relevance scoring using embeddings or keyword matching boosts logs that relate to the current investigation focus. Recency factors favor recent events while allowing important historical events to remain visible. The selection algorithm calculates a composite score for each candidate log entry, sorts by score, and adds logs to the context until the token budget is exhausted. After selection, logs should be reordered chronologically for presentation—the AI needs temporal sequence to understand event progression, even though selection wasn’t time-based. When even selective inclusion exceeds budget, log summarization provides an alternative. Rather than including individual entries, generate a summary covering the time range examined, key events in sequence, repeated error patterns, affected assets, and potential security implications. This approach trades granular detail for broader coverage.

Alert Context Prioritization

Alerts require prioritization because investigations typically involve multiple related alerts, each with its own enrichment data. The primary alert under analysis needs full context, but related alerts provide valuable background without requiring complete detail. Priority scoring combines base severity (critical alerts score highest), recency (newer alerts receive a boost that decays over time), entity overlap (alerts sharing IP addresses, usernames, or hostnames with the focus alert score higher), and technique overlap (alerts mapping to the same MITRE ATT&CK techniques suggest related activity). Entity and technique overlap are particularly important for security context because they identify potentially related attack activity. An alert involving the same source IP as the current investigation is far more relevant than an unrelated critical alert. This relationship-aware prioritization helps AI systems understand attack patterns rather than just individual alerts.

Incident Investigation Context

Investigations span multiple sessions and require persistent context that survives conversation boundaries. Unlike ad-hoc queries, investigations have formal state: affected systems discovered so far, users potentially compromised, indicators confirmed as malicious, MITRE ATT&CK techniques identified, findings documented, and next steps planned. This state must be serialized into the context window at session start and updated throughout the investigation. Different detail levels serve different needs: minimal serialization (~100 tokens) includes just the investigation ID, status, and top findings for quick reference; standard serialization (~300-500 tokens) adds asset lists, indicators, techniques, and next steps for typical working sessions; full serialization (~800-1500 tokens) includes complete timeline, all hypotheses, and detailed findings for comprehensive review or handoff. The serialization format matters for AI comprehension. Structured formats like YAML or JSON help models parse investigation state reliably. Including explicit section headers (“Affected Systems:”, “Confirmed IOCs:”, “Next Steps:”) improves information retrieval within the context.

Threat Intelligence Context

Threat intelligence has natural time-decay characteristics that should inform context management. IP-based IOCs become stale within days as attackers rotate infrastructure. File hashes remain relevant longer since malware variants persist. TTPs and threat actor profiles change slowly and remain relevant for months. Weighting threat intelligence by recency prevents outdated IOCs from consuming valuable context space. Active campaign information should refresh daily, network indicators need hourly updates, while strategic intelligence on threat actors can refresh weekly. The decay function also matters—IP reputation should decay exponentially (rapidly becoming irrelevant), while TTP information decays linearly (gradually losing relevance over months).

Implementation Techniques

Token Counting and Budgeting

Accurate token counting is essential because exceeding context limits causes failures or truncation. Different models use different tokenization schemes—GPT-4 uses cl100k_base encoding, while Claude uses a proprietary tokenizer per Anthropic’s documentation. Using the wrong tokenizer leads to inaccurate estimates that cause overflow errors. For OpenAI models, the tiktoken library provides exact token counts. For Claude, the API returns token counts with responses, and Anthropic provides estimation guidance. For quick estimates when exact counts aren’t critical, the heuristic of approximately 4 characters per token provides reasonable approximations for English text, though this varies significantly for code, JSON, or non-English content. Budget allocation should follow clear principles: reserve response tokens first (15-25% of context), guarantee system prompt allocation (your instructions must always fit), track utilization continuously across conversation turns, and maintain safety margins of 5-10% for tokenization variance and unexpected content.

Context Pruning Strategies

When context exceeds available tokens, intelligent pruning removes low-value content while preserving critical information. The MemGPT research demonstrates how operating system concepts like paging can inform pruning approaches. Oldest-first (FIFO) pruning removes the earliest messages, working well for simple chat but potentially losing important early context in investigations. Relevance-based pruning computes semantic similarity between content and the current query, removing the least relevant items first—effective but requires embedding computation. Priority-based pruning removes low-priority content before high-priority regardless of age, requiring explicit priority tagging during content creation. Summarize-and-replace offers a middle ground: rather than deleting content, compress older material into summaries. This preserves information (albeit in reduced form) while freeing tokens. Hybrid scoring combines multiple factors—recency, relevance, priority, uniqueness—into a composite score for production systems that need nuanced pruning behavior. The key pruning decision factors include recency (when was it added?), relevance (how related to current query?), priority (is it marked critical?), uniqueness (is this information available elsewhere?), and reference frequency (how often is it cited?).

Memory Persistence

Context that spans sessions or exceeds window limits requires external memory systems. Per LlamaIndex documentation, effective persistent memory uses multiple storage mechanisms for different access patterns. Vector databases like Pinecone, Weaviate, or Chroma store embeddings for semantic retrieval—finding content similar to the current query. These excel at retrieving relevant historical context but require embedding computation. Key-value stores like Redis provide fast session state access with sub-millisecond latency, ideal for maintaining current investigation state that needs frequent reads and writes. Document stores like MongoDB or Elasticsearch handle structured investigation data with complex querying needs—finding all investigations involving a specific IP address or user. Knowledge graphs like Neo4j track entity relationships, enabling queries like “show all systems connected to this compromised host” or “find users who accessed this sensitive file.”

Multi-Turn Optimization

Context allocation should adapt as conversations progress. Initial queries typically need heavy retrieved context (60%) with minimal history (5%) since there’s nothing to remember yet. Follow-up questions shift toward more history (30%) as the conversation develops shared context. Deep investigation phases balance history and retrieval equally—the AI needs both what’s been discussed and new information being analyzed. Summary and conclusion phases reserve substantial space for response (40%) since outputs are comprehensive. Progressive summarization compresses older turns while keeping recent exchanges verbatim, giving the AI both detailed recent context and compressed historical background. Entity tracking maintains running lists of mentioned IOCs, users, and systems that persist even when discussion is pruned. Decision logging preserves key conclusions and analyst decisions regardless of what else gets removed.

Context Quality Metrics

Measuring context management effectiveness requires metrics aligned with NIST AI Risk Management Framework guidance. Context utilization measures the percentage of available window used effectively—target greater than 80% to avoid wasting capacity. Relevance score assesses quality of included context through semantic similarity to the query—target greater than 0.8. Information density measures unique information per token, identifying redundancy that wastes space. Response coherence evaluates output quality despite context limits, typically through human evaluation targeting 90%+ acceptable responses. Context overflow rate tracks queries exceeding limits—target below 5%. Summarization retention verifies that key information survives compression—target 95%+ retention of critical details like IOCs. Retrieval precision measures relevance of context pulled from external systems—target 85%+ relevance judgments. Token cost efficiency tracks cost per successful query, identifying opportunities to reduce API spend through better context management.

Architecture Patterns

Stateless vs. Stateful Designs

Architecture choices significantly impact context management complexity and effectiveness. Stateless architectures rebuild context from external sources for each request. They offer low complexity and high horizontal scalability since any server can handle any request. However, context quality suffers because state must be reconstructed, adding latency and potentially losing nuance that would be preserved in continuous memory. Stateful architectures maintain context across requests, typically with session affinity routing requests to specific servers. Context quality is higher because state persists naturally, but scalability is constrained by session affinity requirements, and failure recovery requires state rehydration. Hybrid architectures combine approaches: maintain hot session state in memory for active conversations while persisting to external storage for durability. This provides high context quality with better scalability than pure stateful designs, at the cost of moderate implementation complexity. Cost models differ significantly: stateless systems pay per-query reconstruction costs, stateful systems incur ongoing memory costs, and hybrid systems balance both. Failure recovery is simplest for stateless (no state to lose) but complex for stateful (requires state rehydration). Stateless works best for simple Q&A and stateless APIs, stateful excels at investigations and chat, while hybrid suits most production systems.

External Memory Systems

When context requirements exceed window limits, external memory systems provide overflow capacity with retrieval-based access. The choice of system depends on access patterns and query types. Vector databases store content as embeddings and retrieve based on semantic similarity. Pinecone offers managed infrastructure with low latency for large-scale deployments. Weaviate combines vector search with graph capabilities through a GraphQL interface. Chroma provides a simple Python API ideal for development and prototyping. Milvus targets enterprise scale with high throughput. Key-value and caching systems handle session state and frequently accessed data. Redis delivers sub-millisecond access for session caching and now supports vector similarity search. Graph databases like Neo4j excel at relationship tracking—understanding how entities connect, which is critical for attack path analysis and lateral movement detection. Search engines like Elasticsearch combine full-text search with vector capabilities, making them well-suited for log analysis where both keyword and semantic queries are needed.

Context Caching

Caching frequently used context reduces both latency and API costs. Different cache types serve different purposes. System prompt caches store static instructions for hours, invalidating only when prompts are versioned and updated. Since system prompts rarely change mid-session, this cache has high hit rates. Entity context caches store frequently queried asset information (user profiles, system details, threat actor summaries) with minute-scale TTLs, invalidating when underlying data changes through event-driven mechanisms. Investigation state caches persist for the session duration, updating explicitly when analysts add findings or change investigation direction. Threat intelligence caches store IOC lookups for hours with time-based refresh, balancing API costs against intelligence freshness. Query result caches with LRU eviction prevent redundant embedding or retrieval operations for repeated queries.

Anti-Patterns to Avoid

Context window management requires avoiding common pitfalls that degrade AI performance. Context stuffing includes irrelevant information “just in case,” wasting tokens and degrading response quality as the model struggles to identify what’s important. The fix is relevance scoring before inclusion—measure semantic similarity to the query and apply a threshold. Recency bias over-prioritizes recent context, potentially missing critical historical patterns. An IOC from last week’s investigation might be more relevant than today’s routine alert. Balance recency with explicit importance scoring. Fixed allocation uses static budgets regardless of task type. Report generation needs different allocation than alert triage. Implement task-specific profiles that adapt context distribution to current needs. Ignoring token costs fails to account for content density differences. JSON consumes more tokens than prose for the same information; code uses more than natural language. Content-type-aware budgeting prevents unexpected overflows. Summarization loss through aggressive compression loses critical security details. IP addresses, file hashes, and timestamps cannot be summarized—they must be preserved verbatim even when surrounding context is compressed. Context fragmentation splits related information across the context window, reducing coherence. Keep related data together—an alert and its enrichment should appear adjacently, not scattered. Missing provenance includes information without source attribution, preventing verification and reducing analyst trust. Always include references to original logs, threat intelligence sources, or investigation notes. Premature optimization over-engineers context management before measuring actual needs. Start with simple approaches, measure utilization and overflow rates, then optimize based on evidence.

References

LLM Provider Documentation

Memory and Context Frameworks

Research

Security Standards

Vector and Memory Storage

  • Pinecone — Managed vector database for large-scale semantic search
  • Weaviate — Open-source vector database with GraphQL and hybrid search
  • Chroma — Lightweight embedding database for development
  • Milvus — Open-source vector database for enterprise scale
  • Redis — In-memory data store with vector search capabilities
  • Neo4j — Graph database for relationship tracking and entity modeling
  • Elasticsearch — Search and analytics engine with vector search support