Data Value Tiers and Retention
Value-Based Classification Security data should be classified by investigative value, with retention and storage tier aligned to value. High-value data including authentication logs and security alerts warrant longer retention in fast storage. Low-value data including verbose application logs may have shorter retention or be sampled. Value classification should consider detection use cases, investigation requirements, and compliance obligations. Data value changes over time, with recent data having higher investigative value than historical data. Tiered storage reflects this value degradation. Tiered Storage Architecture Hot storage (30-90 days) provides fast access for active detection and investigation, typically using SSDs or in-memory databases. Hot storage should contain highest-value data and recent logs. Warm storage provides moderate access speed for less frequent queries, using standard disks or cloud storage tiers. Warm storage balances cost with accessibility. Cold storage provides cost-effective long-term retention for compliance and historical analysis, using object storage or archival tiers. Cold storage queries are slower but dramatically cheaper. Retention Strategy Differential retention per source aligns retention with data value and compliance requirements. Critical security logs may require one-year retention, while verbose application logs may have 30-day retention. Compliance retention should be separated from operational retention needs. Compliance data can be moved to cold storage after operational retention expires. Legal holds and evidence requirements may require indefinite retention of specific data. Legal hold processes should be automated and auditable.Storage and Compute Optimization
Decoupled Architecture Lakehouse architectures decouple storage from compute, enabling independent scaling. Storage can grow without increasing compute costs, while compute can scale for query bursts without affecting storage. Decoupled architectures enable cost-effective long-term retention by using cheap object storage while maintaining query performance through compute scaling. Columnar Storage Formats Columnar formats including Parquet and ORC provide efficient compression and query performance for analytical workloads. Columnar storage enables reading only required columns, reducing I/O. Compression ratios of 10x or higher are common with columnar formats, dramatically reducing storage costs. Compression should be balanced with query performance. Partitioning and Pruning Time-based partitioning enables efficient pruning of irrelevant data during queries. Queries for recent data only scan recent partitions, reducing query costs. Tenant-based partitioning in multi-tenant environments enables per-tenant queries without scanning all data. Partitioning strategy should match common query patterns. Partition pruning can reduce query costs by 90% or more by eliminating unnecessary data scanning. Indexing and Query Optimization Compressed indexes provide fast lookups with minimal storage overhead. Bloom filters enable efficient existence checks without full data scans. Vectorized query engines process data in batches, improving CPU efficiency. Vectorization can improve query performance by 10x or more.Pipeline Cost Levers
Pre-Aggregation and Rollups Pre-aggregation computes statistics during ingestion, enabling fast queries without scanning raw data. Common aggregations include counts, sums, and percentiles. Rollups progressively aggregate older data, reducing storage while maintaining analytical capabilities. Daily rollups may be sufficient for data older than 90 days. Pre-aggregation trades ingestion cost for query cost, which is often favorable given that data is queried more than once. Sampling Strategies Sampling reduces log volume for high-volume, low-value sources while maintaining statistical representativeness. Sampling should be applied carefully to avoid missing security events. Head-based sampling samples a percentage of requests, while tail-based sampling samples based on request characteristics. Tail-based sampling can preserve all errors while sampling successful requests. Adaptive sampling adjusts sampling rates based on traffic volume, maintaining constant data volume regardless of traffic. Deduplication and Suppression Deduplication eliminates duplicate log entries, reducing storage and processing costs. Deduplication should preserve first and last occurrence timestamps. Suppression eliminates low-value repetitive logs, such as successful health checks. Suppression rules should be carefully designed to avoid eliminating security-relevant events. Edge Filtering Edge filtering eliminates low-value data before ingestion, reducing pipeline and storage costs. Filtering should occur as early as possible to minimize processing. Filtering rules should be version-controlled and tested to prevent accidental elimination of valuable data. Schema Optimization Compact schemas reduce storage and processing costs. High-cardinality fields should be avoided or indexed carefully. Field data types should match actual data, with integers used instead of strings where possible. Type optimization reduces storage and improves query performance.Cost Governance and Tooling
Cost Visibility Cost dashboards provide visibility into security data costs by source, team, and data type. Visibility enables informed decisions about retention and sampling. Per-team budgets with chargeback or showback create accountability for data costs. Teams that generate data should be aware of associated costs. Cost allocation should be automated and accurate, enabling fair cost distribution. Query Governance Query guards prevent expensive queries from consuming excessive resources. Guards may include query timeouts, result size limits, and concurrency limits. Query quotas limit resource consumption per user or team, preventing individual users from affecting others. Quotas should be monitored and adjusted based on legitimate needs. Expensive query patterns should be identified and optimized or blocked. Common expensive patterns include full table scans and unbounded time ranges. Detection Rule Optimization Periodic reviews of detection rules identify expensive rules with low value. Rules should be measured by cost and detection effectiveness. Expensive rules with low detection rates should be optimized or sunset. Rule optimization may include adding filters, reducing query frequency, or improving query efficiency. Source Rationalization Low-value data sources should be sunset or sampled more aggressively. Source value should be measured by detection coverage and investigation usage. Sources that are never queried provide no value and should be eliminated. Source usage metrics inform rationalization decisions.Common Anti-Patterns
Infinite Retention Retaining all data forever creates unsustainable costs without proportional value. Retention should be based on value and compliance requirements, not technical capability. SIEM-Only Architectures SIEM-only architectures force all data through expensive SIEM platforms. Tiered architectures with data lakes for cold storage dramatically reduce costs. Expensive Query Patterns Regular expressions on hot query paths create unnecessary CPU costs. Regex should be avoided in favor of indexed lookups where possible. No Cost Ownership Security teams without cost ownership lack incentives to optimize. Cost ownership should be clear with accountability for budget management.Conclusion
Security data cost optimization requires architectural controls including tiered storage, intelligent sampling, and query optimization. Security engineers design cost-effective security data architectures that maintain comprehensive visibility while managing costs sustainably. Success requires treating cost as first-class requirement alongside security effectiveness, with continuous monitoring and optimization of data costs. Organizations that invest in cost optimization fundamentals maintain security visibility at scale without budget overruns.References
- Cloud Provider Cost Optimization Documentation
- Data Lakehouse Architecture Patterns
- SIEM Cost Optimization Best Practices
- FinOps Foundation Security Data Guidance