Scalable Infrastructure: Design a Distributed Log Aggregation System

📌 Question

"Design a system that collects and aggregates logs from distributed services across multiple servers, allowing querying and visualization in real time."

This system design question is frequently asked by companies like Datadog, Snowflake, and Google. It evaluates your ability to build infrastructure for observability, scalability, durability, and efficient querying.


✅ Solution

1. Clarify Requirements

Functional Requirements:

  • Collect logs from distributed microservices in real-time
  • Ingest logs from multiple formats and sources (e.g., application, access, error logs)
  • Store logs for querying and dashboarding
  • Support filtering, full-text search, and time-range queries
  • Enable alerts based on log patterns or errors

Non-Functional Requirements:

  • High availability and durability
  • Horizontal scalability for ingestion and querying
  • Low-latency indexing for real-time use
  • Retention policy (e.g., 7–30 days of searchable logs)

2. Core System Components

  • Log Forwarders / Agents: Deployed on each host to collect and ship logs (e.g., Fluent Bit, Filebeat)
  • Ingestion Layer: Receives, parses, and buffers logs (e.g., Kafka, Logstash)
  • Indexing Layer: Indexes logs for search (e.g., Elasticsearch, OpenSearch, ClickHouse)
  • Storage Layer: Stores raw and indexed logs (e.g., S3, HDFS, cold storage)
  • Query Engine: Enables full-text search, filtering, and analytics
  • Dashboard / Alert Layer: Visualizes logs and sends alerts

3. Log Ingestion Flow

  1. Applications generate logs and write them to local files or stdout
  2. Agents on each node tail logs and forward them to the ingestion pipeline
  3. Logs are enriched (e.g., add host, service, timestamp) and passed to Kafka
  4. Processing layer parses logs, transforms them, and ships to indexers
  5. Logs are indexed and stored for querying

The system must handle logs in near real-time while being fault-tolerant and backpressure-aware.


4. Indexing and Storage

  • Use inverted indexes to support full-text search
  • Time-based sharding: partition data by timestamp (e.g., daily or hourly)
  • Compress data for cost efficiency
  • Separate hot vs. cold storage (e.g., Elasticsearch for 7 days, S3 for 30+ days)
  • Apply retention policies or TTL for automatic log expiration

5. Query Layer and Dashboarding

  • Support filters like service name, log level, keyword, and time range
  • Enable aggregations (e.g., number of errors per service)
  • Connect to tools like Grafana or Kibana for visualization
  • Create alerts for error spikes or unusual patterns using rule-based triggers

6. Fault Tolerance and Scalability

  • Kafka ensures durability and decouples ingestion from indexing
  • Use replication in Kafka and Elasticsearch/OpenSearch
  • Scale indexers horizontally based on log volume
  • Add backpressure and rate limits to handle traffic spikes

7. Monitoring and Observability

Track:

  • Ingestion latency and throughput
  • Queue backlog and dropped messages
  • Storage utilization and index health
  • Query performance and slow searches

Expose metrics via Prometheus and dashboards for SRE teams.


8. Trade-Offs and Considerations

AreaTrade-Offs
Ingestion ReliabilityFire-and-forget vs. acknowledgments (performance vs. durability)
Storage TieringFast but expensive SSD vs. cold object storage
Search CapabilitiesFull-text indexing (costly) vs. structured querying
Real-time vs. BatchLow-latency vs. cost-effective processing
Index RetentionLonger retention = higher cost

9. Extensions

  • Correlate logs with metrics and traces for full observability
  • Implement role-based access control for logs
  • Mask PII data before storage
  • Support log streaming to external systems (e.g., SIEM)

10. What Interviewers Look For

  • End-to-end pipeline thinking: from log generation to query
  • Awareness of indexing strategies and storage trade-offs
  • Handling of backpressure and traffic spikes
  • Real-world observability concerns and operational readiness
  • Consideration of both real-time and long-term needs

✅ Summary

A distributed log aggregation system is the foundation of observability in modern infrastructure. A solid design includes:

  • Log collectors on every node
  • A durable, scalable ingestion pipeline (Kafka)
  • Efficient indexing and time-partitioned storage
  • Real-time query engine and alerting
  • Clear handling of faults, spikes, and scaling needs