Scalable Infrastructure: Design a Distributed Log Aggregation System

📌 Question

"Design a system that collects and aggregates logs from distributed services across multiple servers, allowing querying and visualization in real time."

This system design question is frequently asked by companies like Datadog, Snowflake, and Google. It evaluates your ability to build infrastructure for observability, scalability, durability, and efficient querying.

✅ Solution

1. Clarify Requirements

Functional Requirements:

Collect logs from distributed microservices in real-time
Ingest logs from multiple formats and sources (e.g., application, access, error logs)
Store logs for querying and dashboarding
Support filtering, full-text search, and time-range queries
Enable alerts based on log patterns or errors

Non-Functional Requirements:

High availability and durability
Horizontal scalability for ingestion and querying
Low-latency indexing for real-time use
Retention policy (e.g., 7–30 days of searchable logs)

2. Core System Components

Log Forwarders / Agents: Deployed on each host to collect and ship logs (e.g., Fluent Bit, Filebeat)
Ingestion Layer: Receives, parses, and buffers logs (e.g., Kafka, Logstash)
Indexing Layer: Indexes logs for search (e.g., Elasticsearch, OpenSearch, ClickHouse)
Storage Layer: Stores raw and indexed logs (e.g., S3, HDFS, cold storage)
Query Engine: Enables full-text search, filtering, and analytics
Dashboard / Alert Layer: Visualizes logs and sends alerts

3. Log Ingestion Flow

Applications generate logs and write them to local files or stdout
Agents on each node tail logs and forward them to the ingestion pipeline
Logs are enriched (e.g., add host, service, timestamp) and passed to Kafka
Processing layer parses logs, transforms them, and ships to indexers
Logs are indexed and stored for querying

The system must handle logs in near real-time while being fault-tolerant and backpressure-aware.

4. Indexing and Storage

Use inverted indexes to support full-text search
Time-based sharding: partition data by timestamp (e.g., daily or hourly)
Compress data for cost efficiency
Separate hot vs. cold storage (e.g., Elasticsearch for 7 days, S3 for 30+ days)
Apply retention policies or TTL for automatic log expiration

5. Query Layer and Dashboarding

Support filters like service name, log level, keyword, and time range
Enable aggregations (e.g., number of errors per service)
Connect to tools like Grafana or Kibana for visualization
Create alerts for error spikes or unusual patterns using rule-based triggers

6. Fault Tolerance and Scalability

Kafka ensures durability and decouples ingestion from indexing
Use replication in Kafka and Elasticsearch/OpenSearch
Scale indexers horizontally based on log volume
Add backpressure and rate limits to handle traffic spikes

7. Monitoring and Observability

Track:

Ingestion latency and throughput
Queue backlog and dropped messages
Storage utilization and index health
Query performance and slow searches

Expose metrics via Prometheus and dashboards for SRE teams.

8. Trade-Offs and Considerations

Area	Trade-Offs
Ingestion Reliability	Fire-and-forget vs. acknowledgments (performance vs. durability)
Storage Tiering	Fast but expensive SSD vs. cold object storage
Search Capabilities	Full-text indexing (costly) vs. structured querying
Real-time vs. Batch	Low-latency vs. cost-effective processing
Index Retention	Longer retention = higher cost

9. Extensions

Correlate logs with metrics and traces for full observability
Implement role-based access control for logs
Mask PII data before storage
Support log streaming to external systems (e.g., SIEM)

10. What Interviewers Look For

End-to-end pipeline thinking: from log generation to query
Awareness of indexing strategies and storage trade-offs
Handling of backpressure and traffic spikes
Real-world observability concerns and operational readiness
Consideration of both real-time and long-term needs

✅ Summary

A distributed log aggregation system is the foundation of observability in modern infrastructure. A solid design includes:

Log collectors on every node
A durable, scalable ingestion pipeline (Kafka)
Efficient indexing and time-partitioned storage
Real-time query engine and alerting
Clear handling of faults, spikes, and scaling needs