📌 Question
"Design a real-time analytics system that processes events (e.g., clicks, views, purchases) and allows querying aggregated results within seconds."
This system design question is a favorite at companies like Pinterest, LinkedIn, Uber, and Stripe. It tests your understanding of streaming pipelines, distributed processing, scalability, and data querying strategies.
✅ Solution
1. Clarify Requirements
Functional Requirements:
- Ingest millions of events per second from web/mobile apps
- Process events in near real-time
- Aggregate and store results (e.g., total clicks per user or per ad)
- Provide low-latency querying (<5s)
- Support dashboards and alerting use cases
Non-Functional Requirements:
- Horizontal scalability and fault tolerance
- Event ordering and deduplication (if needed)
- Exactly-once or at-least-once processing
- Time-windowed analytics (e.g., per minute, hour)
2. Core System Components
- Event Producer: Frontend/mobile clients or services emitting events
- Data Ingestion Layer: Distributed message queue (e.g., Kafka, Kinesis)
- Stream Processing Engine: Processes and aggregates events (e.g., Flink, Spark Streaming, Kafka Streams)
- Storage Layer: Stores aggregated results (e.g., ClickHouse, Druid, BigQuery, Redis)
- Query Engine / API Layer: Allows frontend to request analytics results
- Monitoring / Alerting: Observes system health and triggers alerts on anomalies
3. Data Flow Overview
- Clients generate events (e.g., page views, ad clicks)
- Events are pushed to a Kafka topic for ingestion
- Stream processors consume from Kafka and apply transformations
- Aggregated results are written to an OLAP store
- A dashboard or analytics API layer queries these results
This design decouples ingestion, processing, storage, and querying, which improves modularity and scalability.
4. Stream Processing Logic
- Windowed Aggregations: Compute aggregates over sliding or tumbling time windows
- Grouping: Aggregate by keys (e.g., user ID, ad ID)
- Deduplication: Based on message ID or event timestamp
- Watermarking: Handles late-arriving data
- Backpressure: Ensures system stability during spikes
Use Flink or Kafka Streams for stateful stream processing and event-time semantics.
5. Storage & Query Layer
- Hot Data: Store recent aggregates (e.g., last 24 hours) in fast OLAP engines like ClickHouse or Druid
- Cold Data: Archive historical raw events to a data lake (e.g., S3 + Presto)
- Query API: Expose endpoints to fetch aggregates or stream trends to dashboards
For high performance, pre-aggregate most queries in the stream layer and minimize on-demand computation.
6. Fault Tolerance and Scalability
- Use partitioned topics for horizontal scalability
- Checkpoint state and offsets in the stream processor
- Use replication for Kafka and HA setups for Flink/Spark
- Recover from failure using checkpoints and idempotent writes
- Use distributed storage with replication and compaction
7. Latency and Throughput Tuning
- Tune Kafka partition count and retention settings
- Batch events in producers and consumers
- Use in-memory aggregation where possible
- Ensure fast serialization (e.g., Avro or Protobuf)
- Scale horizontally based on QPS or window size
8. Monitoring and Observability
Track:
- Ingestion lag and processing time
- Dropped or delayed events
- Aggregation errors and retries
- Storage size and query performance
Use Prometheus + Grafana, or tools built into Kafka/Flink for real-time health dashboards.
9. Trade-Offs and Considerations
Concern | Trade-Off Options |
---|---|
Delivery Guarantee | At-least-once (simpler) vs. exactly-once (more complex) |
Storage Type | OLAP (fast aggregates) vs. Data Lake (archival) |
Query Flexibility | Pre-aggregation (faster) vs. on-demand computation |
Latency vs. Accuracy | Delay results for completeness vs. stream immediately |
Cost Efficiency | Buffering + TTLs vs. long-term batch pipelines |
10. What Interviewers Look For
- Understanding of stream processing and aggregation models
- Design for scalability and data correctness
- Awareness of trade-offs between latency, complexity, and storage
- Modularity: ingestion, processing, storage, and querying as separate layers
- Monitoring and operational safety in real-time systems
✅ Summary
A real-time analytics system must balance ingestion speed, correctness, fault tolerance, and low-latency querying. A solid design includes:
- Message queue for decoupling and buffering
- Stream processor for real-time transformation and aggregation
- OLAP engine for querying and serving
- Checkpoints, replication, and monitoring for reliability