Scalable Infrastructure: Design a Real-Time Analytics System

📌 Question

"Design a real-time analytics system that processes events (e.g., clicks, views, purchases) and allows querying aggregated results within seconds."

This system design question is a favorite at companies like Pinterest, LinkedIn, Uber, and Stripe. It tests your understanding of streaming pipelines, distributed processing, scalability, and data querying strategies.


✅ Solution

1. Clarify Requirements

Functional Requirements:

  • Ingest millions of events per second from web/mobile apps
  • Process events in near real-time
  • Aggregate and store results (e.g., total clicks per user or per ad)
  • Provide low-latency querying (<5s)
  • Support dashboards and alerting use cases

Non-Functional Requirements:

  • Horizontal scalability and fault tolerance
  • Event ordering and deduplication (if needed)
  • Exactly-once or at-least-once processing
  • Time-windowed analytics (e.g., per minute, hour)

2. Core System Components

  • Event Producer: Frontend/mobile clients or services emitting events
  • Data Ingestion Layer: Distributed message queue (e.g., Kafka, Kinesis)
  • Stream Processing Engine: Processes and aggregates events (e.g., Flink, Spark Streaming, Kafka Streams)
  • Storage Layer: Stores aggregated results (e.g., ClickHouse, Druid, BigQuery, Redis)
  • Query Engine / API Layer: Allows frontend to request analytics results
  • Monitoring / Alerting: Observes system health and triggers alerts on anomalies

3. Data Flow Overview

  1. Clients generate events (e.g., page views, ad clicks)
  2. Events are pushed to a Kafka topic for ingestion
  3. Stream processors consume from Kafka and apply transformations
  4. Aggregated results are written to an OLAP store
  5. A dashboard or analytics API layer queries these results

This design decouples ingestion, processing, storage, and querying, which improves modularity and scalability.


4. Stream Processing Logic

  • Windowed Aggregations: Compute aggregates over sliding or tumbling time windows
  • Grouping: Aggregate by keys (e.g., user ID, ad ID)
  • Deduplication: Based on message ID or event timestamp
  • Watermarking: Handles late-arriving data
  • Backpressure: Ensures system stability during spikes

Use Flink or Kafka Streams for stateful stream processing and event-time semantics.


5. Storage & Query Layer

  • Hot Data: Store recent aggregates (e.g., last 24 hours) in fast OLAP engines like ClickHouse or Druid
  • Cold Data: Archive historical raw events to a data lake (e.g., S3 + Presto)
  • Query API: Expose endpoints to fetch aggregates or stream trends to dashboards

For high performance, pre-aggregate most queries in the stream layer and minimize on-demand computation.


6. Fault Tolerance and Scalability

  • Use partitioned topics for horizontal scalability
  • Checkpoint state and offsets in the stream processor
  • Use replication for Kafka and HA setups for Flink/Spark
  • Recover from failure using checkpoints and idempotent writes
  • Use distributed storage with replication and compaction

7. Latency and Throughput Tuning

  • Tune Kafka partition count and retention settings
  • Batch events in producers and consumers
  • Use in-memory aggregation where possible
  • Ensure fast serialization (e.g., Avro or Protobuf)
  • Scale horizontally based on QPS or window size

8. Monitoring and Observability

Track:

  • Ingestion lag and processing time
  • Dropped or delayed events
  • Aggregation errors and retries
  • Storage size and query performance

Use Prometheus + Grafana, or tools built into Kafka/Flink for real-time health dashboards.


9. Trade-Offs and Considerations

ConcernTrade-Off Options
Delivery GuaranteeAt-least-once (simpler) vs. exactly-once (more complex)
Storage TypeOLAP (fast aggregates) vs. Data Lake (archival)
Query FlexibilityPre-aggregation (faster) vs. on-demand computation
Latency vs. AccuracyDelay results for completeness vs. stream immediately
Cost EfficiencyBuffering + TTLs vs. long-term batch pipelines

10. What Interviewers Look For

  • Understanding of stream processing and aggregation models
  • Design for scalability and data correctness
  • Awareness of trade-offs between latency, complexity, and storage
  • Modularity: ingestion, processing, storage, and querying as separate layers
  • Monitoring and operational safety in real-time systems

✅ Summary

A real-time analytics system must balance ingestion speed, correctness, fault tolerance, and low-latency querying. A solid design includes:

  • Message queue for decoupling and buffering
  • Stream processor for real-time transformation and aggregation
  • OLAP engine for querying and serving
  • Checkpoints, replication, and monitoring for reliability