Scalable Infrastructure: Design a Real-Time Analytics System

📌 Question

"Design a real-time analytics system that processes events (e.g., clicks, views, purchases) and allows querying aggregated results within seconds."

This system design question is a favorite at companies like Pinterest, LinkedIn, Uber, and Stripe. It tests your understanding of streaming pipelines, distributed processing, scalability, and data querying strategies.

✅ Solution

1. Clarify Requirements

Functional Requirements:

Ingest millions of events per second from web/mobile apps
Process events in near real-time
Aggregate and store results (e.g., total clicks per user or per ad)
Provide low-latency querying (<5s)
Support dashboards and alerting use cases

Non-Functional Requirements:

Horizontal scalability and fault tolerance
Event ordering and deduplication (if needed)
Exactly-once or at-least-once processing
Time-windowed analytics (e.g., per minute, hour)

2. Core System Components

Event Producer: Frontend/mobile clients or services emitting events
Data Ingestion Layer: Distributed message queue (e.g., Kafka, Kinesis)
Stream Processing Engine: Processes and aggregates events (e.g., Flink, Spark Streaming, Kafka Streams)
Storage Layer: Stores aggregated results (e.g., ClickHouse, Druid, BigQuery, Redis)
Query Engine / API Layer: Allows frontend to request analytics results
Monitoring / Alerting: Observes system health and triggers alerts on anomalies

3. Data Flow Overview

Clients generate events (e.g., page views, ad clicks)
Events are pushed to a Kafka topic for ingestion
Stream processors consume from Kafka and apply transformations
Aggregated results are written to an OLAP store
A dashboard or analytics API layer queries these results

This design decouples ingestion, processing, storage, and querying, which improves modularity and scalability.

4. Stream Processing Logic

Windowed Aggregations: Compute aggregates over sliding or tumbling time windows
Grouping: Aggregate by keys (e.g., user ID, ad ID)
Deduplication: Based on message ID or event timestamp
Watermarking: Handles late-arriving data
Backpressure: Ensures system stability during spikes

Use Flink or Kafka Streams for stateful stream processing and event-time semantics.

5. Storage & Query Layer

Hot Data: Store recent aggregates (e.g., last 24 hours) in fast OLAP engines like ClickHouse or Druid
Cold Data: Archive historical raw events to a data lake (e.g., S3 + Presto)
Query API: Expose endpoints to fetch aggregates or stream trends to dashboards

For high performance, pre-aggregate most queries in the stream layer and minimize on-demand computation.

6. Fault Tolerance and Scalability

Use partitioned topics for horizontal scalability
Checkpoint state and offsets in the stream processor
Use replication for Kafka and HA setups for Flink/Spark
Recover from failure using checkpoints and idempotent writes
Use distributed storage with replication and compaction

7. Latency and Throughput Tuning

Tune Kafka partition count and retention settings
Batch events in producers and consumers
Use in-memory aggregation where possible
Ensure fast serialization (e.g., Avro or Protobuf)
Scale horizontally based on QPS or window size

8. Monitoring and Observability

Track:

Ingestion lag and processing time
Dropped or delayed events
Aggregation errors and retries
Storage size and query performance

Use Prometheus + Grafana, or tools built into Kafka/Flink for real-time health dashboards.

9. Trade-Offs and Considerations

Concern	Trade-Off Options
Delivery Guarantee	At-least-once (simpler) vs. exactly-once (more complex)
Storage Type	OLAP (fast aggregates) vs. Data Lake (archival)
Query Flexibility	Pre-aggregation (faster) vs. on-demand computation
Latency vs. Accuracy	Delay results for completeness vs. stream immediately
Cost Efficiency	Buffering + TTLs vs. long-term batch pipelines

10. What Interviewers Look For

Understanding of stream processing and aggregation models
Design for scalability and data correctness
Awareness of trade-offs between latency, complexity, and storage
Modularity: ingestion, processing, storage, and querying as separate layers
Monitoring and operational safety in real-time systems

✅ Summary

A real-time analytics system must balance ingestion speed, correctness, fault tolerance, and low-latency querying. A solid design includes:

Message queue for decoupling and buffering
Stream processor for real-time transformation and aggregation
OLAP engine for querying and serving
Checkpoints, replication, and monitoring for reliability