System Design

A comprehensive guide to system design interviews — frameworks, core concepts, and technologies you need to know.

1How to Prepare

1Study 10-15 classic system design problems (URL shortener, chat system, news feed, etc.) and understand the patterns they share.
2Learn the building blocks: load balancers, caches, databases, message queues, CDNs. Know when and why to use each.
3Practice back-of-the-envelope estimation — QPS, storage, bandwidth calculations.
4Read engineering blogs from companies like Netflix, Uber, Airbnb, and Meta to see real-world system design in action.
5Practice drawing diagrams and explaining them out loud. Communication is half the interview.

2Interview Walkthrough

Clarify Requirements

Ask about functional requirements (what the system should do) and non-functional requirements (scale, latency, availability, consistency). Nail down the scope before designing anything.

Define Core Entities

Identify the main data entities (e.g., User, Post, Order) and their relationships. This grounds your design in concrete data.

Design the API

Define the key API endpoints or interfaces. This clarifies what the system exposes and how clients interact with it.

Draw High-Level Architecture

Sketch the main components: clients, load balancers, application servers, databases, caches, message queues. Show how data flows through the system.

Deep Dive into Components

Pick 1-2 components to zoom into based on interviewer interest. Discuss database schema, caching strategy, or how a specific service works internally.

Discuss Trade-offs

Every design decision has trade-offs. Discuss consistency vs availability, SQL vs NoSQL, synchronous vs asynchronous processing, and why you chose what you chose.

3Common Mistakes

1Jumping into the solution without clarifying requirements first.
2Over-engineering the design — adding components you can't justify.
3Ignoring non-functional requirements like scalability and fault tolerance.
4Not discussing trade-offs — every choice should have a 'why'.
5Staying too high-level and never diving deep into any component.
6Forgetting about data storage — how and where is data persisted?

4Tips

✓Always start with requirements — never jump into drawing boxes.
✓Drive the conversation. The interviewer wants to see you lead.
✓Use real numbers — 'We need to support 10M DAU' is better than 'a lot of users'.
✓Name specific technologies when discussing components (e.g., 'Redis for caching' not just 'a cache').
✓Draw clean diagrams — label everything, show data flow direction.
✓Time-box each section: ~5 min requirements, ~5 min high-level, ~15 min deep dive, ~5 min trade-offs.

Concepts You Should Know

Scalability & Load Balancing

How systems handle increasing load by scaling horizontally or vertically, and how load balancers distribute traffic.

•What is horizontal vs vertical scaling?
•How do load balancers work? (Round-robin, least connections, IP hash)
•What is auto-scaling and when do you use it?
•How do you handle stateful services behind a load balancer?

Caching

Storing frequently accessed data in fast storage to reduce latency and database load.

•What is a cache? When do you use it?
•What caching strategies exist? (write-through, write-back, write-around)
•Where can caching be applied? (client, CDN, server, database)
•What are cache invalidation strategies?
•What are the trade-offs?

Database Design (SQL vs NoSQL, Sharding, Replication)

Choosing the right database, designing schemas, and scaling data storage.

•When do you use SQL vs NoSQL?
•What is database sharding and what are its trade-offs?
•What is replication? (leader-follower, multi-leader)
•How do you handle schema migrations at scale?

Consistent Hashing

A technique for distributing data across nodes that minimizes redistribution when nodes are added or removed.

•How does consistent hashing work?
•Why is it better than simple modulo hashing?
•What are virtual nodes and why are they useful?
•Where is consistent hashing used? (distributed caches, databases)

CAP Theorem

A distributed system can only guarantee two of three: Consistency, Availability, and Partition Tolerance.

•What does each letter in CAP stand for?
•Why can you only pick two?
•What is the difference between CP and AP systems?
•How does eventual consistency fit in?

Message Queues & Event-Driven Architecture

Decoupling services using asynchronous message passing for reliability and scalability.

•When should you use a message queue?
•What is pub/sub vs point-to-point?
•How do you handle message ordering and deduplication?
•What is event sourcing and CQRS?

API Design (REST, GraphQL, gRPC)

Designing clean, scalable interfaces for communication between services and clients.

•When do you use REST vs GraphQL vs gRPC?
•How do you design idempotent APIs?
•What is API versioning and pagination?
•How do you handle rate limiting at the API level?

Rate Limiting

Controlling the rate of requests to protect services from abuse and overload.

•What algorithms exist? (token bucket, leaky bucket, sliding window)
•Where do you implement rate limiting? (API gateway, application, load balancer)
•How do you rate limit in a distributed system?
•How do you handle rate limit exceeded responses?

Microservices vs Monolith

Architectural patterns for structuring applications as single units or collections of small services.

•What are the trade-offs between microservices and monolith?
•How do microservices communicate? (REST, gRPC, message queues)
•What is service discovery?
•How do you handle distributed transactions?

Networking Essentials (DNS, TCP/UDP, HTTP)

Core networking concepts that underpin all distributed systems.

•How does DNS resolution work?
•What is the difference between TCP and UDP?
•How does HTTP/2 improve over HTTP/1.1?
•What are WebSockets and when do you use them?

Storage (Block, Object, File)

Different storage types and when to use each for different data access patterns.

•What is block vs object vs file storage?
•When do you use S3 (object) vs EBS (block) vs EFS (file)?
•How do you design for durability and availability?
•What is data replication and erasure coding?

Consensus Algorithms (Raft, Paxos)

Algorithms that allow distributed systems to agree on a single value, even in the presence of failures.

•Why do distributed systems need consensus?
•How does Raft work at a high level?
•What is leader election?
•What is the difference between Raft and Paxos?

Technologies You Should Know

Redis

Caching / In-Memory Store

In-memory data store used for caching, session management, rate limiting, and pub/sub messaging.

Kafka

Message Queue / Streaming

Distributed event streaming platform for building real-time data pipelines and streaming applications.

PostgreSQL

Relational Database

Powerful open-source relational database with strong ACID compliance, JSON support, and extensibility.

MongoDB

Document Database

Document-oriented NoSQL database designed for flexible schemas and horizontal scaling.

Cassandra / DynamoDB

Wide-Column / Key-Value

Wide-column / key-value stores designed for high write throughput and horizontal scalability across regions.

Elasticsearch

Search Engine

Distributed search and analytics engine for full-text search, log analysis, and real-time indexing.

Nginx

Load Balancer / Reverse Proxy

High-performance web server, reverse proxy, and load balancer used to handle concurrent connections efficiently.

Docker / Kubernetes

Containerization / Orchestration

Containerization and orchestration tools for packaging, deploying, and managing applications at scale.

ZooKeeper

Coordination / Service Discovery

Centralized service for distributed coordination, configuration management, and service discovery.

CDN (CloudFront, Cloudflare)

Content Delivery

Content delivery networks that cache and serve content from edge locations close to users for low latency.