System Design Practice
Mind Map Summary
- Topic: System Design Practice
- Definition: The process of defining the architecture, components, modules, interfaces, and data for a system to satisfy specified requirements. It involves making strategic decisions about technology, infrastructure, and patterns to build robust, scalable, and maintainable solutions.
- Key Concepts:
- Non-functional Requirements (NFRs):
- Scalability: Ability to handle increasing load (horizontal vs. vertical).
- Availability: System uptime and resilience to failures (redundancy, failover).
- Latency: Time taken for a request to be processed (response time).
- Consistency: Data consistency models (strong, eventual, causal).
- Durability: Guarantee that committed data will not be lost.
- Reliability: Probability of a system performing its intended function without failure.
- Security: Protection against unauthorized access and attacks.
- Cost: Infrastructure, development, and operational expenses.
- Maintainability: Ease of modifying, extending, and debugging the system.
- Core System Components:
- Load Balancers: Distribute incoming traffic across multiple servers.
- Web Servers: Handle HTTP requests (e.g., Nginx, Apache).
- Application Servers: Execute business logic (e.g., Node.js, Spring Boot, ASP.NET Core).
- Databases: Store and retrieve data (SQL: PostgreSQL, MySQL; NoSQL: MongoDB, Cassandra, Redis).
- Caching: Improve performance by storing frequently accessed data (e.g., Redis, Memcached).
- Message Queues: Enable asynchronous communication and decouple services (e.g., Kafka, RabbitMQ, SQS).
- Content Delivery Networks (CDNs): Deliver static content closer to users.
- Search Engines: Provide full-text search capabilities (e.g., Elasticsearch).
- Monitoring & Logging: Observe system health and troubleshoot issues.
- Design Principles & Patterns:
- Horizontal vs. Vertical Scaling: Adding more machines vs. adding more resources to existing machines.
- Database Sharding/Partitioning: Distributing data across multiple database instances.
- Database Replication: Maintaining multiple copies of data for availability and read scaling.
- Microservices vs. Monolith: Architectural styles and their trade-offs.
- Event-Driven Architecture: Services communicate via events.
- API Gateway: Single entry point for clients to access microservices.
- Circuit Breaker: Prevents cascading failures in distributed systems.
- Trade-offs: Understanding that optimizing for one NFR often impacts others (e.g., strong consistency vs. high availability).
- Non-functional Requirements (NFRs):
- Benefits (Pros):
- Holistic Understanding: Develops a comprehensive view of how different components interact.
- Problem-Solving Skills: Enhances ability to break down complex problems into manageable parts.
- Architectural Thinking: Fosters the ability to design robust, scalable, and resilient systems.
- Interview Preparation: Essential skill for senior and staff engineering roles.
- Better Decision Making: Leads to more informed choices about technology and infrastructure.
- Challenges (Cons):
- Breadth of Knowledge: Requires understanding across many domains (networking, databases, distributed systems, security).
- No Single Solution: Often multiple valid designs exist, requiring justification of trade-offs.
- Abstract Nature: Can be difficult to grasp without practical experience.
- Keeping Up-to-Date: The technology landscape evolves rapidly.
- Practical Use:
- Designing new software systems from scratch.
- Re-architecting existing systems to address performance or scalability issues.
- Evaluating and selecting appropriate technologies.
- Preparing for technical interviews.
Core Concepts
System design is the art and science of architecting complex software systems. It’s not just about writing code, but about making high-level decisions that impact the system’s performance, reliability, scalability, and maintainability. A key aspect of system design is understanding and balancing non-functional requirements (NFRs), as optimizing for one often comes at the expense of another.
The process typically involves:
- Understanding Requirements: Clarifying functional and non-functional requirements.
- Estimating Scale: Determining expected user load, data volume, and traffic.
- High-Level Design: Sketching out major components and their interactions.
- Deep Dive: Detailing specific components (e.g., database schema, API design).
- Identifying Bottlenecks & Trade-offs: Analyzing potential issues and making informed decisions.
Practice Exercise
Design a high-level architecture for a real-time notification system. Discuss:
- How clients (web/mobile) would connect.
- How the backend would push notifications.
- How to handle millions of concurrent connections.
- The data model for storing notifications. (Hint: Consider SignalR, WebSockets, Message Queues).
Answer (High-Level Architecture for a Real-Time Notification System)
1. Requirements & Scale Estimation
- Functional: Send real-time notifications to users (web, mobile).
- Non-functional:
- Scalability: Handle millions of concurrent connections and high notification throughput.
- Availability: High uptime, notifications should be delivered reliably.
- Latency: Near real-time delivery (low latency).
- Durability: Notifications should be persisted.
- Security: Secure user authentication and authorization.
2. High-Level Components
- Clients: Web browsers, iOS/Android apps.
- API Gateway / Load Balancer: Entry point for all client requests.
- Notification Service: Core service for sending/managing notifications.
- WebSocket/Long Polling Server (e.g., SignalR Hub, custom WebSocket server): Manages persistent connections with clients.
- Message Queue/Broker (e.g., Kafka, RabbitMQ, Azure Service Bus): Decouples notification producers from consumers, handles fan-out.
- Database: Stores notification history, user preferences, and connection mapping.
- Cache (e.g., Redis): Stores active user connections, recent notifications for quick retrieval.
3. Architecture Diagram (Conceptual)
+-----------------+ +-----------------+ +-----------------+
| | | | | |
| Web/Mobile | | API Gateway | | Notification |
| Clients +-----> (Load Balancer) +-----> Service |
| | | | | |
+--------+--------+ +--------+--------+ +--------+--------+
| | |
| (WebSocket/Long Polling) | (Pub/Sub)
| | |
| | |
+--------v--------+ +--------v--------+ +--------v--------+
| | | | | |
| WebSocket/Long | | Message Queue | | Database |
| Polling Servers |<----+ (Kafka/MQ) +-----> (NoSQL/SQL) |
| | | | | |
+-----------------+ +-----------------+ +-----------------+
^
| (Cache for active connections)
|
+--------+--------+
| |
| Cache |
| (Redis) |
+-----------------+
4. Discussion Points
1. How clients (web/mobile) would connect:
- WebSockets: Ideal for real-time, bi-directional communication. Clients establish a persistent connection with the WebSocket/Long Polling Servers. This is the preferred method for low-latency updates.
- Long Polling/Server-Sent Events (SSE): As a fallback for older browsers or specific scenarios where WebSockets are not feasible.
- Mobile Push Notifications (APNS/FCM): For mobile apps, integrate with platform-specific push notification services (Apple Push Notification Service, Firebase Cloud Messaging) for background notifications when the app is not active.
2. How the backend would push notifications:
- Notification Producer: Any service that needs to send a notification (e.g., a new order service, a social media service) would publish a notification event to the Message Queue. This decouples the producer from the notification delivery mechanism.
- Notification Service (Consumer): This service subscribes to the Message Queue. When it receives a notification event:
- It persists the notification in the Database.
- It looks up the active user connections (from Cache or Database).
- It then pushes the notification to the relevant WebSocket/Long Polling Servers which are maintaining the client connections.
- For mobile, it would send the notification payload to APNS/FCM.
3. How to handle millions of concurrent connections:
- Horizontal Scaling of WebSocket/Long Polling Servers: Deploy multiple instances of these servers behind a Load Balancer (e.g., Nginx, HAProxy) that supports sticky sessions or WebSocket proxying. Each server can handle a large number of concurrent connections.
- Distributed Connection Management: The WebSocket servers need a way to know which user is connected to which server. A Cache (like Redis) can store this mapping (e.g.,
userId -> serverId/connectionId
). When the Notification Service needs to send a notification, it queries the cache to find the correct server(s) to send the message to. - Message Queue for Fan-out: The Message Queue is critical for fanning out notifications efficiently to potentially millions of connected clients. The Notification Service publishes to the queue, and the WebSocket servers consume from it.
- Stateless Application Servers: Keep the Notification Service and other backend services stateless to allow for easy horizontal scaling.
4. The data model for storing notifications:
- Database Choice: A NoSQL database (e.g., MongoDB, Cassandra) might be suitable for its flexibility and scalability with high write/read volumes, especially for storing potentially large numbers of notifications per user. A relational database could also work, but might require more careful scaling.
- Notification Schema (Example):
Notification { id: UUID (Primary Key) userId: UUID (Index for quick lookup) type: String (e.g., "message", "alert", "promotion") title: String body: String timestamp: DateTime (Index for chronological order) isRead: Boolean (Index for unread count) link: String (Optional URL) metadata: JSON (Flexible field for additional data) }
- User Connection Mapping (in Cache/Database):
UserConnection { userId: UUID (Primary Key) connectionId: String (WebSocket connection ID) serverId: String (ID of the WebSocket server handling the connection) lastActive: DateTime }
- Indexes: Crucial for efficient querying (e.g.,
userId
andtimestamp
for retrieving a user’s notifications,isRead
for unread counts).
Trade-offs:
- Complexity vs. Scalability: This distributed architecture is more complex than a monolithic approach but offers significantly higher scalability and resilience.
- Real-time vs. Cost: Maintaining persistent WebSocket connections can be resource-intensive, impacting cost. Mobile push notifications are more cost-effective for non-active users.
- Consistency vs. Latency: For notifications, eventual consistency is often acceptable (a notification might arrive slightly delayed but will eventually arrive), allowing for lower latency and high availability.
This high-level design provides a foundation for building a robust and scalable real-time notification system.