Sep 6, 2025

System Design Practice

Mind Map Summary

Topic: System Design Practice
Definition: The process of defining the architecture, components, modules, interfaces, and data for a system to satisfy specified requirements. It involves making strategic decisions about technology, infrastructure, and patterns to build robust, scalable, and maintainable solutions.
Key Concepts:
- Non-functional Requirements (NFRs):
  - Scalability: Ability to handle increasing load (horizontal vs. vertical).
  - Availability: System uptime and resilience to failures (redundancy, failover).
  - Latency: Time taken for a request to be processed (response time).
  - Consistency: Data consistency models (strong, eventual, causal).
  - Durability: Guarantee that committed data will not be lost.
  - Reliability: Probability of a system performing its intended function without failure.
  - Security: Protection against unauthorized access and attacks.
  - Cost: Infrastructure, development, and operational expenses.
  - Maintainability: Ease of modifying, extending, and debugging the system.
- Core System Components:
  - Load Balancers: Distribute incoming traffic across multiple servers.
  - Web Servers: Handle HTTP requests (e.g., Nginx, Apache).
  - Application Servers: Execute business logic (e.g., Node.js, Spring Boot, ASP.NET Core).
  - Databases: Store and retrieve data (SQL: PostgreSQL, MySQL; NoSQL: MongoDB, Cassandra, Redis).
  - Caching: Improve performance by storing frequently accessed data (e.g., Redis, Memcached).
  - Message Queues: Enable asynchronous communication and decouple services (e.g., Kafka, RabbitMQ, SQS).
  - Content Delivery Networks (CDNs): Deliver static content closer to users.
  - Search Engines: Provide full-text search capabilities (e.g., Elasticsearch).
  - Monitoring & Logging: Observe system health and troubleshoot issues.
- Design Principles & Patterns:
  - Horizontal vs. Vertical Scaling: Adding more machines vs. adding more resources to existing machines.
  - Database Sharding/Partitioning: Distributing data across multiple database instances.
  - Database Replication: Maintaining multiple copies of data for availability and read scaling.
  - Microservices vs. Monolith: Architectural styles and their trade-offs.
  - Event-Driven Architecture: Services communicate via events.
  - API Gateway: Single entry point for clients to access microservices.
  - Circuit Breaker: Prevents cascading failures in distributed systems.
- Trade-offs: Understanding that optimizing for one NFR often impacts others (e.g., strong consistency vs. high availability).
Benefits (Pros):
- Holistic Understanding: Develops a comprehensive view of how different components interact.
- Problem-Solving Skills: Enhances ability to break down complex problems into manageable parts.
- Architectural Thinking: Fosters the ability to design robust, scalable, and resilient systems.
- Interview Preparation: Essential skill for senior and staff engineering roles.
- Better Decision Making: Leads to more informed choices about technology and infrastructure.
Challenges (Cons):
- Breadth of Knowledge: Requires understanding across many domains (networking, databases, distributed systems, security).
- No Single Solution: Often multiple valid designs exist, requiring justification of trade-offs.
- Abstract Nature: Can be difficult to grasp without practical experience.
- Keeping Up-to-Date: The technology landscape evolves rapidly.
Practical Use:
- Designing new software systems from scratch.
- Re-architecting existing systems to address performance or scalability issues.
- Evaluating and selecting appropriate technologies.
- Preparing for technical interviews.

Core Concepts

System design is the art and science of architecting complex software systems. It’s not just about writing code, but about making high-level decisions that impact the system’s performance, reliability, scalability, and maintainability. A key aspect of system design is understanding and balancing non-functional requirements (NFRs), as optimizing for one often comes at the expense of another.

The process typically involves:

Understanding Requirements: Clarifying functional and non-functional requirements.
Estimating Scale: Determining expected user load, data volume, and traffic.
High-Level Design: Sketching out major components and their interactions.
Deep Dive: Detailing specific components (e.g., database schema, API design).
Identifying Bottlenecks & Trade-offs: Analyzing potential issues and making informed decisions.

Practice Exercise

Design a high-level architecture for a real-time notification system. Discuss:

How clients (web/mobile) would connect.
How the backend would push notifications.
How to handle millions of concurrent connections.
The data model for storing notifications. (Hint: Consider SignalR, WebSockets, Message Queues).

Answer (High-Level Architecture for a Real-Time Notification System)

1. Requirements & Scale Estimation

Functional: Send real-time notifications to users (web, mobile).
Non-functional:
- Scalability: Handle millions of concurrent connections and high notification throughput.
- Availability: High uptime, notifications should be delivered reliably.
- Latency: Near real-time delivery (low latency).
- Durability: Notifications should be persisted.
- Security: Secure user authentication and authorization.

2. High-Level Components

Clients: Web browsers, iOS/Android apps.
API Gateway / Load Balancer: Entry point for all client requests.
Notification Service: Core service for sending/managing notifications.
WebSocket/Long Polling Server (e.g., SignalR Hub, custom WebSocket server): Manages persistent connections with clients.
Message Queue/Broker (e.g., Kafka, RabbitMQ, Azure Service Bus): Decouples notification producers from consumers, handles fan-out.
Database: Stores notification history, user preferences, and connection mapping.
Cache (e.g., Redis): Stores active user connections, recent notifications for quick retrieval.

3. Architecture Diagram (Conceptual)

+-----------------+     +-----------------+     +-----------------+
|                 |     |                 |     |                 |
|  Web/Mobile     |     |   API Gateway   |     | Notification    |
|    Clients      +----->   (Load Balancer) +----->    Service    |
|                 |     |                 |     |                 |
+--------+--------+     +--------+--------+     +--------+--------+
         |                       |                         |
         | (WebSocket/Long Polling)                        | (Pub/Sub)
         |                       |                         |
         |                       |                         |
+--------v--------+     +--------v--------+     +--------v--------+
|                 |     |                 |     |                 |
| WebSocket/Long  |     |  Message Queue  |     |    Database     |
| Polling Servers |<----+    (Kafka/MQ)   +----->    (NoSQL/SQL)  |
|                 |     |                 |     |                 |
+-----------------+     +-----------------+     +-----------------+
         ^
         | (Cache for active connections)
         |
+--------+--------+
|                 |
|      Cache      |
|     (Redis)     |
+-----------------+

4. Discussion Points

1. How clients (web/mobile) would connect:

WebSockets: Ideal for real-time, bi-directional communication. Clients establish a persistent connection with the WebSocket/Long Polling Servers. This is the preferred method for low-latency updates.
Long Polling/Server-Sent Events (SSE): As a fallback for older browsers or specific scenarios where WebSockets are not feasible.
Mobile Push Notifications (APNS/FCM): For mobile apps, integrate with platform-specific push notification services (Apple Push Notification Service, Firebase Cloud Messaging) for background notifications when the app is not active.

2. How the backend would push notifications:

Notification Producer: Any service that needs to send a notification (e.g., a new order service, a social media service) would publish a notification event to the Message Queue. This decouples the producer from the notification delivery mechanism.
Notification Service (Consumer): This service subscribes to the Message Queue. When it receives a notification event:
- It persists the notification in the Database.
- It looks up the active user connections (from Cache or Database).
- It then pushes the notification to the relevant WebSocket/Long Polling Servers which are maintaining the client connections.
- For mobile, it would send the notification payload to APNS/FCM.

3. How to handle millions of concurrent connections:

Horizontal Scaling of WebSocket/Long Polling Servers: Deploy multiple instances of these servers behind a Load Balancer (e.g., Nginx, HAProxy) that supports sticky sessions or WebSocket proxying. Each server can handle a large number of concurrent connections.
Distributed Connection Management: The WebSocket servers need a way to know which user is connected to which server. A Cache (like Redis) can store this mapping (e.g., userId -> serverId/connectionId). When the Notification Service needs to send a notification, it queries the cache to find the correct server(s) to send the message to.
Message Queue for Fan-out: The Message Queue is critical for fanning out notifications efficiently to potentially millions of connected clients. The Notification Service publishes to the queue, and the WebSocket servers consume from it.
Stateless Application Servers: Keep the Notification Service and other backend services stateless to allow for easy horizontal scaling.

4. The data model for storing notifications:

Database Choice: A NoSQL database (e.g., MongoDB, Cassandra) might be suitable for its flexibility and scalability with high write/read volumes, especially for storing potentially large numbers of notifications per user. A relational database could also work, but might require more careful scaling.

Notification Schema (Example):

Notification {
    id: UUID (Primary Key)
    userId: UUID (Index for quick lookup)
    type: String (e.g., "message", "alert", "promotion")
    title: String
    body: String
    timestamp: DateTime (Index for chronological order)
    isRead: Boolean (Index for unread count)
    link: String (Optional URL)
    metadata: JSON (Flexible field for additional data)
}

User Connection Mapping (in Cache/Database):

UserConnection {
    userId: UUID (Primary Key)
    connectionId: String (WebSocket connection ID)
    serverId: String (ID of the WebSocket server handling the connection)
    lastActive: DateTime
}

Indexes: Crucial for efficient querying (e.g., userId and timestamp for retrieving a user’s notifications, isRead for unread counts).

Trade-offs:

Complexity vs. Scalability: This distributed architecture is more complex than a monolithic approach but offers significantly higher scalability and resilience.
Real-time vs. Cost: Maintaining persistent WebSocket connections can be resource-intensive, impacting cost. Mobile push notifications are more cost-effective for non-active users.
Consistency vs. Latency: For notifications, eventual consistency is often acceptable (a notification might arrive slightly delayed but will eventually arrive), allowing for lower latency and high availability.

This high-level design provides a foundation for building a robust and scalable real-time notification system.