System Design Part 1

System Design Part 1

A Beginner's Guide to System Design: Key Concepts for Scalable and Reliable Systems

Introduction

In addition to coding interviews, system design is a crucial component of the technical interview process at many tech companies. In this post, I will provide you with a basic understanding of common system design principles, including what they are, how they are used, and their pros and cons.

Let's dive in!

System design is the art and science of building large-scale systems like Google, Facebook, Amazon, and other applications that serve millions of users daily. It draws upon core computer science concepts such as networking, distributed systems, and database management to ensure these systems are scalable, reliable, and efficient.

When designing large-scale applications, distributed systems play a central role. Imagine an application accessed by millions of users—on the server side, exceptional engineering is required. To ensure reliability:

  • Redundancy: Multiple copies of servers must exist to prevent downtime.

  • Consistency: Information across servers must be synchronized to avoid conflicting data responses.

  • Load Balancing: Traffic must be evenly distributed to prevent server overload.

  • Performance: Optimized networks and databases are critical to ensure timely responses to user requests.

System design addresses these challenges by providing methodologies to build robust systems capable of meeting user demands while maintaining high availability, consistency, and performance.

Performance vs scalability

A service is considered scalable if, when we increase the resources in the system, it results in increased performance in a manner proportional to the resources added. Increasing performance generally means serving more units of work, but it can also refer to handling larger units of work, such as when datasets grow.

Scalability measures a system's ability to handle increasing amounts of work or to grow without losing performance. If a system isn't scalable, adding users, data, or traffic might cause slowdowns or failures.

For example:

  • A non-scalable system might crash if 1,000 users try to access it simultaneously because it wasn't designed to handle high traffic.

  • A scalable system could handle millions of users without performance issues by using techniques like load balancing, caching, or database sharding.

Another way to differentiate performance and scalability:

  • Performance refers to how fast or efficient your system is for individual actions or users. If it’s slow for one user, you have a performance issue.

  • Scalability refers to how well your system can handle growth (more users or data). If the system works well for a few users but struggles under heavy load, it’s a scalability issue.

Latency vs throughput

Latency is the time required to perform some action or produce a result. It is measured in units of time—hours, minutes, seconds, nanoseconds, or clock periods.

Throughput refers to the number of actions executed or results produced per unit of time.

Example:

  1. Latency:

    • A food delivery app takes 30 minutes to deliver one order.

    • The latency for delivering one order is 30 minutes.

  2. Throughput:

    • The app can handle 100 deliveries in one hour.

    • The throughput is 100 orders per hour.

Availability vs consistency

The CAP theorem explains a challenge in designing distributed systems. It states that a system can only have two out of these three properties at the same time:

  1. Consistency (C): All users see the same data at the same time.

    • Consistency means that all the nodes (databases) inside a network will have the same copies of a replicated data item visible for various transactions. It guarantees that every node in a distributed cluster returns the same, most recent, and successful write. This ensures every client has the same view of the data.

For example:

  • A user spends 200 rupees, reducing their balance from 500 to 300.

  • If this change isn't updated everywhere, some databases might still show 500 rupees, which is incorrect.

This is a Consistency problem because all parts of the system are not showing the same, correct data.

  1. Availability (A): The system is always ready to respond to requests.
  • Availability means that each read or write request for a data item will either be processed successfully or will receive a message that the operation cannot be completed. For example:

  • User 1 has 1,000 subscribers.

  • User 2 wants to subscribe to User 1’s channel, but they connect to a different database node because they are geographically distant.

  • If the system prioritizes Availability (A), User B’s action (subscribing) must succeed immediately, even if the system doesn’t immediately update all database nodes.

This ensures that the system remains usable and responsive for everyone.

3. Partition Tolerance (P): The system works even if there are network issues or parts of the system are disconnected.

  • Partition Tolerance refers to the ability of a distributed system to continue operating despite network partitions that prevent some nodes in the system from communicating with others.

Take the example of the same social media network, where two users are trying to find the subscriber count of a particular channel. Due to a network outage, User B's connection to a second database is lost. However, the system can still provide the correct data by using a replica of the data from the first database, backed up before the outage occurred. Hence, the system is partition-tolerant.

Networks aren't reliable, so you'll need to support partition tolerance. You'll need to make a software tradeoff between consistency and availability.

CP - consistency and partition tolerance

Waiting for a response from a partitioned node might result in a timeout error. CP is a good choice if your business needs require atomic reads and writes. For financial systems, prioritize Consistency and Partition Tolerance (CP).

AP - availability and partition tolerance

In AP systems, responses return the most readily available version of the data, which might not be the latest. Writes may take time to propagate when the partition is resolved. For real-time services, prioritize Availability and Partition Tolerance (AP).