Back of the Envelope Calculations

Back of the envelope calculation:
A quick and approximate calculation that gives us further insight. Presumably carried out on the back of an envelope or a napkin.

Why do we need it?

Sometimes we face a choice between alternative architectures. Is a single server sufficient? If not, how many servers would we need? The calculation gives us a rough estimate.

The calculation tells us

  • if the architecture can fulfill the functional requirements, for example number of supported users or response latency,
  • the resource requirements.

How to do it?

Recognize a limited resource, then approximate the required amount. For example, servers are capped by 2GHz CPUs. Can we serve all requests using a single server?

How to approximate the required amount? We divide and conquer, by breaking down the usage to its constituting factors, making a rough estimate of those factors, and combining them.

For example, we might expect to have 1K active users, each issuing 15 requests per day. That’s 15K requests per day, or 15K/86400 requests per second.

When combining the parts, a trick is to round aggressively. Division by 86400? No thanks. So let’s round to 20K/100K, leaving 0.2 seconds time available to serve a single request. If we know that a single request roughly takes 0.7 seconds to serve, we need to bring up 4 machines. Of course you don’t want to live on the edge, so let’s add some buffer and make that 5 machines.

On quick operations

Prefer using small numbers along with abbreviation for magnitude (or, if needed, exponents) rather than writing out a full number. 10K instead of 10000.

If given a large and overly-precise number, for example 7432, instantly convert and write it down as 7K. You are approximating anyway.

Having numbers in this form makes multiplication and division fast. K*K is M. G/M is K. 4K*7M=28G. To work with larger numbers, round both of them towards a small multiple of a power of 2 or 10.

  • 27*14 ~ 30*10 = 300.
  • 6500/250 ~ 6400/256 ~ 100 * 2^6 / 2^8 ~ 100 / 2^2 = 25.

Dimensions to approximate

Find typical limited dimensions along with exercises below.

Network bandwidth

Assuming 1Gbps link per machine, if we want to crawl 70TB of websites every day, how many machines would a crawler system need?

Storage space

How much space would it take to store the contents of 100M web pages? What if we substitute each word by an integer index? How many machines of 64GB SSD would it fit?

IO throughput

You store fetched web pages on a mechanical hard drive, with limited random access speed. Users issue requests with 100 query per sec (qps), each request typically returning the content of 20 pages. How many hard drives would you need to keep request latency low?

Engineering effort.

You need to deliver a new feature. There are 5 programmers and 40 tasks. How many weeks until possible launch?

Money.

A user pays $10 a month for your image store service, storing all their photos, each downsized to 3MB. During a month a user fetches 1K photos. Find the pricing page of your favorite cloud provider, and calculate the cost associated with each user. How much is your revenue per user? Check for different assumed photo counts.

Others include CPU time, RAM size, latencies of various kinds (disk access, RAM access, network), thread count.

Where to start?

Enumerate typical use-cases of the system and determine the most critical resources they need. A document store will need lots of storage. Guesstimating document sizes and counts is a good start, but further details will depend on usage. How often are new documents added? Are the documents searchable? Do we need any indices? Is it a write-heavy or read-heavy store?

Sometimes different use-cases will need very different shapes of resources. For example, serving the documents might need lots of RAM but not CPU, while preprocessing new documents the other way around. Hosting all those services on homogeneous machines would waste both CPU and RAM, since you need to get machines which are maxed on both dimensions.

Such differences indicate those features should be split to different services, hosted on independent sets of machines.

Real data

Outside of system design interviews, you can reach out to actual data. Spend some time mining the monitoring dashboards to get usual CPU usage. Perform a load test to measure peak RAM consumption. Run a SQL query to get the average number of photos stored by user.

It doesn’t need to be an either-or. Complement assumptions with data-backed facts if needed.

Useful resources

A summary of numbers every engineer should know. Or at least know where to look up ;)