Hard drives in the cloud

By Will Braynen

Whether as a backend for mobile apps or for something else, Amazon Web Services (AWS) is an exciting platform.  And yet, hard drives (AWS's "EBS volumes"), an AWS topic, have a soporific effect on me, even when they are networked drives and coupled with virtual machines that compute things ("ec2 instances").  Even when they can be attached and detached and reattached at will, persisting data from one ec2 instance to the next.  Still.  Perhaps it's because I am a software guy.  Basically, I find it difficult to get excited about the characteristics of EBS volumes.  And when I look at the documentation, I can't help but see just that: a mapping from EBS volume types to their performance metrics.  For me, this gets things backwards.  What I want instead is a mapping from performance metrics to EBS volume types—a lookup table from my needs to what Amazon has to offer, not the other way around. The Amazon-specific names of pieces of networked storage is the last thing I want to see and think about.  It's true: I do need to know what configuration choices to make.  But the first thing on my mind should be my or my client's needs (which mentally also leaves open the question of whether any current Amazon offerings meet those needs).

So, to help me flip my thinking, I made a simple calculator.  This calculator takes the documentation and helps me think about it in reverse.  Instead of saying, "Let me choose the type of EBS volume first and then see how much throughput I will get", this calculator asks "How much throughput do you need?" (measured in IOPS or Input/Output Operations Per Second, pronounced "eye-ops") and then recommends a volume type to consider first.  Try it out:

Which of the following describes your workload better?
  Frequent read/write operations with small I/O size, with non-sequential reads not unlikely.
  Large streaming workloads (e.g. streaming video).

(It's two steps: first, above, select a description of your workload and then enter your performance target.)  This calculator uses the following subset of the docs, reformatted "backwards" for this post:

WorkloadMaximum IOPS per volumeIOPS based on I/O sizeMaximum throughput per volume (MiB/second)Volume name (API name)
Streaming 250 1 MiB 250 Cold HDD (sc1)
Streaming 500 1 MiB 500 Throughput-Optimized HDD (st1)
Transactional 10,000 16 KiB 160 General-Purpose SSD (gp2)
Transactional 20,000 16 KiB 320 Provisioned IOPS SSD (io1)

The calculator also takes into account that, according to the same docs, "[b]etween a minimum of 100 IOPS (at 33.33 GiB and below) and a maximum of 10,000 IOPS (at 3,334 GiB and above), baseline performance scales linearly at 3 IOPS per GiB of volume size."

Beyond that, the full picture is a little more complicated, so the above is only meant as a starting point or a helpful simplification.  For one thing, the calculator is fiscally conservative.  To boot, there is the "bursting" behavior.  Three of the current-generation volumes (gp2, st1 and sc1) can burst their IOPS or throughput and the duration of this turbo boost depends on how long your drive had to rest (or time to accrue "credits").  But that's probably too fine-grained for our purposes here.  Also, I left out "EBS Magnetic (standard)" volumes because those are getting phased out.

And of course neither throughput nor IOPS is all you should care about when choosing networked storage.  Be it a solid-state drive (SSD) or an old-school hard-drive disk (HDD), you should care about at least three benchmarks: throughput or IOPS (the how much or how many, per second, depending on the nature of your workload), latency (the how long, in milliseconds), and the size of the drive (in GiB to TiB).  My point isn't that throughput or IOPS are the most important thing or even the first thing you should think about when choosing a networked drive in the cloud.  

Rather, my point is that the first thing I want to think about is needs, not seller-specific drive names.  That's because my client's needs are not a function of Amazon's offerings.  The type of drive I should choose is a function of my or my client's needs, not the other way around.

NB: GiB and TiB is not necessarily the same as GB and TB and sounds like a very appropriate choice of unit as it gets us into the mindset of thinking about data transmission (i.e. networked storagerather than data storage simpliciter (i.e. networked storage).