AWS doesn't make sense for scientific computing

Generic cloud computing like AWS isn't cost-effective for most scientific computing. While it's crazy to build your own server infrastructure from the start for an app, in 2022, it actually makes sense to build your own scientific computing infrastructure if you're a university, NASA, or a growing bio company.

Scientific computing has a completely different usage profile than modern apps. Scientists need powerful computers and massive data transfer that runs on relatively simple infrastructure. Most cloud computing has infrastructural complexity that isn't necessary for scientific computing, and that complexity comes at a cost.

Scientific computing has different needs than web and mobile apps

Scientific computing is fundamentally different. It has different availability requirements, networking needs, load profiles, and latency tolerance.

Availability requirements of scientific computing are different. Most scientific computing runs on queues. These queues can be months long for the biggest supercomputers – that's the API equivalent of storing your inbound API requests, and then responding to them months later. If the supercomputer is down for a day, most requests in the queue don't see anything different. Since some downtime is acceptable, you can skip redundant networking, power, backup generators, over-provisioned cooling, and a lot of what adds cost to something like AWS.

Networking needs are also different: supercomputers rarely communicate with the public internet, but when they do it's for large chunks of data. The typical flow includes one large transfer of data before the run and one after the run. In the middle, there's almost no network traffic.

The load profile is much more plannable. You can keep your servers at 100% utilization by maintaining a queue of requested jobs. This slows down results, but your researchers aren't leaving to go to another supercomputer – they just wait longer. Base costs are significantly lower for on-prem infrastructure, because you don't need to use more servers to meet peak demand.

Scientific computing has much higher latency tolerance. Milliseconds of latency affect user experience and conversion rates in the browser. For scientific computing, you can have minutes of latency and nobody will notice. Faster scientific computing jobs take hours, and the longer ones take months.

These requirements are different from an average app. Cloud computing pricing reflects the needs of the average app, not scientific computing.

It's 10x cheaper to build your own infrastructure than use AWS on-demand instances

Most modern supercomputers are a cluster of smaller servers. Running a modern AMD-based server that has 48 cores, at least 192 GB of RAM, and no included disk space costs:

  • ~$2670.36/mo for a c5a.24xlarge AWS on-demand instance
  • ~$1014.7/mo for a c5a.24xlarge AWS reserved instance on a three-year term, paid upfront
  • ~$558.65/mo on OVH Cloud[1]
  • ~$512.92/mo on Hetzner[2]
  • ~$200/mo on your own infrastructure as a large institution[3]

That means it's over 10x more expensive to use AWS on-demand instances than your own infrastructure, and still 5x more expensive to use AWS reserved instances. OVH Cloud and Hetzner come closer, and only about 2.5x more expensive. Even 2.5x over building your own infrastructure is significant for a $50M/yr supercomputer.

An aside: data egress on AWS

AWS charges for data egress, in addition to base server costs. Their data egress charges aren't tied to their COGS, unlike other core services like EC2. Instead, they charge over 30x their underlying cost[4].

AWS realized that large successful web apps have high data egress[5]. Those customers that can afford extra bandwidth charges, will struggle to migrate off AWS, and are an easy way to increase margins. In contrast, a small scientific computing project can transfer more data than large successful apps: a month-long DNA sequencing project can generate 90 TB of data[6]. At $0.09 per gigabyte, that costs more than $8k to transfer out of AWS.

AWS, Azure, and GCP all have similarly exhorbitant bandwidth charges. Hetzner doesn't meter, and OVH Cloud includes 20 TB/mo. Transferring data from your in-house data generation (DNA sequencers, telescopes, etc) is just the cost of a cable.

I still use AWS for scientific computing

There are benefits to AWS, which is why I use it for computational biology. You don't have to configure or maintain the underlying physical infrastructure, and on-demand scalability is unmatched[7].

Still, generic cloud computing like AWS isn't cost-effective for most scientific computing. AWS is a great place to start, but custom infrastructure quickly becomes the rational choice – much earlier for scientific computing than for an app.

Thank you Cameron Ferguson, Paul Butler, and Bryce Cai for reading drafts of this.

  1. Slightly different CPU, and 256 GB of RAM. 0GB of disk. 1 Gb/s unmetered public bandwidth. 24 month agreement to save 15%. ↩︎

  2. Includes 960 GB of disk. 20 TB of metered traffic. No long-term agreement. ↩︎

  3. Assumes an AMD EPYC 7552 run at 100% load in Boston with high electricity prices of $0.23/kWh, for $33.24/mo in raw power. Hardware is amortized over five years, for an average monthly price of $67.08/mo. We assume that your large institution already has 24/7 security and public internet bandwidth, but multiply base hardware and power costs by 2x to account for other hardware, cooling, physical space, and a half-a-$120k-sysadmin amortized across 100 servers. ↩︎

  4. A 1 Gbps link at a data center – or 324 TB/mo in data transfer if fully utilized – usually costs at most $1k/mo. AWS charges ~$30k/mo for 324 TB of data egress. ↩︎

  5. I'm not positive that this is the true reason. If you know of another reason, let me know! ↩︎

  6. An Illumina NovaSeq with a dual S4 flow cell produces 6 TB of data every couple days. ↩︎

  7. This is how I built Toolchest, which runs computational biology software on AWS. Some software needs clusters with thousands of vCPUs, and we can spawn them in minutes with AWS. ↩︎