AWS has a low elasticity ceiling for big servers
As of 2022, the elasticity that cloud computing is known for doesn't apply for big servers on AWS. I didn't know this when I started building scientific computing pipelines on AWS, and as a result we based our architecture on faulty availability assumptions.
I think this is because of the dynamics of filling virtualized servers: if you can't move running VMs between the host servers, over time the majority of your host servers become partially full with long-lived VMs. That means there's little room on the instance class for VMs that need the majority of the resources of the host.
AWS hasn't had a lot of available capacity for high memory or GPU instances lately, even though they're flush with smaller instances. There's often zero on-demand high-memory servers in all of us-east-1, which isn't quite the virtually unlimited compute capacity that AWS advertises for HPC.
For web apps, that's usually fine: you don't need a lot of RAM or vCPUs to run an API, and it's easy to scale out horizontally. For use-cases where you have a high minimum server size – which is very common in scientific computing – that's where this becomes a problem.
I regularly can't launch a single on-demand c5a.24xlarge – an instance with 48 cores and 192 GB of RAM – in any availability zone in us-east-1. We built some servers of our own to have guaranteed and highly utilized baseline capacity, which has led to the weird case where a single one of our on-prem servers sometimes has more available on-demand single-node RAM capacity than all of AWS' us-east-1 region.
Our AWS contact suggested using capacity reservations. Capacity reservations are like placeholder instances: you pay the hourly cost of a running on-demand instance, and when you need an actual instance your request has priority. For us, that means paying for capacity reservations defeats the cost advantage of rapidly scaling on the cloud – and capacity reservations don't always work. If there's insufficient capacity, the request to create a capacity reservation fails.
An aside: availability in other regions and transferring data
Sometimes there's better availability in another region, but there's no clear way to find this availability on AWS outside of trying to spawn instances. Neither my VC-backed company nor the bio startup I used to work at that spent >$1M on AWS received any official AWS guidance on where we could deploy to get the resources we needed.
If you've found availability elsewhere, you're not in the clear yet: you need to pay for data transfer between regions. Data transfer fees get expensive very quickly!
Spot instance availability
Spot instances – cheaper servers that can be shut off by AWS with a short warning – are hailed as a way to reduce costs. When people estimate spot instance pricing, they usually look at a graph that looks like this:
and they choose the lowest priced availability zone. For big servers, the lowest number usually has no availability. For this specific instance, I can't launch the instance type that I want as a spot instance in any availability zone inside us-east-1 . Maybe another instance type would work well, but as of this month this also happens in every us-east-1 availability zone with the half-sized c5a.12xlarge, c5.24xlarge, c6a.24xlarge, r6a.24xlarge, and r5a.24xlarge.
High memory spot instances are very hard to find, and especially difficult to use as the backing for a high-scale pipeline.
AWS is still valuable for big servers, but takes extra engineering
There are ways around this, like having tasks waiting in a queue, using AWS Batch, or restructuring the pipeline to use smaller instances. It can still be worth it, but it takes a significant amount of engineering effort and is counter to the infinite capacity that many (myself formerly included!) envision.
Thank you Paul Butler for reading drafts of this.
I mean "big" as >100 GB of RAM. Generally, I mean instances that are greater than half the size of the host VM. ↩︎
Is this true? If you work on EC2, please enlighten me! ↩︎
us-east-1b, us-east-1c, and us-east-1d all fail. us-east-1e doesn't support this type of instance. This is highly variable; availability fluctuates constantly. ↩︎