Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

One main problem that we observed was that big parts of their IAM / auth setup was overloaded / down which led to all kinds of cascading problems. It sounds as if Dynamo was reported to be a root cause, so is IAM dependent on dynamo internally?

Of course, such a large control plane system has all kinds of complex dependency chains. Auth/IAM seems like such a potentially (global) SPOF that you'd like to reduce dependencies to an absolute minimum. On the other hand, it's also the place that needs really good scalability, consistency, etc. so you probably like to use the battle proof DB infrastructure you already have in place. Does that mean you will end up with a complex cyclic dependency that needs complex bootstrapping when it goes down? Or how is that handled?



There was a very large outage back in ~2017 that was caused by DynamoDB going down. Because EC2 stored its list of servers in DynamoDB, EC2 went down too. Because DynamoDB ran its compute on EC2, it was suddenly no longer able to spin up new instances to recover.

It took several days to manually spin up DynamoDB/EC2 instances so that both services could recover slowly together. Since then, there was a big push to remove dependencies between the “tier one” systems (S3, DynamoDB, EC2, etc.) so that one system couldn’t bring down another one. Of course, it’s never foolproof.


I don't remember an event like that, but I'm rather certain the scenario you described couldn't have happened in 2017.

The very large 2017 AWS outage originated in s3. Maybe you're thinking about a different event?

https://share.google/HBaV4ZMpxPEpnDvU9


Sorry the 2015 one. I misremembered the year

https://aws.amazon.com/message/5467D2/

I imagine this was impossible in 2017 because of actions taken after the 2015 incident


Definitely impossible in 2015.

If you're talking about this part:

> Initially, we were unable to add capacity to the metadata service because it was under such high load, preventing us from successfully making the requisite administrative requests.

It isn't about spinning up ec2 instances or provisioning hardware. It is about logically adding the capacity to the system. The metadata service is a storage service, so adding capacity necessitates data movement. There are a lot of things that need to happen to add capacity while maintaining data correctness and availability (mind at this point, it was still trying to fulfill all requests)


I’m referring to impact on other services


When I worked at AWS several years ago, IAM was not dependent on Dynamo. It might have changed, but I highly doubt this. Maybe some kind of network issue with high-traffic services?

> Auth/IAM seems like such a potentially (global) SPOF that you'd like to reduce dependencies to an absolute minimum.

IAM is replicated, so each region has its own read-only IAM cache. AWS SigV4 is also designed to be regionalized, if you ever wondered why the signature key derivation has many steps, that's exactly why ( https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_s... ).


Many AWS customers have bad retry policies that will overload other systems as part of their retries. DynamoDB being down will cause them to overload IAM.


Which is interesting because per their health dashboard,

>We recommend customers continue to retry any failed requests.


They should continue to retry but with exponential backoff and jitter. Not in a busy loop!


If the reliability of your system depends upon the competence of your customers then it isn't very reliable.


Have you ever built a service designed to operate at planetary scale? One that's built of hundreds or thousands of smaller service components?

There's no such thing as infinite scalability. Even the most elastic services are not infinitely elastic. When resources are short, you either have to rely on your customers to retry nicely, or you have to shed load during overload scenarios to protect goodput (which will deny service to some). For a high demand service, overload is most likely during the first few hours after recovery.

See e.g., https://d1.awsstatic.com/builderslibrary/pdfs/Resilience-les...


Probably stupid question (I am not a network/infra engineer) - can you not simply rate limit requests (by IP or some other method)?

Yes your customers may well implement stupidly aggressive retries, but that shouldn't break your stuff, they should just start getting 429s?


Load shedding effectively does that. 503 is the correct error code here to indicate temporary failure; 429 means you've exhausted a quota.


Can't exactly change existing widespread practice so they're ready for that kind of handling.


I think Amazon uses an internal platform called Dynamo as a KV store, it’s different than DynamoDB, so im thinking the outage could be either a dns routing issue or some kind of node deployment problem.

Both of which seem to prop up in post mortems for these widespread outages.


They said the root cause was DNS for dynamoDB. inside AWS relying on dynamoDB is highly encouraged so it’s not surprising that a failure there would cascade broadly. The fact that EC2 instance launching is effected is surprising. Loops in the service dependency graph are known to be a bad idea.


It's not a direct dependency. Route 53 is humming along... DynamoDB decided to edit its DNS records that are propagated by Route 53... they were bogus, but Route 53 happily propagated the toxic change to the rest of the universe.

DynamoDB is not going to set up its own DNS service or its own Route 53.

Maybe DynamoDB should have had tooling that tested DNS edits before sending it to Route 53, or Route53 should have tooling to validate changes before accepting them. I'm sure smart people at AWS are yelling at each other about it right now.


Dynamo is AFAIK, not used by core AWS services.


I find it very interesting that this is the same issue that took down GCP recently.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: