Cool, building in resilience seems to have worked. Our static site has origins in multiple regions via CloudFront and didn’t seem to be impacted (not sure if it would have been anyway).
My control plane is native multi-region, so while it depends on many impacted services it stayed available. Each region runs in isolation. There is data replication at play but failing to replicate to us-east-1 had no impact on other regions.
The service itself is also native multi-region and has multiple layers where failover happens (DNS, routing, destination selection).
Nothing’s perfect and there are many ways this setup could fail. It’s just cool that it worked this time - great to see.
Nothing I’ve done is rocket science or expensive, but it does require doing things differently. Happy to answer questions about it.
Yeah, it's not clear how resilient CloudFront is but it seems good. Since content is copied to the points of presence and cached it's the lightly used stuff that can break (we don't do writes through CloudFront, which in IMHO is an anti-pattern). We setup multiple "origins" for the content so hopefully that provides some resiliency -- not sure if it contributed positively in this case since CF is such a black box. I might setup some metadata for the different origins so we can tell which is in use.
There is always more than a way to do things with AWS. But CloudFront Origin groups can’t use HTTP POST. They’re limited to read requests. Without origin groups you opt-out of some resiliency. IMHO that’s a bad trade-off. To each their own.
Yep if you wrote lambda@edge functions, which are part of Cloudfront and can be used for authentication among other things, they can only be deployed to us-east-1
I was under the impression it's similar to IAM where the control plane is in us-east-1 and the config gets replicated to other regions. In that case, existing stuff would likely continue to work but updates may fail
True for certs but not the log bucket (but it’s still going to be in a single region, just doesn’t have to be Virginia). I’m guessing those certs are cached where needed, but I can also imagine a perfect storm where I’m unable to rotate them due to an outage.
I prefer the API Gateway model where I can create regional endpoints and sew them together in DNS.
We use AWS for keys and certs, with aliases for keys so they resolve properly to the specific resources in each region. For any given HTTP endpoint there is a cert that is part of a the stack in that region (different regions use different certs).
The hardest part is that our customers' resources aren't always available in multiple regions. When they are we fall back to a region where they exist that is next closest (by latency, courtesy of https://www.cloudping.co/).
That’s what I’d expect a basic setup to look like - region/space specific
So you’re minimally hydrating everyone’s data everywhere so that you can have some failover. Seems smart and a good middle ground to maximize HA. I’m curious what your retention window for the failover data redundancy is. Days/weeks? Or just a fifo with total data cap?
Just config information, not really much customer data. Customer data stays in their own AWS accounts with our service. All we hold is the ARNs of the resources serving as destinations.
We’ve gone to great lengths to minimize the amount of information we hold. We don’t even collect an email address upon sign-up, just the information passed to us by AWS Marketplace, which is very minimal (the account number is basically all we use).
The data layer is DynamoDB with Global Tables providing replication between regions, so we can write to any region. It's not easy to get this right, but our use case is narrow enough and rate of change low enought (intentionally) that it works well. That said, it still isn't clear that replication to us-east-1 would be perfect so we did "diff" tables just to be sure (it has been for us).
There is some S3 replication as well in the CI/CD pipeline, but that doesn't impact our customers directly. If we'd seen errors there it would mean manually taking Virginia out of the pipeline so we could deploy everyehere else.
Our stacks in us-east-1 stopped getting traffic when the errors started and we’ve kept them out of service for now, so those tables aren’t being used. When we manually checked around noon (Pacific) they were fine (data matched) but we may have just gotten lucky.
cool thanks, we've been considering dynamo global tables for the same. We have S3 replication setup for cold storage data. For primary/hot DB there doesn't seem to be many other options for doing local writes
My control plane is native multi-region, so while it depends on many impacted services it stayed available. Each region runs in isolation. There is data replication at play but failing to replicate to us-east-1 had no impact on other regions.
The service itself is also native multi-region and has multiple layers where failover happens (DNS, routing, destination selection).
Nothing’s perfect and there are many ways this setup could fail. It’s just cool that it worked this time - great to see.
Nothing I’ve done is rocket science or expensive, but it does require doing things differently. Happy to answer questions about it.