Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Cool, building in resilience seems to have worked. Our static site has origins in multiple regions via CloudFront and didn’t seem to be impacted (not sure if it would have been anyway).

My control plane is native multi-region, so while it depends on many impacted services it stayed available. Each region runs in isolation. There is data replication at play but failing to replicate to us-east-1 had no impact on other regions.

The service itself is also native multi-region and has multiple layers where failover happens (DNS, routing, destination selection).

Nothing’s perfect and there are many ways this setup could fail. It’s just cool that it worked this time - great to see.

Nothing I’ve done is rocket science or expensive, but it does require doing things differently. Happy to answer questions about it.



> Our static site has origins in multiple regions via CloudFront and didn’t seem to be impacted

This seems like such a low bar for 2025, but here we are.


You're also betting that CloudFront isn't one of the several AWS services that only works when us-east-1 is up.


Yeah, it's not clear how resilient CloudFront is but it seems good. Since content is copied to the points of presence and cached it's the lightly used stuff that can break (we don't do writes through CloudFront, which in IMHO is an anti-pattern). We setup multiple "origins" for the content so hopefully that provides some resiliency -- not sure if it contributed positively in this case since CF is such a black box. I might setup some metadata for the different origins so we can tell which is in use.


CloudFront isn't just for CDN, but also for DDoS protection. Writes through CloudFront are not an anti-pattern.


There is always more than a way to do things with AWS. But CloudFront Origin groups can’t use HTTP POST. They’re limited to read requests. Without origin groups you opt-out of some resiliency. IMHO that’s a bad trade-off. To each their own.


WAF is cheaper on CloudFront and so is traffic (compared to the ALB). It keeps bad traffic near the sender rather than near the recpient.


Yep if you wrote lambda@edge functions, which are part of Cloudfront and can be used for authentication among other things, they can only be deployed to us-east-1


I was under the impression it's similar to IAM where the control plane is in us-east-1 and the config gets replicated to other regions. In that case, existing stuff would likely continue to work but updates may fail


afaik cloudfront TLS certs and access logs S3 buckets must be stored in us-east-1


True for certs but not the log bucket (but it’s still going to be in a single region, just doesn’t have to be Virginia). I’m guessing those certs are cached where needed, but I can also imagine a perfect storm where I’m unable to rotate them due to an outage.

I prefer the API Gateway model where I can create regional endpoints and sew them together in DNS.


How did you do resilient auth for keys and certs?


We use AWS for keys and certs, with aliases for keys so they resolve properly to the specific resources in each region. For any given HTTP endpoint there is a cert that is part of a the stack in that region (different regions use different certs).

The hardest part is that our customers' resources aren't always available in multiple regions. When they are we fall back to a region where they exist that is next closest (by latency, courtesy of https://www.cloudping.co/).


That’s what I’d expect a basic setup to look like - region/space specific

So you’re minimally hydrating everyone’s data everywhere so that you can have some failover. Seems smart and a good middle ground to maximize HA. I’m curious what your retention window for the failover data redundancy is. Days/weeks? Or just a fifo with total data cap?


Just config information, not really much customer data. Customer data stays in their own AWS accounts with our service. All we hold is the ARNs of the resources serving as destinations.

We’ve gone to great lengths to minimize the amount of information we hold. We don’t even collect an email address upon sign-up, just the information passed to us by AWS Marketplace, which is very minimal (the account number is basically all we use).


Ah well that certainly makes it easier


active/active? curious what the data stack looks like as that tends to be the hard part


The data layer is DynamoDB with Global Tables providing replication between regions, so we can write to any region. It's not easy to get this right, but our use case is narrow enough and rate of change low enought (intentionally) that it works well. That said, it still isn't clear that replication to us-east-1 would be perfect so we did "diff" tables just to be sure (it has been for us).

There is some S3 replication as well in the CI/CD pipeline, but that doesn't impact our customers directly. If we'd seen errors there it would mean manually taking Virginia out of the pipeline so we could deploy everyehere else.


So your global tables weren't impacted in us-east-1... I thought I read their status showed issues with global table replication


Our stacks in us-east-1 stopped getting traffic when the errors started and we’ve kept them out of service for now, so those tables aren’t being used. When we manually checked around noon (Pacific) they were fine (data matched) but we may have just gotten lucky.


cool thanks, we've been considering dynamo global tables for the same. We have S3 replication setup for cold storage data. For primary/hot DB there doesn't seem to be many other options for doing local writes




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: