They don't appear to *have* a rollout procedure for some of their globally repli...

JB_Dev · 2025-12-05T13:44:18 1764942258

Yea agree.. This is the same discussion point that came up last time they had an incident.

I really don’t buy this requirement to always deploy state changes 100% globally immediately. Why can’t they just roll out to 1%, scaling to 100% over 5 minutes (configurable), with automated health checks and pauses? That will go along way towards reducing the impact of these regressions.

Then if they really think something is so critical that it goes everywhere immediately, then sure set the rollout to start at 100%.

Point is, design the rollout system to give you that flexibility. Routine/non-critical state changes should go through slower ramping rollouts.

franktankbank · 2025-12-05T15:05:13 1764947113

Can't get hacked when you are down.

ethbr1 · 2025-12-05T13:42:03 1764942123

For hypothetical conflicting changes (read worst case: unupgraded nodes/services can't interop with upgraded nodes/services), what's best practice for a partial rollout?

Blue/green and temporarily ossify capacity? Regional?

cryptonym · 2025-12-05T15:38:49 1764949129

- Push a version with the new logic but not yet enabled, still using legacy logic, able to implement both

- Push a version that enables new logic for 1% of traffic

- Continue rollout until 100%

nrhrjrjrjtntbt · 2025-12-05T23:36:10 1764977770

Can also do canary rollout before that. Canary means rollout to endpoints only used by CF to test. Monitor metrics and automated test results.

cryptonym · 2025-12-06T08:45:06 1765010706

That's ok but doesn't solve issues you notice only on actual prod traffic. While it can be a nice addition to catch issues earlier with minimal user impact, best practice on large scale systems still requires a staged/progressive prod rollout.

nrhrjrjrjtntbt · 2025-12-06T08:49:00 1765010940

Yep. This is definitely an "as well as"

Unit test, Integration Test, Staging Test, Staging Rollout, Production Test, Canary, Progressive Rollout

Can all be automated can smash through all that quickly with no human intervention.

tehlike · 2025-12-05T12:36:49 1764938209

You can selectively bypass many roll out procedures in a properly designed system.

lima · 2025-12-05T12:54:34 1764939274

If there is a proper rollout procedure that would've caught this, and they bypass it for routine WAF configuration changes, they might as well not have one.

nrhrjrjrjtntbt · 2025-12-05T23:34:52 1764977692

Not sure I buy it. Do 1% for 10 minutes. I mean it must have taken over half a day to code and test a patch. Why not wait another 10 minutes.