It has been quite a while, wondering how many 9s are dropped. 365 day \* 24 \* 0...

rdtsc · 2025-10-20T17:44:32 1760982272

9s don’t have to drop if you increase the time period! “We still guarantee the same 9s just over 3450 years now”.

rvba · 2025-10-20T18:43:41 1760985821

In a company where I worked, the tool measuring downtime was at the same server, so even if the server was down they still showed 100% up.

If the server didnt work - the tool too measure didnt work too! Genius

bityard · 2025-10-20T18:58:28 1760986708

This happened to AWS too.

February 28, 2017. S3 went down and took down a good portion of AWS and the Internet in general. For almost the entire time that it was down, the AWS status page showed green because the up/down metrics were hosted on... you guessed it... S3.

https://aws.amazon.com/message/41926/

CaptainOfCoit · 2025-10-20T20:57:31 1760993851

Happened a couple of times :)

- 2008 - https://news.ycombinator.com/item?id=116445

- 2010 - https://news.ycombinator.com/item?id=1396191

- 2015 - https://news.ycombinator.com/item?id=10033172

- 2017 - https://news.ycombinator.com/item?id=13755673 (Postmortem: https://news.ycombinator.com/item?id=13775667)

- 2024 - https://news.ycombinator.com/item?id=41770111

hinkley · 2025-10-20T23:41:42 1761003702

Five times is no longer a couple. You can use stronger words there.

bapak · 2025-10-21T06:44:58 1761029098

It happened a murder of times.

hinkley · 2025-10-21T17:38:10 1761068290

Ha! Shall I bookmark this for the eventual wiki page?

casey2 · 2025-10-22T10:44:05 1761129845

https://www.youtube.com/watch?v=HxP4wi4DhA0

Maybe they should start using real software instead of mathematicians' toy langs

Scoundreller · 2025-10-20T22:02:22 1760997742

Have we ever figured out what “red” means? I understand they’ve only ever gone to yellow.

kokanee · 2025-10-20T23:03:47 1761001427

If it goes red, we aren't alive to see it

Cthulhu_ · 2025-10-21T08:35:26 1761035726

I'm sure we need to go to Blackwatch Plaid first.

subpar · 2025-10-20T20:19:36 1760991576

obligatory https://x.com/lintzston/status/791761626890469377

belter · 2025-10-20T20:42:59 1760992979

Published in the same week of October ...9 years ago ...Spooky...

decimalenough · 2025-10-20T19:30:12 1760988612

I used to work at a company where the SLA was measured as the percentage of successful requests on the server. If the load balancer (or DNS or anything else network) was dropping everything on the floor, you'd have no 500s and 100% SLA compliance.

conductr · 2025-10-20T19:15:53 1760987753

Similar to hosting your support ticketing system with same infra. "What problem? Nobody's complaining"

hinkley · 2025-10-20T23:40:23 1761003623

I’ve been customer for at least four separate products where this was true.

I can’t explain why Saucelabs was the most grating one, but it was. I think it’s because they routinely experienced 100% down for 1% of customers, and we were in that one percent about twice a year. <long string of swears omitted>

bigiain · 2025-10-20T22:11:14 1760998274

I spent enough time ~15 years back to find an external monitoring service that did not run on AWS and looked like a sustainable business instead of a VC fueled acquisition target - for our belts-n-braces secondary monitoring tool since it's not smart to trust CloudWatch to be able to send notifications when it's AWS's shit that's down.

Sadly while I still use that tool a couple of jobs/companies later - I no longer recommend it because it migrated to AWS a few years back.

(For now, my out-of-AWS monitoring tool is a bunch of cron jobs running on a collections of various inexpensive vpses and my and other dev's home machines.)

6031769 · 2025-10-21T10:28:01 1761042481

Nagios is still a thing and you can host it wherever you like.

bigiain · 2025-10-22T03:21:57 1761103317

Interestingly, the reason I originally looked for and started using it was an unapproved "shadow IT" response to an in-house Nagios setup that was configured and managed so badly it had _way_ more downtime than any of the services I'd get shouted about at if customers noticed them down before we did...

(No disrespect to Nagios, I'm sure a competently managed installation is capable of being way better than what I had to put up with.)

AbstractH24 · 2025-10-21T14:39:15 1761057555

If its not on the dashboard, it didn't happen

echelon · 2025-10-20T18:20:36 1760984436

Common SLA windows are hour, day, week, month, quarter, and year. They're out of SLA for all of those now.

When your SLA holds within a joke SLA window, you know you goofed.

"Five nines, but you didn't say which nines. 89.9999...", etc.

SlightlyLeftPad · 2025-10-20T20:11:25 1760991085

These are typically calculated system-wide, so if you include all regions, technically only a fraction of customers are impacted.

alkhimey · 2025-10-20T20:38:55 1760992735

Customers in all regions were affected…

prmoustache · 2025-10-20T21:48:32 1760996912

Indirectly yes but not directly.

Our only impact was some atlassian tools.

captainkrtek · 2025-10-20T19:31:41 1760988701

I shoot for 9 fives of availability.

dare944 · 2025-10-21T03:38:44 1761017924

5555.55555% Really stupendous availableness!!!

president_zippy · 2025-10-20T23:01:44 1761001304

I see what you did there, mister :P

hamburglar · 2025-10-20T18:42:06 1760985726

I prefer shooting for eight eights.

decimalenough · 2025-10-20T19:30:58 1760988658

You mean nine fives.

Veserv · 2025-10-20T17:41:21 1760982081

You added a zero. There are ~8760 hours per year, so 8 hours is ~1 in 1000, 99.9%.

nine_k · 2025-10-20T22:01:02 1760997662

An outage like this does not happen every year, The last big outage happened in December 2021, roughly 3 years 10 month = 46 months ago.

The duration of the outage in relation to that uptime is (8 h / 33602 h) * 100% = 0.024%, so the uptime is 99.976%, slightly worse than 99.99%, but clearly better than 99.90%.

They used to be five nines, and people used to say that it's not worth the while to prepare for an outage. With less than four nines, the perception might shift, but likely not enough to induce a mass migration to outage-resistant designs.

hinkley · 2025-10-20T23:44:55 1761003895

Won’t the end result be people keeping more servers warm in other AWS regions which means Amazon profits from their own fuckups?

pinkgolem · 2025-10-21T02:29:09 1761013749

There was a pretty big outage 2023

markus_zhang · 2025-10-20T18:00:33 1760983233

Oh you are right!

codeduck · 2025-10-20T17:25:49 1760981149

I'm sure they'll find some way to weasel out of this.

elchananHaas · 2025-10-20T19:32:37 1760988757

For DynamoDB, I'm not sure but I think its covered. https://aws.amazon.com/dynamodb/sla/. "An "Error" is any Request that returns a 500 or 503 error code, as described in DynamoDB". There were tons of 5XX errors. In addition, this calculation uses percentage of successful requests, so even partial degradation counts against the SLA.

From reading the EC2 SLA I don't think this is covered. https://aws.amazon.com/compute/sla/

The reason is the SLA says "For the Instance-Level SLA, your Single EC2 Instance has no external connectivity.". Instances that were already created kept working, so this isn't covered. The SLA doesn't cover creation of new instances.

alex_young · 2025-10-20T18:27:37 1760984857

It's not down time, it's degradation. No outage, just degradation of a fraction[0] of the resources.

[0] Fraction is ~ 1

indoordin0saur · 2025-10-20T19:29:57 1760988597

This 100% seems to be what they're saying. I have not been able to get a single Airflow task to run since 7 hours ago. Being able to query Redshift only recently came back online. Despite this all their messaging is that the downtime was limited to some brief period early this morning and things have been "coming back online". Total lie, it's been completely down for the entire business day here on the east coast.

randomname11 · 2025-10-20T21:58:43 1760997523

We continue to see early signs of progress!

Keyframe · 2025-10-20T19:32:34 1760988754

It doesn't count. It's not downtime, it's unscheduled maintenance event.

8organicbits · 2025-10-20T19:12:32 1760987552

Check the terms of your contract. The public terms often only offer partial service credit refunds, if you ask for it, via a support request.

hinkley · 2025-10-20T23:46:20 1761003980

If you aren’t making $10 for every dollar you pay Amazon you need to look at your business model.

The refund they give you isn’t going to dent lost revenue.

hinkley · 2025-10-20T23:38:31 1761003511

Where were you guys the other day when someone was calling me crazy for trying to make this same sort of argument?

abraae · 2025-10-20T20:34:46 1760992486

I haven't done any RFP responses for a while but this question always used to make me furious. Our competitors (some of who had had major incidents in the past) claimed 99.99% availability or more, knowing they would never have to prove it, and knowing they were actually 100% until the day they weren't.

We were more honest, and it probably cost us at least once in not getting business.

d1sxeyes · 2025-10-20T21:35:15 1760996115

An SLA is a commitment, and an RFP is a business document, not a technical one. As an MSP, you don’t think in terms of “what’s our performance”, you think of “what’s the business value”.

If you as a customer ask for 5 9s per month, with service credit of 10% of at-risk fees for missing on a deal where my GM is 30%, I can just amortise that cost and bake it into my fee.

procaryote · 2025-10-21T07:22:56 1761031376

it's a matter of perspective... 9.9999% is real easy

dgoldstein0 · 2025-10-21T07:33:29 1761032009

Only if you remember to spend your unavailability budget

hvb2 · 2025-10-20T18:02:10 1760983330

It's a single region?

I don't think anyone would quote availability as availability in every region I'm in?

While this is their most important region, there's a lot of clients that are probably unaffected if they're not in use1.

They COULD be affected even if they don't have anything there because of the AWS services relying on it. I'm just saying that most customers that are multi region should have their east region out and are just humming along.

reactordev · 2025-10-20T18:21:40 1760984500

It’s THE region. All of AWS operates out of it. All other regions bow before it. Even the government is there.

idontwantthis · 2025-10-20T18:33:09 1760985189

"The Cloud" is just a computer that you don't own that's located in Reston, VA.

reactordev · 2025-10-20T19:29:02 1760988542

Facts.

hinkley · 2025-10-20T23:47:42 1761004062

The Rot Starts at the Head.

derektank · 2025-10-20T19:40:01 1760989201

AWS GovCloud East is actually located in Ohio IIRC. Haven't had any issues with GovCloud West today; I'm pretty sure they're logically separated from the commercial cloud.

vasco · 2025-10-20T19:09:18 1760987358

> All of AWS operates out of it.

I don't think this is true anymore. In the early days bad enough outages in us-east-1 would bring down everything because some metadata / control pane stuff was there, I remember getting affected while in other regions, but there's been many years since this has happened.

Today for example no issues. I just avoid us-east-1 and everyone else should to. It's their worst region by far in terms of reliability because they launch all the new stuff there and are always messing it up.

Root_Denied · 2025-10-20T19:47:46 1760989666

A secondary problem is that a lot of the internal tools are still on US East, so likely the response work is also being impacted by the outage. Been a while since there was a true Sev1 LSE (Large Scale Event).

phinnaeus · 2025-10-20T20:49:53 1760993393

What the heck? Most internal tools were in Oregon when I worked in BT pre 2021.

Root_Denied · 2025-10-21T22:10:18 1761084618

The primary ticketing system was up and down apparently, so tcorp/SIM must still have critical components there.

reactordev · 2025-10-20T19:29:37 1760988577

tell me it isn't true while telling me there isn't an outage across AWS because us-east-1 is down...

vasco · 2025-10-20T19:46:05 1760989565

I help run quite a big operation in a different region and had zero issues. And this has happened many times before.

reactordev · 2025-10-20T20:52:39 1760993559

If that were true, you’d be seeing the same issues we are in us-west-1 as well. Cheers.

alkhimey · 2025-10-20T20:48:43 1760993323

Global services such as STS have regional endpoints, but is it really that common to hit specific endpoint rather than use the default?

hamburglar · 2025-10-20T18:43:43 1760985823

The regions are independent, so you measure availability for each on its own.

logifail · 2025-10-20T19:08:31 1760987311

Except if they aren't quite as independent as people thought

hamburglar · 2025-10-20T19:43:58 1760989438

Well that’s the default pattern anyway. When I worked in cloud there were always some services that needed cross-regional dependencies for some reason or other and this was always supposed to be called out as extra risk, and usually was. But as things change in a complex system, it’s possible for long-held assumptions about independence to change and cause subtle circular dependencies that are hard to break out of. Elsewhere in this thread I saw someone mentioning being migrated to auth that had global dependencies against their will, and I groaned knowingly. Sometimes management does not accept “this is delicate and we need to think carefully” in the midst of a mandate.

I do not envy anyone working on this problem today.

oxfordmale · 2025-10-20T20:01:55 1760990515

But is is a partial outage only, so it doesn't count. If you retry a million times everything still works /s