Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It has been quite a while, wondering how many 9s are dropped.

365 day * 24 * 0.0001 is roughly 8 hours, so it already lost the 99.99% status.



9s don’t have to drop if you increase the time period! “We still guarantee the same 9s just over 3450 years now”.


In a company where I worked, the tool measuring downtime was at the same server, so even if the server was down they still showed 100% up.

If the server didnt work - the tool too measure didnt work too! Genius


This happened to AWS too.

February 28, 2017. S3 went down and took down a good portion of AWS and the Internet in general. For almost the entire time that it was down, the AWS status page showed green because the up/down metrics were hosted on... you guessed it... S3.

https://aws.amazon.com/message/41926/



Five times is no longer a couple. You can use stronger words there.


It happened a murder of times.


Ha! Shall I bookmark this for the eventual wiki page?


https://www.youtube.com/watch?v=HxP4wi4DhA0

Maybe they should start using real software instead of mathematicians' toy langs


Have we ever figured out what “red” means? I understand they’ve only ever gone to yellow.


If it goes red, we aren't alive to see it


I'm sure we need to go to Blackwatch Plaid first.



Published in the same week of October ...9 years ago ...Spooky...


I used to work at a company where the SLA was measured as the percentage of successful requests on the server. If the load balancer (or DNS or anything else network) was dropping everything on the floor, you'd have no 500s and 100% SLA compliance.


Similar to hosting your support ticketing system with same infra. "What problem? Nobody's complaining"


I’ve been customer for at least four separate products where this was true.

I can’t explain why Saucelabs was the most grating one, but it was. I think it’s because they routinely experienced 100% down for 1% of customers, and we were in that one percent about twice a year. <long string of swears omitted>


I spent enough time ~15 years back to find an external monitoring service that did not run on AWS and looked like a sustainable business instead of a VC fueled acquisition target - for our belts-n-braces secondary monitoring tool since it's not smart to trust CloudWatch to be able to send notifications when it's AWS's shit that's down.

Sadly while I still use that tool a couple of jobs/companies later - I no longer recommend it because it migrated to AWS a few years back.

(For now, my out-of-AWS monitoring tool is a bunch of cron jobs running on a collections of various inexpensive vpses and my and other dev's home machines.)


Nagios is still a thing and you can host it wherever you like.


Interestingly, the reason I originally looked for and started using it was an unapproved "shadow IT" response to an in-house Nagios setup that was configured and managed so badly it had _way_ more downtime than any of the services I'd get shouted about at if customers noticed them down before we did...

(No disrespect to Nagios, I'm sure a competently managed installation is capable of being way better than what I had to put up with.)


If its not on the dashboard, it didn't happen


Common SLA windows are hour, day, week, month, quarter, and year. They're out of SLA for all of those now.

When your SLA holds within a joke SLA window, you know you goofed.

"Five nines, but you didn't say which nines. 89.9999...", etc.


These are typically calculated system-wide, so if you include all regions, technically only a fraction of customers are impacted.


Customers in all regions were affected…


Indirectly yes but not directly.

Our only impact was some atlassian tools.


I shoot for 9 fives of availability.


5555.55555% Really stupendous availableness!!!


I see what you did there, mister :P


I prefer shooting for eight eights.


You mean nine fives.


You added a zero. There are ~8760 hours per year, so 8 hours is ~1 in 1000, 99.9%.


An outage like this does not happen every year, The last big outage happened in December 2021, roughly 3 years 10 month = 46 months ago.

The duration of the outage in relation to that uptime is (8 h / 33602 h) * 100% = 0.024%, so the uptime is 99.976%, slightly worse than 99.99%, but clearly better than 99.90%.

They used to be five nines, and people used to say that it's not worth the while to prepare for an outage. With less than four nines, the perception might shift, but likely not enough to induce a mass migration to outage-resistant designs.


Won’t the end result be people keeping more servers warm in other AWS regions which means Amazon profits from their own fuckups?


There was a pretty big outage 2023


Oh you are right!


I'm sure they'll find some way to weasel out of this.


For DynamoDB, I'm not sure but I think its covered. https://aws.amazon.com/dynamodb/sla/. "An "Error" is any Request that returns a 500 or 503 error code, as described in DynamoDB". There were tons of 5XX errors. In addition, this calculation uses percentage of successful requests, so even partial degradation counts against the SLA.

From reading the EC2 SLA I don't think this is covered. https://aws.amazon.com/compute/sla/

The reason is the SLA says "For the Instance-Level SLA, your Single EC2 Instance has no external connectivity.". Instances that were already created kept working, so this isn't covered. The SLA doesn't cover creation of new instances.


It's not down time, it's degradation. No outage, just degradation of a fraction[0] of the resources.

[0] Fraction is ~ 1


This 100% seems to be what they're saying. I have not been able to get a single Airflow task to run since 7 hours ago. Being able to query Redshift only recently came back online. Despite this all their messaging is that the downtime was limited to some brief period early this morning and things have been "coming back online". Total lie, it's been completely down for the entire business day here on the east coast.


We continue to see early signs of progress!


It doesn't count. It's not downtime, it's unscheduled maintenance event.


Check the terms of your contract. The public terms often only offer partial service credit refunds, if you ask for it, via a support request.


If you aren’t making $10 for every dollar you pay Amazon you need to look at your business model.

The refund they give you isn’t going to dent lost revenue.


Where were you guys the other day when someone was calling me crazy for trying to make this same sort of argument?


I haven't done any RFP responses for a while but this question always used to make me furious. Our competitors (some of who had had major incidents in the past) claimed 99.99% availability or more, knowing they would never have to prove it, and knowing they were actually 100% until the day they weren't.

We were more honest, and it probably cost us at least once in not getting business.


An SLA is a commitment, and an RFP is a business document, not a technical one. As an MSP, you don’t think in terms of “what’s our performance”, you think of “what’s the business value”.

If you as a customer ask for 5 9s per month, with service credit of 10% of at-risk fees for missing on a deal where my GM is 30%, I can just amortise that cost and bake it into my fee.


it's a matter of perspective... 9.9999% is real easy


Only if you remember to spend your unavailability budget


It's a single region?

I don't think anyone would quote availability as availability in every region I'm in?

While this is their most important region, there's a lot of clients that are probably unaffected if they're not in use1.

They COULD be affected even if they don't have anything there because of the AWS services relying on it. I'm just saying that most customers that are multi region should have their east region out and are just humming along.


It’s THE region. All of AWS operates out of it. All other regions bow before it. Even the government is there.


"The Cloud" is just a computer that you don't own that's located in Reston, VA.


Facts.


The Rot Starts at the Head.


AWS GovCloud East is actually located in Ohio IIRC. Haven't had any issues with GovCloud West today; I'm pretty sure they're logically separated from the commercial cloud.


> All of AWS operates out of it.

I don't think this is true anymore. In the early days bad enough outages in us-east-1 would bring down everything because some metadata / control pane stuff was there, I remember getting affected while in other regions, but there's been many years since this has happened.

Today for example no issues. I just avoid us-east-1 and everyone else should to. It's their worst region by far in terms of reliability because they launch all the new stuff there and are always messing it up.


A secondary problem is that a lot of the internal tools are still on US East, so likely the response work is also being impacted by the outage. Been a while since there was a true Sev1 LSE (Large Scale Event).


What the heck? Most internal tools were in Oregon when I worked in BT pre 2021.


The primary ticketing system was up and down apparently, so tcorp/SIM must still have critical components there.


tell me it isn't true while telling me there isn't an outage across AWS because us-east-1 is down...


I help run quite a big operation in a different region and had zero issues. And this has happened many times before.


If that were true, you’d be seeing the same issues we are in us-west-1 as well. Cheers.


Global services such as STS have regional endpoints, but is it really that common to hit specific endpoint rather than use the default?


The regions are independent, so you measure availability for each on its own.


Except if they aren't quite as independent as people thought


Well that’s the default pattern anyway. When I worked in cloud there were always some services that needed cross-regional dependencies for some reason or other and this was always supposed to be called out as extra risk, and usually was. But as things change in a complex system, it’s possible for long-held assumptions about independence to change and cause subtle circular dependencies that are hard to break out of. Elsewhere in this thread I saw someone mentioning being migrated to auth that had global dependencies against their will, and I groaned knowingly. Sometimes management does not accept “this is delicate and we need to think carefully” in the midst of a mandate.

I do not envy anyone working on this problem today.


But is is a partial outage only, so it doesn't count. If you retry a million times everything still works /s




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: