Surly this post must have the opposite effect of what he intended. Even if you side with Cloudflare on the core issue this post is so cringy my butthole collapsed into itself.
Are Americans not embarrassed by the way these tech bros operate? As a European it’s obvious that the US gone from an allied to an enemy. I would feel like a traitor if I picked US tech these days.
That’s why you should only export anonymous information to external parties. There is no valid reason for OpenAI to export my personal information like this.
I will report OpenAI to the data protection agency in my country and I encourage others to do the same. They can not blame Mixpanel when they sprinkle others personal information around like this. NOT OK.
Name that was provided to us on the API account
Email address associated with the API account
Approximate coarse location based on API user browser (city, state, country)
Operating system and browser used to access the API account
Referring websites
Organization or User IDs associated with the API account
Rookie mistake for a billion dollar plus company, let alone the most valuable in the world.
I find throwing mixpanel under the bus whilst ignoring the giant elephant of "why were you giving them that user data in the first place" to leave a sour taste
How can you write the proxy without handling the config containing more than the maximum features limit you set yourself?
How can the database export query not have a limit set if there is a hard limit on number of features?
Why do they do non-critical changes in production before testing in a stage environment?
Why did they think this was a cyberattack and only after two hours realize it was the config file?
Why are they that afraid of a botnet? Does not leave me confident that they will handle the next Aisuru attack.
I'm migrating my customers off Cloudflare. I don't think they can swallow the next botnet attacks and everyone on Cloudflare go down with the ship, so it will be safer to not be behind Cloudflare when it hits.
Exactly. The only way this could happen in the first place was _because_ they failed at so many levels. And as a result, more layers of Swiss cheese will be added, and holes in existing ones will be patched. This process is the reason flying is so safe, and the reason why Cloudflare will be a little bit more resilient tomorrow than it was yesterday.
> Why do they do non-critical changes in production before testing in a stage environment?
I guess the noncritical change here was the change to the database? My experience has been a lot of teams do a poor job having a faithful replica of databases in stage environments to expose this type of issue.
In part because it is somewhere between really hard and impossible. Is your staging DB going to be as big? Seeing the same RPS as prod? Seeing the same scenarios?
Permissions stuff might be caught without a completely faithful replica, but there are always going to be attributes of the system that only exist in prod.
I know its easy to criticize what happened after the fact and having a clear(er) picture of all the moving parts and the timeline of events, but I think that while most of the people in the thread are pointing out either Rust-related or lack of configuration validation, what really grinds my gears is something that - in my opinion - is bad engineering.
Having an unprivileged application querying system.columns to infer the table layout is just bad; Not having a proper, well-defined table structure indicates sloppiness in the overall schema design, specially if it changes quickly. Considering specifically clickhouse, and even if this approach would be a good idea, the unprivileged way of doing it would be "DESCRIBE TABLE <name>", NOT iterating system.columns. The gist of it - sloppy design not even well implemented.
Having a critical application issuing ad-hoc commands to system.* tablespace instead of using a well-tested library is just amateurism, and again - bad engineering; IMO it is good practice to consider all system.* privileged applications and ensure their querying is completely separate from your application logic; Sometimes some system tables change, and fields are added and/or removed - not planning for this will basically make future compatibility a nightmare.
Not only the problematic query itself, but the whole context of this screams "lack of proper application design" and devs not knowing how to use the product and/or read the documentation. Granted, this is a bit "close to home" for me, because I use ClickHouse extensively (at a scale - I'm assuming - several orders of magnitude smaller than CloudFlare) and I have spent a lot of time designing specifically to avoid at least some of these kind of mistakes. But, if I can do it at my scale, why aren't they doing it?
On all the other issues, I thought they wanted to do the right thing at heart, but missed to make it fail safe. I can pass it as a problem of a journey to maturity or simply the fact that you can't get everything perfect. Maybe even a bit of sloppiness here and there.
The database issue screamed at me: lack of expertise. I don't use CH, but seeing someone to mess with a production system and they being surprised "Oh, it does that?", is really bad. And this is obviously not knowledge that is hard to achieve, buried deep in a manual or an edge case only discoverable by source code, it's bread and butter knowledge you should know.
What is confusing, that they didn't add this to their follow-up steps. With some benefit of doubt I'd assume they didn't want to put something very basic as a reason out there, just to protect the people behind it from widespread blame. But if that's not the case, then it's a general problem. Sadly it's not uncommon that components like databases are dealt with, on an low effort basis. Just a thing we plug in and works. But it's obviously not.
I don't think these are realistic requirements for any engineered system to be honest. Realistic is to have contingencies for such cases, which are simply errors.
But the case for Cloudflare here is complicated. Every engineer is very free to make a better system though.
What is not realistic? To do simple input validation on data that has the potential to break 20% of the internet? To not have a system in place to rollback to the latest known state when things crash?
Cloudflare builds a global scale system, not an iphone app. Please act like it.
Cloudflares success was simplicity to build a distributed system in different data centers around the world to be implemented by third party IT workers while Cloudflare were a few people. There are probably a lot of shitty iPhone apps that do less important work and are vastly more complex than the former Cloudflare server node configuration.
Every system has a non-reducible risk and no data rollback is trivial, especially for a CDN.
Yeah, I don't quite understand the people cutting Cloudflare massive slack. It's not about nailing blame on a single person or a team, it's about keeping a company that is THE closest thing to a public utility for the web accountable. They more or less did a Press Release with a call to action to buy or use their services at the end and everybody is going "Yep, that's totally fine. Who hasn't sent a bug to prod, amirite?".
It goes over my head why Cloudflare is HN's darling while others like Google, Microsoft and AWS don't usually enjoy the same treatment.
>It goes over my head why Cloudflare is HN's darling while others like Google, Microsoft and AWS don't usually enjoy the same treatment.
Do the others you mentioned provide such detailed outage reports, within 24 hours of an incident? I’ve never seen others share the actual code that related to the incident.
Or the CEO or CTO replying to comments here?
>Press Release
This is not press release, they always did these outage posts from the start of the company.
> Do the others you mentioned provide such detailed outage reports, within 24 hours of an incident? I’ve never seen others share the actual code that related to the incident.
The code sample might as well be COBOL for people not familiar with Rust and its error handling semantics.
> Or the CEO or CTO replying to comments here?
I've looked around the thread and I haven't seen the CTO here nor the CEO, probably I'm not familiar with their usernames and that's on me.
> This is not press release, they always did these outage posts from the start of the company.
My mistake calling them press releases. Newspapers and online publications also skim this outage report to inform their news stories.
I wasn't clear enough on my previous comment. I'd like all major players in the internet and web infrastructure to be held to higher standards. As it stands when it comes to them or the tech department of a retail store the retail store must answer to more laws when surface area of combined activities is took into account.
Yes, Cloudflare excels where others don't or barely bother and I too enjoyed the pretty graphs, diagrams and I've learned some nifty Rust tricks.
EDIT: I've removed some unwarranted snark from my comment which I apologize for.
> To do simple input validation on data that has the potential to break 20% of the internet?
There will always be bugs in code, even simple code, and sometimes those things don't get caught before they cause significant trouble.
The failing here was not having a quick rollback option, or having it and not hitting the button soon enough (even if they thought the problem was probably something else, I think my paranoia about my own code quality is such that I would have been rolling back much sooner just in case I was wrong about the “something else”).
Name me global, redundant systems that have not (yet) failed.
And if you used cloudflare to protect against botnet and now go off cloudflare... you are vulnerable and may experience more downtime if you cannot swallow the traffic.
I mean no service have 100% uptime - just that some have more nines than others.
As yourself more the question, is your service that important to need 99.999% uptime? Because i get the impression that people are so fixated on this uptime concept, that the idea of being down for a few hours is the most horrible issue in the world. To the point that they rather hand over control of their own system to a 3th party, then accept a downtime.
The fact that cloudflare can literally ready every bit of communication (as it sits between the client and your server) is already plenty bad. And yet, we accept this more easily, then a bit of downtime. We shall not ask about the prices for that service ;)
To me its nothing more then the whole "everybody on the cloud" issue, when most do not need the resource that cloud companies like AWS provide (and the bill), and yet, get totally tied down to this one service.
Not when you start pushing into the TB's range of monthly data... When you get that dreaded phone call from a CF rep, because the bill that is coming is no joke.
Its free as long as you really are small, not worth milking. The moment you can afford to run your own mini dc at your office, you start to enter the "well, hello there" for CF.
> The moment you can afford to run your own mini dc at your office, you start to enter the "well, hello there" for CF.
As someone who has (and is) runs (running) a DC with all the electrical/UPS, cooling, piping, HVAC+D stuff to deal with: it can be a lot of just time/overhead.
Especially if you don't have a number of folks in-house to deal with all that 'non-IT' equipment (I'm a bit strange in that I have an interest in both IT and HVAC-y stuff).
> There are many self-hosted alternatives to protect against botnet.
What would some good examples of those be? I think something like Anubis is mostly against bot scraping, not sure how you'd mitigate a DDoS attack well with self-hosted infra if you don't have a lot of resources?
On that note, what would be a good self-hosted WAF? I recall using mod_security with Apache and the OWASP ruleset, apparently the Nginx version worked a bit slower (e.g. https://www.litespeedtech.com/benchmarks/modsecurity-apache-... ), there was also the Coraza project but I haven't heard much about it https://coraza.io/ or maybe the people who say that running a WAF isn't strictly necessary also have a point (depending on the particular attack surface).
There is haproxy-protection, which I believe is the basis of Kiwiflare. Clients making new connections have to solve a proof-of-work challenge that take about 3 seconds of compute time.
Well if you self host DDoS protection service, that would be VERY expensive. You would need rent rack space along with a very fast internet connection at multiple data centers to host this service.
If you're buying transit, you'll have a hard time getting away with less than 10% commit, i.e. you'll have to pay for 10 Gbps of transit to have a 100 Gbps port, which will typically run into 4 digits USD / month. You'll need a few hundred Gbps of network and scrubbing capacity to handle common DDoS attacks using amplification from script kids with a 10 Gbps uplink server that allow spoofing, and probably on the order of 50+ Tbps to handle Aisuru.
If you're just renting servers instead, you have a few options that are effectively closer to a 1% commit, but better have a plan B for when your upstreams drop you if the incoming attack traffic starts disrupting other customers - see Neoprotect having to shut down their service last month.
We had better uptime with AWS WAF in us-east-1 than we've had in the last 1.5 years of Cloudflare.
I do like the flat cost of Cloudflare and feature set better but they have quite a few outages compared to other large vendors--especially with Access (their zero trust product)
I'd lump them into GitHub levels of reliability
We had a comparable but slightly higher quote from an Akamai VAR.
But at the same time, what value do they add if they:
* Took down the the customers sites due to their bug.
* Never protected against an attack that our infra could not have handled by itself.
* Don't think that they will be able to handle the "next big ddos" attack.
It's just an extra layer of complexity for us. I'm sure there are attacks that could help our customers with, that's why we're using them in the first place. But until the customers are hit with multiple ddos attacks that we can not handle ourself then it's just not worth it.
> • Took down the the customers sites due to their bug.
That is always a risk with using a 3rd party service, or even adding extra locally managed moving parts. We use them in DayJob, and despite this huge issue and the number of much smaller ones we've experienced over the last few years their reliability has been pretty darn good (at least as good as the Azure infrastructure we have their services sat in front of).
> • Never protected against an attack that our infra could not have handled by itself.
But what about the next one… Obviously this is a question sensitive to many factors in our risk profiles and attitudes to that risk, there is no one right answer to the “but is it worth it?” question here.
On a slightly facetious point: if something malicious does happen to your infrastructure, that it does not cope well with, you won't have the “everyone else is down too” shield :) [only slightly facetious because while some of our clients are asking for a full report including justification for continued use of CF and any other 3rd parties, which is their right both morally and as written in our contracts, most, especially those who had locally managed services affected, have taken the “yeah, half our other stuff was affected to, what can you do?” viewpoint].
> • Don't think that they will be able to handle the "next big ddos" attack.
It is a war of attrition. At some point a new technique, or just a new botnet significantly larger than those seen before, will come along that they might not be able to deflect quickly. I'd be concerned if they were conceited enough not to be concerned about that possibility. Any new player is likely to practise on smaller targets first before directly attacking CF (in fact I assume that it is rather rare that CF is attacked directly) or a large enough segment of their clients to cause them specific issues. Could your infrastructure do any better if you happen to be chosen as one of those earlier targets?
Again, I don't know your risk profile so can say which is the right answer, if there even is an easy one other than “not thinking about it at all” being a truly wrong answer. Also DDoS protection is not the only service many use CF for, so those need to be considered too if you aren't using them for that one thing.
I agree. I think the comments about how "it is fine, because so many things had to fail" do not apply in this case.
It's not that many things had to fail, it's that many things that are obvious haven't been done. It would be a valid excuse if many "exotic" scenarios would have to align, not when it's obvious error cases that weren't handled and changes have not been tested.
While having wrong first assumptions is just how things work when you try to analyze the issue[1], not testing changes before production is just stupidity and nothing else.
The story would be different if eg. multiple unlikely, hard to track things happened at once without anyone making a clearly linkable event, something that would also happen in staging. Most of the things mentioned could essentially statically checked. This is the prime example of what you want as any tech person, because it's not hard to prevent compared to a lot of scenarios where you deal with balancing likelihoods of scenarios, timings, etc.
You don't think someone is a great plumber, because they forgot their tools and missed that big hole in the pipe and also rang at the wrong door, because all these things failed. You think someone is a good plumber if they said they would have to go back to fetch a bulky specialized tool, because this is the rare case in which they need it, but they could also do this other thing in this spcific case. They are great plumbers if they tell you how this happened in first place and how to fix it. They are great plumbers if they manage to fix something outside of their usual scope.
Here pretty much all of the things that you pay them for failed. At a large scale.
I am sure this has there are reasons which we don't now about, and I hope that CloudFlare can fix them. Be it management focusing on the wrong things, be it developers not being in the wrong position or annoyed enough to care or something else entirely. However, not doing these things is (likely) a sign that currently they are not in the state of creating reliable systems - at least none reliable enough for what they are doing. It would be perfectly fine if they ran a web shop or something, but if as experienced many other companies rely on you being up or their stuff fails, then maybe you should not run a company with products like "Always Online".
[1] And should make you adapt the process of analyzing issues. Eg. making sure config changes are "very loud" in monitoring. It's one of the most easily tracked thing that can go wrong, and can relatively easily be mapped to a point in time compared to many other things.
We do Thursday to Thursday and then you get Friday off after completed on-call.
Being on-call gives you no extra pay by itself, but if you get paged off hours and need to work you get paid 150 to 200% of your normal hourly wage depending on what time of day you need to work.
No pay for being on call by itself is still poor, particularly when it comes to swapping rotations between team members to provide flexibility amongst each other.
You’re making yourself available 24/7. That has a non trivial lifestyle impact which I’ve always thought deserves more than is typically rewarded.
As long as the on-call coverage is as specified at the time of hiring, this is just a difference in form of payment.
If I receive 100 total units of compensation, I'd way rather get 100 units of base pay (and 0 on-call pay) than 90 units of base pay and 10 units of specific on-call pay. (What if the company eliminates on-call? What if I get injured and my insurance only covers base pay? Severance is usually based only on base pay; I would not be paid on-call while I'm on PTO or other paid leave, annual raise percentages typically apply to base pay, etc...)
How can the on-call coverage be specified at hiring? Can the company guarantee that my team will never shrink or that the page rate won't increase?
What will financially encourage my company to stop paging me overnight if there isn't a labor cost to the company every time an on-call incident occurs?
> What if I get injured and my insurance only covers base pay?
Insurance payouts can be easily based on wages that include reported commissions, tips, and overtime. They can very easily be based on an average of past actual wages paid in the last handful of months at the company.
> Severance is usually based only on base pay
Severance is a completely optional practice that is based entirely on what the company wants to do. I would argue that severance is more accurately based on "The lowest safe number to pay to this particular employee to make sure their termination does not become a legal risk."
> I would not be paid on-call while I'm on PTO or other paid leave
But also, PTO days and on-call days don't indersect. If you took time off during an on-call shift you would be trading it with a team member, so you would never lose that extra wage.
Example: I'm taking a week off, it's during my scheduled on-call shift. I would normally get paid my on-call hours but I didn't this week. But when I get back from my vacation, I'm picking up an extra on-call shift because my team member covered my shift when I was on vacation.
Now, I'm taking a week off, but it's not during my on-call shift. I wouldn't have been paid on-call hours this week anyway. When I get back from my vacation, I am going on my normally scheduled on-call shift.
I personally have never felt compensated dynamically enough for on-call schedules. Most corporate jobs seem to pay for a sliver of the life disruption, maybe paying for half my phone and Internet bill or something like that. They all say that the on-call is baked into the compensation, but I'm not so sure.
>"Severance is a completely optional practice that is based entirely on what the company wants to do. I would argue that severance is more accurately based on "The lowest safe number to pay to this particular employee to make sure their termination does not become a legal risk."
Almost right! I see it as an extension of what I call the basic rules, "I am as nice to you as you are to me", and "I care exactly as much as you do."
That does, in some cases, expand severance a little beyond the cold risk calculation. If the severance is going to someone who helped the company make it, then helping make sure they make it to their next gig is part of the equation.
Not everyone boils it all down that far, but a whole lot of us do!
Which makes your comment solid, and mine a quibble, but one I consider worthy of some discussion.
Germany (among other countries) has laws around this. My company pays I think 200 euro a day that someone is on call, so my German reports end up making a decent amount in months they have their on call shifts, especially felt when the team is smaller and rotations more frequent!
> If you took time off during an on-call shift you would be trading it with a team member, so you would never lose that extra wage.
I think this is true in _most cases_, but is not a given. I myself have encountered scenarios where it isn’t true: switching with someone much later in the rotation, only to then end up having to switch again for instance. You could envision a nefarious teammate weaseling out of their fair share with sneaky switches like this, too, though paying for it would maybe incentivize them not to!
Not to mention that there is incentive to keep having oncall pages, because that's how you get paid. Or not participate at all. On the other hand, with a flat payment, there is a big incentive to prevent issues and not have(reduce) ooh incidents, and participate in the rota.
In OP's case it sounds like they do get compensated with the day off, which is PTO. It's not a trade everyone would make but an extra day off into a long weekend is one I would have taken earlier in my career.
+1! You can't travel very much, you can't go hiking or biking in places without cell coverage, your whatever thing you are busy with gets interrupted, you can get woken up in the middle of the night, etc etc. That deserves some compensation.
The extra day off is probably equivalent to getting paid ? Last time I had on-call part of the job, I think the pay increase for standby would have amounted to roughly 8h as well (actual interventions were also 150% for regular nights and Saturday, 200% for Saturday night and Sunday)
Lugging around a laptop and the on-call phone when going anywhere, checking every now and then when the phone was not with your for a while (e.g. pool, gym etc), making sure you don't go places with no signal was enough of a PITA that knowing we were paid every hour of that had a nice psychological effect.
> ”we are identifying and reaching out to former employees who signed a standard exit agreement to make it clear that OpenAI has not and will not cancel their vested equity and releases them from nondisparagement obligations”
Well, they say they are. But the nondisparagement agreement repeatedly forbids revealing the agreement itself, so if it wasn't cancelled those subject to it would be forbidden to point out that the public claim they were going to release people from it was a lie (or done only for people from whom OpenAI was not particularly concerned about potential disparagement.)
“disparagement" is whatever is defined in the agreement, which reportedly (from one of the people who declined to sign it) includes discussing the existence of the agreement.
This is not normal for being a corporate employee. This was certainly going to come out eventually and cause big problems, but to the extent Sam thinks AGI is around the corner he might not be playing the long game.
"We're hereby making a legally binding commitment that those clauses are void, whether anyone reaches out to us or we manage to reach out to them or not."
Unless and until that's what they say, looks like they’re not doing that.
I've sometimes wondered if I am the only person who actually likes the syntax :D There's a reason for it, but additionally, I like the fact that it's explicit — I can look at `some_call.()` and specifically know it's an anonymous function.
I like barewords in Ruby but they don't hold the same value for me in Elixir. In Ruby I was trying to write code in a way where it didn't matter all that much where stuff was coming from. In Elixir I want to know exactly where stuff is coming from and what it is (like how we generally don't `import Enum` or the like).
Not to say you shouldn't like barewords! It is nice that Elixir enables that possibility for those who want it!
I'm repeating myself from another thread but ah well! It has that nice parallel to Erlang too where the calls look distinct: `f()` for function call, `F()` for anonymous function call! All that to say that I agree and I also like the dot syntax :)
Then they should have a "?" status that can be triggered by automated systems that acknowledge that it looks to be an issue but that they are manually investigating.
If it's a false positive they just resolve it without it affecting SLA and if it's a real problem then us customers wouldn't have to debug our own stack for 2 hours before Microsoft informs us that they are the problem.
EDIT:
Wonder how many man-years of extra debugging work their non-working status page have caused the customers.
It's not a troll account. It's a throw away account because you can't trust people these days to not go after your company/family/reputation because you hold an unpopular opinion. I am not 'trolling' anyone here. In a few days I'll be back to using a regular sn here, and you'll like me as much as you used to.
No identity, no stakes, no reason to believe you. Put your reputation on the line if you want your opinions to be treated as anything other than computer-generated noise.
You dont have to believe me, when all I do is ask a question. If your position is so weak you can't answer the question well, that's a problem and not simply proof that I'm a "troll".
No need. If you had a good answer, you'd post it for others to see. The purpose of a forum is not to have 1 on 1 conversations only, its to inform others as well. My question makes others think, and you are unable of giving an answer. That's enough. I'm not going to risk business/financial loss (due to fanatics going after people.. Mozilla CEO ring a bell?) to have a little more street cred with someone like you.
Kevlar helmets and ceramic body armor is pretty standard.
Go for body armor that can take a hit from 7.62 rounds.
Body armor is heavy so most people prefer the ones that don’t have plates on the sides and over the shoulders as an compromise between weight and protection.
Isn’t it better that you give money to the Ukrainian military though, so they can get what they need? How will you even get the equipment into Ukraine now?
Are Americans not embarrassed by the way these tech bros operate? As a European it’s obvious that the US gone from an allied to an enemy. I would feel like a traitor if I picked US tech these days.
reply