Hacker Newsnew | past | comments | ask | show | jobs | submit | more jraph's commentslogin


I work for one of the several European companies building open source software that has been chosen as components of openDesk.

openDesk is solid, legit and serious.

Open source is a requirement. As such, money doesn't go to a startup building proprietary software that get bought a few years later by a big tech company and then all the investment is lost. They audit and check that licenses are open source and that the dependencies have compatible licenses.

It's publicly funded, by Germany* (for their needs, but it will grow larger than them). Their strategy is to give money to established European open source software companies so they improve their software in areas that matter to them, including integration features (user management, for instance, or file / event sharing with other software, many things) as well as accessibility. They take all these pieces of software and build a coherent (with a common theme / look & feel), turn-key, feature-rich suite. This strategic decision that has its drawbacks allows to get something fast with what exists today.

I'm not sure communication and the business strategy is all figured out / polished yet, but with the high profile institutions adopting it, it will come. Each involved companies wants this to succeed too.

I think this is huge. I'm quite enthusiastic. Software might not be perfect but with the potential momentum this thing has, it could improve fast, and each piece of open source software that is part of this as well along the way.

* see also caubin's comment


The author is lucky the phrasing wasn't "there won't be another major version of htmx", or even "a third version".


From what I know of Carson from his writing and presentations, he probably worded it that way on purpose knowing he'd eventually do a new version, and he didn't want to miss an opportunity to troll everyone a bit.


> I make sure that as much state as possible is saved in a URL

Do you have advice on how to achieve this (for purely client-side stuff)?

- How do you represent the state? (a list of key=value pair after the hash?)

- How do you make sure it stays in sync?

-- do you parse the hash part in JS to restore some stuff on page load and when the URL changes?

- How do you manage previous / next?

- How do you manage server-side stuff that can be updated client side? (a checkbox that's by default checked and you uncheck it, for instance)


One example I think is super interesting is the NWS Radar site, https://radar.weather.gov/

If you go there, that's the URL you get. However, if you do anything with the map, your URL changes to something like

https://radar.weather.gov/?settings=v1_eyJhZ2VuZGEiOnsiaWQiO...

Which, if you take the base64 encoded string, strip off the control characters, pad it out to a valid base64 string, you get

"eyJhZ2VuZGEiOnsiaWQiOm51bGwsImNlbnRlciI6Wy0xMTUuOTI1LDM2LjAwNl0sImxvY2F0aW9uIjpudWxsLCJ6b29tIjo2LjM1MzMzMzMzMzMzMzMzMzV9LCJhbmltYXRpbmciOmZhbHNlLCJiYXNlIjoic3RhbmRhcmQiLCJhcnRjYyI6ZmFsc2UsImNvdW50eSI6ZmFsc2UsImN3YSI6ZmFsc2UsInJmYyI6ZmFsc2UsInN0YXRlIjpmYWxzZSwibWVudSI6dHJ1ZSwic2hvcnRGdXNlZE9ubHkiOmZhbHNlLCJvcGFjaXR5Ijp7ImFsZXJ0cyI6MC44LCJsb2NhbCI6MC42LCJsb2NhbFN0YXRpb25zIjowLjgsIm5hdGlvbmFsIjowLjZ9fQ==", which decodes into:

{"agenda":{"id":null,"center":[-115.925,36.006],"location":null,"zoom":6.3533333333333335},"animating":false,"base":"standard","artcc":false,"county":false,"cwa":false,"rfc":false,"state":false,"menu":true,"shortFusedOnly":false,"opacity":{"alerts":0.8,"local":0.6,"localStations":0.8,"national":0.6}}

I only know this because I've spent a ton of time working with the NWS data - I'm founding a company that's working on bringing live local weather news to every community that needs it - https://www.lwnn.news/


In this case, why encode the string instead of just having the options as plain text parameters?


Nesting, mostly (having used that trick a lot, though I usually sign that record if originating from server).

I've almost entirely moved to Rust/WASM for browser logic, and I just use serde crate to produce compact representation of the record, but I've seen protobufs used as well.

Otherwise you end up with parsing monsters like ?actions[3].replay__timestamp[0]=0.444 vs {"actions": [,,,{"replay":{"timestamp":[0.444, 0.888]}]}


Sorry but this is legitimately a terrible way to encode this data. The number 0.8 is encoded as base64 encoded ascii decimals. The bits 1 and 0 similarly. URLs should not be long for many reasons, like sharing and preventing them from being cut off.


The “cut off” thing is generally legacy thinking, the web has moved on and you can reliably put a lot of data in the URI… https://stackoverflow.com/questions/417142/what-is-the-maxim...


Links with lots of data in them are really annoying to share. I see the value in storing some state there, but I don’t think there is room for much of it.


What makes them annoying to share? I bet it's more an issue with the UX of whatever app or website you're sharing the link in. Take that stackoverflow link in the comment you're replying to, for example: you can see the domain and most of the path, but HN elides link text after a certain length because it's superfluous.


SO links require just the question ID; short enough to memorize.


Sure, but the SO link was just an example. HN does it with any link, like this one which is 1000 characters long:

https://example.com/some/path?foo=bar&baz=bat&foo=bar&baz=ba...

If the website or app has a good UX for displaying/sharing URLs, the length doesn't really matter.


The URL spec already takes care of a lot of this, for example /shopping/shirts?color=blue&size=M&page=3 or /articles/my-article-title#preface


The OP gives great guidance on these questions.


XSLT is to my knowledge the only client side technology that lets you include chunks of HTML without using JavaScript and without server-side technology.

XSLT lets you build completely static websites without having to use copy paste or a static website generator to handle the common stuff like menus.


> XSLT lets you build completely static websites without having to use copy paste or a static website generator to handle the common stuff like menus.

How many people ever do this?


Plain text, markup and Markdown to HTML with XSLT:

REPO: https://github.com/gregabbott/skip

DEMO: https://gregabbott.pages.dev/skip

(^ View Source: 2 lines of XML around a .md file)


Parsing the XSLT file fails in Firefox :)


Thanks! Reworked for Firefox.


I did that. You can write .rst, then transform it into XML with 'rst2xml' and then generate both HTML and PDF (using XSL-FO). (I myself also did a little literate programming this way: I added a special reStructuredText directive to mark code snippets, then extracted and joined them together into files.)


If this is "declarative XSL Processing Instructions", apparently 0.001% of global page loads.


skechers.com (a shoe manufacturer) used to do this:

https://web.archive.org/web/20140101011304/http://www.skeche...

They don't anymore. It was a pretty strange design.


Lies in user agent strings where for bypassing bugs, poor workarounds and assumptions that became wrong, they are nothing like what we are talking about.


A server returning HTML for Chrome but not cURL seems like a bug, no?

This is why there are so many libraries to make requests that look like they came from browser, to work around buggy servers or server operators with wrong assumptions.


> A server returning HTML for Chrome but not cURL seems like a bug, no?

tell me you've never heard of https://wttr.in/ without telling me. :P

It would absolutely be a bug iff this site returned html to curl.

> This is why there are so many libraries to make requests that look like they came from browser, to work around buggy servers or server operators with wrong assumptions.

This is a shallow take, the best counter example is how googlebot has no problem identifying it itself both in and out of thue user agent. Do note user agent packing, is distinctly different from a fake user agent selected randomly from the list of most common.

The existence of many libraries with the intent to help conceal the truth about a request doesn't feel like proof that's what everyone should be doing. It feels more like proof that most people only want to serve traffic to browsers and real users. And it's the bots and scripts that are the fuckups.


Googlebot has no problem identifying itself because Google knows that you want it to index your site if you want visitors. It doesn't identify itself to give you the option to block it. It identifies itself so you don't.


I care much less about being indexed by Google as much as you might think.

Google bot doesn't get blocked from my server primarily because it's a *very* well behaved bot. It sends a lot of requests, but it's very kind, and has never acted in a way that could overload my server. It respects robots.txt, and identifies itself multiple times.

Google bot doesn't get blocked, because it's a well behaved bot that eagerly follows the rules. I wouldn't underestimate how far that goes towards the reason it doesn't get blocked. Much more than the power gained by being google search.


Yes, the client wanted the server to deliver content it had intended for a different client, regardless of what the service operator wanted, so it lied using its user agent. Exact same thing we are talking about. The difference is that people don't want companies to profit off of their content. That's fair. In this case, they should maybe consider some form of real authentication, or if the bot is abusive, some kind of rate limiting control.


Add "assumptions that became wrong" to "intended" and the perspective radically changes, to the point that omitting this part from my comment changes everything.

I would even add:

> the client wanted the server to deliver content it had intended for a different client

In most cases, the webmaster intended their work to look good, not really to send different content to different clients. That later part is a technical means, a workaround. The intent of bringing the ok version to the end user was respected… even better with the user agent lies!

> The difference is that people don't want companies to profit off of their content.

Indeed¹, and also they don't want terrible bot to bring down their servers.

1: well, my open source work explicitly allows people to profit off of it - as long as the license is respected (attribution, copyleft, etc)


> Yes, the client wanted the server to deliver content it had intended for a different client, regardless of what the service operator wanted, so it lied using its user agent.

I would actually argue, it's not nearly the same type of misconfiguration. The reason scripts, which have never been a browser, who omit their real identity, are doing it, is to evade bot detection. The reason browsers pack their UA with so much legacy data, is because of misconfigured servers. The server owner wants to send data to users and their browsers, but through incompetence, they've made a mistake. Browsers adapted by including extra strings in the UA to account for the expectations of incorrectly configured servers. Extra strings being the critical part, Google bot's UA is an example of this being done correctly.


I'm confused, how are those two things related?


The commenter you replied to was implying that the EU does not respect the privacy/freedom of mobile device users.


Okay, thanks.

I was confused bexause anonymity against the state is hardly the only, or even a main point of android forks.

Privacy usually is, but against big tech typically.


Nanny state


More like surveillance state


Which states aren't? And for the love of god do not write US now


> When you return a response with a 200-series status code, you've granted consent. If you don't want to grant consent, change the logic of the server.

"If you don't consent to me entering your house, change its logic so that picking the door's lock doesn't let me open the door"

Yeah, well…

As if the LLM scrappers didn't try anything under the sun like using millions of different residential IP to prevent admins from "changing the logic of the server" so it doesn't "return a response with a 200-series status code" when they don't agree to this scrapping.

As if there weren't broken assumptions that make "When you return a response with a 200-series status code, you've granted consent" very false.

As if technical details were good carriers of human intents.


The locked door is a ridiculous analogy when it comes to the open web. Pretty much all "door" analogies are flawed, but sure let's imagine your web server has a door. If you want to actually lock the door, you're more than welcome to put an authentication gate around your content. A web server that accepts a GET request and replies 2xx is distinctly NOT "locked" in any way.


Any analogy is flawed and you can kill most analogies very fast. They are meant to illustrate a point hopefully efficiently, not to be mathematically true. They are not to everyone's taste, me included in most cases. They are mostly fine as long as they are not used to make a point, but only to illustrate it.

I agree with this criticism of this analogy, I actually had this flaw in mind from the start. There are other flaws I have in mind as well.

I have developed more without the analogy in the remaining of the comment. How about we focus on the crux of the matter?

> A web server that accepts a GET request and replies 2xx is distinctly NOT "locked" in any way

The point is that these scrappers use tricks so that it's difficult not to grant them access. What is unreasonable here is to think that 200 means consent, especially knowing about the tricks.

Edit:

> you're more than welcome to put an authentication gate around your content.

I don't want to. Adding auth so llm providers don't abuse my servers and the work I meant to share publicly is not a working solution.


People need to have a better mental model of what it means to host a public web site, and what they are actually doing when they run the web server and point it at a directory of files. They're not just serving those files to customers. They're not just serving them to members. They're not just serving them to human beings. They're not even necessarily serving files to web browsers. They're serving files to every IP address (no matter what machine is attached to it) that is capable of opening a socket and sending GET. There's no such distinct thing as a scraper--and if your mental model tries to distinguish between a scraper and a human user, you're going to be disappointed.

As the web server operator, you can try to figure out if there's a human behind the IP, and you might be right or wrong. You can try to figure out if it's a web browser, or if it's someone typing in curl from a command line, or if it's a massively parallel automated system, and you might be right or wrong. You can try to guess what country the IP is in, and you might be right or wrong. But if you really want to actually limit access to the content, you shouldn't be publishing that content publicly.


> They're serving files to every IP address (no matter what machine is attached to it) that is capable of opening a socket and sending GET.

Legally in the US a “public” web server can have any set of usage restrictions it feels like even without a login screen. Private property doesn’t automatically give permission to do anything even if there happens to be a driveway from the public road into the middle of it.

The law cars about authorized access not the specific technical implementation of access. Which has caused serious legal trouble for many people when they make seemingly reasonable assumptions that say access to someURL/A12.jpg also gives them permission to someURL/A13.jpg etc.


...but the matter of "what the law cares about" is not really the point of contention here - what matters here is what happens in the real world.

In the real world, these requests are being made, and servers are generating responses. So the way to change that is to change the logic of the servers.


> In the real world, these requests are being made, and servers are generating responses.

Except that’s not the end of the story.

If you’re running a scraper and risking serious legal consequences when you piss off someone running a server enough, then it suddenly matters a great deal independent of what was going on up to that point. Having already made these requests you’ve just lost control of the situation.

That’s the real world we’re all living in, you can hope the guy running a server is going to play ball but that’s simply not under your control. Which is the real reason large established companies care about robots.txt etc.


> There's no such distinct thing as a scraper--and if your mental model tries to distinguish between a scraper and a human user, you're going to be disappointed.

I disagree. If your mental model doesn't allow conceptualizing (abusive) scrapers, it is too simplicistic to be useful to understand and deal with reality.

But I'd like to re-state the frame / the concern: it's not about any bot or any scraper, it is about the despicable behavior of LLM providers and their awful scrappers.

I'm personally fine with bots accessing my web servers, there are many legitimate use cases for this.

> But if you really want to actually limit access to the content, you shouldn't be publishing that content publicly.

It is not about denying access to the content to some and allowing access to others.

It is about having to deal with abuses.

Is a world in which people stop sharing their work publicly because of these abuses desirable? Hell no.


Technically, you are not serving anything - it's just voltage levels going up and down with no meaning at all.


The CFAA wants to have a word. The fact that a server responds with a 200 OK has no bearing on the legality of your request, there's plenty of precedent by now.


How about AI companies just act ethically and obey norms?


here's my analogy, it's like you own a museum and you require entrance by "secret" password (your user agent filtering or what not). the problem is the password is the same for everyone so would you be surprised when someone figures it out or gets it from a friend and they visit your museum? Either require a fee (processing power, captcha etc) or make a private password (auth)

It is inherently a cat and mouse game that you CHOOSE to play. Either implement throttling for clients that consume too much resources for your server / require auth / captcha / javascript / whatever whenever the client is using too much resources. if the client still chooses to go through the hoops you implemented then I don't see any issue. If u still have an issue then implement more hoops until you're satisfied.


> Either require a fee (processing power, captcha etc) or make a private password (auth)

Well, I shouldn't have to work or make things worse for everybody because the LLM bros decided to screw us.

> It is inherently a cat and mouse game that you CHOOSE to play

No, let's not reverse the roles and blame the victims here. We sysadmins and authors are willing to share our work publicly to the world but never asked for it to be abused.


That's like saying you shouldn't have to sanitize your database inputs because you never asked for people to SQL inject your database. This stance is truly mind boggling to me


Would you take the defense of attackers using SQL injections? Because it feels like people here, including you, are defending the llm scrapers against sysadmins and authors who dare share their work publicly.

Ensuring basic security and robustness of a piece of software is simply not remotely comparable to countering the abuse these llm companies carry on.

But it's not even the point. And preventing SQL injections (through healthy programming practices) doesn't make things worse for any legitimate user neither.


It’s both. You should sanitize your inputs because there are bad actors, but you also categorize attempts to sql inject as abuse and there is legal recourse.


When I open an HTTP server to the public web, I expect and welcome GET requests in general.

However,

(1) there's a difference between (a) a regular user browsing my websites and (b) robots DDoSing them. It was never okay to hammer a webserver. This is not new, and it's for this reason that curl has had options to throttle repeated requests to servers forever. In real life, there are many instances of things being offered for free, it's usually not okay to take it all. Yes, this would be abuse. And no, the correct answer to such a situation would not be "but it was free, don't offer it for free if you don't want it to be taken for free". Same thing here.

(2) there's a difference between (a) a regular user reading my website or even copying and redistributing my content as long as the license of this work / the fair use or related laws are respected, and (b) a robot counterfeiting it (yeah, I agree with another commenter, theft is not the right word, let's call a spade a spade)

(3) well-behaved robots are expected to respect robots.txt. This is not the law, this is about being respectful. It is only fair bad-behaved robots get called out.

Well behaved robots do not usually use millions of residential IPs through shady apps to "Perform a get request to an open HTTP server".


> robots.txt. This is not the law

In Germany, it is the law. § 44b UrhG says (translated):

(1) Text and data mining is the automated analysis of one or more digital or digitized works to obtain information, in particular about patterns, trends, and correlations.

(2) Reproductions of lawfully accessible works for text and data mining are permitted. These reproductions must be deleted when they are no longer needed for text and data mining.

(3) Uses pursuant to paragraph 2, sentence 1, are only permitted if the rights holder has not reserved these rights. A reservation of rights for works accessible online is only effective if it is in machine-readable form.


I doubt robots.txt would fit. robots.txt allows or disallows access, but it does not state any claim. You can license content you don't own, put it on your website, and then exclude it in robots.txt without that implying any claims of rights to that content.


> A reservation of rights for works accessible online is only effective if it is in machine-readable form.

What if MY machine can't read it though?


That’s your problem.

A solution has been offered and you can adhere to it, or stop doing that thing which causes problems for many of us.


> Well behaved robots do not usually use millions of residential IPs

Some antivirus and parental control control software will scan links sent to someone from their machine (or from access points/routers).

Even some antivirus services will fetch links from residential IPs in order to detect malware from sites configured to serve malware only to residential IPs.

Actually, I'm not entirely sure how one would tell the difference between a user software scanning links to detect adult content/malware/etc, randos crawling the web searching for personal information/vulnerable sites/etc. and these supposed "AI crawlers" just from access logs.

While I'm certainly not going to dismiss the idea that these are poorly configured crawlers at some major AI company, I haven't seen much in the way of evidence that is the case.


Occasionally fetching a link will probably go unnoticed.

If your antivirus software hammers the same website several times a second for hours on end, in a way that is indistinguishable from an "AI crawler", then maybe it's really misbehaving and should be stopped from doing so.


Legitimate software that scan links are often well behaved, in isolation. It's when that software is installed on millions of computers that in aggregate, they can behave poorly. This isn't particularly new though. RSS software used to blow up small websites that couldn't handle it. Now with some browsers speculatively loading links, you can be hammered simply because you're linked to from a popular site even if no one actually clicks on the link.

Personally, I'm skeptical of blaming everything on AI scrapers. Everything people are complaining about has been happening for decades - mostly by people searching for website vulnerabilities/sensitive info who don't care if they're misbehaving, sometimes by random individuals who want to archive a site or are playing with a crawler and don't see why they should slow them down.

Even the techniques for poisoning aggressive or impolite crawlers are at least 30 years old.


Yes, and sysadmins have been quietly banning those misbehaving programs for the last 30 years.

The only thing that seems to have changed is that today's thread is full of people who think they have some sort of human right to access any website by any means possible, including their sloppy vibe-coded crawler. In the past, IIRC, people used to be a little more apologetic about consuming other people's resources and did their best to fly below the radar.

It's my website. I have every right to block anyone at any time for any reason whatsoever. Whether or not your use case is "legitimate" is beside the point.


The entitlement of so many modern vibe coders (or as we called them before, script kiddies) is absolutely off the charts. Just because there is not a rule or law expressly against what you're doing doesn't mean it's perfectly fine to do. Websites are hosted by and funded by people, and if your shitty scraper racks up a ton of traffic on one of my sites, I may end up on the hook for that. I am perfectly within both my rights and ethical boundaries to block your IP(s).

And just to not leave it merely implied, I don't give a rats ass if that slows down your "innovation." Go away.


> And no, the correct answer to such a situation would not be "but it was free, don't offer it for free if you don't want it to be taken for free".

The answer to THAT could: "It is free but leave some for others you greedy fuck"


Dependency management is work. And almost nobody does this work seriously because it has become unrealistic to do, which is the big concern here.

You now have to audit the hundreds of dependencies. Each time you upgrade them.

Rust is compiled and source code doesn't weigh that much, you could have the compiler remove dead code.

And sometimes it's just better to review and then copy paste small utility functions once.


> Rust is compiled and source code doesn't weigh that much, you could have the compiler remove dead code.

I get the impression that one driver to make microdependencies in rust is that code does weigh a lot because the rust compiler is so slow.

For a language with a focus on safety, it's a pretty bad choice


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: