Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Web0.cc – Generate clutter, ad and tracker free article pages to share (web0.cc)
136 points by web0_cc on March 13, 2023 | hide | past | favorite | 61 comments
I recently observed that majority of my family members and friends are not using ad blockers and reader mode for reasons suck as lack of knowledge about plugins & laziness.

So their online reading experience is not pleasant as a result very less amount of them read anything including what i share.

Its an attempt to give them clutter, tracker & ad free reading experience right off the bat.



I saw "generate clutter..." in the title and thought "why would I want to generate clutter?"

I know it's a minor point but I'd go "Remove clutter, ads and trackers from any article" for a stronger initial proposition.


Mine is

> Generate pages free of ads, trackers and other clutter.

Put the major purpose in the first two words.

[edit - otoh it doesn't generate pages really but generates versions of existing pages]


How about “Declutter web pages, remove ads and trackers”. Then the thing being improved (web pages) is also mentioned at the start and not only at the end.


> Generate clutter-, ad-, and tracker-free article pages to share

I miss hyphens.


This is correct, and without the hyphens it reads like a garden-path sentence[1], which is one with an ambiguous structure that normal readers tend to parse incorrectly.

[1] https://en.wikipedia.org/wiki/Garden-path_sentence


Was also thrown off by the ambiguous grammar. But for me the thought was "Finally, a submitted example of using AI as a means to generate garbage to be consumed by "tech" companies as 'data, the new oil'."

Faculty cannot detect the use of "AI". We have much about that. But neither can "tech" companies.

As for the submission, unlike archive.is, archive.ph, etc. web0.cc does not require Javascript, it does not require the user to solve a CAPTCHA, it does not insist on EDNS0 (ECS),^1 and it does not connect to the following third party URLs

https://www.google.com/recaptcha/api.js?onload=onloadCallbac...

https://www.google.com/webmasters/tools/ping?sitemap=https:/...

https://a.publir.com/platform/1100.js

https://a.publir.com/sellers.json

https://a.publir.com/ads-txt/505/ads.txt

https://top-fwz1.mail.ru/js/code.js

1. Including disabling the use of Cloudflare DNS. Even when a user is not using EDNS0, archive.ph still collects the user's IP address using DNS. For example,

   https://[YOUR_IP_ADDRESS].[YOUR_COUNTRY].inc1.358424231.pixel.archive.ph/x.gif
Further, archive.ph does not seem to work with archive.org. Web0.cc seems to work:

https://web.archive.org/web/20230313185743if_/https://web0.c...


To hop on this tagline train...open to any critiques.

"Share any link with ads and tracking removed."

Rationale:

Sharing = most important verb/subject. Followed by the subject (links)

Ads and tracking - people know what ads are, they might know what tracking implies. trackers may be more abstract.

Clutter - removed this word. What does clutter mean in this context?


My bad.

First of all thank you all for the feedback & suggestions.

I think these all are good suggestions. I will pick one for sure


Agreed. Some example links or a single small before/after image would have helped.


“Generate removal of clutter, ads and trackers” and then everyone is happy :D


"Generate a URL that brings the content of any article into focus, leaving the clutter, ads, and tracking behind"


I use archive.is because it removes ads, cookie banners, and doesn't execute archived webapage's javascript.


Thats good.

Honest question, Would you enjoy reading articles from archive.is on your mobile phone?

as far as i remember, its not mobile-friendly.


It is mobile friendly towards reader-mode of iPhone Safari which is more than enough for me since I have reader-mode on by default for articles.

But I can agree that it doesn't render mobile-friendly websites by default.


Nice. How about creating a bookmarklet to make it super easy to share a web0.cc link from the current page?

Then people don't need to copy the url, go to web0.cc, paste it and submit.

Also maybe if the refresh can be reliably done after a few seconds then you could make the page refresh itself automatically (it's not difficult for the user but it's one less thing to handle). You could tweak the refresh time by how busy your site is.


Thanks for the feedback & suggestions.

>> Nice. How about creating a bookmarklet to make it super easy to share a web0.cc link from the current page?

Sure, i will do it

>> Also maybe if the refresh can be reliably done after a few seconds then you could make the page refresh itself automatically (it's not difficult for the user but it's one less thing to handle). You could tweak the refresh time by how busy your site is.

Thats a good feature but it involves JS and for some reasons i decided not to put JS on the client side.


> Thats a good feature but it involves JS

Actually, you don’t need JS. The simplest mechanism is probably:

https://en.m.wikipedia.org/wiki/Meta_refresh

(Both HTML tag and HTTP header routes would probably work)



Yea. due to lazy loading. Fix is simple but it comes at a cost

Thanks for finding the bug.


If the site ever gets popular expect to get blocked by publishers and/or to receive legal threats. Big publishers really don't like people republishing content without a license.


Reddit page converted to text only. Headline claim checks out. Would be even cooler if this were just in the browser, that you could link someone to web0.cc/#url and it would do the computation client-side, not needing any queueing system and (depending on CORS headers) slightly more privacy-friendly.

Given that it now has queueing, server-side processing, and forever-links, clearly there's computing power and storage behind it. What will this cost to run? Would I not risk ads being shown when we start to use this frequently?


Is the source code available anywhere?


would be nice


web0- Zero ads, Zero trackers & Zero Clutter. Awesome.

I think this is a good tool to share articles with not tech savy parents who cant distinguish between ads and real content.


This tool is super nice, just tried the wired article about the electrons[1] and it got it really good, Safari reader just got the first couple of paragraphs, this tool delivered the complete thing

[1] https://www.wired.com/story/the-electron-is-having-a-magneti...


I can only think of two websites that I really despise visiting, CNN and Fandom. Tried a random link[0][1] from both; the CNN article did not get decluttered at all but actually moved underneath the clutter, while Fandom looked mostly okay but had weird brackets after all the titles.

[0]: https://www.cnn.com/2023/03/13/business/svb-employees-angry-... , https://web0.cc/a/8ev_GjUcz-

[1]: https://deadcells.fandom.com/wiki/Promenade_of_the_Condemned , https://web0.cc/a/tymsJ3ZIH0


Nice domain name! It'd be cool to leverage archive.md as well to help bypass pay walls. I tried it on a First Things link and it was only able to grab the first two paragraphs.


Mozilla's readability library does the heavy lifting behind the scenes. Its an amazing library but it fails to support some sites.

I will definitely try to improve the library or find another way to support the mentioned website.

Thanks for the feedback.


Tried to load in a Yahoo article. After waiting about 30 seconds I hit the refresh button and it said "failed after 1 attempts" or some such. I went to click "report a problem" and the Email client prompt showed up (I don't have a desktop email client), so I'm afraid I couldn't report feedback.

Just some feedback for you here, instead.


The prompt comes up from auto detecting an email address/mailto in the link. You should be able to right click and "Copy Email Address".


The point is more that I don't want to use email for feedback.


A valid opinion for sure, but not one that was apparent from your original comment.


Thanks for giving it a try. Sure i will look into it.


Is it reading the web page, parsing it into an AST, sanitizing the AST, then writing it out? I have a similar function on my site: https://roastidio.us

There are 2 caveats:

* There are broken HTML files that somehow still work with a browser.

* There are web pages that rely on javascript even for the main text.


* With (legacy) HTML being one of the most complex things to parse, that AST-parsing step poses an increased attack surface. Trying to mitigate that risk might keep cost of operating this service high in the long run.


Hmm reminds me of Google’a AMP in a way. I applaud author’s work. It’s really nice. Unfortunately if this gets popular all the negatives that stemmed from Google trying to host AMP content, with similarly good intentions, might come back to haunt us. Hopefully your product remains niche. Great work


Tried https://nytimes.com/ to see what I get. The output https://web0.cc/a/rLk0Tnn2K- was not what I expected.


It's designed for reading articles, I wouldn't expect the front page to work well.

Here's an opinion article from the Times, seems to work fine:

https://web0.cc/a/zTHtZFEXci

Suggestion for web0: include the final part of the URL in the new URL, so you can at least make a guess as to what it is. Include a GUID at the end if you need to disambiguate.


Thanks for checking the app, SamBam.

Do you mean like this https://web0.cc/a/zTHtZFEXci?url=https://www.nytimes.com/202...


That's good, although anyone can put anything there.

https://web0.cc/a/zTHtZFEXci?url=https://haxx0r.c0m

I was thinking more along the lines of

https://web0.cc/a/opinion/english-literature-study-zTHtZFEXc...

Of course it doesn't actually grant you security.

What if your original link with the `url=` refused to load if the target didn't match the query parameter? That would actually make it secure, no?


> majority of my family members and friends are not using ad blockers

Yep, I'm also surprised when I notice people not having an ad blocker. That's a one time setup though, I often offer to help and install uBlock Origin, usually people keep it for a long time after this.


Does this insert affiliate links?


No


I think this is a great idea.

The two sites I tried this on still had ads showing and where was rendered entirely unreadable.

I dont think that is really a critique. The task is exceedingly difficult. If it works in a lot of sites that is a victory


Pretty neat; it's something I've wished for in the past for things like browsing on limited E-Readers.

How'd you build it? I built something vaguely similar (but generating epub) using Mercury Parser.


Why isn't this achieved client side?


It can be, Firefox and some other browsers have a built-in reader view. This website is useful regardless for sharing articles with people who don't know about reader view or whose browsers don't support it


CORS makes this difficult. For non-CORS enabled sites you would need to use a server-side proxy. But at that point why not do the filtering on the server to avoid transferring unnecessary data?


> For non-CORS enabled sites

Which is essentially every website, as CORS underpins how the web works. Only if there is a public API for a given site, would it be possible to do otherwise. And then you'd usually still need URL to API mapping, plus any further API sign up stuff that may or may not be required for simple GETs.


CORS was a hacky fix to a bad design that we are still paying for.

But backwards compatibility is nice I guess.


I've always seen it as a feature that you can't access data from outside your own sandbox, kind of the killer thing that made the internet so useful for so many things without worrying about other sites being able to access your domain's data. Now copied to newer OSes and the reason why I trust my mom on Android a thousand times more than on Windows (or Linux, if it had had enough market share to have a serious malware industry).

But I can see how you could see it another way indeed


But you can access other domain's data! You just need to proxy via the server.

As I understand it the main "bug" that CORS fixes is that cookies were sent by default on cross-domain requests. This basically means that APIs were always authenticated no matter what site you were on. The funny thing is that browsers are stopping this cross-domain cookies anyways by adding domain isolation to prevent tracking. So this main feature of CORS is becoming obsolete. I wish that domain isolation was the initial fix (at least by default) but at the time it was thought that this backwards incompatibility wasn't worth it, so CORS is what we got.

There is one other feature of CORS which is network perspective. However it isn't an effective solution anyways due to domain rebinding attacks. So it is a best-effort mitigation at best. Blocking basically all client-side application use (RSS readers, API exporers, URL previews in chat apps ...) by default seems like a really high price to pay for this minor mitigation. A better approach would probably be browsers just blocking requests from public sites to internal IPs by default. That would actually be reliable (as long as you aren't abusing public IP space for your private services), block requests that aren't CORS protected (like form posts!) and would avoid the huge cost of CORS.

As it is if you want to do stuff client side you need to set up a CORS-stripping proxy server which is really annoying and creates a dependency on your service. At the end of the day CORS is a hacky mitigation for the braindead choice to send cross-site cookies by default. If you want real security you should protect your API via a real technical measure, not just hoping the the browser will block requests.


This is neat! Can I ask what you use to generate this part: "7 minutes"?


Formula i used is Ceil(CL/WL)/WPM)

CL= Content Length WP= Average Word Length (i used 5) WPM = Words per min, reading speed (i used 180)

not perfect but good enough.

Thanks breck for giving the app a try


Awesome work! Really cool.

What's the short of why manual refresh is needed?


Does this also remove paywalls? Either intentionally or unintentionally?


unfortunately, No it can't.


Does anyone know how archive.is is able to remove paywalls if this can't?


It would either have to be pooled subscriptions, or a maintained set of different, per-site workarounds (using specific referrers and/or user-agents, disabling or modifying on-page javascript, hidden urls, etc) or a mix of the two.


How do you bypass paywalled content?

Thanks for sharing.


I started working on this few days ago so havent had much time to think about paywalled content. Maybe this week i will try to find a solution.

There wont be much we could do If its behind captchas.

Thanks for giving it a try.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: