Show HN: Web0.cc – Generate clutter, ad and tracker free article pages to share

dmje · on March 13, 2023

I saw "generate clutter..." in the title and thought "why would I want to generate clutter?"

I know it's a minor point but I'd go "Remove clutter, ads and trackers from any article" for a stronger initial proposition.

emmelaich · on March 13, 2023

Mine is

> Generate pages free of ads, trackers and other clutter.

Put the major purpose in the first two words.

[edit - otoh it doesn't generate pages really but generates versions of existing pages]

layer8 · on March 13, 2023

How about “Declutter web pages, remove ads and trackers”. Then the thing being improved (web pages) is also mentioned at the start and not only at the end.

carapace · on March 13, 2023

> Generate clutter-, ad-, and tracker-free article pages to share

I miss hyphens.

wunderlust · on March 14, 2023

This is correct, and without the hyphens it reads like a garden-path sentence[1], which is one with an ambiguous structure that normal readers tend to parse incorrectly.

[1] https://en.wikipedia.org/wiki/Garden-path_sentence

1vuio0pswjnm7 · on March 14, 2023

Was also thrown off by the ambiguous grammar. But for me the thought was "Finally, a submitted example of using AI as a means to generate garbage to be consumed by "tech" companies as 'data, the new oil'."

Faculty cannot detect the use of "AI". We have much about that. But neither can "tech" companies.

As for the submission, unlike archive.is, archive.ph, etc. web0.cc does not require Javascript, it does not require the user to solve a CAPTCHA, it does not insist on EDNS0 (ECS),^1 and it does not connect to the following third party URLs

https://www.google.com/recaptcha/api.js?onload=onloadCallbac...

https://www.google.com/webmasters/tools/ping?sitemap=https:/...

https://a.publir.com/platform/1100.js

https://a.publir.com/sellers.json

https://a.publir.com/ads-txt/505/ads.txt

https://top-fwz1.mail.ru/js/code.js

1. Including disabling the use of Cloudflare DNS. Even when a user is not using EDNS0, archive.ph still collects the user's IP address using DNS. For example,

   https://[YOUR_IP_ADDRESS].[YOUR_COUNTRY].inc1.358424231.pixel.archive.ph/x.gif

Further, archive.ph does not seem to work with archive.org. Web0.cc seems to work:

https://web.archive.org/web/20230313185743if_/https://web0.c...

henshao · on March 14, 2023

To hop on this tagline train...open to any critiques.

"Share any link with ads and tracking removed."

Rationale:

Sharing = most important verb/subject. Followed by the subject (links)

Ads and tracking - people know what ads are, they might know what tracking implies. trackers may be more abstract.

Clutter - removed this word. What does clutter mean in this context?

web0_cc · on March 14, 2023

My bad.

First of all thank you all for the feedback & suggestions.

I think these all are good suggestions. I will pick one for sure

d00nut · on March 14, 2023

Agreed. Some example links or a single small before/after image would have helped.

codetrotter · on March 13, 2023

“Generate removal of clutter, ads and trackers” and then everyone is happy :D

pcthrowaway · on March 13, 2023

"Generate a URL that brings the content of any article into focus, leaving the clutter, ads, and tracking behind"

ignoramous · on March 13, 2023

I use archive.is because it removes ads, cookie banners, and doesn't execute archived webapage's javascript.

web0_cc · on March 14, 2023

Thats good.

Honest question, Would you enjoy reading articles from archive.is on your mobile phone?

as far as i remember, its not mobile-friendly.

ilt · on March 14, 2023

It is mobile friendly towards reader-mode of iPhone Safari which is more than enough for me since I have reader-mode on by default for articles.

But I can agree that it doesn't render mobile-friendly websites by default.

nmstoker · on March 13, 2023

Nice. How about creating a bookmarklet to make it super easy to share a web0.cc link from the current page?

Then people don't need to copy the url, go to web0.cc, paste it and submit.

Also maybe if the refresh can be reliably done after a few seconds then you could make the page refresh itself automatically (it's not difficult for the user but it's one less thing to handle). You could tweak the refresh time by how busy your site is.

web0_cc · on March 14, 2023

Thanks for the feedback & suggestions.

>> Nice. How about creating a bookmarklet to make it super easy to share a web0.cc link from the current page?

Sure, i will do it

>> Also maybe if the refresh can be reliably done after a few seconds then you could make the page refresh itself automatically (it's not difficult for the user but it's one less thing to handle). You could tweak the refresh time by how busy your site is.

Thats a good feature but it involves JS and for some reasons i decided not to put JS on the client side.

jsjohnst · on March 14, 2023

> Thats a good feature but it involves JS

Actually, you don’t need JS. The simplest mechanism is probably:

https://en.m.wikipedia.org/wiki/Meta_refresh

(Both HTML tag and HTTP header routes would probably work)

jszymborski · on March 13, 2023

Seems to truncate this article

https://web0.cc/a/e-cYg2rJ8Q

https://arstechnica.com/science/2023/03/floating-solar-panel...

web0_cc · on March 14, 2023

Yea. due to lazy loading. Fix is simple but it comes at a cost

Thanks for finding the bug.

eli · on March 13, 2023

If the site ever gets popular expect to get blocked by publishers and/or to receive legal threats. Big publishers really don't like people republishing content without a license.

lucb1e · on March 13, 2023

Reddit page converted to text only. Headline claim checks out. Would be even cooler if this were just in the browser, that you could link someone to web0.cc/#url and it would do the computation client-side, not needing any queueing system and (depending on CORS headers) slightly more privacy-friendly.

Given that it now has queueing, server-side processing, and forever-links, clearly there's computing power and storage behind it. What will this cost to run? Would I not risk ads being shown when we start to use this frequently?

eevahr · on March 13, 2023

Is the source code available anywhere?

margarina72 · on March 14, 2023

would be nice

realshadow · on March 13, 2023

web0- Zero ads, Zero trackers & Zero Clutter. Awesome.

I think this is a good tool to share articles with not tech savy parents who cant distinguish between ads and real content.

lolovaldez · on March 14, 2023

This tool is super nice, just tried the wired article about the electrons[1] and it got it really good, Safari reader just got the first couple of paragraphs, this tool delivered the complete thing

[1] https://www.wired.com/story/the-electron-is-having-a-magneti...

pie_flavor · on March 14, 2023

I can only think of two websites that I really despise visiting, CNN and Fandom. Tried a random link[0][1] from both; the CNN article did not get decluttered at all but actually moved underneath the clutter, while Fandom looked mostly okay but had weird brackets after all the titles.

[0]: https://www.cnn.com/2023/03/13/business/svb-employees-angry-... , https://web0.cc/a/8ev_GjUcz-

[1]: https://deadcells.fandom.com/wiki/Promenade_of_the_Condemned , https://web0.cc/a/tymsJ3ZIH0

prophesi · on March 13, 2023

Nice domain name! It'd be cool to leverage archive.md as well to help bypass pay walls. I tried it on a First Things link and it was only able to grab the first two paragraphs.

web0_cc · on March 13, 2023

Mozilla's readability library does the heavy lifting behind the scenes. Its an amazing library but it fails to support some sites.

I will definitely try to improve the library or find another way to support the mentioned website.

Thanks for the feedback.

junon · on March 13, 2023

Tried to load in a Yahoo article. After waiting about 30 seconds I hit the refresh button and it said "failed after 1 attempts" or some such. I went to click "report a problem" and the Email client prompt showed up (I don't have a desktop email client), so I'm afraid I couldn't report feedback.

Just some feedback for you here, instead.

mroche · on March 13, 2023

The prompt comes up from auto detecting an email address/mailto in the link. You should be able to right click and "Copy Email Address".

junon · on March 14, 2023

The point is more that I don't want to use email for feedback.

mroche · on March 14, 2023

A valid opinion for sure, but not one that was apparent from your original comment.

web0_cc · on March 13, 2023

Thanks for giving it a try. Sure i will look into it.

derekzhouzhen · on March 14, 2023

Is it reading the web page, parsing it into an AST, sanitizing the AST, then writing it out? I have a similar function on my site: https://roastidio.us

There are 2 caveats:

* There are broken HTML files that somehow still work with a browser.

* There are web pages that rely on javascript even for the main text.

Hackbraten · on March 14, 2023

* With (legacy) HTML being one of the most complex things to parse, that AST-parsing step poses an increased attack surface. Trying to mitigate that risk might keep cost of operating this service high in the long run.

sdfhbdf · on March 13, 2023

Hmm reminds me of Google’a AMP in a way. I applaud author’s work. It’s really nice. Unfortunately if this gets popular all the negatives that stemmed from Google trying to host AMP content, with similarly good intentions, might come back to haunt us. Hopefully your product remains niche. Great work

vivegi · on March 13, 2023

Tried https://nytimes.com/ to see what I get. The output https://web0.cc/a/rLk0Tnn2K- was not what I expected.

SamBam · on March 13, 2023

It's designed for reading articles, I wouldn't expect the front page to work well.

Here's an opinion article from the Times, seems to work fine:

https://web0.cc/a/zTHtZFEXci

Suggestion for web0: include the final part of the URL in the new URL, so you can at least make a guess as to what it is. Include a GUID at the end if you need to disambiguate.

web0_cc · on March 13, 2023

Thanks for checking the app, SamBam.

Do you mean like this https://web0.cc/a/zTHtZFEXci?url=https://www.nytimes.com/202...

SamBam · on March 13, 2023

That's good, although anyone can put anything there.

https://web0.cc/a/zTHtZFEXci?url=https://haxx0r.c0m

I was thinking more along the lines of

https://web0.cc/a/opinion/english-literature-study-zTHtZFEXc...

Of course it doesn't actually grant you security.

What if your original link with the `url=` refused to load if the target didn't match the query parameter? That would actually make it secure, no?

jraph · on March 14, 2023

> majority of my family members and friends are not using ad blockers

Yep, I'm also surprised when I notice people not having an ad blocker. That's a one time setup though, I often offer to help and install uBlock Origin, usually people keep it for a long time after this.

xnx · on March 13, 2023

Does this insert affiliate links?

web0_cc · on March 13, 2023

ThinkBeat · on March 14, 2023

I think this is a great idea.

The two sites I tried this on still had ads showing and where was rendered entirely unreadable.

I dont think that is really a critique. The task is exceedingly difficult. If it works in a lot of sites that is a victory

solarkraft · on March 13, 2023

Pretty neat; it's something I've wished for in the past for things like browsing on limited E-Readers.

How'd you build it? I built something vaguely similar (but generating epub) using Mercury Parser.

steren · on March 13, 2023

Why isn't this achieved client side?

a2128 · on March 13, 2023

It can be, Firefox and some other browsers have a built-in reader view. This website is useful regardless for sharing articles with people who don't know about reader view or whose browsers don't support it

kevincox · on March 13, 2023

CORS makes this difficult. For non-CORS enabled sites you would need to use a server-side proxy. But at that point why not do the filtering on the server to avoid transferring unnecessary data?

lucb1e · on March 13, 2023

> For non-CORS enabled sites

Which is essentially every website, as CORS underpins how the web works. Only if there is a public API for a given site, would it be possible to do otherwise. And then you'd usually still need URL to API mapping, plus any further API sign up stuff that may or may not be required for simple GETs.

kevincox · on March 13, 2023

CORS was a hacky fix to a bad design that we are still paying for.

But backwards compatibility is nice I guess.

lucb1e · on March 14, 2023

I've always seen it as a feature that you can't access data from outside your own sandbox, kind of the killer thing that made the internet so useful for so many things without worrying about other sites being able to access your domain's data. Now copied to newer OSes and the reason why I trust my mom on Android a thousand times more than on Windows (or Linux, if it had had enough market share to have a serious malware industry).

But I can see how you could see it another way indeed

kevincox · on March 14, 2023

But you can access other domain's data! You just need to proxy via the server.

As I understand it the main "bug" that CORS fixes is that cookies were sent by default on cross-domain requests. This basically means that APIs were always authenticated no matter what site you were on. The funny thing is that browsers are stopping this cross-domain cookies anyways by adding domain isolation to prevent tracking. So this main feature of CORS is becoming obsolete. I wish that domain isolation was the initial fix (at least by default) but at the time it was thought that this backwards incompatibility wasn't worth it, so CORS is what we got.

There is one other feature of CORS which is network perspective. However it isn't an effective solution anyways due to domain rebinding attacks. So it is a best-effort mitigation at best. Blocking basically all client-side application use (RSS readers, API exporers, URL previews in chat apps ...) by default seems like a really high price to pay for this minor mitigation. A better approach would probably be browsers just blocking requests from public sites to internal IPs by default. That would actually be reliable (as long as you aren't abusing public IP space for your private services), block requests that aren't CORS protected (like form posts!) and would avoid the huge cost of CORS.

As it is if you want to do stuff client side you need to set up a CORS-stripping proxy server which is really annoying and creates a dependency on your service. At the end of the day CORS is a hacky mitigation for the braindead choice to send cross-site cookies by default. If you want real security you should protect your API via a real technical measure, not just hoping the the browser will block requests.

breck · on March 14, 2023

This is neat! Can I ask what you use to generate this part: "7 minutes"?

web0_cc · on March 14, 2023

Formula i used is Ceil(CL/WL)/WPM)

CL= Content Length WP= Average Word Length (i used 5) WPM = Words per min, reading speed (i used 180)

not perfect but good enough.

Thanks breck for giving the app a try

hackerbrother · on March 13, 2023

Awesome work! Really cool.

What's the short of why manual refresh is needed?

dgeiser13 · on March 13, 2023

Does this also remove paywalls? Either intentionally or unintentionally?

web0_cc · on March 13, 2023

unfortunately, No it can't.

pcthrowaway · on March 13, 2023

Does anyone know how archive.is is able to remove paywalls if this can't?

tyingq · on March 13, 2023

It would either have to be pooled subscriptions, or a maintained set of different, per-site workarounds (using specific referrers and/or user-agents, disabling or modifying on-page javascript, hidden urls, etc) or a mix of the two.

nogologo · on March 14, 2023

How do you bypass paywalled content?

Thanks for sharing.

web0_cc · on March 14, 2023

I started working on this few days ago so havent had much time to think about paywalled content. Maybe this week i will try to find a solution.

There wont be much we could do If its behind captchas.

Thanks for giving it a try.