Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm working on a chrome extension for use in a headless browser in marginalia search to capture information about network traffic, ads, and popovers when visiting a website, to better identify nuisance websites.

A bit of a janky setup, but I've mostly gotten it to do what I want it to do after some head scratching.



Holy shit, so you are uh building a search engine from scratch. Do you crawl yourself? What is your infrastructure? What is your goal for search.marginalia ?


> Holy shit, so you are uh building a search engine from scratch.

Yup

> Do you crawl yourself?

Yup

> What is your infrastructure?

All custom built in Java, sitting on a rack server in a basement in Sweden.

> What is your goal for search.marginalia ?

I'm basically building what I feel is lacking in internet search and discovery, which is tools for finding stuff based on something other than a popularity metric, as those tend to feed into themselves to make the web seem so small.


I love the 'coffee stain'indicator! How do you rank results?

Can you give some rough indications of how many pages you index in total? How many page you crawl each day? Size of the machine(s) in RAM and HDD?

Sorry, many questions, just genuinely intrigued!


> How do you rank results?

There's a ton of factors.

https://github.com/MarginaliaSearch/MarginaliaSearch/blob/ma...

> Can you give some rough indications of how many pages you index in total?

I index like 300 million documents right now, though I crawl something like 1.4 billion (and could index them all). The search engine is pretty judicious about filtering out low quality results, mostly because this improves the search results.

> How many page you crawl each day?

I don't know if I have a good answer for that. In general the crawling isn't really much of a bottleneck. I try to refresh the index completely every ~8 weeks, and also have some capabilities for discovering recent changes via RSS feeds.

> Size of the machine(s) in RAM and HDD?

It's an EPYC 7543 x2 SMP machine with 512 GB RAM and something like 90 TB disk space, all NVMe storage.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: