I'm working on a chrome extension for use in a headless browser in marginalia search to capture information about network traffic, ads, and popovers when visiting a website, to better identify nuisance websites.
A bit of a janky setup, but I've mostly gotten it to do what I want it to do after some head scratching.
Holy shit, so you are uh building a search engine from scratch.
Do you crawl yourself? What is your infrastructure? What is your goal for search.marginalia ?
> Holy shit, so you are uh building a search engine from scratch.
Yup
> Do you crawl yourself?
Yup
> What is your infrastructure?
All custom built in Java, sitting on a rack server in a basement in Sweden.
> What is your goal for search.marginalia ?
I'm basically building what I feel is lacking in internet search and discovery, which is tools for finding stuff based on something other than a popularity metric, as those tend to feed into themselves to make the web seem so small.
> Can you give some rough indications of how many pages you index in total?
I index like 300 million documents right now, though I crawl something like 1.4 billion (and could index them all). The search engine is pretty judicious about filtering out low quality results, mostly because this improves the search results.
> How many page you crawl each day?
I don't know if I have a good answer for that. In general the crawling isn't really much of a bottleneck. I try to refresh the index completely every ~8 weeks, and also have some capabilities for discovering recent changes via RSS feeds.
> Size of the machine(s) in RAM and HDD?
It's an EPYC 7543 x2 SMP machine with 512 GB RAM and something like 90 TB disk space, all NVMe storage.
A bit of a janky setup, but I've mostly gotten it to do what I want it to do after some head scratching.