I was reading another thread about webscraping, someone mentioned CSS selectors being way quicker than xpath. I'm easy either way but apart from a more powerful syntax what other benefits are there?
In my experience, it's not that CSS selectors are "more powerful," but rather "more legible." XPath is for sure more powerful, but also usually lower signal to noise ratio
response.css("#the-id")
# vs
response.xpath("//*[@id='the-id']")
Thankfully, Scrapy (well, pedantically "parsel") allows mixing and matching, using the one which makes the most sense
CSS is nice because it's more readable than XPATH for longer queries, and is friendlier to newer programmers who didn't come up when XML was big.
XPATH is generally more powerful for really gnarly things and for backtracking. "Show me the 3rd paragraph that's a sibling of the fourth div id="subhed" and contains the text "starting".
Here’s an example of parsing some particularly annoying old school html. I’m not claiming it’s the _best_ way to do it, just that you can, and I’m not sure this one is doable with selectors. https://github.com/openstates/openstates-scrapers/blob/40246...
I'm not sure about quicker. Doesn't scrapy use elementpath?. which converts a css query to an xpath under the hood as there is no complete CSSOM available for python. Likely as there is no modern standards based python dom to operate on so doing it on lxml tree is probably the best option. I find the main difference is xpath can return an attribute value where as css returns the node. You can use either from the terminal in my lib... https://github.com/byteface/domonic (as it uses elementpath like scrapy)
In my experience, XPath selectors are easier to write but usually result in selectors that are less robust to DOM changes. It is possible to write reliable XPath selectors as well, but I often see XPath selectors breaking because of implicit assumptions about the DOM structure. I don't see this as often for CSS selectors since they encourage you to make more explicit assumptions.
This is in the context of test automation of modern web apps with a virtual DOM. I'm sure things might be different in other areas.
Having a large codebase like ours, we find out that XPATH are more readable, but i understand it's a personal feeling. We don't have high frequency scraping, so the performances of CSS vs XPATH were not considered.
It's an interesting point i'd like to write more about, thanks for sharing.