I was reading another thread about webscraping, someone mentioned CSS selectors ...

mdaniel · on May 27, 2022

In my experience, it's not that CSS selectors are "more powerful," but rather "more legible." XPath is for sure more powerful, but also usually lower signal to noise ratio

    response.css("#the-id")
    # vs
    response.xpath("//*[@id='the-id']")

Thankfully, Scrapy (well, pedantically "parsel") allows mixing and matching, using the one which makes the most sense

    response.css(".someClass").xpath(".//*[starts-with(text(), 'Price')]")

showerst · on May 27, 2022

CSS is nice because it's more readable than XPATH for longer queries, and is friendlier to newer programmers who didn't come up when XML was big.

XPATH is generally more powerful for really gnarly things and for backtracking. "Show me the 3rd paragraph that's a sibling of the fourth div id="subhed" and contains the text "starting".

dotancohen · on May 27, 2022

  > XPATH is generally more powerful...

That is a convincing argument is you can back it up with an XPATH expression.

showerst · on May 27, 2022

Here’s an example of parsing some particularly annoying old school html. I’m not claiming it’s the _best_ way to do it, just that you can, and I’m not sure this one is doable with selectors. https://github.com/openstates/openstates-scrapers/blob/40246...

mdaniel · on May 27, 2022

Well, the rest of their sentence summed it up pretty well; try and implement that example using CSS selectors

Hell, even "find id=subhead and _go up one element_" isn't possible in CSS because that's not a problem it was designed to solve

byteface · on May 28, 2022

I'm not sure about quicker. Doesn't scrapy use elementpath?. which converts a css query to an xpath under the hood as there is no complete CSSOM available for python. Likely as there is no modern standards based python dom to operate on so doing it on lxml tree is probably the best option. I find the main difference is xpath can return an attribute value where as css returns the node. You can use either from the terminal in my lib... https://github.com/byteface/domonic (as it uses elementpath like scrapy)

byteface · on May 28, 2022

sorry i meant cssselect... https://pypi.org/project/cssselect/ which converts to xpath.

lancebeet · on May 28, 2022

In my experience, XPath selectors are easier to write but usually result in selectors that are less robust to DOM changes. It is possible to write reliable XPath selectors as well, but I often see XPath selectors breaking because of implicit assumptions about the DOM structure. I don't see this as often for CSS selectors since they encourage you to make more explicit assumptions.

This is in the context of test automation of modern web apps with a virtual DOM. I'm sure things might be different in other areas.

PigiVinci83 · on May 27, 2022

Having a large codebase like ours, we find out that XPATH are more readable, but i understand it's a personal feeling. We don't have high frequency scraping, so the performances of CSS vs XPATH were not considered. It's an interesting point i'd like to write more about, thanks for sharing.