I think it's common for tools to assume that file names are valid unicode, not s...

SAI_Peregrinus · 2025-03-19T14:34:32 1742394872

Common, but rather stupid. Filenames aren't even text. `fd` is written in Rust, and it uses std::path for paths, the regex pattern defaults to matching text. That said, it is possible by turning off the Unicode flag. `(?-u:\x??)` where `??` is a raw byte in hex. E.g. `(?-u:\xFF)` for OP. See "Opt out of Unicode support[1] in the regex docs.

[1] https://docs.rs/regex/latest/regex/#opt-out-of-unicode-suppo...

jcranmer · 2025-03-19T15:39:41 1742398781

IMHO, the kernel should have filesystem mount options to just reject path names that are non-UTF-8, and distros should default to those when creating new filesystems on new systems.

For >99.99% of usecases, file paths are textual data, and people do expect to view them as text. And it's high time that kernels should start enforcing that they act as text, because it constitutes a security vulnerability for a good deal of software while providing exceedingly low value.

xorcist · 2025-03-19T18:32:02 1742409122

So just turn off support for external media, which could possibly be created on other platforms, and all old file systems? Legacy platforms, like modern Windows which still uses UCS-2 (or some half broken variant thereof)?

While I support the UTF-8 everywhere movement with every fiber of my body, that still sounds like a hard sell for all vintage computer enthusiasts, embedded developers, and anyone else, really.

jcranmer · 2025-03-19T22:05:55 1742421955

As I said in another comment, you can handle the legacy systems by giving a mount option that transcodes filenames using Latin-1. (Choosing Latin-1 because it's a trivial mapping that doesn't require lookup tables). UCS-2 is easily handled by WTF-8 (i.e., don't treat an encoded unpaired surrogate as an error).

The reality is that non-UTF-8 filenames already break most modern software, and it's probably more useful for the few people who need to care about it to figure out how to make their workflows work in a UTF-8-only filename world rather than demanding that everybody else has to fix their software to handle a case where there kind of isn't a fix in the first place.

burntsushi · 2025-03-19T15:11:07 1742397067

What is text? Are the contents of files text? How does one determine if something is text?

(I'm the author of ripgrep, and this is my way of gently suggesting that "filenames aren't even text" isn't an especially useful model.)

SAI_Peregrinus · 2025-03-19T15:46:11 1742399171

Oh, I agree that "text" isn't well-defined. The best I can come up with is that "text" is a valid sequence of bytes when interpreted in some text encoding. I think that something designed to search filenames should clearly document how to search for all valid filenames in its help or manual, not require looking up the docs of a dependency. Filenames are paths, which are weird on every platform. 99% of the time you can search paths using some sort of text encoding, but IMO it should be pointed out in the man page that non-unicode filenames can actually be searched for. `fd`'s man page just links to the regex crate docs, it doesn't generate a new man page for those & name that.

As for "filenames aren't even text" not being a useful model, to me text is a `&str` or `String` or `OsString`, filenames are a `Path` or `PathBuf`. We have different types for paths & strings because they represent different things, and have different valid contents. All I mean by that is the types are different, and the types you use for text shouldn't be the same as the types you use for paths.

burntsushi · 2025-03-19T17:40:34 1742406034

I'd suggest engaging with this question, which I think you ignored:

> Are the contents of files text?

It is perhaps the most prescient of all. What is the OS interface for files? Does it tell you, "This is a UTF-8 encoded text file containing short human readable lines"? No, it does not. All you get is bytes, and if you're lucky, you can maybe infer something about the extension of the file's path (but this is only a convention).

How do you turn bytes into a `&str`? Do you think ripgrep converts an entire file to `&str` before searching it? Does ripgrep even do UTF-8 validation at all? No no no, it does not.

I'd suggest giving https://burntsushi.net/bstr/#motivation-based-on-concepts and the crate docs of https://docs.rs/bstr/latest/bstr/ a read.

To be clear, there is no perfect answer here. You've got to do the best with what you've got. But the model I work with is, "treat file contents and file paths as text until you heuristically believe otherwise." But I work on Unix CLI tooling that needs to be fast. For most people, I would say, "validate file contents and file paths as text" is the right model to start with.

> but IMO it should be pointed out in the man page

Docs can always be improved, sure, but that is not what I'm trying to engage with you about. :-)

SAI_Peregrinus · 2025-03-19T17:54:10 1742406850

I'd say some files are text, some are not. And I agree that there's no good way to tell! I think ripgrep has a much harder job than fd, because at least fd can always know that all paths it's searching are valid paths for the OS in use.

burntsushi · 2025-03-19T18:27:57 1742408877

My point is that you can apply to the answer to the question "are the contents of files text?" to the question "are file paths text?"

SAI_Peregrinus · 2025-03-19T20:48:14 1742417294

I get it. I think you're right that they both have the same problem, but paths have a std type for handling them that, while file content's don't. As long as you're on an OS you can use std::path::Path (or PathBuf) for paths, and ensure they're valid. I suppose I should have said "Paths aren't Strings" or similar, they might be text but they might not be, and fundamentally the issue is that they're different data types. "Text" isn't universally defined.

burntsushi · 2025-03-20T02:14:33 1742436873

You can't really just use `std::path::Path` though. Because it's largely opaque. How do you run a regex or a glob on a `std::path::Path`? Doing a UTF-8 check first is expensive at ripgrep's scale. So it just gets it to `&[u8]` as quickly as it can and treats it as if it were text. (These days you can use `OsStr::as_encoded_bytes`.)

`std::path::Path` isn't necessarily a better design. I mean, on some days, I like it. On other days, I wonder if it was a mistake because it creates so much ceremony. And in many of those cases, the ceremony is totally unwarranted.

And I'm saying this as someone who has been adjudicating Rust's standard library API since Rust 1.0.

jakeogh · 2025-03-19T17:07:37 1742404057

Tools must be general. Im not going to invest time using a new one if it cant handle arb vaild filesystems. But thats just me.

https://github.com/jakeogh/angryfiles

burntsushi · 2025-03-19T17:32:24 1742405544

`fd` does, as pointed out in this thread in numerous places. So I don't know what your point is, and you didn't engage at all with my prompt.