Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is it really common enough for files not to be annotated with a useful/correct file type extension (e.g. .mp3, .txt) that a library like this is needed?


Yes!

Sometimes a file has no extension. Other times the extension is a lie. Still other times, you may be dealing with an unnamed bytestring and wish to know what kind of content it is.

This last case happens quite a lot in Nosey Parker [1], a detector of secrets in textual data. There, it is possible to come across unnamed files in Git history, and it would be useful to the user to still indicate what type of file it seems to be.

I added file type detection based on libmagic to Nosey Parker a while back, but it's not compiled in by default because libmagic is slow and complicates the build process. Also, libmagic is implemented as a large C library whose primary job is parsing, which makes the security side of me jittery.

I will likely add enabled-by-default filetype detection to Nosey Parker using Magika's ONNX model.

[1] https://github.com/praetorian-inc/noseyparker


Nothing is ever simple. Even for the most basic .txt files it’s still useful to know what the character encoding is (utf? 8/16? Latin-whatever? etc.) and what the line format is (\n,\cr\lf,\n\lf) as well as determining if some maniac removed all the indentation characters and replaced them with a mystery number of spaces.

Then there are all the container formats that have different kinds of formats embedded in them (mov,mkv,pdf etc.)


A fun read in service of your first point: https://en.wikipedia.org/wiki/Bush_hid_the_facts


At multiple points in my career I've been responsible for apis that accept PDFs. Many non-tech savvy people seeing this, will just change the extension of the file they're uploading to `.pdf`.

To make matters worse, there is some business software out there that will actually bastardize the PDF format and put garbage before the PDF file header. So for some things you end up writing custom validation and cleanup logic anyway.


malware can intentionally obfuscate itself




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: