> Data files are useless without a program that knows how to utilize the data.
As I see it, the point is that the exact details of how the bits are encoded is not really interesting from the perspective of the program reading the data.
Consider a program that reads CSV files and processes the data in them. First column contains a timestamp, second column contains a filename, third column contains a size.
As long as there's a well-defined interface that the program can use to extract rows from a file, where each row contains one or more columns of data values and those data values have the correct data type, then the program doesn't really care about this coming from a CSV file. It could just as easily be a 7zip-compressed JSON file, or something else entirely.
Now, granted, this file format isn't well-suited as a generic file format. After all, the decoding API they specify is returning data as Apache Arrow arrays. Probably not well-suited for all uses.
I think the counter argument here is that you’re now including a CSV decoder in every CSV data file now. At the data sizes we’re talking, this is negligible overhead, but it seems overly complicated to me. Almost like it’s trying too hard to be clever.
How many different storage format implementations will there realistically be?
It does open up the possibility for specialized compressors for the data in the file, which might be interesting for archiving where improved compression ratio is worth a lot.
That makes sense. I think fundamentally you’re trading off space between the compressed data and the lookup tables stored in your decompression code. I can see that amortizing well if the compressed payloads are large or if there are a lot of payloads with the same distribution of sequences though.
As I see it, the point is that the exact details of how the bits are encoded is not really interesting from the perspective of the program reading the data.
Consider a program that reads CSV files and processes the data in them. First column contains a timestamp, second column contains a filename, third column contains a size.
As long as there's a well-defined interface that the program can use to extract rows from a file, where each row contains one or more columns of data values and those data values have the correct data type, then the program doesn't really care about this coming from a CSV file. It could just as easily be a 7zip-compressed JSON file, or something else entirely.
Now, granted, this file format isn't well-suited as a generic file format. After all, the decoding API they specify is returning data as Apache Arrow arrays. Probably not well-suited for all uses.