Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Forgive me for not understanding the difference between columnar storage and otherwise, but why is this so cool? Is the big win that you can send around custom little vector embedding databases with a built in sandbox?

I get the "step change/foundational building block" vibe from the paper - and the project name itself implies more than a little "je ne sais quoi" - but unfortunately I only understand a few sentences per page. The pictures are pretty though, and the colour choices are tasteful yet bold. Two thumbs up from the easily swayed.



> Is the big win that you can send around custom little vector embedding databases with a built in sandbox?

No, this is a compatibility layer for future encoding changes.

For example, ORCv2 has never shipped because we tried to bundle all the new features into a new format version, ship all the writers with the features disabled, then ship all the readers with support and then finally flip the writers to write the new format.

Specifically, there was a new flipped bit version of float encoding which sent the exponent, mantissa and sign as integers for maximum compression - this would've been so much easier to ship if I could ship a wasm shim with the new file and skip the year+ wait for all readers to support it.

We'd have made progress with the format, but we'd also be able to deprecate a reader impl in code without losing compatibility if the older files carried their own information.

Today, something like Spark's variant type would benefit from this - the sub-columnarization that does would be so much easier to ship as bytecode instead of as an interpreter that contains support for all possible recombinations from split up columns.

PS: having spent a lot of nights tweaking tpc-h with ORC and fixing OOMs in the writer, it warms my heart to see it sort of hold up those bits in the benchmark


I'm less confident. Your description highlights a real problem but this particular solution looks like an attempt to shoe horn a technical solution to a political people problem. It feels like one of these great ideas that years later results in 1000s of different decoders, breakages and a nightmare to maintain. Then someone starts an initiative to move decoding from being bundled and to instead just defining the data format.

Sometimes the best option is to do the hard political work and improve the standard and get everyone moving with it. People have pushed parquet and arrow. Which they are absolutely great technologies that I use regularly but 8 years after someone asked how to write parquet in java, the best answer is to use duckdb: https://stackoverflow.com/questions/47355038/how-to-generate...

Not having a good parquet writer for java shows a poor attempt at pushing forward a standard. Similarly arrow has problems in java land. If they can't be bothered to consider how to actually implement and roll out standards to a top 5 language, I'm not sure I want them throwing WASM into the mix will fix it.


Columnar storage is great because a complex nested schema can be decomposed into its leaf values and stored as primitives. You can directly access leaf values, avoiding a ton of IO and parsing. Note that all these formats are actually partitioned into groups of rows at the top level.

One big win here is that it's possible to get Apache Arrow buffers directly from the data pages - either by using the provided WASM or bringing a native decoder.

In Parquet this is currently very complicated. Parquet uses the Dremel encoding which stores primitive values alongside two streams of integers (repetition and definition levels) that drive a state machine constructed from the schema to reconstruct records. Even getting those integer streams is hard - Parquet has settled on “RLE” which is a mixture of bit-packing and run-length encoding and the reference implementation uses 74,000 lines of generated code just for the bit-packing part.

So to get Arrow buffers from Parquet is a significant amount of work. F3 should make this much easier and future-proof.

One of the suggested wins here is random access to metadata. When using GeoParquet I index the metadata in SQLite, otherwise it would take about 10 minutes as opposed to a few milliseconds to run a spatial query on e.g. Overture Maps - I'd need to parse the footer of ~500 files meaning ~150MB of Thrift would need to be parsed and queried.

However the choice of Google's Flatbuffers is an odd one. Memory safety with FlatBuffers is a known problem [1]. I seriously doubt that generated code that must mitigate these threats will show any real-world performance benefits. Actually - why not just embed a SQLite database?

[1] https://rustsec.org/advisories/RUSTSEC-2021-0122.html


Afaik the win of columnar storage comes from the fact that you can very quickly scan the entire column across all rows making very efficient use of os buffering etc. so queries like select a where b = 'x' are very quick.


> so queries like select a where b = 'x' are very quick

I wouldn’t say “very quick”. They only need to read and look at the data for columns a and b, whereas, with a row-oriented approach, with storage being block-based, you will read additional data, often the entire dataset.

That’s faster, but for large datasets you need an index to make things “very quick”. This format supports that, but whether to have that is orthogonal to being row/column oriented.


Sum(x) is a better example. Indexing x won’t help when you need all the values.


Another useful one is aggregations. Think sum(), concat(), max(), etc. You can operate on the column.

This is in contrast to row based. You have to scan the full row, to get a column. Think how you'd usually read a CSV (read line, parse line).


When b = 'x' is true for many rows and you select * or multiple columns, then it's the opposite, because reading all row data is slower in column based data structures than in row based ones.

IMO it's easier to explain in terms of workload:

- OLTP (T = transactional workloads), row based, for operating on rows - OLAP (A = analytical workloads), column based, for operating on columns (sum/min/max/...)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: