Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That's true. Parquet went through the weirdest changes between its various revisions and because it was used for Hadoop data lakes, there's a whole bunch of data that is being stored in legacy formats. Off the top of my head:

- different physical types to store timestamps: INT96 vs INT64

- different ways to interpret timestamps before tzdb (current vs earliest tzdb record)

- different ways to handle proleptic Gregorian dates and timestamps

- different ways to handle time zones (since Parquet only has the equivalents of LocalDateTime and Instant, but no OffsetDateTime or ZonedDateTime and earlier versions of Hive 3 were terribly confused which is which)

- decimal data type was written differently, as a byte array in older versions and as int/byte array/binary in the newer ones

- Hadoop ecosystem doesn't support decimals longer than 38 digits, but the file format supports them



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: