Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

OLAP databases can/are still Relational databases. The difference is that they’re optimised for different workloads.

SQLite/MySQL/Postgres/MSSQL etc are all OLTP databases whose primary operation is based around operations on single (or few) rows.

OLAP databases like ClickHouse/DuckDB, Monet, Redshift, etc are optimised for operating on columns and performing operations like bulk aggregations, group-bys, pivots, etc on a large subsets or whole tables.

If I was recording user purchases/transactions: SQLite.

If I was aggregating and analysing a batch of data on a machine: DuckDB.

I read an interesting engineering blog from Spotify(?) where in their data pipeline instead of passing around CSV’s or rows of JSON, passed around SQLite databases: DuckDB would probably be a good fit there.



I gather that there may be some differences in functionality offered, but that it’s probably not much. So I presume performance is the biggest area of difference. In that regard: how big are the differences in performance for each category’s representative workloads?


Before the latest optimization, and only using 1 core, vs. SQLite we were seeing 133x performance on a basic group by or join, and about 4x for a pretty complex query. It was roughly even to Pandas in performance, but it can scale to larger than memory data and now it can use multiple cores! As an example, I could build views from 2 Pandas DataFrames with 2 columns and 1 million rows each, join them, and return the 1 million row dataset back to Pandas in 2 seconds vs. 40 seconds with SQLite/SQLAlchemy... Pretty sweet. DuckDB is going to be even faster now I bet!


Have you compared this to dask from a performance standpoint? That is the larger than memory solution we're currently using for analytics.


How about the other way? When will SQLite perform better than DuckDB?


If your workload is lots of singular INSERT or UPDATE’s.


I routinely see order of magnitude or greater improvements moving analytical workloads to compressed, columnstore engines vs rowstore engines.

The shortest useful description I can give of what constitutes an analytical workloads is that it is table-scan-heavy, and often concerned with a small subset of columns in a table.


To add to the other comment (albeit with different DB’s because I haven’t used DuckDB yet) comparing MSSQL and Clickhouse: similar hardware and same dataset, CH responds in ~30-60ms, MSSQL 300-400+ ms for simpler queries. 90-180ms vs several seconds, up to about 30s for more complex queries.

I could add more indices to MSSQL and do all sorts of query optimisations, but out of the box the OLAP wins hands down for that workload.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: