> It's good to know that OOTB duckdb can replace snowflake et all in these situations, especially with how expensive they are.
Does this article demonstrate that though? I get, and agree, that a lot of people are using "big data" tools for datasets that are way too small to require it. But this article consists of exactly one very simple aggregation query. And even then it takes 16m to run (in the best case). As others have mentioned the long execution time is almost certainly dominated by IO because of limited network bandwidth, but network bandwidth is one of the resources you get more of in a distributed computing environment.
But my bigger issue is just that real analytical queries are often quite a bit more complicated than a simple count by timestamp. As soon as you start adding non-trivial compute to query, or multiple joins (and g*d forbid you have a nested-loop join in there somewhere), or sorting then the single node execution time is going to explode.
I completely agree, real world queries are complicated joins, aggregations, staged intermediary datasets, and further manipulations. Even if you start with a single coherent 650gb dataset, if you have a downstream product based on that, you will have multiple copies and iterations, which also have the reproducible, tracked in source control, and visualized in other tools in real time. Honestly, yes, parquet and duckdb make all this easier than awk. But, they still need to be integrated into a larger system.
Does this article demonstrate that though? I get, and agree, that a lot of people are using "big data" tools for datasets that are way too small to require it. But this article consists of exactly one very simple aggregation query. And even then it takes 16m to run (in the best case). As others have mentioned the long execution time is almost certainly dominated by IO because of limited network bandwidth, but network bandwidth is one of the resources you get more of in a distributed computing environment.
But my bigger issue is just that real analytical queries are often quite a bit more complicated than a simple count by timestamp. As soon as you start adding non-trivial compute to query, or multiple joins (and g*d forbid you have a nested-loop join in there somewhere), or sorting then the single node execution time is going to explode.