1. tested column pruning and the dataset you access would have been 2 columns + metadata for the parquet files so probably fit in memory even without streaming.
2. Most of the processing time would be IO bound on S3 and the access patterns/simultaneous connection limits etc. would have more of an impact than any processing code.
Love that you went through the pain of trying the different systems but I'd like to see an actual larger than memory query.
1. Important points that the query is a projection that only returns a fraction of the 650GB that fits in memory. DuckDB is good at streaming larger than memory queries, Polars less mature there. That would show in the results.
2. S3 defaults shouldn't prevent all available threads/cpus from reading the files in parallel, so I would assume that the network bandwidth of the VM (or container) would be the bottleneck.
1. tested column pruning and the dataset you access would have been 2 columns + metadata for the parquet files so probably fit in memory even without streaming.
2. Most of the processing time would be IO bound on S3 and the access patterns/simultaneous connection limits etc. would have more of an impact than any processing code.
Love that you went through the pain of trying the different systems but I'd like to see an actual larger than memory query.