Can you shed light on the biggest challenges and steps you (and the team) took to overcome them to succeed with the rewrite? We often hear about how major rewrites can fail or be massively delayed, but you seem to have succeeded.
This is a difficult one to answer succinctly, but I'll leave some quick thoughts.
One of the things that made this tricky is that we weren't just replacing some small system with a single API. We fundamentally changed the underlying architecture of the database and built it around an entirely different paradigm for querying. This is the result of building it around a columnar query engine with a database architecture designed for the cloud and object storage.
So we made a bunch of changes all at once. We didn't start out this way. We wanted to enable some things in the DB like infinite cardinality, tiered data storage, SQL capabilities and a bunch more. When we saw all that, I knew we'd be rewriting the database one way or another.
This was in early 2020. And I figured if we were going to look at some significant rewrite, I'd probably want to do it in Rust. But rewriting your core in a new language is a highly risky endeavor. Honestly, if you can figure out a way to do it iteratively, that's what I'd recommend. A big bang rewrite is the worst possible thing you can do. And it's super stressful.
But... I didn't see a way around that. So we started small with me and one other person working on it starting around March of 2020. Then we added another team member in May (hey Andrew). The three of us spend the next 6 months treating it as a kind of research project. We evaluated building it around existing database engines (like DuckDB and Clickhouse) and looked at what tools we'd want to use.
By August of 2020 we'd settled on building it in Rust with Apache Arrow, Apache DataFusion, and Parquet as the persistence format. I announced this crazy plan in November of 2020 at our online conference and said we were hiring.
Over the first 3 months of 2021 we formed a team around it of 9 people. Everyone else in the company was still focused on everything else we were doing. So the majority of our engineering efforts were focused elsewhere. I think this was critical. Actually, it was quite difficult to have 9 people this early in the project. We hadn't originally planned to scale up that quickly, but we had a flood of great people interested in joining the project (new hires and internal transfers) that we decided to go for it.
Over the next few years we kept this small group working on the new DB while everyone else was working on previous versions of the product. In mid-2022 we were far enough along to bring up the database alongside one of our production environments and start mirroring workloads onto the new DB. This was critical over the following 6 months or so.
We started getting more people from the engineering team looped into the effort in the 4 months leading up to the first launch.
Starting with a small team and scaling up as you get farther along is critical, I think.
There's so much more I could probably write about this, but I'll leave it at this for now :)
Thanks for stopping in. I've been seeing a lot of InfluxDB 3.0 content in the past few days. It would be helpful for me at least to see more comparison between 2.x and 3.0? Not sure if there is a changelog or list of things that were added/deleted/are now incompatible between versions. Cheers
The differences in 2.x and 3.x are quite significant. The 3.0 database was a ground up rewrite in a new language (v1 and v2 were in Go, v3 is in Rust).
InfluxDB v2 was all about the Flux language and a much broader set of API capabilities along with an integrated UI.
For 3.0 we focused on the core database technology. We were able to bring the 1.x APIs forward, which means 3.0 supports both InfluxQL and SQL natively. We were only able to add Flux support through a separate process and a lower level gRPC API that the two use to communicate.
The underlying database architecture is also completely different. v1 and v2 are essentially an inverted index paired with a time series store. v3 organizes data into larger Parquet files and pairs that with a columnar query engine (Apache DataFusion) to execute fast queries against it.
Me or someone on our team should probably write a detailed post about the underlying database architecture to highlight the differences between the versions.
We built 3.0 mainly to accomplish some things that we were unable to deliver in v1 or v2:
* Unlimited cardinality
* Tiered data storage
* Fast analytic queries
* SQL compatability
* Bulk data import and export (coming soon to v3)
Then there are the systems architecture changes we made highlighted in this blog post. v1 InfluxDB was a monolithic database that had all these components in one. The v3 design allows us to scale ingest, query, and compaction separately, which is something that kept coming up in larger scale use cases.