Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

They say the "stop the world" approach that causes more downtime is

  Turn off all writes.
  Wait for 16 to catch up
  Enable writes again — this time they all go to 16
and instead they used a better algorithm:

  Pause all writes.
  Wait for 16 to catch up. 
  Resume writes on 16.
These seem pretty similar.

1. What is the difference in the algorithm? Is it just that in the "stop the world" approach the client sees their txns fail until "wait for 16 to catch up" is done? Whereas in the latter approach the client never sees their txns fail, they just have a bit more latency?

2. Why does the second approach result in less downtime?



> in the "stop the world" approach the client sees their txns fail until "wait for 16 to catch up" is done? Whereas in the latter approach the client never sees their txns fail, they just have a bit more latency?

Yes, this is the main difference. For "stop the world", we imagined a simpler algorithm: instead of a script, we could manually toggle a switch for example.

However, by writing the script, the user only experiences a bit more latency, rather than failed transactions.


> If we went with the ‘stop the world approach’, we’d have about the same kind of downtime as blue-green deployments: a minute or so.

> After about a 3.5 second pause [13], the failover function completed smoothly! We had a new Postgres instance serving requests

> [13] About 2.5 seconds to let active queries complete, and about 1 second for the replica to catch up

Why is the latter approach faster though? It seems in the "stop the world" approach wouldn't it still take only 1 second for the replica to catch up? Where do the other ~59 seconds of write downtime come from?


In the "stop the world approach", I imagined our algorithm to be a bit more manual: for example, we would turn the switch on manually, wait, and then turn it back on.

You make a good point though, that with enough effort it could also be a few seconds. I updated the essay to reflect this:

https://github.com/instantdb/instant/pull/774/files


Did you test the "stop the world" approach? I wonder how the write downtime compares. It seems the 1 second of replication lag is unavoidable. The arbitrary 2.5 seconds of waiting for txns to finish could be removed by just killing all running txns, which your new approach does for txns longer than 2.5 seconds already.

> ;; 2. Give existing transactions 2.5 seconds to complete.

> (Thread/sleep 2500)

> ;; Cancel the rest

> (sql/cancel-in-progress sql/default-statement-tracker)

Then you have 2.5 seconds less downtime and I think you can avoid the problem of holding all connections on one big machine.

> Our switching algorithm hinges on being able to control all active connections. If you have tons of machines, how could you control all active connections?

> Well, since our throughput was still modest, we could temporarily scale our sync servers down to just one giant machine

> In December we were able to scale down to one big machine. We’re approaching the limits to one big machine today. [15] We’re going to try to evolve this into a kind of two-phase-commit, where each machine reports their stage, and a coordinator progresses when all machines hit the same stage.

I guess it depends on what your SLO is. With your approach only clients with txns longer than 2.5 seconds started before the upgrade see them fail, whereas with the "stop the world" approach there would be a period lower-bounded by the replication lag time where all txns fail.

Cool work thanks for sharing!

Edit: I feel like a relevant question regarding the SLO I'm not considering is how txns make their way from your customers to your DB? Do your customers make requests to your API and your application servers send txns to your Postgres instance? I think then you could set up a reasonable retry policy in your application code and use the "stop the world" approach and once your DB is available again the retries succeed. Then your customers never see any txns fail (even the long-running ones) and just a slight increase in latency. If you are worried about retrying in cases that are not related to this upgrade you could change the configuration of your retry policy shortly before/after the upgrade. Or return an error code specific to this scenario so your retry code knows.

Then you get the best of both worlds: no downtime perceivable to customers, no waiting for 2.5 seconds, and you don't have to write a two-phase-commit approach for it to scale.

If your customer sends txns to your Postgres instance directly, this wouldn't work I think.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: