21-02-2022

How to Migrate Data With Minimal Downtime

Igor Antonov
3 mins

Tech
How to

One of the recent technical issues we have been working on is migrating database services from MongoDB as persistent storage to MySQL.

The usual way we do that is by placing all the migration logic (which uses 2 databases) into the DB migration and hoping everything will go fine. This leads to a “very minor risk of serious data damage” in cases where the source database data is in unexpected format, or one of the DB connections fails during migration, or there is too much data and migration takes hours (or worse, fails with a timeout somewhere), and so on.

I would like to describe a general approach to the data migration problem, which allows us to reduce the risks mentioned above.

The approach described in the post is definitely not the only one possible, and is not a ready recipe for the migration. You need to decide if you are going to use this approach or some other, what steps to use, and how to implement those steps for the specific service and data conditions.

The Problem

Let’s say we have an “old” DB, which is used by some service, and a “new” DB. We are switching all the operations and moving all the data from “old” to “new” DB.

Unlike the “DB migration” path, we want to have a “fallback” to the old DB in cases where there are some issues with the new one or with the data migration process itself. And using this fallback must not affect the data consistency of the old DB in any way.

Additionally, while the migration can have the “critical” downtime parts like configuring the databases and updating the services, the data should be moved “in background” while the service is running.

It’s quite obvious that you can’t start the new DB, migrate data to it and then shutdown the old DB without a downtime. There will be a period of time when both DBs are working simultaneously on the live environment, and we have to gradually migrate data and operations.

Steps To Success

Unsurprisingly, the best way to approach the problem is to decompose it.

We can express the way we regularly work with DBs using two main kinds of operations — reading the data and writing it. While there are some additional operations like atomic or locking ones, they are used in very specific cases and should be reviewed case-by-case for migration too.

Let’s define the “roles” for the databases on the service, depending on whether they are used for reading or writing the data:

Read Primary (RP) — read database, which is considered the Source of Truth for the data
Read Secondary (RS) — read database, which is an additional source, but not necessarily a consistent one.
Write Primary (WP) — write database, which is considered the Source of Truth for the data.
Write Secondary (WS) — write database, which is an additional target for updates. Writes to WS happen after a successful write to WP.

Now, we can gradually switch old and new DB roles on the service side to ensure our reads and writes happen correctly:

The exact steps you follow for the specific service/DB migration depend on the amount of data and on your own confidence in the service integration with the new DB.

As an example, for a very simple service with a large amount of immutable data you most likely need only step 6 as an intermediary one. You can check the read/write code locally and immediately switch the old DB to the read-only state and perform the background migration.

Caveats

There are some common issues you can encounter while migrating the data, which I’ve documented here.

1. Update operations on new DB with missing data

If you have mutable data in your DB, you’ll most likely encounter a case where you want to update a record but it’s not yet present in the new DB.

In this case, you’ll have to transfer the record from the old DB first, and then perform the update. Be extra careful with the transfer as this is a concurrent write operation — you can write the data in another service request, or perform background data migration at the same time — some data locking is most likely needed.

2. Locking data for update

In general, data locking for the business logic should happen on the WP only as WS is not consistent.

Although there are concurrent writes happening on the WS as well, regular WS writes can conflict with the background transfer process. In this case, it’s better to use pessimistic concurrency on the background process and ignore the concurrency errors per-record (as it means that the record in question was already transferred).

Those are the steps we took and it worked out great. I hope this helps. Good luck migrating your database with minimal downtime!