Atomic Database Repairs

3 min read
15-Jan-21

What is the best method to wrap multiple SQL transactions in a data repair? The caveat: the transactions must be atomic, so if one fails they are all rolled back.

The Situation

Recently, we needed to conduct a large-scale data migration from a prior vendor to our database. This included many long hours of writing Python scripts to parse their exported data sets and manually mapping related values.

Naturally, something that came out of this migration was the need to conduct data repairs to adjust the misaligned data that had been transferred. These repairs were on a much larger scale than simple collation. It took a couple months of conversations with customer support, the client, the prior vendor, and within our engineering team. The end result was a list of roughly 10k records needing to have an updated foreign key.

The Considerations

We want the SQL to be wrapped in a single transaction. If one of the larger scripts (there were about 6 total) had a typo or was missing a FK reference that we learned after execution, it would be bad news all around.
The updates we were making only affected one table at a time. Fortunately, we did not need to consider anything from a separate table while the updates were executing
All changes were as simple as:

1UPDATE server.table SET column_name = new_value WHERE primary_key = id;

The Solutions

Wrap the entire series of scripts per table with a BEGIN-END statement. This is particularly useful when executing them against a test environment, where you can toss a raise exception ‘Testing Transaction’ prior to the END. That will roll back the whole transaction, enabling manual testing of runtimes, scripts, updates, etc. without committing the DB.

1DO $do$ BEGIN
2    -- updates
3    END $do$

Use a CTE and single update statement. This option is slightly more involved than the first, but also provides a common pattern than can be modified to fit cases where we are updating one table based on information in another. It tends to be more efficient than iterating over our secondary table row-by-row.

1WITH t AS (
2  SELECT ‘uuid-one’ AS foo_id, ‘uuid-two’ AS bar_id
3  UNION
4  SELECT ‘uuid-three’, ‘uuid-four’
5)
6UPDATE server.table
7SET bar_id = t.bar_id
8FROM t
9WHERE server.table.foo_id = t.foo_id

Our team chose to go with the first option. This was done for two reasons:

We did not need to reference a second table on each transaction, so the added benefits of option 2 were of no advantage.
We had the chance to run our updates against a beta server in advance, so rolling back the transaction to test (a la raise exception) was a nice perk.

The Sources

Next
CSVify