AKF Partners

Abbott, Keeven & Fisher PartnersPartners In Hyper Growth

Delayed Replication

Do you think your database replica will save your data in a disaster? Think again because there are a lot of scenarios that will cause you to corrupt all your data.

Recently on the MySQL Performance Blog they had a post that did a great job explaining a problem that we often try to warn our clients about. The crux of the problem is that if you are relying only on a replica for disaster recovery then you are going to lose data when something bad happens.

For minimizing the impact of eventual consistency in our BASE applications, we want our replicas to be very near real time. This unfortunately can be unintended consequences in a disaster. Whether you’re relying on MySQL’s statement-based replication or Oracle’s redo apply replicating at the block-level, both are vulnerable to data corruption.

Any scenario resulting in data corruption on the primary will immediately be replicated to the standby. If a DBA drops a table by the time he stops cursing the drop table has been replicated to the standby. Storage subsystem or HA failover both can corrupt data files which can get propagated to the standby.

The solution to this problem is to create a standby or replica that has a delay on applying the log files. We recommend between 6 – 12 hours delay which gives you plenty of time to catch a logical corruption and stop the replication. You don’t need a large production sized server for this since you’ll never use this database in production but simply recover the database from it. Do this simple thing and it might save your data.

Comments RSS TrackBack 3 comments

  • Ori Lahav

    in August 22nd, 2010 @ 13:07

    Good subject to discuss.
    The sad truth is that you will discover the fault on the 13th hour after you dropped the table and the data is already corrupted.
    at outbrain we hold 1 replica in each datacenter that is doing nothing but dropping a full DB backup to NAS server daily. and we keep few days back of these TARBALLS.
    this gives us the ability to recover our data pretty quick even if a real (Sep 11th style) disaster hit one of our datacenters.

    to our understanding replication is never considered a backup from the reasons you have mentioned.

    The problem with backups is always how fast you can recover from them and did you ever trained your team to recover from backup.
    In that matter we believe in. “If you want to recover better from disasters, you better always be in disaster”.
    when deploying new DB replicas. or upgrading a DB, we are always building it from the last local (to that datacenter) backup and letting it catch up on replication. and we do it at least once a month.
    that way, every ops engineer known how to recover from backup.

  • Rima

    in August 29th, 2012 @ 17:56

    awesome Tip Fish, thanks !

    • fish

      in August 29th, 2012 @ 21:15

      Thanks Rima, I appreciate the comments!