Getting Serious about Data Redundancy

Reluctantly, I have had to assume the role of “server guy” with my translation company. I generally prefer to focus on the creative side of web application development, but I’m not naive enough to think that server backups and security can be completely ignored… so it falls to me to make sure that we are prepared for a catastrophe or any kind. This weekend I spent some time reviewing our current situation and implementing improvements.

In reviewing our backup strategy we must first consider what type of catastrophes we want to be prepared for. Some possible problems we might face include:

1. The site could be hacked and data corrupted or deleted. 2. We could experience hardware failure (e.g. hard drive could die or server could konk out). 3. We could face a major regional distaster like the earthquake/tsunami that hit Japan recently.

We also need to consider our tolerance to down time and frequency of database changes. E.g. some simple backup strategies might involve an off-site copy of the files and database off-site so that you can retrieve them if necessary. But this strategy, for larger setups may take upwards of 24 hours to get back only in the case of a failure (re-uploading the data, setting up the server configuration, etc..). If you’re working in an environment where even a few minutes of down-time is a problem, then you would need to develop a strategy that will allow you to be back online much faster.

Similarly, if you are only backing up once every 24 hours, you could potentially be losing 24 hours worth of user updates if you had to revert to a backup.

In our case we are running a 2-tier backup strategy:

1. Hot-backup: This is a backup that is always sychronized to the live site so that it can be brought online with the flip of a switch. 2. Archived Backup: The focus of this back-up is to be able to withstand a regional catastrophe, or be able to revert to a previous version of the data in case of corruption that has affected the hot-backup.

Hot Backup Strategy

For the hot backup we are using MySQL replication to run a slave server that is always in sync with the master. This is useful for 2 reasons:

1. If there is a failure on the master’s hard drive, then we can switch over to the slave without any down-time or loss of data. 2. If we need to produce a snapshot of the data (and shut down the server temporarily) it is easier to work off of this slave so that the live site never needs to be turned off for backup maintenance.

Archived Backup Strategy

We are running our more critical sites on Amazon’s Elastic Computing Cloud service (EC2) because of its ease of scalability and redundancy. We are using the Elastic Block Storage (EBS) for the file systems which store both the application files and the database data. This makes it easy for us to take snapshots of the drives at any point in time. For our database backups, we periodically take a snapshot of the EBS volume containing our database data. (First we set a read lock on the database and records the master status so we know exactly which point in the binary log this snapshot refers to). If there is a failure at any point in time, we just load in the most recent snapshot, then rebuild the data incrementally using the binary log.

The snapshot feature of Amazon for EBS is a real life saver. Actually copying the data when you’re talking hundreds of gigabytes is quite time consuming. With EBS, however is only takes a minute or two as it uses a clever incremental scheme for producing the snapshot. This is one of the key reasons why I’m opting to use EC2 for our critical sites.

Just in Case

Amazon claims to have redundant backups of all of our snapshots and distributed across different data centres…. but just in, case, I like to have a local backup available for our purposes. So I use rsync to perform a daily backup to a local hard drive. Hopefully I never need to use this backup, but it is there just in case.

We can always get better…

This backup strategy helps me sleep at night but there are still somethings about it that could be improved. Our database backups are now pretty rock solid – as we could recover from nearly any failure without experiencing any data loss using a combination of a snapshot and the binary log. However for the file system we don’t have the equivalent of a binary log to be able to rebuild the build system from the most recent snapshot. I know that this can be achieved, and more seasoned “server people” probably think this is a no-brainer, but I’m a software guy, not a server guy so go easy ….