Note: I am giving a talk on Backups and Disaster Recovery Best Practices on July 27th.
When discussing disaster recovery, it’s important to take your business’ continuity plan into consideration. Backup and recovery processes are a critical part of any application infrastructure.
A well-tested backup and recovery system can be the difference between a minor outage and the end of your business.
You will want to take three things into consideration when planning your disaster recovery strategy: recovery time objective, recovery point objective and risk mitigation.
Recovery time objective (RTO) is how long it takes to restore your backups. Recovery point objective (RPO) is what point in time you want to recover (in other words, how much data you can afford to lose after recovery). Finally, you need to understand what risks you are trying to mitigate. Risks to your data include (but are not limited to) bad actors, data corruption, user error, host failure and data center failure.
We recommend that you use both physical (Percona XtraBackup, RDS/LVM Snapshots, MySQL Enterprise Backup) and logical backups (mysqldump, mydumper, mysqlpump). Logical backups protect against the loss of single data points, while physical backups protect against total data loss or host failure.
The best practice is running Percona XtraBackup nightly, followed by mysqldump (or in 5.7+, mysqlpump). Percona XtraBackup enables you to quickly restore a server, and mysqldump enables you to quickly restore data points. These address recovery time objectives.
For point-in-time recovery, it is recommended that you download binlogs on a regular basis (once an hour, for example).
Another option is binlog streaming. You can find more information on binlog streaming in our blog: Backing up binary log files with mysqlbinlog.
There is also a whitepaper that is the basis of my webinar here: MySQL Backup and Recovery Best Practices.
One way to save on operational overhead is to create a 24-hour delayed slave. This takes the place of the logical backup (mysqldump) as well as the binlog streaming. You want to ensure that you stop the delayed slave immediately following any issues. This ensures that the data does not get corrupted on the backup as well.
A delayed slave is created in 5.6 and above with:
It is a good idea to test your backups at least once a quarter. Backups do not exist unless you know you can restore them. There are some recent high-profile cases where developers dropped tables or schemas, or data was corrupted in production, and in one case five different backup types were not viable to use to restore.
The best case scenario is an automated restore test that runs after your backup, and gives you information on how long it takes to restore (RTO) and how much data you can restore (RPO).
For more details on backups and disaster recovery, come to my webinar.