RDS Single AZ vs Multi AZ Backup & Restore Benchmarking


I need to perform a backup and restore benchmarking for Amazon RDS in Single AZ vs Multi AZ and share the data with client so that he can make some decision based on my inputs with his customers before signing an SLA.

Could someone please give me some ideas how to do benchmarking, what and all tests do I need to perform.

Thanks in advance
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

btanExec ConsultantCommented:
I see it more of testing the durability and availability stand points between single and multiple AZ. And since user is to pay based on what is used, these may be test metric to measure the difference if the use cases. E.g. from the RDS FAQs

- Throughput related-
•DB Instance hours – Based on the class (e.g. Standard Small, Large, Extra Large) of the DB Instance consumed. Partial DB Instance hours consumed are billed as full hours.
•I/O requests per month – Total number of storage I/O requests you have (for Amazon RDS Magnetic Storage only)
•Provisioned IOPS per month – Provisioned IOPS rate, regardless of IOPS consumed (for Amazon RDS Provisioned IOPS (SSD) Storage only)
•Data transfer –Internet data transfer in and out of your DB Instance.

- Storage related -
•Storage (per GB per month) – Storage capacity you have provisioned to your DB Instance. If you scale your provisioned storage capacity within the month, your bill will be pro-rated.
•Backup Storage – Backup storage is the storage associated with your automated database backups and any active database snapshots you have taken. RDS provides backup storage up to 100% of your provisioned database storage at no additional charge. Backup storage is only free for active DB Instances.

Not sure the best test means can really proof using the metric effectively to proof the point to your user. Cases such as live data replication, or shorter maintenance time is transparent to user. Maybe the strategy is to work on verifying one important aspects of running multi AZ (against single AZ) on its availability:- e.g.

A) If one of the AZ in a region goes down, the production application traffic is automatically routed to the RDS in the alternate AZ.

B) If DB maintenance and upgrades are applied to the RDS per AZ basis (for a multi-AZ RDS), there would be minimal impact to uptime while user is still accessing the application.

Henceforth, with respect to cost, it is totally up to the nature of your user application hosted to map its degree of downtime tolerance that it can sustain. Its down still to user appetite on a
> Cost (e.g. multi-AZ proof on long run saving not short run like single on cheaper out front subscription) and
> Uptime trade-off (e.g. multi-AZ handled better in unexpected and uncalled circumstances above, minimally user can have a "sleep better" assurance compared to single AZ outages can have significant downtime impact on critical "live" apps, if applicable ).
cloudtechnicianAuthor Commented:
Thanks btan for sharing your comment.

I'm mainly looking to find out the different ways to test the High Availability, time to restore in case of a failover, RTO, RPO etc., For example, if my DB instance in single AZ goes down how much time would it take to restore from RDS snapshot, test point in time recovery etc and benchmark all these.
btanExec ConsultantCommented:
Thanks for clarifications, in that case the disaster recovery use case is relevant contextually for the Backup & Restore KPI stated - see this http://awsmedia.s3.amazonaws.com/ARC302.pdf
(See slide 24 - DR example with metric and test case (RTO-8hr and RPO -1hr) and slide 13 - DR related the "Performance Metric – Total Time", and in slide 30 on the replication throughput mentioned "120,000 files @ 15,000 TPS = 8 seconds").

Overall AWS RDS provides two distinct backup mechanisms that can restore data from either a specific point in time (snapshot) or up to the last 5 minutes of operations (automated backup). So the test will have to revolve around on cycling round
1.Select an RTP & RTO.
2.If automated backup does not meet your requirements, plan on setting up frequent snapshots as well as HA with Multi AZ.
3.Plan your snapshots when data is stable.

I believe the "Example Disaster Recovery Scenarios with AWS" in guide below
Backup and restore
1. Freeze data changes to the DR site.
2. Take a backup.
3. Restore the backup to the primary site.
4. Re-point users to the primary site.
5. Unfreeze the changes.

Pilot light, warm standby, and multi-site
1. Establish reverse mirroring/replication from the DR site back to the primary site, once the primary site has caught up with the changes.
2. Freeze data changes to the DR site.
3. Re-point users to the primary site.
4. Unfreeze the changes

Tough to say a sample test case but the above will baseline what needs to be covered minimally, better to have AWS consultancy to come in and I see they probably can activate sample instance for such testing or share backend est more to you..

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
cloudtechnicianAuthor Commented:
Thanks Btan. One last thing, basically I want to measure the performance when app is connected to RDS in the same AZ and test again when app is connected to RDS in another AZ, what are the ways to test it and make sure we do the correct benchmarking.

Thank you!
btanExec ConsultantCommented:
I suggest instead to focus on two area
- user performance: measure for single at the two AZ via client app with network monitor (req/resp stats e.g. wireshark or even window perfmon) and also appl based stats (esp if using browser to turn  on its developer mode and see the req/resp stats). Move on to concurrent multiple user and re-measure those performance.
-rds performance: this remains as mentioned on those stats available in AZ, no difference. The stats will differs based on user performance scenario (single vs multiple).

They will form a baseline average if you can get a median tier range acceptable (from slowest to fastest req/resp use case above)...pardon as just seeing a high level inputs instead of details.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Storage Software

From novice to tech pro — start learning today.