Disaster Recovery with SnapManager for Oracle [SMO] & Active DataGuard

Published:
Updated:
NetApp’s SnapManager for Oracle may not be the best solution for your cloning and disaster recovery needs. Sifting through some of the pros and cons can help analyze if your use case is best suited to implementing a SMO solution. In terms of disaster recovery between two physically disparate data centers beyond distances for which metro-clustering would be possible, SMO cloning a database has a number of benefits in terms of storage capacity but also some severe trade-offs.

Disaster recovery planning is an arduous task for both business and IT participants. It is helpful if everyone understands the full implications of developing a strategy from the acceptable levels of protection and the proposed technologies and toolsets available to them. As an example, let’s look at a specific use case to recover an Oracle database backed by NetApp storage. The ERP database being protected is 10 GB and sits on an Oracle RAC cluster. There are two data centers 750 kilometers apart and a one gigabit replication only link between them. 

Let's assume there is an Active Data Guard replication in place between two RAC clusters, one in each data center. This provides an optimal replication strategy with a very low RTO (Recovery Time Objective) and RPO (Recovery Point Objective) at least in terms of recovering the database. What about disaster recovery testing? At no point in time do we want to interrupt the replication between the production and disaster recovery databases, which means a duplicate instance of the underlying storage is required. In essence, this means three distinct copies of the same database exist: Production, Disaster Recovery and Disaster Recovery Testing. This is where NetApp inserts SnapManager for Oracle as a proposed method of storage optimization and cost reduction.

SMO can eliminate the third copy required for DR testing and create a flex-clone of the underlying storage already being replicated via Data Guard to the DR site. This is a great solution to prevent hundreds of wasted and costly gigabytes of SAN storage. The problem is, although SMO does not duplicate the entire storage footprint it still needs a point-in-time baseline snapshot for reference.

SMO is documented not to play well with Active Data Guard for the simple reason that it has to lock and quiesce the database in order to get a clean snapshot, which disrupts the Data Guard replication. Depending on the size of the database, the types of underlying disks, and the array performance the snapshot process can take a few minutes or north of 45 minutes, meaning the organization is in an unprotected state for the duration of the snapshot and cloning process.

The alternative is to use DataGuard to replicate to both a DR site and a tertiary location OR to use DataGuard to cascade the data from the DR copy down to a DR testing copy, ideally on some cheaper lower performing storage device.

What about removing the database component completely and doing all of the replication and cloning on the storage array? This may work depending on the use cases or disaster recovery metrics (RTO/RPO). SnapMirror can be used to replicate the desired volumes to another array or remote location, certainly. The fastest schedule for this task is every 15 minutes and it still requires duplicate DR and DR testing copies of the database.

NetApp’s Semi-Sync mode did allow for SnapMirror schedules to be increased to 10 seconds but had a distance limitation of 500 kilometers.  It is also no longer an option in the newer version of OnTap and thus far NetApp has not released anything like it that will accommodate use cases not suited for metro-clustering.

Let's assume that 15 minute scheduled synchronizations would meet the requirements in terms of recovery metrics. The rate of change to the data could not exceed the maximum throughput of the replication network path for a 15 minute interval. For example, if 20 gigabytes of data changed within 15 minutes, then 20 gigabytes would have to replicate to the DR site before the start of the next replication job. Unless the distance is less than 250 kilometers and 10G dark fiber connections are used for point-point links, this is going to cause a backup of the SnapMirror synchronizations and stale data will be recovered in the event of a disaster.

In terms of disaster recovery metrics and storage efficiency, SMO may not be the best technology for use cases outside of a metro-cluster type distance with a bifurcated replication strategy. Even with synchronous replication, the need to quiesce the database will stop replication for a short time and leave the organization unprotected.

The most efficient scenario in the situation as discussed would be to have a third lower performing storage device that could be used for disaster recovery testing and use either SnapManager or Data Guard to replicate the necessary data to it. The higher cost of this option has to be offset by the potential monetary loss to the organization if any data were to be missing or corrupt in the time it took to generate a SMO clone during disaster recovery testing. This point alone will assist the business in understanding the kinds of costs incurred in developing a zero RPO (no acceptable data loss). 
0
2,292 Views

Comments (0)

Have a question about something in this article? You can receive help directly from the article author. Sign up for a free trial to get started.