Link to home
Start Free TrialLog in
Avatar of sdruss
sdruss

asked on

Running Oracle 11g 2-Node RAC on Only Single Node

Customer wants to be convinced that if one node goes down on our two node Oracle RAC 11g database, we will continue processing; albeit perhaps with some latency.  And of course; how much latency. Recently we had a minor glitch at our customer's site, where a Solaris Cluster resource faulted.  This particular cluster resource is associated with our ASM disk groups, "asm_dg_rs".  After much investigation, a standard ASM query, that happens as part of this resource check was taking longer than the expected value of 120 seconds.  Long story short, in summary the query was taking longer because the actual hardware was extremely bogged down.  The server was doing a "zfs scrub", which I now understand scrubbing is similar to the file system check, “fsck”, as it validates blocks on the disks. The
zfs scrub” was scheduled once a week, so we canceled the weekly scheduled job.  The hardware team has since decided the our (2) R810 Dell Servers are memory starved, and the memory will be severely upgraded.  Apparently there was an extreme amount of paging occurring.

So, because of this minor resource fault, that I thought was well explained and solved – customer is extraordinarily concerned about a failure during a critical event.  How can I convince my customer, than our high availability Oracle RAC 11g 2-node database will be able to keep on ticking and mostly like not miss a beat?  Part of the RAC technology of high availability is fault tolerance, where you should be able to continue with a server down in the cluster – correct?
Avatar of Alex [***Alex140181***]
Alex [***Alex140181***]
Flag of Germany image

where you should be able to continue with a server down in the cluster – correct?
Yes ;-)
The best way to convince the customer is to literally pull the plug on one of the servers.  You cannot get a better test than that.
Avatar of madunix
madunix

I would demonstrate transparent (FO) Fail Over. FO should be tested with different scenarios.
If ALL the apps connect to the scan-ip and have been properly configured for application fail-over, then yes, losing a node should be invisible.

Short of pulling the plug, which would be a true test, albeit it brutal, shut down the instance on the primary node.  You might be surprised how many apps might notice...

If you aren't familiar with the Simian Army, such things aren't impossible:
https://medium.com/netflix-techblog/the-netflix-simian-army-16e57fbab116
The only way to satisfy a customer is a true test.  A orderly shutdown is not the same as a total failure.

When Digital Equipment Corporation went into a customer site to compete with an IBM mainframe, the sales guy would set up their machine next to the mainframe they were competing against.  Then pull the plug on both machines.  Instant sale every time.  The mainframe takes 45 minutes to come back up and their machine only takes a few minutes.
ASKER CERTIFIED SOLUTION
Avatar of schwertner
schwertner
Flag of Antarctica image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial