Oracle Recovery Case Study 1: What happens when RMAN lets you down?

RMAN is my favorite Oracle utility. It abstracts much of the dirty details of Oracle backup and recovery, and has created a generation of Oracle DBAs who are confident and productive managing their database(s) without knowing much more than a few canned scripts.

This is a good thing. Backup and Recovery of a database in Oracle's price range should not be manual and difficult. Before RMAN, recovering a database was sometimes like solving the Rubik's cube, and I felt "lucky to escape" after each production recovery. RMAN contributes much to make the process simpler and when we stand on the shoulders of RMAN we can perform some very powerful tasks. But RMAN is not perfect, and has had a bug or two in its lifespan.

There are times when it helps to know how to recover the database without it, and not knowing this can hurt you when you can least afford it, during a production crash. The new generation of RMAN DBAs may not know what to do when RMAN gives up mysteriously and they see something like this:

RMAN-00571: ===========================================================
                      RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
                      RMAN-00571: ===========================================================
                      RMAN-03002: failure of recover command at 12/25/2008 10:00:00
                      ORA-00283: recovery session canceled due to errors
                      RMAN-11003: failure during parse/execution of SQL statement: alter database recover logfile '/arch/ora1/arch_1_5000_605730252.arc'
                      ORA-00283: recovery session canceled due to errors

Open in new window

Anytime I see the words "failure", "recovery" and "ORA-" together I get a sinking feeling.

If we trust solely in RMAN we might get caught out. What would you do in this case? Call Oracle? Not a bad idea. It is most likely a bug, especially if the error stack contains an ORA-600. But submit the Service Request and waiting on response takes time, even with tiers of support, the engineer still has to collect all of your information before diagnosing, and that time is often an hour or more wasted. If your business is losing dollars every minute that the database is down, you cannot afford not to know manual recovery as a safety net.

So, let's downshift to SQLPLUS and keep moving. If RMAN fails during a standard restore or recovery, there is a good chance you can still rescue yourself or your customer from the situation, and then continue working the Oracle Service Request / Bug Report later. But remember, because some operations in recovery are irreversibile, it is adviseable to take an OS backup of all Oracle datafiles, control files and logs _as they stand_ before proceeding. If a recovery operations goes wrong, you can backtrack this way.

In my case, when this happened last, I was halfway through an RMAN tablespace restoration due to data block corruption, and I now had datafiles in an inconsistent state. I KNEW all my archive logs were available, but for some reason RMAN died before applying all of them! The next thing I did was login to SQLPLUS and try to continue recovery manually.

sqlplus / as sysdba
                      SQL> recover database until cancel;

Open in new window

ORA-00279: change 1214578 generated at 06/27/2009 22:07:15 needed for thread 1
                      ORA-00289: suggestion : D:\BACKUP\DEV\ARCHIVELOG\2009_06_30\O1_MF_1_1_%U_.ARC
                      ORA-00280: change 1214578 for thread 1 is in sequence #1

Open in new window

The first thing I do is simply press return. If Oracle is setup properly, its suggestion will be correct and it will find the next archive log where it expects to. What we hope to see is:

ORA-00279: change 1217507 generated at 06/27/2009 23:04:41 needed for thread 1
                      ORA-00289: suggestion : D:\BACKUP\DEV\ARCHIVELOG\2009_06_30\O1_MF_1_2_%U_.ARC
                      ORA-00280: change 1217507 for thread 1 is in sequence #2
                      ORA-00278: log file
                      'D:\BACKUP\DEV\ARCHIVELOG\2009_06_27\O1_MF_1_1_54FQGBBX_.ARC' no longer needed for this recovery

Open in new window

Yes, a lot of ORA "errors" but these are good ORA errors in this case. We hope to see, "log file ... no longer needed for this recovery." If we do, we can either continue pressing return, or we can just type auto to let Oracle apply the rest of the logs automatically. If Oracle cannot find a log, then either enter the full filename at each prompt, or copy the logs over to the db_recovery_file_dest where Oracle expects to find them and keep trying until you have applied the last one. How do we know when we've run out of logs? We could simply guess at it (not good), or we can query Oracle's log history in another session and refer to the file names in order as we apply them, so we'll know what to expect. So kick off a 2nd sqlplus session:

SQL> select sequence#, first_change#, name from v$archived_log order by first_change#;

Open in new window

It may be quite a huge list, depending on how busy your database is, but all you are interested in are the last couple so you can see the sequence# and change#.

SEQUENCE# FIRST_CHANGE# NAME
                      ---------- ------------- ------------------------------------------------------------------------
                         17       1340010     D:\BACKUP\DEV\ARCHIVELOG\2009_06_30\O1_MF_1_17_54O3C91H_.ARC
                         18       1340234     D:\BACKUP\DEV\ARCHIVELOG\2009_06_30\O1_MF_1_18_54O3DD44_.ARC
                      
                      743 rows selected.

Open in new window

An interesting note here is we did not order the query by sequence#. My sequences above stop at 18, but the total rows were 743. Why? Oracle always increments the sequence # when switching to the next log, right? Yes, but if your database has gone through multiple incarnations, (a new incarnation will exist if you've previously performend an incomplete recovery, opening with resetlogs option), then your sequences started over at 1. The SCN (change number) does not start over, however, across incarnations, so the proper way is to order the logs by SCN. With 10g and 11g we can actually recover across incarnations fairly easily. This was not possible prior to Oracle 7.3, and until 10g, it was a very touchy, cryptic procedure, requiring controlfiles to be mounted in phases to apply the logs for that incarnation. To be honest, I never tried it, and I'm happy I never had to. But as of 10g, Oracle no longer clears the log history in the controlfile during a resetlogs, and it tracks the incarnations with new columns in the V$ARCHIVED_LOG view (all the columns start with RESETLOGS_), so if you are running 10g or later, you can actually perform one recovery session across incarnations.

Back to our scenario; looking at the last log, we see that the sequence is 18, and we see the name of the file to confirm. Once we've fed all available logs to Oracle, and we've verified our sequence, the next step is to apply the online redo log. This is critical for full recovery, and RMAN does this automatically so many folks never know it, but in a manual recovery, the DBA must do it. Why?

All DBAs should know this: The oneline log(s), called REDO logs, are actually all transactions that were unarchived since the last archive log you applied. A little or a lot, without this last step, you'll lose data and will have no option but to use the "resetlogs" option when opening the database, and that's IF the database will open. We'll save that scenario for another case study. In this example, we know we have everything, datafile backups were restored, archive logs were applied, and we are close to home.

So let's query the online logs to find the sequence we want. Technically this is not required, you can take the brute force method and offer them one by one to Oracle, but I like to do it cleanly:

SQL> select group#, sequence# from v$log;

Open in new window

GROUP#  SEQUENCE#
                      ---------- ----------
                               1         17
                               2         18
                               3         19
                               4         16

Open in new window

We see sequences 16 - 19. We know from above that the last archive log was sequence 18 and we already applied it, so we've only to apply the 19th and we should be done. Since that is group# 3, we make a guess and choose based on the group#:

Specify log: {<RET>=suggested | filename | AUTO | CANCEL}

c:\oracle\dev\redo03.log

Open in new window

Log applied.
Media recovery complete.

Your goal is ALWAYS to see the words, "Media recovery complete"! Sweet words!

Open your database.

SQL> alter database open noresetlogs;

Open in new window

Database altered.

When you can open your database with noresetlogs, you feel like a hero, even if your users or boss assumes you just clicked a button.

Now that we've opened the database, before taking a lunch or coffee, take the time to immediately close the DB, take a cold backup, and then send that trace file to Oracle so we find out why it happened and make sure it does NOT bite us again. It is likely that you'll be doing a patch in the near future.

If you had, instead, viewed the RMAN failure as a total showstopper and waited on Oracle, your customers might have lost money during the downtime, and you as a DBA will lose credibility for being a "just call Oracle" DBA. Oracle DBAs need to be self-sufficient. Some countries have laws regarding automobile accidents. If you are not injured and the automobile is not disabled, move it out of the street while you wait on the police to come write the report!

For those who may be new or suspicious of RMAN, it is important that you do not get the impression that RMAN is unreliable. I have done many, many MANY backup and restores and various types of recoveries and clones with RMAN. As of 10.2.0.4 or 11g, RMAN virtually never fails. I do not suggest that you avoid RMAN, but I suggest that you remember it is a tool that sits on top of other mechanics, and as an Oracle DBA, you must know how to deploy your own parachute when things go wrong.

It's also a good idea to keep those patches current before you run into problems in the first place, and do periodic, offline tests of your recovery so you are always in practice.

Comments (2)

mrjoltcola

Top Expert 2009

Author

Commented: 2009-06-30

Waterstreet,

Thanks will review. I wrote this tonight after a real-world scenario while it was fresh in my mind. Did not want to wait until later when I lost the motivation, but I did write it in one sitting and did not really give it a proper review.

Will revise it tomorrow. Thanks.

mjc

Mark Wills

Topic Advisor

CERTIFIED EXPERT

Distinguished Expert 2018

Commented: 2009-07-23

Hi mrjoltcola,

Just a reminder that your Article is sitting, waiting for Author Review.

Regards,

Mark Wills
Page Editor