Solved

PowerEdge 2800 Boot Problems after resetting NVRAM_CLR jumper

Posted on 2011-02-25
26
2,160 Views
Last Modified: 2012-05-11
I was getting an error on one of the rows of RAM so went about replacing it, but the machine wouldn't recognise it even though the vendor said the RAM was identical. I searched forums and found a tip to reset the NVRAM_CLR jumper in order to get the BIOS to recognise the RAM. I did as instructed, but it didn't work, so I just replaced the old row of RAM and set the jumper back to the default position. The BIOS recognised all the old RAM, including the one that was producing the errors.
Now a new problem - on reboot I got PXE-E61: Media test failure, check cable error. So I went about checking the cables, swapping some new ones in, reseating the drives etc. Nothing worked. Then I thought to look at the BIOS and the SCSI RAID controller was set to OFF. So I turned that back on and watched the starup sequence.
All the drives are being recognised. but now Windows (Server2003) doesn't start, I get to the very first spash screen for a second or two and then the machine restarts. Its stuck in a loop, I've tried every windows boot option, but nothing works.
I have 2x SCSI 320Gb drives mirrored.
Can someone please help me out of this loop?
0
Comment
Question by:Impressionist
  • 13
  • 9
  • 4
26 Comments
 
LVL 32

Expert Comment

by:PowerEdgeTech
Comment Utility
When you clear the NVRAM, it resets all BIOS values to default, which is RAID controller OFF, and since it had no SCSI-only device to boot from, it went to the next boot device in the list: the network boot device (PXE - which is why you were getting the media test failure message).

About the RAM ... what are the exact specs/model numbers/etc. of the new memory you tried to put in there?

About your Windows problem.  The changing around may have corrupted something, so it may be as simple as repairing Windows, but it also might be some complex RAID corruption that occurred that may prevent you from being able to repair it at all.

Important!  Are you getting any error or warning messages during POST, like memory/battery problems being detected, or anything?  Is the LCD screen on the server (that is normally blue) blue or amber?  If amber, what is the message?

If the LCD is blue and there are no error messages during post, then you will need to boot to your 2003 CD and 1) make sure the Windows installation is recognized, and 2) attempt to repair it by running two commmands:

First: chkdsk /r
Second: fixboot

You will likely need the RAID driver to load from floppy at F6 during Windows Setup in order to see the hard drives (or use nLiteOS.com to integrate the RAID driver into the media and create a new install CD).

Use this driver:
http://support.dell.com/support/downloads/download.aspx?c=us&cs=04&l=en&s=bsd&releaseid=R99970&SystemID=PWE_PNT_2800&servicetag=&os=WNET&osl=en&deviceid=6395&devlib=0&typecnt=0&vercnt=2&catid=-1&impid=-1&formatcnt=0&libid=35&typeid=-1&dateid=-1&formatid=-1&source=-1&fileid=129538
0
 
LVL 32

Expert Comment

by:PowerEdgeTech
Comment Utility
Another important question ... at any point after re-enabling the RAID did you see a message about a "mismatch" on the controller?  If so, what exactly did you see, and what exactly did you press/do?
0
 

Author Comment

by:Impressionist
Comment Utility
Thanks for the help. The LCD screen is blue with no errors, but there were some errors on the drive when I ran CHKDSK, so I'm just doing a repair now.
I'll come back to the memory problem once I can get up and running again :-)
0
 

Author Comment

by:Impressionist
Comment Utility
I've done CHKDSK /r followed by FIXBOOT, but I'm still stuck in a loop. Is there any further advice you could give me? Thanks :-)
0
 
LVL 32

Expert Comment

by:PowerEdgeTech
Comment Utility
Did the chkdsk /r report uncorrectable errors, or did it find and correct errors?  If it could not fix them, try running that and fixboot again.

If not, you can also try a fixmbr in the Recovery Console.

Beyond that ... how about my "mismatch" question (34985472) earlier?

Are you able to boot to Last Known Good Configuration or Safe Mode?

The "loop" might actually be a blue screen ... you might try disabling reboot on error (may be on the same screen as the above options).
0
 

Author Comment

by:Impressionist
Comment Utility
Thanks, when I did chkdsk it did report errors, so then I ran chkdsk /r and it finished and didn't report that it couldn't fix them, just listed the info on the drive. I then did the fixboot.
I didn't get a mismatch error when I turned SCSI back on.
I'll try this over again, I previously tried safe mode, but haven't done it after the fix, so I'll try that as well. Thanks for your help :-)
0
 

Author Comment

by:Impressionist
Comment Utility
Just rerunning the chkdsk /r now. I couldn't restart in safe mode or last known good config. I disabled restart on error and got a blue screen:
A problem has been detected......
Run Chkdsk /F....
*** STOP: 0x0000007B (0xF78A6A94, 0xC0000034, 0x0000000, 0x00000000)
0
 
LVL 47

Expert Comment

by:dlethe
Comment Utility
Well w/o more information, I can't confirm, but it looks to me like
1) You lost the RAID config (no questions about that)
2) You had a large enough stripe size so there was enough of a partition table for the O/S to determine that this is a windows O/S

So the most likely scenario is one of several, but all just as devastating, and you probably have nearly 100% data loss .. my guess is that

The system was in degraded mode because the RAID1 ran broken for a while, and so the data doesn't match (Did it ever do a rebuild? Were you even checking to see that the 2 drives were consistent)? . so when it lost metadata, it used the stale disk as primary that was never rebuilt right, so it intermixed ancient stale data with some live data, and controller thought the degraded disk was primary.

But it is moot now.  The worst thing you could have done is the rebuild.  (chkdsk & fixboot were not quite as bad) The repair operation destroyed all chances of recovery.  The other disk could have been just fine and had you booted that instead you would be up.   Now it is too late, but you just made the other disk as screwed up as this one.

If the rebuild is running, and you want to get the data back, just turn OFF the computer now.  You need some of the raw blocks to compare contents on both disks to determine if you built the mirror backwards.  Also a pro could look at the disk you are overwriting to see if there are remnants of any files that all haven't been touched past a certain date to confirm.

In any event, if the data is valuable enough to warrant spending $1000+ really do just cut the power and stop making it worse.

Sorry.


0
 
LVL 47

Expert Comment

by:dlethe
Comment Utility
P.S. if the RAID controller or mobo had an embedded event log, then that would have been lost, but at least if it has such capability then something of use may now be in there, but if it still not to late then you need to kill the rebuild.  Maybe crack open the case, or just cut power to the disks to kill it. You do NOT want a graceful kill, you want an expedient kill of the rebuild.  
0
 

Author Comment

by:Impressionist
Comment Utility
Thanks for that, I've got the data backed up on tape, but was trying to avoid having to rebuild the server :-(
0
 
LVL 32

Expert Comment

by:PowerEdgeTech
Comment Utility
Hence the reason for asking about the mismatch error.  If there was a mismatch and the wrong version of the array was chosen (no action could have resulted in the wrong one being imported as well if say you walked away or weren't looking, which is what I was afraid of), then exactly the scenario dlethe describes could be your reality.  This seems to be the case anyway, as you're at the end of the road for repairs anyway.  Time to pull out the tapes.
0
 

Author Comment

by:Impressionist
Comment Utility
Its seems I've had some luck. I pulled one of the drives, and chose the repair windows option from the windows setup. Looked like it was running through a reinstall, but has now come back up and I can login etc.
I'm just nit sure what to do next. I assume the bad RAM caused the errors on the drive. Its still in there. I've got 4 x 1Gb, its reporting DIMM2B to have 'correctable DDR2' errors. I'm not sure how one can correct the issue though? I'm thinking I should just remove DIMM2A and DIMM2B to be safe, then plug the second drive back in and let it rebuild?
0
 

Author Comment

by:Impressionist
Comment Utility
I think I'm just about there, I have one drive all setup and looking good. Now I want to bring the other drive back online and start the mirroring. I turned RAID on in the BIOS and when I restarted it gave me a warning stating that data loss would occur. Does this mean loss of data on the second drive I just plugged in? Or does it mean I could lose all the data I just fixed on the first drive?
I wasn't sure, so I just cancelled and restarted with the one drive.
0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 
LVL 32

Expert Comment

by:PowerEdgeTech
Comment Utility
It will always give that warning when switching between SCSI and RAID, and as long as you don't do any writing to the disks in the "wrong" mode, then there should be no data loss, provided you pay attention to any RAID status warnings as it boots up.
0
 
LVL 47

Expert Comment

by:dlethe
Comment Utility
Yes, it means that the 2nd drive will have the contents blown away .. which is what you want.
0
 

Author Comment

by:Impressionist
Comment Utility
Thanks for the extra help.
I've got another issue now. Switched to RAID and wouldn't startup, got an error:
ntldr fatal error 7 reading boot.ini
I assumed the boot.ini was OK on the drive as it boots fine when just in SCSI mode. The drive I fixed is sitting in slot 0 and the other one is in slot 1. Should I copy the boot.ini from 0 drive to 1 drive? Or run fixboot on the drive in slot 1?
0
 

Author Comment

by:Impressionist
Comment Utility
I have just realised that the RAID driver must not be present as I can't see any mention of it on startup and I understand there should be an option to use Ctrl-M to enter it.
Once I install the RAID driver noted by PowerEdge Tech, then do I go into the BIOS and set the Embedded RAID Controller to - RAID Enabled, Channel A to RAID and Channel B to SCSI?
0
 
LVL 47

Assisted Solution

by:dlethe
dlethe earned 100 total points
Comment Utility
WHAT?!!!   You switched the bootable disk?   You wrote, "I turned RAID on in the BIOS and when I restarted it gave me a warning stating that data loss would occur. Does this mean loss of data on the second drive I just plugged in? Or does it mean I could lose all the data I just fixed on the first drive? I wasn't sure, so I just cancelled and restarted with the one drive."

I was going on the premise that the first disk was in RAID when it was working, and when you tried to turn the 2nd disk on in RAID mode it complained, which is normal and to be expected.  PowerEdgeTech and I both were advising from perspective that you had the RAID enabled and clearing it blew it away, so we talked about rebuilding the RAID.

So what you are saying in is that you switched ..

Forget it. Start from scratch.  Be specific. Exactly WHAT did you switch from/to what state, and when.
Based on what you are saying now, you had a booted system and you changed the state of the booted disk from non-RAID to RAID and now it won't boot??   If that was the case to begin with then how were you "rebuilding" the RAID?  

"I ve got another issue now. Switched to RAID and wouldn't startup, got an error:
ntldr fatal error 7 reading boot.ini
I assumed the boot.ini was OK on the drive as it boots fine when just in SCSI mode. The drive I fixed is sitting in slot 0 and the other one is in slot 1. Should I copy the boot.ini from 0 drive to 1 drive? Or run fixboot on the drive in slot 1?"


So at this state, you are not using hardware-RAID, RAID has to be turned off in the O/S.  No need for any RAID drivers, but you are doing something else other than advised.

0
 

Author Comment

by:Impressionist
Comment Utility
Sorry, I didn't know that I wasn't being clear. I'm no expert, and I'm now learning what's going on and why I stuffed this all up :-)

After everything went wrong I was so worried about losing the whole drive that I was really happy that it was all up and running again, that I didn't realise I had made a mistake with the RAID. When I went into the BIOS I saw SCSI RAID controller was set to OFF, this was only after I had fixed the data on the Drive in Slot 0 (I physically disconnected the other one to have as a backup). I turned the setting back ON and I didn't realise that there were specific settings for each of the drives, I just saw that both were set to SCSI and didn't think anything of it as I knew that both the drives were SCSI. I know realise that they can each be set to SCSI or RAID. But I also think I must have chosen the wrong option.

Before all the issues, I could view the RAID in Dell Server Manager - see the status of the drives, choose to rebuild one or the other etc. Now they are just listed as separate physical drives and you can't do anything with them (like rebuild), and if I look in the Windows device manager there are SCSI controllers but no RAID controller.

The situation now, is that I have one drive that I am perfectly happy with and would like it to copy to the other drive and have them to continue to mirror each other after that.

Is there any other info you need to help me? Again, I apologise for not being clear, I didn't realise I wasn't giving the correct info.
0
 
LVL 32

Expert Comment

by:PowerEdgeTech
Comment Utility
Wow.  Not sure what you describe makes sense - at least as far as how you got here, so help me understand what you are seeing now, and maybe we go from here ...

So, is SCSI on or off right now with one drive that "works"?  
Exactly what do you see under Virtual Disks in "server manager"?  Screenshot?

0
 

Author Comment

by:Impressionist
Comment Utility
I'm fairly sure SCSI is on, I'm remoting in at the moment, but can check later.
Attached is the drive info. Is this what you need?
I had to take a few screen shots and put them together as I'm working on a small screen at the moment. Drive Info
0
 
LVL 32

Expert Comment

by:PowerEdgeTech
Comment Utility
According to this, RAID is off.  What are the chances that you never had RAID enabled?  I ask because it shouldn't be possible on this controller (assuming the onboard PERC 4e/Di) to switch between SCSI and RAID and have anything be readable.  Let's say somehow it did though ... you could simply "mirror" the drives in the OS.  If you want to re-enable hardware RAID 1 (mirror), then I would image/backup what you have, turn on RAID, configure your RAID 1, then restore to it.
0
 

Author Comment

by:Impressionist
Comment Utility
I thought RAID was enabled before all the problems because when I first discovered the memory problems, I also got a data error on the second drive, so I went into the server administrator and I had the option to rebuild the volume, so I did this and it fixed the errors. Now when I go in to it, I don't.

Would I have been able to do that if RAID weren't enabled?

I looked at the hardware list on the original config of the machine (which hasn't changed), and it lists the RAID controller as Raid Controller PERC4e/Di with 256MB Cache.

The weird thing is that I expected that I would see it listed on startup and have the option to hit ctrl-M to get into the config utility, but its not there. Or does it only showup if I enable RAID in the BIOS?

Perhaps the RAID controller is dead. But I guess I better work that out before I go to the trouble of ghosting the drive etc. How would I go about that?

Sorry, this has turned out to be a convuluted problem. :-)
0
 
LVL 32

Accepted Solution

by:
PowerEdgeTech earned 400 total points
Comment Utility
"Would I have been able to do that if RAID weren't enabled?"

Not from Server Administrator.  If you were indeed running a software RAID from Windows, you would have done it from Disk Management.

Yes, RAID must be enabled in the BIOS in order to see the CTRL-M utility.  If it is off, you will only see a CTRL-A prompt.


0
 

Author Closing Comment

by:Impressionist
Comment Utility
PowerEdge Tech was the most helpful on this one, but I gave some points to dlethe for the effort.
I think I'm on the right track now to get everything back working.
0
 
LVL 32

Expert Comment

by:PowerEdgeTech
Comment Utility
Good luck!
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

INTRODUCTION The purpose of this document is to demonstrate the Installation and configuration, of the HP EVA 4400 SAN Storage. The name , IP and the WWN ID’s used here are not the real ones. ABOUT THE STORAGE For most of you reading this, you …
More or less everybody in the IT market understands the basics of Networking, however when we start talking about Storage Networks, things get a bit dizzier, and this is where I would like to help.
Excel styles will make formatting consistent and let you apply and change formatting faster. In this tutorial, you'll learn how to use Excel's built-in styles, how to modify styles, and how to create your own. You'll also learn how to use your custo…
Access reports are powerful and flexible. Learn how to create a query and then a grouped report using the wizard. Modify the report design after the wizard is done to make it look better. There will be another video to explain how to put the final p…

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now