Solved

Random Blue Screens leading to Windows 7 not booting - RAM and Hard Drive pass testing.

Posted on 2010-11-29
17
457 Views
Last Modified: 2012-05-10
Hello,

I am working on a computer which no longer boots into Windows 7 x64.

I delivered the machine to a client in what appeared to be perfect working order. After a week or two of use the machine began blue screening. I was not able to make it to the client's for another week. During this time the blue screens became more and more erratic (no consistent error code). Eventually the machine failed to boot, loading the startup repair screen. The startup repair fails with the error: "Auto Fail Over ... no os installed".

I have the system drive setup as two 600 GB drives in raid 0. I disabled the RAID and tested each drive individually. They both passed full read tests.

The board is a rampage iii extreme i am using the intel RAID controller and the array status shows up as Normal.

I then tested the RAM - it passed memtest repeatedly.

I am a bit stumped. I re-enabled the RAID and booted to the startup repair screen. When the startup repair screen appears it identifies the Windows 7 install and you can proceed. It always fails with the NO OS INSTALLED message.

In the command prompt I can navigate to the drive and view all the files so the filesystem appears okay. If I check BCDEdit I see the bootloader points to the correct drive location. If I try to manually rebuild the bootloader using bootrec I recieve the same error: No OS Installed.

Any help is greatly appreciated.
0
Comment
Question by:csialbany
  • 6
  • 6
  • 2
  • +2
17 Comments
 
LVL 47

Expert Comment

by:dlethe
ID: 34236141
Testing HDDs requires appropriate methodology, software (sometimes hardware), and knowledge of interoperability issues.

Based on what you report, my educated guess is that you are using consumer class disks which are not suitable for the Matrix controller.  Specifically, the disks don't have the proper timing / settings to deal with error recovery (google TLER).  Intel does not qualify consumer class disks for use with many of their controllers, because they will cause such problems.

Also, you probably don't even have the right kind of diagnostics that can expose such issues.  You certainly can't run them when the disk is attached to that "controller", which is not so much a controller as a $3.00 chip.   You will get a faster and more stable system with native XP RAID.  It will also do load balancing on reads.  The controller you have probably doesn't,

(Also, I think you mean you are using RAID1, not RAID0, right?)

Here is a paper I wrote which discusses some of this in more detail.
http://www.experts-exchange.com/Storage/Misc/A_2757-Disk-drive-reliability-overview.html
0
 
LVL 2

Expert Comment

by:agengler11
ID: 34236596
Ive had issues like this in the past with RAM not being compatible with the MOBO. Try removing one RAM at a time or looking up the MOBO to see what ram is kosher for it.
0
 
LVL 9

Expert Comment

by:faizbaig
ID: 34236624
-> A single .wdl (WatchDog Log) file is created in the \Windows\LogFiles\Watchdog folder for each crash. Just open the most recently dated file in your favorite text editor (or Notepad) to view details of the crash and some related information.
0
 
LVL 9

Expert Comment

by:faizbaig
ID: 34236637
I mean if possiable via Safe Mode.
0
 

Author Comment

by:csialbany
ID: 34238086
dlethe:

We are using two Western Digital VelociRaptor WD6000HLHX 600GB drives. These drives appear to support TLER and RAID (at least raid 1) according to http://community.wdc.com/t5/Desktop/VelociRaptor-WD6000HLHX-RAID-and-TLER/td-p/28261 . I am 100 percent sure the drives are hooked up in Raid 0. Raid 0 was requested by the client to improve the speed of a PostgreSQL database they serve on their machine.

Regarding hard drive testing, I disabled RAID in the bios so both drives were hooked up to the SATA controller running in IDE mode. I am using quicktech pro diagnostic software which ID's the drives correctly. I am fairly certain that the hard drive tests were run correctly.

AGengler11: I will investigate the RAM compatibility but the machine did run without a problem for at least a month and a half. I have removed the RAM and installed it in various configurations without being able to resolve the startup issue.

Faizbaig: I will check the .wdl as soon as possible, thank you. I cannot start the machine in safemode but i can use the windows 7 recovery command line to copy the file to another machine.
0
 

Author Comment

by:csialbany
ID: 34238108
dlethe: To clarify I tested the hard drives to find out if the drives themselves are damaged. I was not testing them to find out if the there was an issue with the RAID.  I just re-read your post and realized you were referring to RAID issues when you said "you probably don't even have the right kind of diagnostics that can expose such issues."
0
 
LVL 47

Expert Comment

by:dlethe
ID: 34238480
The 'raptor does NOT have the appropriate firmware for use behind a RAID controller.   You need the RE3/RE4s.  You'll note on their website that the RE3/RE4s both specifically mention RAID compatibility/TLER, while the velociraptors do not.   All you have is a blog posting from a moderator that says it "should" be OK.  The spec sheets on the drives do NOT mention suitability. (Unfortunately I don't have a raptor in my lab, so I can't hook it up to our software to report the read/write TLER settings,)

The velociraptors are enterprise class for reliability and data integrity, but the error recovery logic is not tuned for the 2-3 second max window before it gives up.

Besides, you are really not even using a RAID controller in your design. This is a fakeraid.  The device driver does the work.

Read this ...
http://www.wdc.com/en/library/sata/2579-001098.pdf
0
 

Author Comment

by:csialbany
ID: 34238835
Thanks dlethe.

If I understand what you are saying the problem lies in the fact that the western digital drives cannot handle the RAID configuration due to the TLER spec (or lack thereof). If they were up to spec than the the intel raid "controller" would suffice?
0
Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

 
LVL 47

Expert Comment

by:dlethe
ID: 34239034
well, it would 'work'.. but understand that this is a fakeraid & the device driver does all the work.  it is a low end, cheap 3 dollar chip that you are trusting your data to.  no processor, no ECC, no battery.  frankly imho, it is unsuitable.
0
 

Author Comment

by:csialbany
ID: 34239985
Hi again,

I just spoke with Western Digital and they claim the WD6000HLHX is RAID compatible but "did not have the information" whether or not it has the TELR spec.

I think I trust your opinion over a western digital customer service representative. I'm just wondering if it is normal to see a problem with intermittent failures that increase over time on an inadequate drive in RAID? Is it also normal for the RAID status to show up as "Normal" in the intel configuration screen on boot? Also is it normal that I can still access the filesystem even if there are RAID issues?

Thanks for your answers so far. Unfortunately the only solution this leads em to is reinstalling the OS without RAID and waiting to find out if the problems reoccur.
0
 
LVL 47

Expert Comment

by:dlethe
ID: 34241158
Here is the deal with "RAID compatible".  A USB memory stick is RAID compatible.  A floppy drive, and an EMC subsystem is also RAID compatible.

The deal is how the RAID "controller", whether it is software raid or hardware RAID is architected to deal with error recovery situations.  I'm going to take some shortcuts here and not write a volume on all error recovery scenarios, and generalize things, just because we all have better stuff to do, and I have a conf call in a few mins.

The intel matrix controller, as well as LSI MPT, and HP SMARTArray, and pretty much all the hardware controllers allow about 7-10 seconds for a drive to respond to queued I/O requests.  When a HDD goes into a deep recovery mode, it sits there and re-reads and re-reads the offending sector(s) until it times out based on settings in firmware.  TLER has a read and write timer.  TLER is also a marketing term, BTW, as it started as a vendor-specific feature that has since been adopted as a standard.

Basically, the firmware sets how long a disk is going to be allowed to work exclusively on recovering an unreadable block (or range, depending on the ATA command).   "enterprise raid" disks set it for 2-3 seconds.  The premise is that the disk is behind a RAID controller or software and there is redundancy, so just give up and let the controller get the disk from XOR parity elsewhere, and it will overwrite and remap the block automatically.

A disk without TLER (which again is not the correct term, and not even the correct way to say it, but it is just easier to go with the flow here), will go for 30+ seconds.  It is presumably a desktop machine, and the block belongs to some guys's wedding photos, or important data, and there is never a backup, so it needs to pull out all the stops and try to recover before giving up.

So to the controller, the disk locks up during that time.

This is why the 2-3 second timeout found in high-end disks and standard with FC, SAS, SCSI just makes those systems so snappy, and they rarely lock up.

Now the problem, is that the matrix controller allows 7-10 seconds (it varies  by model), for a disk to time out before it takes "Corrective actions".  Corrective actions could be a lot of things, and it depends on variables I can't get into w/o NDA and looking things up on specific chip, and I wouldn't even do that w/o $$$ for the effort :)

The matrix controller does NOT have the intelligence to automatically re-route read requests to the other disk. (if you had RAID 1), and since you are doing RAID0, and both disks must respond to each I/O, it waits.

Windows native software RAID1 as well as LINUX md driver, and solaris ZFS, and other similar RAIDs will remap automatically, and won't kill a disk w/o knowing it is really dead.

SO that is the problem.
Just turn OFF the INTEL matrix firmware, reinstall, and if you want to keep those disks, you are simply going to have to go to RAID10 or get 2 larger drives and go with RAID1.   With software-RAID, you have benefit of reading from both disks at the same time so reads will have improved throughput and IOPS.   YOu do not get this benefit with the matrix controller anyway, so you should actually see a nice performance boost in software RAID1 vs MATRIX RAID0 in reads.   In writes, real-world, it will probably be a wash.
0
 

Author Comment

by:csialbany
ID: 34257115
So I am going to reinstall and go with software raid1. Thanks for the information, I'll keep you posted.
0
 
LVL 47

Expert Comment

by:dlethe
ID: 34257228
Good, that is what you should do.  If you have not done so already, you might take the time to run some benchmarks, and establish a baseline so you can do a before / after.
0
 

Author Comment

by:csialbany
ID: 34417490
Alright,

Sorry for the long delay.

So I took the hard drives out of RAID and did a clean install of Win 7 x64 (i didn't even use software RAID, just a regular single HD install). No issues with the OS install and initial driver installs. Ran Prime95 for 12 hours and memtest for 12 hours with no issues.

I delivered the machine to the client and a week later they started getting random blue screens again.

Checked the memory compatibility and found that ASUS only supports 3 dimms of the ram model the client purchased (there were 6 installed). Pulled the 3 additional dimms and blue screens persist.

In the event log there were several ntfs errors before the blue screens began. I will try running chkdsk and see if this has any effect.

Any other ideas are greatly appreciated
0
 
LVL 47

Accepted Solution

by:
dlethe earned 500 total points
ID: 34418137
Don't ever lose site that there could have been several issues ...

Sometimes people buy crappy ram where the chips don't meet the specs, so are the chips the actual manufacturer and part number that they support?  If not, then that is someplace to look.  (But give agengler11 credit if that is it, as he suggested it first).

If 12+12 hours is more than long enough for it to have crashed before, then you have to consider that perhaps your customer has a power problem.  I've seen similar things when end-users have equipment like arc welders on the same circuit that nearly fry any electronics that are nearby.
0
 
LVL 27

Expert Comment

by:Tolomir
ID: 34699724
This question has been classified as abandoned and is being closed as part of the Cleanup Program. See my comment at the end of the question for more details.
0

Featured Post

What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

This paper addresses the security of Sennheiser DECT Contact Center and Office (CC&O) headsets. It describes the DECT security chain comprised of “Pairing”, “Per Call Authentication” and “Encryption”, which are all part of the standard DECT protocol.
This Micro Tutorial will teach you how to the overview of Microsoft Security Essentials. This is a free anti-virus software that guards your PC against viruses, spyware, worms, and other malicious software. This will be demonstrated using Windows…
This Micro Tutorial will go in depth within Systems and Security in Windows 7 and will go into detail regarding Action Center, Windows Firewall, System, etc. This will be demonstrated using Windows 7 operating system.

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now