Link to home
Start Free TrialLog in
Avatar of hypercube
hypercubeFlag for United States of America

asked on

Diagnosing reason for Windows 7 PC freezing / stopping - only when in production.

I have a number of production workstation running Windows 7 Pro.  One started freezing up a few times each day and I've been trying to troubleshoot it.
(Please don't suggest a new computer or a new operating system as these are definitely worthy of consideration but are *not* the purpose of this question).
"Freezing" in this case means it requires a hard reboot to recover.  Nothing is responsive.  I presume that the remaining display comes from the monitor's buffer.

Any number of things have been done:
cleaned up accumulated dust
Refreshed the CPU heatsink compound.
sfc
dism
driver updates
new power supply
HD testing and monitoring
RAM testing

Last week, I took the machine into the lab and ran it for 4 days straight without any indication of freezing - including stress tests.
Then, figuring that hardware and some software were eliminated, returned it to production.
The freezing while in production resumed and I've witnessed it / experienced it.
Event Monitor shows nothing leading up to the reboot.

Now I have it back in the lab along with its charge card machine, wireless mouse, wired keyboard.  Other than networked printer, everything should be here.
It's been running now for 15 hours nonstop and I have to be planning the next steps.
This question is as much about considering approaches as about fixing this immediate issue!!

We have a machine that freezes in production and does not freeze in the lab.  What's different?
- Environmental: Does the room it's in make a difference?
     RF interference re the wireless mouse?
     Power conditioning or lack thereof
    (Ambient temperature is about the same)
- Software:
     Actual software apps that are being used in production.  

Diagnosis:
- If the machine would fail in the lab then we could try things like trimming startups in msconfig.  But this isn't the case.
- Work with the User to determine more finely-honed software environments: what's being started manually? etc.

I'm well aware that you can't fix what you can't observe.  So, the challenge here seems to be to generate some better observations and data.

Q: How might we generate better observations and data?
Avatar of John
John
Flag of Canada image

Drivers:  In addition to your list, update BIOS (many since 1/1/2018) and Chipset.

Windows Updates:  Run these (new ones this week).

Users:  Try a new, test, different Windows User Profile (Account).
Hi Fred,

Thanks for posting that exhaustive list of what you've tried - saves a lot of to and fro questions. In this case, I'd be suspecting a hardware problem and testing for that.

An excellent tool to check all Hardware components in a machine that I've been using for many years is called BurinInTest by Passmark.

The trial only allows 15 minutes of continuous testing, which often isn't enough time for it to fault a component, but it can be run multiple times in trial mode. Alternatively, a one time purchase to unlock continuous testing is quite affordable in my view. Grab a copy from here:

https://www.passmark.com.au/download/bit_download.htm

If you need help in configuring its tests, sing out and I'll be happy to help.

Regards, Andrew
Now I have it back in the lab along with its charge card machine, wireless mouse, wired keyboard.  Other than networked printer, everything should be here.
Swap out the charge card machine with another unit and see if the other unit fails.

The freezing while in production resumed and I've witnessed it / experienced it.
Were you actually working on it when it did or were just nearby and called over to take a look? Can you determine the input that was going on when it froze? Maybe by the timestamps in the log files around the time it froze. Running alone it sounds fine, when input or its calculating data sounds like when it may freeze. You mentioned card reader, does it fail after a card is scanned? This may narrow it down to the card reader or card reader driver.
One other thing I just thought of as well Fred. Have you checked Windows Event Viewer for any Application or System errors at the time it happens? If it's freezing up solid, there likely may not be anything there, but worth a quick look.
Avatar of hypercube

ASKER

There is no correlation between freezing and operator actions that I can tell.  I was doing something pretty simple yesterday while it was in the production environment - when it froze up - (like moving the mouse from A to B across the screen with no clicks).

The charge card machine, the mouse, etc. have all been swapped out at various times in order to pinpoint the cause of freezing.

At first, I thought it was the accumulation of dust because inadequate cooling *will* cause this to happen without a doubt.
But that appears to have been completely eliminated.
I have been using Passmark's BurnInTest program (where I said "including stress tests").
So, I too was leaning toward hardware initially.
But this doesn't explain how it would run for 4 days without incident in the lab and yet show around 4 incidents during 8 hours while in the production environment (office).
So now I'm leaning toward software......

Event Viewer was in the orginal question but I had called it Event Monitor.
Avatar of Dr. Klahn
Dr. Klahn

If you have multiple systems identical in hardware but with different Windows installations, swap the drives between the afflicted machine and one that does not have problems.

If the problem follows the machine, it's a hardware issue.  If the problem moves to the other machine, it's a software issue.  From there you can decide how to chase the problem further.
Fred, possible there is a sudden "spike" in the CPU temperature when this occurs? Also, are you running the system through a UPS to ensure there aren't any surges occurring? Here's a good article on for Basic CPU Temperature Monitoring if you want to try that.

Just trying to think of all possibilities here. Intermittent problems like this can be a royal PITA to troubleshoot.
If it's crashing moving the mouse could be an IRQ issue. Search "system info" and check IRQ conflicts. Could be a sign the CPU, motherboard, or memory is going bad. For troubleshooting swap the memory out since it's the easiest and fastest to replace. Memory that's intermittent and still passes tests can act this way. If you don't have memory to swap run a full test, the one that takes 24 or so hours to really burn in the memory while testing.
At a minimum reset the memory, it's been well documented resetting the memory can take care of small "leaks" in the cells that can cause intermittent issues like this. Pull the memory and keep it plugged in and press the power button. This grounds the machine to the building. If you don't feel comfortable turning it on without the memory in then place a grounding strip from the chassis to a grounded point in the building, unplug it from the wall and ground it this way.
In order of importance:
1) I'd like to make it fail here in the lab so I might figure out the cause and figure out how to fix it.
2) I'd rather not "try it" again in production as that's a bit disruptive for the users.
3) I may have to.  A new User profile would be one of those cases - under the assumption that it will still continue to work fine in the lab.

I wouldn't say this is exactly "intermittent" when it won't fail in the lab......
But, yes, it's intermittently consistent in failing in production.
I think that's a pretty good reason to doubt hardware issues - but surely willing to listen to reason.
Only suggestions.

Bad power (see Andrew Leniart's post).
Interference eg neons, high power wiring (elevator shafts, machinery etc)
Radio or radar systems doing testing in the area (but this is stretching it a little)
Is it under a load in the lab? Do you move the mouse around and enter data the same way it does in production? Or does it just sit in the lab and run?

You said you were using the mouse when it failed. This is IRQ related and (maybe) at the processor, motherboard, memory level.

What load is it under in the lab and how close is it to simulating the load while in production?
I'm still inclined to think there may be an on-site cause. I recall an instance with one of my clients once where I removed a machine to my workplace to test and couldn't get it to fault, yet it would fault at his offices. A month or so later, he had an electrical safety test done and about 8 surge powerboards failed testing.

We installed a voltage monitoring device at his office suites only to find out that the expected "maximum" voltage of 230 was spiking at times to 260+ and was the cause of burning out his surge protection boards. That's why I made the suggestion to run the machine behind a UPS on location to see if that cured the problem. I personally would expect this to be a lot easier to be reproduced if it was software related.

Regards, Andrew
it can also be bad AC wiring, or grounding - or shielding.
put the system on an ups to eliminate power issues
one of thepossibilities is certainly environment - as saidif that does not help, put the system in a closed metal box, grounded  - leave only in - and output vents for air circulation
nobus:  Thank you!
I fully understand the idea but I can just imagine this "closed metal box" in a dentist's reception area!!  OMG!
I guess one would run the power, keyboard, charge card machine, mouse ... USB devices through one of the vents, eh?
:-)
I think we're zeroing in on a likely approach re: the environment (i.e. without the box).

The other (software) approach would appear to be a close observation of just what the user turns on and leaves running during the day.
I can't duplicate all that 100% in the lab.

WORKS2011: I didn't say that moving the mouse caused it to freeze.  I tried to say that it froze once while the mouse cursor was moving - as evidence of my being present and alert.  I am told that often it is frozen when the user returns to use it.  And, this system had been working for a long time (it *is* Windows 7) and this problem just showed up a few weeks ago.  
I wonder if an IRQ problem would freeze the entire system?

(I'm wondering if they plugged in a heater? and/or are bringing in the power through a skinny wire?  re Andrew's comment).
well a closed metal box is not really closed, it may have holes, and rasters, - the idea is a  Faraday cage  : https://en.wikipedia.org/wiki/Faraday_cage, and you can make it with as much bling  bling as you like !
as for your Q "I wonder if an IRQ problem would freeze the entire system?"  you won't know for sure until you found , so till then ALL is possible
I would be leaning more towards what the users in the dental reception may have been doing on the computer immediately prior to the freeze-ups, and I wonder how forthright the reception staff would be about this.  We all know that uninformed computer users do things that they see as so innocuous that they are not worth mentioning.

You mentioned that while you were at the dental reception it froze when you did something simple.  Do you know, and would the staff have told you, if they had just been checking their personal GMail, browsing eBay in multiple browser tabs, or had their smartphone plugged into the USB port to charge?  Browser processes can take a while to fully quit after the browser windows are closed, especially with a bunch of add-ons running and you occasionally get hang-ups if you try to do anything while the processes are being terminated.  An additional USB device could place additional demands on an already burgeoned computer tower.

Some standalone tools that might help:
https://www.nirsoft.net/utils/what_is_hang.html
https://www.nirsoft.net/utils/computer_activity_view.html
https://www.nirsoft.net/utils/full_event_log_view.html
https://www.nirsoft.net/utils/usb_devices_view.html
There have been no freezeups once more in the lab.  
So, tomorrow I'm going to install the computer back in their office and:
1) Check the power connection / external wiring / high gauge or overloaded?
2) Look for obvious things like heaters on the same source.
3) Install a UPS just for this computer.
We'll see....

I'm fairly confident that the one staff person using this computer isn't doing odd things.
But I agree that it's a good question.

There aren't a bunch of addons.  The main application for scheduling and patient data is a web-based service.  Pretty typical these days and lends itself to lower computer loads than in the past.
I found that the power source was being run through a crowded surge suppressor.
So, I installed a UPS directly to the wall.

The problem isn't yet solved.
The only user is sensitive to what's running - and it's not much at all.
It was frozen up this morning when she arrived - so this rather lets out much in the way of operator interaction.

I'm planning on doing a repair install over the weekend.
That be as it may, this is starting to feel like hardware again!
I'm planning on doing a repair install over the weekend.
That be as it may, this is starting to feel like hardware again!

That's been my gut feeling from the start as well Fred. The UPS was an excellent addition and should rule out an on-site power issue too. Please do keep us updated.
I don't think I mentioned it but I *did* inspect the motherboard capacitors on this one.
So that leaves me wondering how to best address hardware.
I do have a diagnostic PCI board but I rather think it only deals with the boot sequence to indicate where it stops .. not sure.
BillDL thanks for the links!
I'd still go with a power issue.  If you installed a UPS with auto voltage regulation one would think it should be a non-issue, though not all UPS models are as proactive as we'd like to think.  I have seen this situation twice with similar behavior and once even through a UPS.  

One time we ran across this, as it happened (luckily) my coworker who handled the onsites noticed a refrigerator unit down the hall kicking on when the freeze would occur.   Took 3 or 4 onsites to figure that one and a lot of bench testing down the drain.  Also a lot of customer frustration, and free labor.  He found another wall plug nearby that must have been on a different circuit and all was well.  

The second time we ran into a situation like this, we ended up replacing the PC; my best guess was it was a power issue where the PC in question (on a cheaper UPS) I can only imagine was more sensitive to the issue than the replacement unit.  The client was happier to get it over with and discontinue the back and forth between us, and the down time.  Plus we had a new bench PC out of the deal.  

Some things to consider.  You mentioned motherboard caps, and though far more rare you probably can't see the power supply caps at all without forcing your way into the encasing.  This could contribute in behavior in addition to possible power related issues in the environment onsite that you don't see in shop/on the bench.  It's a long shot but I figured I would mention it.  

You might also want to make sure you have all the peripherals used onsite in your bench testing, anything connected directly to the PC with a cable, customer flash drive even, anything.  Of course you would be starting up the software the customer uses, etc. etc. you know this.  

You mentioned multiple machines deployed I wonder what would happen if you switched this PC with another that doesn't have the issue.  Would the switched in PC develop the issue, and would the swapped out PC continue to have the issue in a new location, or would neither PC have the issue at all?  Is one PC more sensitive in that particular wall outlet than the other?  Since your environment is somehow different back in shop/on the bench and no lockups, would same be true in a different location onsite then?  May be worth it, may not be.  If it's an issue of user data, it may be an option to swap hard drives in the PCs you switch around onsite.  Just a thought.
It appears that the culprit was an update program installed by an app provider.  Removing that solved the problem.  I can't say why this would not have been an issue when the computer was running in the lab.  We found it by matching the date of when the trouble started with the date of software installation and then just guessed and removed it.
ASKER CERTIFIED SOLUTION
Avatar of hypercube
hypercube
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thank you Fred.  I'm glad you managed to find what was causing this issue.