Winhdows 2003 SQL Server loss on network connectivity

Posted on 2007-10-14
Last Modified: 2008-01-09
We have an interesting problem.
First the background.
3 machines (I'll call them APP01, SQL01, and WEB01), all running Windows 2003 SP1, uptodate AV and microsoft patches etc...
WEB01 is the only frontward facing machine. It's the front end for the web application. All traffic to get to APP01 or SQL01 goes through WEB01.
SQL server 2000 is being run in SQL01.
APP01 run the application software.
WEB01 runs a .NET web application front end via IIS and tomcat.

4:30 Wednesday 10th October SQL01 BSODS
4:34 Wednesday 10th October APP01 BSODS.
Machines came back up fine and all seems good in the world (Baring the question of why did they BSOD)
Thursday morning SQL01 began not responding to WEB01 or APP01 trying to access the SQL databases.
Windows service on SQL01 that checks a UNC drive on APP01 every 3 minutes for new files falls over saying it can't find the directory.
SQL01 degrades over time (coming up with insufficent resource messages) to the point running event viewer comes up with a DLL fault and windows telling you it can't run.
Trying to Map a network drive during the degradation was fine, UNC paths worked great.
Copying a 720k file from a local drive to a network drive on APP01 failed with insufficent system resources.
MOM alerts also came up about the same times for not being able to contact the MOM server.
The machine however replied to pings fine.

Rebooting SQL01 bought it back into being and behaving.
How long SQL01 lasted before doing the same thing was between 10 minutes and 2 hours.
IF left alone SQL01 would "fix" itself (The network connectivity problem, the windows problems remained), after 30-45 minutes, only to do the same thing 10 minutes-1 hour later.
At no time did the network card actually close connection (I was RDP'd into it via the -console command and also used other remote control methods in case RDP was a part of the problem) At no point was network connectivity interrupted.

Diagnosis/fixes done so far
Microsoft have analsyed the dumps, applied numerious patches not sent via windows update, no difference.
Dell have replaced the disk controller for the local disks, no difference.
Removed  any remote control software and used Dell's hardware (DRAC) remote control, no difference.
Stopped AV realtime scanning, no difference.
Firmware and BIOS updated, No difference.
All access TO SQL01 was being done via IP address, NOT machine name.
Switch all 4 machines (SQL01,APP01,WEB01, replacement SQL01) has been checked for errors, re-transmits, resets on the interfaces, bad packets you name it, nothing showing up at all.

Status now.
The SQL databases have been moved to another Windows 2003 SP1 server so diagnosis of this server can be done, App01 and Web01 pointed to the new server, all is fine in the world and thing are functioning fine.
SFC /scannow done on SQL01, nothing wrong.
Today in my testing I managed (By luck or otherwise) using some large SQL queries to see the windows service again failed with Path not found. Cannot make it do it again. Only other thing of note that was happening at the same time was windows Virtual Disk Manager had started itself up.
Perfmon has been told to collect data (We couldn't before because of the state of the machine) Hopefully it can catch the windows service crashing again.
Noticed in Perfmon when I was setting it up that the MS Loopback adapter was installed. Not a problem, however I did see traffic go through it which I think is weird.

Hardware - Memory to be specific. We all know the numerious faults bad memory can cause. Dell diagnostics don't show any problem, might run memtest, but since theres 8 gig to test could take a while.
Windows routing is stuffed up. (But then why does it only happen sometimes and then other times it's fine, is it possible it sometimes decides to route things not for the local machine through the loopback adapter and if so why it remote control software not affected?)
The computer gods don't like I've not sacrificed a machine lately and are going to take it by force.
Bill doesn't like me.

Any other suggestions/thoughts?
Thanks in advance,
Question by:qz8dsw
    LVL 8

    Expert Comment

    Id defo run the mem test, a couple of months ago we had a similar problem with our dell 6850's unfortunatly for us the server had 32GB of ram.
    They gave me this util to run
    The program will great a bootable usb stick for you to boot the server from and then there is an option to do a quick test, in our case the quick test did actually pick up the problem.

    Hope that at least gives you something to look into!


    Dave J
    LVL 15

    Author Comment

    Hi Dave,

    Political foreplay (For want of a better description) has stopped me from moving forward on this.
    I'm "officially"  for this service an application specialist, not an OS/system specialist in essence.
    I'm going to step back until Friday my time and see what they find.
    Friday my time if they don't have anything definative thats where I'll grab the machine in question for the weekend and run the diagnostics. (I think the engineer sent out on site was looking at the Dell Open Manage software and not running the offline diagnostics).
    Monday (Hopefully) is when they will re-install windows.

    I find it interesting however you had the same sort of trouble with the same model Dell. (Yes we run a 6850 too).
    And I find it more interesting you bought up the same model since I'd not mentioned the model number of the server at all.

    LVL 15

    Author Comment

    hokies oldhammbc
    Heres the exact failure. (These diagnostics were run from 2 floppy disks as the operators on site could not find a writable CD or USB drive we could use).
    So much for Dell doing the diagnostics.

    Test Results : Fail
    Error code 2900:0221
    Msg: <System Boot>:power supply::power supply sensor (Status) failure detected.

    I'm investigating the error code more to figure out what it means, but lets face it power supply if the voltages are fluctuating then it could cause all sort of trouble.

    LVL 15

    Author Comment

    OK dug around abit more.
    First that error is nothing. It's checking the machines internal event log and picking up old power supply messages from when it was originally built. You have to use openManage to clear those.

    Memory checked out fine doing a pattern test.
    But heres the interesting bit, 3 warnings about IRQ sharing.
    Video and network and the Internal disk controller sharing IRQ's.
    Now "normally" you can get away with wee abit of IRQ sharing as long as the devices are not intensive.
    But Video, Network and disk controller????
    Those are VERY intesive and would generate Interrupt requests.

    I "think" that is a very good possibility that as network activity started to go up and also disk IO because of increased usage the interrupt got overloaded as such and then the network and disk IO on that machine was stuffed, BUT the network card in itself was still finctioning fine, just the interrupt was overloaded.

    Any thoughts on my very large theory?

    LVL 15

    Accepted Solution

    OK. I got it, I got it, I got it!!!!!!

    I refer to

    Those definately point to PAE being a real problem. (Supposedly one of them being fixed in SP1 of Win 2k3 which is applied to the server in question.)
    PAE in W2k3 is by default turned on (A change from W2k where you had to turn it on)
    Our server has 8 gig of memory.
    Using /burnmemory=4096 (Even with /pae still there) has made windows only see 4 gig of memory and guess what, the machine is stable. Take out that /burnmemory=4096 so windows can see all 8 gig and it goes back to being very unstable failing within 5 minutes of bootup for attaching to a UNC path while running some grunty SQL queries.

    Memory itself has been pattern tested and all 8 gig passes multiple read writes.
    This seems to go back to Microsoft pure and outright.

    LVL 1

    Expert Comment

    Closed, 500 points refunded.
    Community Support Moderator

    Featured Post

    Free Trending Threat Insights Every Day

    Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

    Join & Write a Comment

    Enterprise networks where VoIP phones have been deployed frequently use port configurations that allow both a computer and an IP phone to be plugged into the same switch port but use different VLANs. On Cisco equipment I'm referring to the "native V…
    Learn about cloud computing and its benefits for small business owners.
    Migrating to Microsoft Office 365 is becoming increasingly popular for organizations both large and small. If you have made the leap to Microsoft’s cloud platform, you know that you will need to create a corporate email signature for your Office 365…
    Excel styles will make formatting consistent and let you apply and change formatting faster. In this tutorial, you'll learn how to use Excel's built-in styles, how to modify styles, and how to create your own. You'll also learn how to use your custo…

    746 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    16 Experts available now in Live!

    Get 1:1 Help Now