Winhdows 2003 SQL Server loss on network connectivity
Posted on 2007-10-14
We have an interesting problem.
First the background.
3 machines (I'll call them APP01, SQL01, and WEB01), all running Windows 2003 SP1, uptodate AV and microsoft patches etc...
WEB01 is the only frontward facing machine. It's the front end for the web application. All traffic to get to APP01 or SQL01 goes through WEB01.
SQL server 2000 is being run in SQL01.
APP01 run the application software.
WEB01 runs a .NET web application front end via IIS and tomcat.
4:30 Wednesday 10th October SQL01 BSODS
4:34 Wednesday 10th October APP01 BSODS.
Machines came back up fine and all seems good in the world (Baring the question of why did they BSOD)
Thursday morning SQL01 began not responding to WEB01 or APP01 trying to access the SQL databases.
Windows service on SQL01 that checks a UNC drive on APP01 every 3 minutes for new files falls over saying it can't find the directory.
SQL01 degrades over time (coming up with insufficent resource messages) to the point running event viewer comes up with a DLL fault and windows telling you it can't run.
Trying to Map a network drive during the degradation was fine, UNC paths worked great.
Copying a 720k file from a local drive to a network drive on APP01 failed with insufficent system resources.
MOM alerts also came up about the same times for not being able to contact the MOM server.
The machine however replied to pings fine.
Rebooting SQL01 bought it back into being and behaving.
How long SQL01 lasted before doing the same thing was between 10 minutes and 2 hours.
IF left alone SQL01 would "fix" itself (The network connectivity problem, the windows problems remained), after 30-45 minutes, only to do the same thing 10 minutes-1 hour later.
At no time did the network card actually close connection (I was RDP'd into it via the -console command and also used other remote control methods in case RDP was a part of the problem) At no point was network connectivity interrupted.
Diagnosis/fixes done so far
Microsoft have analsyed the dumps, applied numerious patches not sent via windows update, no difference.
Dell have replaced the disk controller for the local disks, no difference.
Removed any remote control software and used Dell's hardware (DRAC) remote control, no difference.
Stopped AV realtime scanning, no difference.
Firmware and BIOS updated, No difference.
All access TO SQL01 was being done via IP address, NOT machine name.
Switch all 4 machines (SQL01,APP01,WEB01, replacement SQL01) has been checked for errors, re-transmits, resets on the interfaces, bad packets you name it, nothing showing up at all.
The SQL databases have been moved to another Windows 2003 SP1 server so diagnosis of this server can be done, App01 and Web01 pointed to the new server, all is fine in the world and thing are functioning fine.
SFC /scannow done on SQL01, nothing wrong.
Today in my testing I managed (By luck or otherwise) using some large SQL queries to see the windows service again failed with Path not found. Cannot make it do it again. Only other thing of note that was happening at the same time was windows Virtual Disk Manager had started itself up.
Perfmon has been told to collect data (We couldn't before because of the state of the machine) Hopefully it can catch the windows service crashing again.
Noticed in Perfmon when I was setting it up that the MS Loopback adapter was installed. Not a problem, however I did see traffic go through it which I think is weird.
Hardware - Memory to be specific. We all know the numerious faults bad memory can cause. Dell diagnostics don't show any problem, might run memtest, but since theres 8 gig to test could take a while.
Windows routing is stuffed up. (But then why does it only happen sometimes and then other times it's fine, is it possible it sometimes decides to route things not for the local machine through the loopback adapter and if so why it remote control software not affected?)
The computer gods don't like I've not sacrificed a machine lately and are going to take it by force.
Bill doesn't like me.
Any other suggestions/thoughts?
Thanks in advance,