• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1626
  • Last Modified:

Winhdows 2003 SQL Server loss on network connectivity

We have an interesting problem.
First the background.
3 machines (I'll call them APP01, SQL01, and WEB01), all running Windows 2003 SP1, uptodate AV and microsoft patches etc...
WEB01 is the only frontward facing machine. It's the front end for the web application. All traffic to get to APP01 or SQL01 goes through WEB01.
SQL server 2000 is being run in SQL01.
APP01 run the application software.
WEB01 runs a .NET web application front end via IIS and tomcat.

4:30 Wednesday 10th October SQL01 BSODS
4:34 Wednesday 10th October APP01 BSODS.
Machines came back up fine and all seems good in the world (Baring the question of why did they BSOD)
Thursday morning SQL01 began not responding to WEB01 or APP01 trying to access the SQL databases.
Windows service on SQL01 that checks a UNC drive on APP01 every 3 minutes for new files falls over saying it can't find the directory.
SQL01 degrades over time (coming up with insufficent resource messages) to the point running event viewer comes up with a DLL fault and windows telling you it can't run.
Trying to Map a network drive during the degradation was fine, UNC paths worked great.
Copying a 720k file from a local drive to a network drive on APP01 failed with insufficent system resources.
MOM alerts also came up about the same times for not being able to contact the MOM server.
The machine however replied to pings fine.

Rebooting SQL01 bought it back into being and behaving.
How long SQL01 lasted before doing the same thing was between 10 minutes and 2 hours.
IF left alone SQL01 would "fix" itself (The network connectivity problem, the windows problems remained), after 30-45 minutes, only to do the same thing 10 minutes-1 hour later.
At no time did the network card actually close connection (I was RDP'd into it via the -console command and also used other remote control methods in case RDP was a part of the problem) At no point was network connectivity interrupted.

Diagnosis/fixes done so far
Microsoft have analsyed the dumps, applied numerious patches not sent via windows update, no difference.
Dell have replaced the disk controller for the local disks, no difference.
Removed  any remote control software and used Dell's hardware (DRAC) remote control, no difference.
Stopped AV realtime scanning, no difference.
Firmware and BIOS updated, No difference.
All access TO SQL01 was being done via IP address, NOT machine name.
Switch all 4 machines (SQL01,APP01,WEB01, replacement SQL01) has been checked for errors, re-transmits, resets on the interfaces, bad packets you name it, nothing showing up at all.

Status now.
The SQL databases have been moved to another Windows 2003 SP1 server so diagnosis of this server can be done, App01 and Web01 pointed to the new server, all is fine in the world and thing are functioning fine.
SFC /scannow done on SQL01, nothing wrong.
Today in my testing I managed (By luck or otherwise) using some large SQL queries to see the windows service again failed with Path not found. Cannot make it do it again. Only other thing of note that was happening at the same time was windows Virtual Disk Manager had started itself up.
Perfmon has been told to collect data (We couldn't before because of the state of the machine) Hopefully it can catch the windows service crashing again.
Noticed in Perfmon when I was setting it up that the MS Loopback adapter was installed. Not a problem, however I did see traffic go through it which I think is weird.

Theories
Hardware - Memory to be specific. We all know the numerious faults bad memory can cause. Dell diagnostics don't show any problem, might run memtest, but since theres 8 gig to test could take a while.
Windows routing is stuffed up. (But then why does it only happen sometimes and then other times it's fine, is it possible it sometimes decides to route things not for the local machine through the loopback adapter and if so why it remote control software not affected?)
The computer gods don't like I've not sacrificed a machine lately and are going to take it by force.
Bill doesn't like me.

Any other suggestions/thoughts?
Thanks in advance,
Terry
0
qz8dsw
Asked:
qz8dsw
  • 4
1 Solution
 
oldhammbcCommented:
Id defo run the mem test, a couple of months ago we had a similar problem with our dell 6850's unfortunatly for us the server had 32GB of ram.
They gave me this util to run http://support.euro.dell.com/support/downloads/download.aspx?c=uk&cs=ukbsdt1&l=en&s=bsd&releaseid=R148590&SystemID=PWE_FOS_XEO_2650&servicetag=&os=WNT5&osl=en&deviceid=196&devlib=0&typecnt=0&vercnt=10&catid=13&impid=-1&formatcnt=2&libid=13&fileid=197528
The program will great a bootable usb stick for you to boot the server from and then there is an option to do a quick test, in our case the quick test did actually pick up the problem.

Hope that at least gives you something to look into!

Cheers

Dave J
0
 
qz8dswAuthor Commented:
Hi Dave,

Political foreplay (For want of a better description) has stopped me from moving forward on this.
I'm "officially"  for this service an application specialist, not an OS/system specialist in essence.
I'm going to step back until Friday my time and see what they find.
Friday my time if they don't have anything definative thats where I'll grab the machine in question for the weekend and run the diagnostics. (I think the engineer sent out on site was looking at the Dell Open Manage software and not running the offline diagnostics).
Monday (Hopefully) is when they will re-install windows.

I find it interesting however you had the same sort of trouble with the same model Dell. (Yes we run a 6850 too).
And I find it more interesting you bought up the same model since I'd not mentioned the model number of the server at all.

Terry
0
 
qz8dswAuthor Commented:
hokies oldhammbc
Heres the exact failure. (These diagnostics were run from 2 floppy disks as the operators on site could not find a writable CD or USB drive we could use).
So much for Dell doing the diagnostics.

IPMI
Test Results : Fail
Error code 2900:0221
Msg: <System Boot>:power supply::power supply sensor (Status) failure detected.

I'm investigating the error code more to figure out what it means, but lets face it power supply if the voltages are fluctuating then it could cause all sort of trouble.

Terry
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
qz8dswAuthor Commented:
OK dug around abit more.
First that error is nothing. It's checking the machines internal event log and picking up old power supply messages from when it was originally built. You have to use openManage to clear those.

Memory checked out fine doing a pattern test.
But heres the interesting bit, 3 warnings about IRQ sharing.
Video and network and the Internal disk controller sharing IRQ's.
Now "normally" you can get away with wee abit of IRQ sharing as long as the devices are not intensive.
But Video, Network and disk controller????
Those are VERY intesive and would generate Interrupt requests.

I "think" that is a very good possibility that as network activity started to go up and also disk IO because of increased usage the interrupt got overloaded as such and then the network and disk IO on that machine was stuffed, BUT the network card in itself was still finctioning fine, just the interrupt was overloaded.

Any thoughts on my very large theory?

Terry
0
 
qz8dswAuthor Commented:
OK. I got it, I got it, I got it!!!!!!

I refer to http://support.microsoft.com/kb/838765
http://support.microsoft.com/kb/834628

Those definately point to PAE being a real problem. (Supposedly one of them being fixed in SP1 of Win 2k3 which is applied to the server in question.)
PAE in W2k3 is by default turned on (A change from W2k where you had to turn it on)
Our server has 8 gig of memory.
Using /burnmemory=4096 (Even with /pae still there) has made windows only see 4 gig of memory and guess what, the machine is stable. Take out that /burnmemory=4096 so windows can see all 8 gig and it goes back to being very unstable failing within 5 minutes of bootup for attaching to a UNC path while running some grunty SQL queries.

Memory itself has been pattern tested and all 8 gig passes multiple read writes.
This seems to go back to Microsoft pure and outright.

Terry
0
 
Vee_ModCommented:
Closed, 500 points refunded.
Vee_Mod
Community Support Moderator
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 4
Tackle projects and never again get stuck behind a technical roadblock.
Join Now