Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1584
  • Last Modified:

Server Intermitantly does no respond to ping requests?

We have a few Servers in place 3 x DL380s and a couple of IBM servers.

SQL 2005 with Windows 2003 R2 is on one of the DL380s.

One nic is configured on a 192.168.1.x / 24 range and the other on a 192.168.5.x / 24 range to facilitate different functions.  The bulk of the work is done on the 1.x range with POS terminals connected to SQL via the 5.x range. The servers are all plugged into a pro curve with cat 6 cables.

We're having dropouts on the POS terminals and have to reconnect them at times.

I've run ping plotter on a non POS machine to the 5.x range and every now and then it's time out between 2 - 20 seconds.  There's only about 25 terminals connected to it on that nic.  We have HP Procurve switches in place, combo of 2848s and 2824s with the LC fibre modules connecting them all up directly.

The server itself is showing about 1-2% utilisation on it's NICs (not working very hard).

Even if I plug a laptop into the switch where all the servers are plugged into I get timeouts with the same consistancy.

I'm starting to feel the pressure on this as it's been happening for about 2 weeks now.

Is there anyone out there with any ideas at all as to how I can troubleshoot this better or have I missed something?

The SQL server doesn't appear to have any issues for desktop users whom operate applications on the 1.x range.  However the odd test here and there has suggested it drops pings ON OCCASION, far less than the 5.x range.

When running ping plotter between the servers themselves there are no dropouts of significance, only the odd 3ms ping time.

Any ideas?
0
leonardrogan
Asked:
leonardrogan
  • 27
  • 11
  • 9
  • +1
1 Solution
 
kyleb84Commented:
You packet loss should be less than the 0.01% mark. That's one in 10,000 packets. Sound like your getting 1 in about 20 (5%).

Telnet into each switch an type "sh int", you'll get a list of all the ports and the error count - compare the error count to the total RX column if the ratio is close replace the cable.

You can be more specific on certain suspect ports by typing "sh int ##" where ## is the port number (eg "sh int 23"). This will give you full output of the statistics - a bit more info.

Let us know if you find any ports with a significant error ratio, and what they're purpose is.

0
 
leonardroganAuthor Commented:
I've got one port with 25 errors but it's not the one the server is plugged into that is having the issue.

Thank you for your timely response.
0
 
leonardroganAuthor Commented:
I'm also going to update the network card drivers to see if that helps?
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
Fred MarshallPrincipalCommented:
It sounds like the switch is a common element, is that right?

Have you rebooted the switch?  Do that.

Consider replacing the switch.
0
 
ChiefITCommented:
I have lots of ideas:

Intermitant comms is my favorite to figure out.

1) Multihomed computers can cause this
2) a switch that is not configured with portfast can cause this
3) SP1 can cause this
4) if you are doing a DNS ping DNS misconfiguration can cause this issue
5) if your switches and router don't have the same mode of operation configured

Where do you want to start???
0
 
leonardroganAuthor Commented:
I will restart the switch tomorrow morning, update the NIC Drivers.
Server is not multihomed, two x nics with seperate subnets is all.
We're pinging via IP to remove DNS from the equation.
Network is local so router/firewall not an issue?

What is portfast?

The switch is relatively new (5 months) and the servers have no issue when pinging each other, just clients that ping the SQL server even when on the same switch (and changing ports).
0
 
ChiefITCommented:
First off, we should determine if this is a managed switch. What make and model do you have.

Spanning tree and portfast

As I understand it, Spanning tree spans the switch and makes a solid connection to the node on that port. This takes 50 seconds to determine the path on that switch. Of course this times out XP and newer boxes, to include 2003 server.

Portfast strips the scanning of the tree and goes right to forwarding the packets.

There is a good article for determining if portfast is your problem >LOL,  It is called "Is Portfast My problem?" Here you go>
http://tcpmag.com/qanda/article.asp?EditorialsID=277

This issue will NOT happen on smart switches, (meaning not managed switches).
______________________________________________________________________

Now the server has two NICs on it, If both are enabled and are not configured correctly, then you can have this problem. I ONLY use two nics for routing over the server or if I have over 250 nodes on my networks. Event if dual NICs are on different subnets, this could cause your issue:

________________________________________________________________________
Service pack 1 has a bug in it that can also cause this issue. It has mistyped code where the MTU channels are improperly entered. You could experience intermitant comms with SP1 on that machine. So, if you are using SP1, consider upating to SP2. I can give you examples and show you more details if you wish.
____________________________________________________________________________
I have also seen this issue when multiple catV cables tie the switches together. So, let's say you have two 24 port switches and you take two Cat V or Cat 6 cables to make sure you have two connections between switches. They will interfere with one another.
____________________________________________________________________
Last, but not least, I have seen this issue where the mode of operation was causing the error. If these are managed switches, you may be on the wrong mode of operations. Lets say you have two switches and one is configured to Auto negotiate while the other is configured to 1000Mb/Full Duplex. You might think they would communicate. But, they don't always do so, and you may see intermitant comms.

I hope these help. Let me know if you need a better explaination or more details. This is kind of a quick down and dirty.  
0
 
leonardroganAuthor Commented:
Thank you very much for the info.  I'll read up on the portfast stuff.

Switches are HP Procurve 2824s x 2 and Procurve 2848 x 2.
They are default config, nothing has been touched on them.  So everything auto.

Fibre connects them all to the one main 2824 which has the servers connected to it.

Even laptop connected to the main switch with the servers connected to it has timeouts.

No collisions on the switch there to speak of.

Server is on SP2 (Windows 2003 R2).
0
 
ChiefITCommented:
Here is a site you can use to contact an HP tech rep or follow along on the comments for your switches. They are managed switches that come standard as point-to-point configured. They also use Spanning tree instead of portfast.

http://www.tek-tips.com/viewthread.cfm?qid=1221775&page=1

NOTE:
Spanning tree is good for Network hub to network hub device, like router to switch or switch to switch. That article I provided will tell you that. But, Spanning tree will not work for Windows XP or later boxes well.
0
 
leonardroganAuthor Commented:
Thank you.
0
 
leonardroganAuthor Commented:
Is there some software out there that I can demo that will test for any network loops and also allow me to flood the network to see if there are any holes in it?

I updated the server NIC drivers this morning and also reset the switches to no avail.
0
 
leonardroganAuthor Commented:
What is c-class LOMs and mezzanine network adapters.  The new firmware for the NC373i's support this and I have no idea what it is.
0
 
leonardroganAuthor Commented:
OK.  Had somewhat of a breakthrough I think.

Basically I ran the ping plotter to a newtwork device (Wireless Access Point) on the 5.x range.

It hasn't dropped a singed packed in over an hour.  If I try a pc on that same range it would drop out 5 or six times for 5-10 seconds each time.

Now, none of the PCs being tested are part of the domain.  Perhaps if I disable NetBios over TCP/IP that may help any broadcasting.  Thoughts?
0
 
kyleb84Commented:
Check your duplex settings for each device.

NBT has nothing to do with your dropouts.
0
 
leonardroganAuthor Commented:
What do you mean? Force the NIC to 100 MBPS Full duplex instead of Auto?
0
 
kyleb84Commented:
Yes.

A duplex mismatch can cause massive CRC errors and intermittent drop-outs as you've described - its worth looking at.

0
 
leonardroganAuthor Commented:
OK.  Will do and will advise.  Thank you.
0
 
leonardroganAuthor Commented:
Still dropping out every 20 mins almost exactly. Server doesn't drop other connections at the same time.

0
 
ChiefITCommented:
Kyle:

These are GB switches. By changing the mode of operation to 100MB/Full, we are creating the mismatch.

I agree with yout, the computers and all swtiches should be set to the same. Also, the difference in the mode of operation should produce an Amber LED on the switch itself, when in the problemed state.
0
 
kyleb84Commented:
"These are GB switches"

Ah, my bad.
0
 
leonardroganAuthor Commented:
So are you saying I should force the switches to 100 Base for all the ports as that is what the machines are primarily?

I agree with yout, the computers and all swtiches should be set to the same. Also, the difference in the mode of operation should produce an Amber LED on the switch itself, when in the problemed state.

How come I have no troulbe connecting to a device over three hours but have trouble with windows based pcs?
0
 
ChiefITCommented:
If you set everything to Auto negotiate, including this problem child computer, it should work fine. I have been working on something similar where the recommended fix is to disable the nic and install a new Gb NIC card. These issues are hard to track down.
________________________________________________________________________
How come I have no troulbe connecting to a device over three hours but have trouble with windows based pcs?
This is usually a sign of a portfast issue, explained above. If portfast is not enabled, XP and newer clients may time out when trying to negotiate the spanned tree.

So, this is the way it should look:

Router <--> swtiches>> (spanning tree)
Switches <--> computers>> (Portfast)

Duplex settings:
router and switches>>have to be the same (auto or 1000Mb/Full Duplex)
Computer NIC settings and swithces (have to be the same auto or 1000MB/Full Duplex)

And sometimes, certain NICs just are not capable of talking with routers and switches. I don't know why.
0
 
kyleb84Commented:
My concern was based on Cisco's issues with duplex mismatching. I have yet to experience this issue with HP ProCurve - but I thought I'd mention it anyway.

And the HP ProCurves don't do M/R/STP by default, it has to be configured - I'm also sceptical of portfast's presence in ProCurve, since its a Cisco feature and not a standard.

I think the issue lies elsewhere.
0
 
leonardroganAuthor Commented:
I'm having trouble understanding why a machine, all machines, seem to be dropping out every 20 minutes for 1 second now.  Is windows trying to do something, look for something, broadcast for a WINS Server or something that may have it hiccup and drop off?

SpanTree I can configure in the switches whereas where is PortFast option in the HP ProCurves ?
Is that the same as Flow Control?
0
 
leonardroganAuthor Commented:
Sorry it's every 10 minutes the timeouts for 1 Second.

The server has no issue with timeouts pinging wireless access point.  Only windows machines have the issue.
0
 
kyleb84Commented:
I doubt portfast exists in HP ProCurve, flow control is to with Ethernet packet flow, not Spanning Tree.

Might I suggest disabling Spanning tree on everything?

------------

So everything at the same time is dropping packets every 20mins?

Or do different devices have the lapse of connectivity at different times?
0
 
leonardroganAuthor Commented:
Different terminals have lapses at different times but specifically every 10 minutes.
Server maintains connections with other terminals fine during this time.

Also, as I keep suggesting, non windows devices are fine, no dropouts.  Any clues there?

Span Tree is currently not enabled.
0
 
kyleb84Commented:
Just run this by CheifIT,

ARP table expiry? 10 minutes, arp entry expires, a ping is dropped while ARP is resolving again....???
0
 
leonardroganAuthor Commented:
I'll take anything at this stage but that sounds good.  ARP table on Clients?
0
 
ChiefITCommented:
You Gents are hitting it. I just think a little clarification will go a long ways:

Spanning tree is the opposite of portfast. Either your have the routing packets or you don't. Spanning tree takes up to fifty seconds to figure out the defined route. That is saved on your server.
Where, I don't know, (maybe in the ARPcache). Portfast just forwards the packets on. SO, keep spanning tree on for the swtiches, and router when one is avail. Then, disable spanning tree all other nodes on the network.
0
 
kyleb84Commented:
leonardrogan:
"Span Tree is currently not enabled."

Enabling spanning tree won't solve the problem, and it has nothing to do with ARP...
0
 
leonardroganAuthor Commented:
Just to add a little more to the equation.

I have a windows 2000 terminal server within the domain. (Broadcom Nextreme GigaBit Adapter)

The terminals can ping the terminal server on the 5x range using the same method, ping plotter, and the server pings the same terminals using ping plotter with no time outs.

So details are:

A) I have a windows 2003 R2 SQL server that on occasion has timeouts with client terminals but does not suffer the same issue pining say a wireless access point.

B) I have a Windows 2000 IBM Netfinity Terminal Server that does NOT have the associated issues for all of the above.

Where to from here?
0
 
leonardroganAuthor Commented:
Ok, I'll add some pingplotter images to express what my difficulty is.

Win2k03r2Server, depicts a ping to 192.168.5.207.  If you look at the timeouts, they're every ten minutes past the last timeout when the time out finished.  So if the timeout went from 5.42 to 5.43 the next timeout would occur at 5.53 not 5.52.  You can see that in the picture.

Second image, Win2000Server.  Pings the same IP address which is on the same hardware without any issue whatsoever.  

Third Image.  Win2k03r2ServerV2, depicts the same line but losing connection to the POS terminal from 7.12 PM to 7.14 PM (2 Whole Minutes).  Guranteed the next timeout will occur at 7.24 not 7.22 PM.

Fourth Image, attempting to confirm my suspiscions.
Proved True, ten minutes after the last timeout completes at 7.14, we have a timeout (smaller) at 7.24 PM.

The Win2k03 Server is a SQL Server, it connects to another server realtime (unix) for membership points updates, could it be hanging on the reply from the unix server thus timing out the client for that response?
0
 
leonardroganAuthor Commented:
Ok, I've Had to use Firefox to upload the images as IE8 beta2 no good with it.

Just on the last question.  If the 2k03 SQL server times out from the client, why can the 2000 Server ping it OK?


Win2k03R2Server.jpg
Win2k03R2ServerV2.jpg
Win2k03R2ServerV3.jpg
Win2000Server.jpg
0
 
leonardroganAuthor Commented:
I think we're getting there.

This latest image depicts the 2k03 Server talking to (Cyberguard SG300) 192.168.1.55 and having 5% packet loss.  

The same image down the bottom depicts the Windows 2000 Server NOT having the same issue when pinging the 192.168.1.55 address.

On top of that you can see good results for the 2k03 Server for all other queries to other devices.

So is the SQL (2k03) Server trying to attack the Cyberguard and it's dropping the packets? And if so, why is it not dropping them from the Windows 2000 Server?


9.50AM-2k03ServerTop-MultiplePin.jpg
0
 
leonardroganAuthor Commented:
It's calmed down now so I'll report if anything changes.

10.30AM-2k03ServerTop-MultiplePi.jpg
0
 
ChiefITCommented:
If this continues to give you grief, you might try an MTU ping.

http://help.expedient.com/broadband/mtu_ping_test.shtml
and adjusting the MTU settings:
http://help.expedient.com/broadband/mtu.shtml

The MTU has to be the same on all nodes of the network, from what I understand. I think the router sets the tone.
0
 
leonardroganAuthor Commented:
I will try that thank you if it continues.  For now it's still looking good.  I adjusted the Sensitivity level on the Main Procurve switch to HIGH.  Which in reading it fixes up crap packets on the fly.
0
 
ChiefITCommented:
fragmented data packets can come from the MTU channels being mis-set.
0
 
leonardroganAuthor Commented:
But if the terminals in question are hard coded IP wise without a gateway, where do they get their MTU from as the problem we're experiencing is Local to one site?
0
 
ChiefITCommented:
Maybe they don't have to be the same size> I am still looking into this as a part of my knowledge:
Wiki explains MTU and how it may fragment your ICMP requests to adjust to a common MTU channel.
http://en.wikipedia.org/wiki/Maximum_transmission_unit

It's much like the days of the modem, where it negotiates the maximum packets that can be sent between two nodes and then negotiates the speed for the best transfer of packets.
0
 
leonardroganAuthor Commented:
This is starting to make some sense now.  Thank you.  Will advise.

I have to find out the MTU of the Cyberguard (192.168.1.55).  We don't have access to that device as it protects gaming machines.

0
 
leonardroganAuthor Commented:
Hi Still no real updates just to say that some machines keep dropping out.  But it's not the case that when a session drops out with the server that it drops out from the network completely because you can still ping those machines from other machines.

Would it have anything to do with the broadcasts floating around the network?
This is the port status on one of the fibre ports going down to the main switch.

clear) :                     
Bytes Rx :       682,820,565                           Bytes Tx :       2,856,533,296
Unicast Rx :       61,566,539                            Unicast Tx :       74,183,710
Bcast/Mcast Rx :       4,206,874      Bcast/Mcast Tx :       877,745

Errors (Since boot or last clear) :                     
FCS Rx :       0      Drops Rx :       0
Alignment Rx :       0      Collisions Tx :       0
Runts Rx :       0      Late Colln Tx :       0
Giants Rx :       0      Excessive Colln :       0
Total Rx Errors :       0      Deferred Tx :       0

0
 
ChiefITCommented:
This is what I am currently thinking:

Part of your problem was portfast: That was the problem with the pings.

Since it didn't effect the 2000 server, but effected XP and above computers, (including 2003 servers), disabling spanning tree seemed to clear up many of the issues. As mentioned above, disabling spanning tree is the same as enabling portfast.

Before we go off on another tangent, I would like to know where we are at with the pinging and ping plots. Is there any RED in them at all any more? If not, I think we should move on to chasing other issues, like maybe DNS.
0
 
leonardroganAuthor Commented:
There are still red markers on some machines.  I have 5 perfect machines and many not so perfect machines.  When the machines are in use though, they tend to work better.  Timeouts typically occur when the machine is idle.  I've checked power management etc and compared working machine settings with those that are not working and they are the same.  I've got no real idea what is happening.
0
 
kyleb84Commented:
"disabling spanning tree is the same as enabling portfast."

Portfast is a Cisco extension of spanning tree, when a link is first plugged in, STP blocks the port while it re-evaluates the network. Portfast is a method of assuming that the link will not cause a loop.

Since the switches are HP, Portfast has _nothing_ to do with this issue.


0
 
ChiefITCommented:
Let me watch you at work for a bit Kyle:

0
 
leonardroganAuthor Commented:
Hi Guys,

Thanks for all of your efforts on this one.  I've resigned myself to just putting the pos terminals on the 1.x range for the time being as there are no connection time outs from the application on that range for some reason.

Ta
Leonard
0

Featured Post

[Webinar On Demand] Database Backup and Recovery

Does your company store data on premises, off site, in the cloud, or a combination of these? If you answered “yes”, you need a data backup recovery plan that fits each and every platform. Watch now as as Percona teaches us how to build agile data backup recovery plan.

  • 27
  • 11
  • 9
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now