Link to home
Start Free TrialLog in
Avatar of developmentguru
developmentguruFlag for United States of America

asked on

How do I debug (and fix) an intermittent communications problem?

 I have a server running windows server 2003 R2.  This server is also running Exchange 2007 SP1.  The server runs fine for about 16 hours then looses some of it's communications abilities.  

  It can still communicate with other servers on our network.  I can ping it's default gateway and our VPN / firewall.    I can run a tracert to google.com and it works.

  Trying to use a web browser from this system fails as does any attempt to send email out.  

  When the system is in this state, simply rebooting it fixes the problem until it happens again.

  I am not a guru at getting under the hood in Windows and diagnosing this kind of thing by looking at log entries.  I believe that the issue is with Windows itself since other forms of communication are affected.

  What I would like is a set of tests I can run in order to determine what is causing the blockage, then what I need to do to avoid it happening in the future.
Avatar of themightydude
themightydude

So if you use a web browser / send emails, those fail..but you can still ping google.com..ping equipment on your network?

Is there anything at all in the event log?..either in application or system?
Avatar of developmentguru

ASKER

I cannot ping google.com, it times out.  I can tracert it.

The application event log had the following error

Microsoft Exchange couldn't find a certificate that contains the domain name oa.polydeckscreen.com in the personal store on the local computer. Therefore, it is unable to support the STARTTLS SMTP verb for the connector Polydeck using polydecksceen.com with a FQDN parameter of oa.polydeckscreen.com. If the connector's FQDN is not specified, the computer's FQDN is used. Verify the connector configuration and the installed certificates to make sure that there is a certificate with a domain name for that FQDN. If this certificate exists, run Enable-ExchangeCertificate -Services SMTP to make sure that the Microsoft Exchange Transport service has access to the certificate key.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

As far as the system the only warnings or errors are related to printers.
hmm..so you can tracert google.com, but ping google.com times out..does it resolve to a IP?

Network setup is:

Internet --> Firewall / VPN --> Switch --> Servers / computers?

What do you use for DNS servers?
TraceRt does resolve to an IP address.

You are correct on the network setup.

We have two internal DNS servers both windows 2003 server R2.

We have had the error I posted since I posted it and there was no associated shut down of communications.
How long has this been happening with not being able to use a web broswer to get out?

Any recent changes / upgrades?

New DNS server entries etc etc?

Does ping resolve to an IP address?
--How long has this been happening with not being able to use a web broswer to get out?--

We just found out about it in the last couple of days.  We have had instances where the server has acted this way of the last month or so, just not this frequently.

--Any recent changes / upgrades?--
  We did make a change to the server to activate a second NIC to tie it to our SAN.  We then moved some of our files from the server hard drives to the SAN.

--Does ping resolve to an IP address?--
Ping, from the server while it is in this state, times out.

We did a little digging and found out that our security software (Panda Security) had somehow been tied to the SAN's IP address.  I could see the constant activity being viewed as an attack and the security software shutting down communications.  We have since removed it from that IP address and the server has not shut down yet.  If it is still running, continuously, this time tomorrow then it is likely solved.

One thing you can still do to earn the points is to give me some tests to run (other than what I have mentioned).  Tests that would allow me to see if SMTP can get out, or any other protocols you can think of.  Tests that will show error results would be best.
ASKER CERTIFIED SOLUTION
Avatar of themightydude
themightydude

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
One NIC goes to our network.  The other NIC goes to the switches (fabric) that only goes to our SAN.  The only one with the security software running now is the one that goes to our network (as it should).

can you give me an example of an external site to try the SMTP telnet with?  This would only be used to verify that the communication is being passed.

Do you have a simple test you use to check FTP?  Sorry if I sound like a newb on all of this, for some things I am.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks for some of the testing tools, here is the latest.  The last time the server got into this state I tried some of the tests.  Web requests would not go out on the server but worked well from any other system we tested.  I could ping internal addresses but not external (from the server in question).  Pinging external (or any of the other tests worked fine from a windows XP system on the network.  I was able to send myself an email from hotmail and receive it in house, but SMTP from the server would not function going out.    Tracert timed out.  FTP tests worked from other systems, not the server.  I could do the telnet SMTP test to our exchange server in house and it worked.  Hopefully this info gives you a place to start...
I was also able to use Outlook Anywhere web access to get into emails.  It would allow me to send internally and queue anything I tried to send external.
hmmmm...this is very strange.

It's going to be something specific with that server then since I assume all your other workstations and what not use the same firewall and DNS servers as the server.

To sum up this problem..anything inside of your network is fine..you can talk to anything on your network from that server..but if you try to talk to a computer outside of your network from that server, you get nothing.

When it does this again...disable the security software on the network facing NIC..just for a few minutes.

I assume this server has a static ip correct?

You might also do a  route print from the server before the probelm happens, and then again when the problem is occuring.

Also, if none of the above helps when this happens, try disabling then re-enabling the NIC instead of rebooting the server. For the hell of it, you might try resetting the TCP/IP stack...

 netsh int ip reset c:\resetlog.txt

is there anything in the firewall logs about blocking outbound traffic from that server ip?
Thanks for all of your advice, I will add this to my knowledgebase as you have given me some new tricks to try.  I had someone from Panda Security remote in and look around.  What we found is this:  I was right to suspect the other NIC but wrong as to why.  The second NIC (that runs directly to the SAN switches) had the default gateway set up. For whatever mysterious MS reason this worked well for several weeks.  Just recently MS decided to try rerouting the network traffic through the SAN!  We removed the default gateway from Local Connection 2 and all went back to normal.  I will flag the posts you put on here that I found useful as the solution (it has worth to me in future similar situations).  I wrote this to be sure the fix was included for anyone trying to find it in the future.

Do not put a default gateway on a NIC unless you want traffic rerouted through it!  This is, I am sure, obvious to everyone who has been in networking any period of time.  It is not obvious to a programmer like myself.
Glad you got figured it out.

That actually is new information to me as well...I would have assumed different default gateways on 2 or more nics would not have affected anything. Especially since one is on one network, and the other on a different network.