Solved

Cluster Servers continue to lose connection to Storage on machine reboot.

Posted on 2008-06-25
20
1,034 Views
Last Modified: 2013-11-14
I have a two node cluster setup and connected to a promise vtrak m300i storage utility by iSCSI.  I really have two problems, but am more concerned with my first problem right now.

Several times during the day, the servers will lose connection between them and then drop connection to the storage unit and cause a delayed/write/fail error.  This is starting to happen with considerable more frequency.  I have to shut down both servers and then turn the storage utility off, turn it back on, and bring the servers up one at time for them to reconnect to the storage unit. This is becoming a hassle. I don't know why their dropping unless it has something to do with the cluster continuing to lose connection to each node all the time. I thought that if one server went down the other should take up the slack, but on ours, when one goes down the other pretty much goes down as well. The only time I can tell it actually works is when I initiate failover or move groups from one node to the other.

The second problem is that if one machine goes down on its own, when it comes back up, it does not connect to the storage utility automatically. Do I always have to take everything down and then bring up the storage unit first before the servers, so that they will connect? That doesn't seem very user friendly to me.
0
Comment
Question by:pcgeek1981
  • 8
  • 6
  • 6
20 Comments
 
LVL 16

Accepted Solution

by:
Gerald Connolly earned 250 total points
ID: 21872296
Firstly, Check your network for problems, check the error logs and counters on your switches.
0
 
LVL 55

Expert Comment

by:andyalder
ID: 21873039
You have to make the cluster service depend on the iSCSI initiator or else cluster starts before storage is available. http://support.microsoft.com/kb/883397/en-us
0
 

Author Comment

by:pcgeek1981
ID: 21877157
Okay, I checked all of my procurves and had a large amount of dropped packets and collisions.  Went to my two main switches where the rest of the plant feeds off of and found a poorly made crossover cable that I think might have been causing some of my problems. It's been almost forty five minutes since I put a new cable in and no errors on any switch yet.

I have gone and set the iSCSI initiator settings like the support article said. Will not know for sure until I can reboot the machines and see if it works, but I cannot do that until after work hours this evening. But since I changed the cable I have not had any complaints about being booted out of any programs or loss of email. Fingers are crossed for the moment.
0
 

Author Comment

by:pcgeek1981
ID: 21877608
No, spoke too soon. Starting getting high collision/drop rate errors again. I don't know if maybe there is too much data trying to go across the line or if the switch is going bad, because I am just getting the errors at one end of the cable in one switch and not in the other switch.
0
 
LVL 55

Expert Comment

by:andyalder
ID: 21879318
Why do you have the switch that is running iSCSI connected to the LAN switches? It is of course valid to use your main LAN to run iSCSI on but it's much more SAN-like to keep them seperate. With f/c SANs you normally have two redundant switches and you don't connect them together in case of fabric failure (mad admin is unlikely to screw up both configs).
0
 

Author Comment

by:pcgeek1981
ID: 21879828
The iscsi cards in the servers are connected directly to the promise vtrak storage unit by cat5e cables.  The servers are connected by their second nic cards to an 8 port 1000m belkin switch, which is connected to one of the procurve switches by a 100/1000 module. And then the servers have a third nic which is connected directly to each other by a crossover cable for the heartbeat connection.  
0
 
LVL 16

Expert Comment

by:Gerald Connolly
ID: 21881347
Have you replaced the cable between the servers yet?
0
 

Author Comment

by:pcgeek1981
ID: 21898770
Yes, replaced it this morning. Am looking to see if that works. Sorry it took so long to get back on here, someone had made some changes to the registry of the cluster servers and it took me most of the weekend to figure it out without re-installing the os.
0
 

Author Comment

by:pcgeek1981
ID: 21909684
After replacing cable it is still going down.  This is some of the errors I keep getting every 30 minutes roughly.

Source: ClusSvc
Event ID: 1123
The node lost communication with cluster node "servername1" on network "public'

Source: ClusSvc
Event ID: 1123
The node lost communication with cluster node "servername2" on network "public'

Source: ClusSvc
Event ID: 1126
The interface for cluster node "servername2" on network "public" is unreachable by at least one other cluster node attached to the network, the server cluster was not able to determine the locations of the failure.

Source: ClusSvc
Event ID: 1126
The interface for cluster node "servername1" on network "public" is unreachable by at least one other cluster node attached to the network, the server cluster was not able to determine the locations of the failure.

0
 
LVL 55

Assisted Solution

by:andyalder
andyalder earned 250 total points
ID: 21910378
Check it is configured as per http://support.microsoft.com/kb/258750 .

You have either got the heartbeat priority to be in public instead of  private or private is flakey.

You'll probably find that if you turn your cluster off completely there is periodic congestion on the public LAN from some other source and it's a red herring as far as MSCS goes.
0
Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

 
LVL 16

Expert Comment

by:Gerald Connolly
ID: 21910485
Looks like you have the cluster traffic going over a network called public which i guess is not the private wire between the two cluster nodes.
0
 

Author Comment

by:pcgeek1981
ID: 21912165
Private must be flakey because it is set up just like the help article suggests.  The error messages always say it is losing connection on the public interface though, because I never get anything about the private network at all. It doesn't make any sense to me.
0
 
LVL 55

Expert Comment

by:andyalder
ID: 21914085
Force failures on private by pulling the crossover lead out to test. If it doesn't log anything it confirms it isn't connected at all since it doesn't log anything for a "still not working" condition.
0
 
LVL 16

Expert Comment

by:Gerald Connolly
ID: 21915738
The key part of the message is that it comes from Cluster Services!

Its not just about wiring up as per the book, you have to make sure that the cluster is setup to use the direct connect rather than the public interface. From memeory as it been a while since i had to do this, right click on the network interfaces in the Cluster GUI and take the appropriate action. i.e. select the private wire to be the preferred interface and select the public one as not preferred.

G
0
 
LVL 55

Expert Comment

by:andyalder
ID: 21917682
That's what the doc is about connollyg, it goes through the bindings setup under networking and the priority under cluster setup.
0
 
LVL 16

Expert Comment

by:Gerald Connolly
ID: 21920323
Andy,

Thats Obvious!

But why is cluster services running over the public network? Thats sems to be the real problem!

A secondary issue is why is the public network so unreliable!

G
0
 

Author Comment

by:pcgeek1981
ID: 21920480
It shouldn't be running over the public network, I went in and checked the network priority tab under properties of the cluster and it shows that only my private network is being used. It doesn't even list the public network. And the properties of my private network show it to be set to only use internal communication.
0
 
LVL 16

Expert Comment

by:Gerald Connolly
ID: 21921002
So is there stuff in the event logs at boot time?
0
 

Author Comment

by:pcgeek1981
ID: 21921394
I am putting in a new procurve switch tomorrow afternoon. I think that maybe our procurve, even though it's not showing a fault error, is going bad. The internet connection is slower in my building where this older switch is being used, than any other office in the whole plant. And people have been complaining some of their programs are getting slower here lately and then just lose connection altogether, but their internet is still fast.  Wouldn't normally be on my mind, but the internet comes into my office and feeds the rest of the plant, so I would think our office should be faster than any other offices.
0
 
LVL 55

Expert Comment

by:andyalder
ID: 21923639
I'm unsubscribing. The obvious thing to do is to take it right off the public LAN and get the private link working but you've got 3 faults here all logged under one ticket. At least we've sorted the iSCSI dependency problem.
0

Featured Post

Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

Join & Write a Comment

The 6120xp switches seem to have a bug when you create a fiber port channel when you have a UCS fabric interconnects talking to them.  If you follow the Cisco guide for the UCS, the FC Port channel will never come up and it will say that there are n…
Usually shares are where we want them for our users and we tend to take them for granted. There are times, however, when those shares may disappear causing difficulty for your users. One of the first things to try is searching for files that shou…
This tutorial will walk an individual through the process of installing the necessary services and then configuring a Windows Server 2012 system as an iSCSI target. To install the necessary roles, go to Server Manager, and select Add Roles and Featu…
This Micro Tutorial will teach you how to reformat your flash drive. Sometimes your flash drive may have issues carrying files so this will completely restore it to manufacturing settings. Make sure to backup all files before reformatting. This w…

760 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now