Link to home
Start Free TrialLog in
Avatar of pcgeek1981
pcgeek1981

asked on

Cluster Servers continue to lose connection to Storage on machine reboot.

I have a two node cluster setup and connected to a promise vtrak m300i storage utility by iSCSI.  I really have two problems, but am more concerned with my first problem right now.

Several times during the day, the servers will lose connection between them and then drop connection to the storage unit and cause a delayed/write/fail error.  This is starting to happen with considerable more frequency.  I have to shut down both servers and then turn the storage utility off, turn it back on, and bring the servers up one at time for them to reconnect to the storage unit. This is becoming a hassle. I don't know why their dropping unless it has something to do with the cluster continuing to lose connection to each node all the time. I thought that if one server went down the other should take up the slack, but on ours, when one goes down the other pretty much goes down as well. The only time I can tell it actually works is when I initiate failover or move groups from one node to the other.

The second problem is that if one machine goes down on its own, when it comes back up, it does not connect to the storage utility automatically. Do I always have to take everything down and then bring up the storage unit first before the servers, so that they will connect? That doesn't seem very user friendly to me.
ASKER CERTIFIED SOLUTION
Avatar of Gerald Connolly
Gerald Connolly
Flag of Australia image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Member_2_231077
Member_2_231077

You have to make the cluster service depend on the iSCSI initiator or else cluster starts before storage is available. http://support.microsoft.com/kb/883397/en-us
Avatar of pcgeek1981

ASKER

Okay, I checked all of my procurves and had a large amount of dropped packets and collisions.  Went to my two main switches where the rest of the plant feeds off of and found a poorly made crossover cable that I think might have been causing some of my problems. It's been almost forty five minutes since I put a new cable in and no errors on any switch yet.

I have gone and set the iSCSI initiator settings like the support article said. Will not know for sure until I can reboot the machines and see if it works, but I cannot do that until after work hours this evening. But since I changed the cable I have not had any complaints about being booted out of any programs or loss of email. Fingers are crossed for the moment.
No, spoke too soon. Starting getting high collision/drop rate errors again. I don't know if maybe there is too much data trying to go across the line or if the switch is going bad, because I am just getting the errors at one end of the cable in one switch and not in the other switch.
Why do you have the switch that is running iSCSI connected to the LAN switches? It is of course valid to use your main LAN to run iSCSI on but it's much more SAN-like to keep them seperate. With f/c SANs you normally have two redundant switches and you don't connect them together in case of fabric failure (mad admin is unlikely to screw up both configs).
The iscsi cards in the servers are connected directly to the promise vtrak storage unit by cat5e cables.  The servers are connected by their second nic cards to an 8 port 1000m belkin switch, which is connected to one of the procurve switches by a 100/1000 module. And then the servers have a third nic which is connected directly to each other by a crossover cable for the heartbeat connection.  
Have you replaced the cable between the servers yet?
Yes, replaced it this morning. Am looking to see if that works. Sorry it took so long to get back on here, someone had made some changes to the registry of the cluster servers and it took me most of the weekend to figure it out without re-installing the os.
After replacing cable it is still going down.  This is some of the errors I keep getting every 30 minutes roughly.

Source: ClusSvc
Event ID: 1123
The node lost communication with cluster node "servername1" on network "public'

Source: ClusSvc
Event ID: 1123
The node lost communication with cluster node "servername2" on network "public'

Source: ClusSvc
Event ID: 1126
The interface for cluster node "servername2" on network "public" is unreachable by at least one other cluster node attached to the network, the server cluster was not able to determine the locations of the failure.

Source: ClusSvc
Event ID: 1126
The interface for cluster node "servername1" on network "public" is unreachable by at least one other cluster node attached to the network, the server cluster was not able to determine the locations of the failure.

SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Looks like you have the cluster traffic going over a network called public which i guess is not the private wire between the two cluster nodes.
Private must be flakey because it is set up just like the help article suggests.  The error messages always say it is losing connection on the public interface though, because I never get anything about the private network at all. It doesn't make any sense to me.
Force failures on private by pulling the crossover lead out to test. If it doesn't log anything it confirms it isn't connected at all since it doesn't log anything for a "still not working" condition.
The key part of the message is that it comes from Cluster Services!

Its not just about wiring up as per the book, you have to make sure that the cluster is setup to use the direct connect rather than the public interface. From memeory as it been a while since i had to do this, right click on the network interfaces in the Cluster GUI and take the appropriate action. i.e. select the private wire to be the preferred interface and select the public one as not preferred.

G
That's what the doc is about connollyg, it goes through the bindings setup under networking and the priority under cluster setup.
Andy,

Thats Obvious!

But why is cluster services running over the public network? Thats sems to be the real problem!

A secondary issue is why is the public network so unreliable!

G
It shouldn't be running over the public network, I went in and checked the network priority tab under properties of the cluster and it shows that only my private network is being used. It doesn't even list the public network. And the properties of my private network show it to be set to only use internal communication.
So is there stuff in the event logs at boot time?
I am putting in a new procurve switch tomorrow afternoon. I think that maybe our procurve, even though it's not showing a fault error, is going bad. The internet connection is slower in my building where this older switch is being used, than any other office in the whole plant. And people have been complaining some of their programs are getting slower here lately and then just lose connection altogether, but their internet is still fast.  Wouldn't normally be on my mind, but the internet comes into my office and feeds the rest of the plant, so I would think our office should be faster than any other offices.
I'm unsubscribing. The obvious thing to do is to take it right off the public LAN and get the private link working but you've got 3 faults here all logged under one ticket. At least we've sorted the iSCSI dependency problem.