Cluster Servers continue to lose connection to Storage on machine reboot.

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

You have to make the cluster service depend on the iSCSI initiator or else cluster starts before storage is available. http://support.microsoft.com/kb/883397/en-us

ASKER

Okay, I checked all of my procurves and had a large amount of dropped packets and collisions. Went to my two main switches where the rest of the plant feeds off of and found a poorly made crossover cable that I think might have been causing some of my problems. It's been almost forty five minutes since I put a new cable in and no errors on any switch yet.

I have gone and set the iSCSI initiator settings like the support article said. Will not know for sure until I can reboot the machines and see if it works, but I cannot do that until after work hours this evening. But since I changed the cable I have not had any complaints about being booted out of any programs or loss of email. Fingers are crossed for the moment.

ASKER

No, spoke too soon. Starting getting high collision/drop rate errors again. I don't know if maybe there is too much data trying to go across the line or if the switch is going bad, because I am just getting the errors at one end of the cable in one switch and not in the other switch.

Why do you have the switch that is running iSCSI connected to the LAN switches? It is of course valid to use your main LAN to run iSCSI on but it's much more SAN-like to keep them seperate. With f/c SANs you normally have two redundant switches and you don't connect them together in case of fabric failure (mad admin is unlikely to screw up both configs).

ASKER

The iscsi cards in the servers are connected directly to the promise vtrak storage unit by cat5e cables. The servers are connected by their second nic cards to an 8 port 1000m belkin switch, which is connected to one of the procurve switches by a 100/1000 module. And then the servers have a third nic which is connected directly to each other by a crossover cable for the heartbeat connection.

Have you replaced the cable between the servers yet?

ASKER

Yes, replaced it this morning. Am looking to see if that works. Sorry it took so long to get back on here, someone had made some changes to the registry of the cluster servers and it took me most of the weekend to figure it out without re-installing the os.

ASKER

After replacing cable it is still going down. This is some of the errors I keep getting every 30 minutes roughly.

Source: ClusSvc
Event ID: 1123
The node lost communication with cluster node "servername1" on network "public'

Source: ClusSvc
Event ID: 1123
The node lost communication with cluster node "servername2" on network "public'

Source: ClusSvc
Event ID: 1126
The interface for cluster node "servername2" on network "public" is unreachable by at least one other cluster node attached to the network, the server cluster was not able to determine the locations of the failure.

Source: ClusSvc
Event ID: 1126
The interface for cluster node "servername1" on network "public" is unreachable by at least one other cluster node attached to the network, the server cluster was not able to determine the locations of the failure.

SOLUTION

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Looks like you have the cluster traffic going over a network called public which i guess is not the private wire between the two cluster nodes.

ASKER

Private must be flakey because it is set up just like the help article suggests. The error messages always say it is losing connection on the public interface though, because I never get anything about the private network at all. It doesn't make any sense to me.

Force failures on private by pulling the crossover lead out to test. If it doesn't log anything it confirms it isn't connected at all since it doesn't log anything for a "still not working" condition.

The key part of the message is that it comes from Cluster Services!

Its not just about wiring up as per the book, you have to make sure that the cluster is setup to use the direct connect rather than the public interface. From memeory as it been a while since i had to do this, right click on the network interfaces in the Cluster GUI and take the appropriate action. i.e. select the private wire to be the preferred interface and select the public one as not preferred.

G

That's what the doc is about connollyg, it goes through the bindings setup under networking and the priority under cluster setup.

Andy,

Thats Obvious!

But why is cluster services running over the public network? Thats sems to be the real problem!

A secondary issue is why is the public network so unreliable!

G

ASKER

It shouldn't be running over the public network, I went in and checked the network priority tab under properties of the cluster and it shows that only my private network is being used. It doesn't even list the public network. And the properties of my private network show it to be set to only use internal communication.

So is there stuff in the event logs at boot time?

ASKER

I am putting in a new procurve switch tomorrow afternoon. I think that maybe our procurve, even though it's not showing a fault error, is going bad. The internet connection is slower in my building where this older switch is being used, than any other office in the whole plant. And people have been complaining some of their programs are getting slower here lately and then just lose connection altogether, but their internet is still fast. Wouldn't normally be on my mind, but the internet comes into my office and feeds the rest of the plant, so I would think our office should be faster than any other offices.