Solved

Cluster Servers continue to lose connection to Storage on machine reboot.

Posted on 2008-06-25
20
1,041 Views
Last Modified: 2013-11-14
I have a two node cluster setup and connected to a promise vtrak m300i storage utility by iSCSI.  I really have two problems, but am more concerned with my first problem right now.

Several times during the day, the servers will lose connection between them and then drop connection to the storage unit and cause a delayed/write/fail error.  This is starting to happen with considerable more frequency.  I have to shut down both servers and then turn the storage utility off, turn it back on, and bring the servers up one at time for them to reconnect to the storage unit. This is becoming a hassle. I don't know why their dropping unless it has something to do with the cluster continuing to lose connection to each node all the time. I thought that if one server went down the other should take up the slack, but on ours, when one goes down the other pretty much goes down as well. The only time I can tell it actually works is when I initiate failover or move groups from one node to the other.

The second problem is that if one machine goes down on its own, when it comes back up, it does not connect to the storage utility automatically. Do I always have to take everything down and then bring up the storage unit first before the servers, so that they will connect? That doesn't seem very user friendly to me.
0
Comment
Question by:pcgeek1981
  • 8
  • 6
  • 6
20 Comments
 
LVL 17

Accepted Solution

by:
Gerald Connolly earned 250 total points
ID: 21872296
Firstly, Check your network for problems, check the error logs and counters on your switches.
0
 
LVL 55

Expert Comment

by:andyalder
ID: 21873039
You have to make the cluster service depend on the iSCSI initiator or else cluster starts before storage is available. http://support.microsoft.com/kb/883397/en-us
0
 

Author Comment

by:pcgeek1981
ID: 21877157
Okay, I checked all of my procurves and had a large amount of dropped packets and collisions.  Went to my two main switches where the rest of the plant feeds off of and found a poorly made crossover cable that I think might have been causing some of my problems. It's been almost forty five minutes since I put a new cable in and no errors on any switch yet.

I have gone and set the iSCSI initiator settings like the support article said. Will not know for sure until I can reboot the machines and see if it works, but I cannot do that until after work hours this evening. But since I changed the cable I have not had any complaints about being booted out of any programs or loss of email. Fingers are crossed for the moment.
0
PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

 

Author Comment

by:pcgeek1981
ID: 21877608
No, spoke too soon. Starting getting high collision/drop rate errors again. I don't know if maybe there is too much data trying to go across the line or if the switch is going bad, because I am just getting the errors at one end of the cable in one switch and not in the other switch.
0
 
LVL 55

Expert Comment

by:andyalder
ID: 21879318
Why do you have the switch that is running iSCSI connected to the LAN switches? It is of course valid to use your main LAN to run iSCSI on but it's much more SAN-like to keep them seperate. With f/c SANs you normally have two redundant switches and you don't connect them together in case of fabric failure (mad admin is unlikely to screw up both configs).
0
 

Author Comment

by:pcgeek1981
ID: 21879828
The iscsi cards in the servers are connected directly to the promise vtrak storage unit by cat5e cables.  The servers are connected by their second nic cards to an 8 port 1000m belkin switch, which is connected to one of the procurve switches by a 100/1000 module. And then the servers have a third nic which is connected directly to each other by a crossover cable for the heartbeat connection.  
0
 
LVL 17

Expert Comment

by:Gerald Connolly
ID: 21881347
Have you replaced the cable between the servers yet?
0
 

Author Comment

by:pcgeek1981
ID: 21898770
Yes, replaced it this morning. Am looking to see if that works. Sorry it took so long to get back on here, someone had made some changes to the registry of the cluster servers and it took me most of the weekend to figure it out without re-installing the os.
0
 

Author Comment

by:pcgeek1981
ID: 21909684
After replacing cable it is still going down.  This is some of the errors I keep getting every 30 minutes roughly.

Source: ClusSvc
Event ID: 1123
The node lost communication with cluster node "servername1" on network "public'

Source: ClusSvc
Event ID: 1123
The node lost communication with cluster node "servername2" on network "public'

Source: ClusSvc
Event ID: 1126
The interface for cluster node "servername2" on network "public" is unreachable by at least one other cluster node attached to the network, the server cluster was not able to determine the locations of the failure.

Source: ClusSvc
Event ID: 1126
The interface for cluster node "servername1" on network "public" is unreachable by at least one other cluster node attached to the network, the server cluster was not able to determine the locations of the failure.

0
 
LVL 55

Assisted Solution

by:andyalder
andyalder earned 250 total points
ID: 21910378
Check it is configured as per http://support.microsoft.com/kb/258750 .

You have either got the heartbeat priority to be in public instead of  private or private is flakey.

You'll probably find that if you turn your cluster off completely there is periodic congestion on the public LAN from some other source and it's a red herring as far as MSCS goes.
0
 
LVL 17

Expert Comment

by:Gerald Connolly
ID: 21910485
Looks like you have the cluster traffic going over a network called public which i guess is not the private wire between the two cluster nodes.
0
 

Author Comment

by:pcgeek1981
ID: 21912165
Private must be flakey because it is set up just like the help article suggests.  The error messages always say it is losing connection on the public interface though, because I never get anything about the private network at all. It doesn't make any sense to me.
0
 
LVL 55

Expert Comment

by:andyalder
ID: 21914085
Force failures on private by pulling the crossover lead out to test. If it doesn't log anything it confirms it isn't connected at all since it doesn't log anything for a "still not working" condition.
0
 
LVL 17

Expert Comment

by:Gerald Connolly
ID: 21915738
The key part of the message is that it comes from Cluster Services!

Its not just about wiring up as per the book, you have to make sure that the cluster is setup to use the direct connect rather than the public interface. From memeory as it been a while since i had to do this, right click on the network interfaces in the Cluster GUI and take the appropriate action. i.e. select the private wire to be the preferred interface and select the public one as not preferred.

G
0
 
LVL 55

Expert Comment

by:andyalder
ID: 21917682
That's what the doc is about connollyg, it goes through the bindings setup under networking and the priority under cluster setup.
0
 
LVL 17

Expert Comment

by:Gerald Connolly
ID: 21920323
Andy,

Thats Obvious!

But why is cluster services running over the public network? Thats sems to be the real problem!

A secondary issue is why is the public network so unreliable!

G
0
 

Author Comment

by:pcgeek1981
ID: 21920480
It shouldn't be running over the public network, I went in and checked the network priority tab under properties of the cluster and it shows that only my private network is being used. It doesn't even list the public network. And the properties of my private network show it to be set to only use internal communication.
0
 
LVL 17

Expert Comment

by:Gerald Connolly
ID: 21921002
So is there stuff in the event logs at boot time?
0
 

Author Comment

by:pcgeek1981
ID: 21921394
I am putting in a new procurve switch tomorrow afternoon. I think that maybe our procurve, even though it's not showing a fault error, is going bad. The internet connection is slower in my building where this older switch is being used, than any other office in the whole plant. And people have been complaining some of their programs are getting slower here lately and then just lose connection altogether, but their internet is still fast.  Wouldn't normally be on my mind, but the internet comes into my office and feeds the rest of the plant, so I would think our office should be faster than any other offices.
0
 
LVL 55

Expert Comment

by:andyalder
ID: 21923639
I'm unsubscribing. The obvious thing to do is to take it right off the public LAN and get the private link working but you've got 3 faults here all logged under one ticket. At least we've sorted the iSCSI dependency problem.
0

Featured Post

Create the perfect environment for any meeting

You might have a modern environment with all sorts of high-tech equipment, but what makes it worthwhile is how you seamlessly bring together the presentation with audio, video and lighting. The ATEN Control System provides integrated control and system automation.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
AD permissions required for data migration across SANs storage 5 46
NETAPP rename a datastore 2 57
DELL RAID MD 3260 Expansion 9 55
Home lab datacenter 9 107
Finding original email is quite difficult due to their duplicates. From this article, you will come to know why multiple duplicates of same emails appear and how to delete duplicate emails from Outlook securely and instantly while vital emails remai…
When we purchase storage, we typically are advertised storage of 500GB, 1TB, 2TB and so on. However, when you actually install it into your computer, your 500GB HDD will actually show up as 465GB. Why? It has to do with the way people and computers…
This video teaches viewers how to encrypt an external drive that requires a password to read and edit the drive. All tasks are done in Disk Utility. Plug in the external drive you wish to encrypt: Make sure all previous data on the drive has been …
This tutorial will walk an individual through the process of installing the necessary services and then configuring a Windows Server 2012 system as an iSCSI target. To install the necessary roles, go to Server Manager, and select Add Roles and Featu…

830 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question