One node is down -  so evicted it but unable to add it back again

RAMKUMAR DABBEERU
RAMKUMAR DABBEERU used Ask the Experts™
on
Issue in Brief: Unable to add node back to the cluster.

Details of the issue: We noticed that one of the nodes in the two node cluster(node & disk majority) is 'down' in FCM.  However, we're able to ping it from another node and also RDP into it.

Troubleshooting Performed:
-Tried to evict and re-add the problematic node to the cluster, but unsuccessful in doing so - both via. PowerShell and via. GUI.
-Also rebooted the node, and performed the above operation.  Still no luck.
  Error: Unable to successfully cleanup.
-Applied "Clear-ClusterNode" and it was successful.
-Then, tried to re-add the node but received the same error, but through GUI & PowerShell.
-Rebooted the problematic node; still getting the same error.
-Upon further investigation, we noticed that  the "CLUSDB" was missing in the registry of the problematic node.
-Copied the database to the problematic node (from the other node), and tried to load it in the registry, without success.
-Then copied the "CLUSDB.blf" file to the location, renamed it to "CLUSDB" and tried to add the node again.  Still getting the same error.

-Looked into AD, and found the appropriate CNO for the cluster, and VCOs of its corresponding nodes.

-Noticed below error in the cluster logs as part of our investigation.

"New join with n2: stage: 'Authenticate Initial Connection' status HrError(0x80090301) reason: '[SV] Authentication failed'"

-Did a "portquery" on the other node, from the problematic node and it returned with an error.
TCP port 3343(ms-cluster -net service): LISTENING
UDP port 3343(ms-cluster -net service): NOT LISTENING
portqry.exe -n 192.168.5.90 -e 3343 -p BOTH exits with return code 0x00000001

Kindly suggest! It's very critical now that we have it added back to the existing cluster so as to fail them over.Unable-to-successfully-cleanup.jpg
-Also gave full permissions to the CNO, without success.
-Checked "services.msc" and found that the service 'cluster' is in disabled state.  Enabled it and started the service, but it failed.
​​Error: Windows couldn't start the cluster service on Local Computer
-Checked 'System Events' and found the below events at the same time:
​Event ID 7024: The Cluster Service terminated with the following service specific error - The system can't find the file specified.
​Event ID 1090: The Cluster Service cannot be started.   An attempt to read configuration data from the Windows registry failed with error '2'. Please use the Failover Cluster Management snap-in to ensure that this machine is a member of a cluster. If you intend to add this machine to an existing cluster use the Add Node Wizard. Alternatively, if this machine has been configured as a member of a cluster, it will be necessary to restore the missing configuration data that is necessary for the Cluster Service to identify that it is a member of a cluster. Perform a System State Restore of this machine in order to restore the configuration data.
>Checked for Certificates on both the servers but couldn't find cluster-related certificates in either of them.
>Removed the role 'Failover Clustering' from the node 'ndb2012b' and added it back, but no success.
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Ben Personick (Previously QCubed)Lead SaaS Infrastructure Engineer

Commented:

Set your Cluster logging level Higher ad generate the error again:


PS: (2012+ Only)
Set-Clusterlog –level 5

Open in new window

CMD:
Cluster log /loglevel:5

Open in new window


Gather your cluster logs:

PS: (2012+ Only)
Get-ClusterLog –destination “\\SomeServer\SomeShare\SomeFolder\”

Open in new window


CMD:
Cluster.exe log /gen /copy:“\\SomeServer\SomeShare\SomeFolder\”

Open in new window


Download, install and run the MS Cluster Diagnostics tool:

https://www.microsoft.com/en-us/download/details.aspx?id=18479

Review the reports from the cluster logs and the Diagnostics tool, and look to pinpoint the issue, we can assist.

Commented:
Do this:

1. Remove the failover clustering/Hyper-v features completely
2. Reboot
3. Add failover clustering back/Hyper-v back
4. Try to add it back into the cluster

Additional things to do:

Restore the registry back to how it was before you made changes (assuming you backed it up beforehand)
If you updated that server before it messed up, remove the updates
RAMKUMAR DABBEERUSupport Engineer - Storage & High Availability

Author

Commented:
Hello Ben,

Thank you very much for the advise.

However, I've already mentioned a couple of errors/observations in cluster logs:
"New join with n2: stage: 'Authenticate Initial Connection' status HrError(0x80090301) reason: '[SV] Authentication failed'"

Also, did a "portquery" on the other node, from the problematic node and it returned with an error.
TCP port 3343(ms-cluster -net service): LISTENING
UDP port 3343(ms-cluster -net service): NOT LISTENING
portqry.exe -n 192.168.5.90 -e 3343 -p BOTH exits with return code 0x00000001

Please advise.

Commented:
Did any policies change on that server?

For instance, the CLIUSR is used for join nodes to the cluster.  This account is a Local account.  You can check Local Users and Groups and find the user in there.

Now go to your local policy on the problematic server, check the "User Rights Assignment" and find "Deny access to this computer from the network"  If the Local Account is in there, that would affect the node being added back to the cluster.  Remove that account, do a "gpudate /force" in the command prompt and try to add the node back.
Ben Personick (Previously QCubed)Lead SaaS Infrastructure Engineer

Commented:
Yeah, I missed that it came from the cluster logs and not the event logs, where the others came from and already saw the port query, my apology on that.

However:

UDP port 3343(ms-cluster -net service): NOT LISTENING
portqry.exe -n 192.168.5.90 -e 3343 -p BOTH exits with return code 0x00000001

Open in new window

code 0x00000001 is just for any port returning "NOT LISTENING", so that isn't much use.

  It isn't listening because the cluster service isn't Running, that isn't a mystery.

Why the cluster service is not starting, that's the crux of the issue.


 it isn't listening is because it s failing every time you run it, and you can;t run it manually if it isn't configured through the interface.

Joining the cluster through the config is generating this:

"New join with n2: stage: 'Authenticate Initial Connection' status HrError(0x80090301) reason: '[SV] Authentication failed'"

The Cluster error you post could actually be a secure channel error with this node, but it's hard to say, you're only posting a single line of your cluster log, and it's about the join process, this seems to indicate you are getting the cluster service running, but that the cluster isn;t authenticating to the other node, but then your other errors are more along the lines of the cluster service not running correctly at all, and failing, so I am not clear on which is coming from which node.

One thing that springs to mind from the issue is that you need to make sure that the WinRM service is running properly to join a cluster correctly.

I assume that you are not running Windows Firewall or otherwise, if you are, turn it off fr now, and make sure the WinRM service is running properly, and do the fresh adding of the role, make sure you have your DNS configured correctly on both nodes along with Routes and that your nodes are taking to Active directory properly.

  If Secure channel is broken you can unjoin the node and re-join it to the domain to resolve that.

  If you were still a member node of the cluster instead of having removed it, then I would suggest trying a /fixquorum, as that could resolve missing entries in the nodes, but since you un-installed the role it's more about making sure all the old entries for the node are gone, and troubleshooting connectivity to AD and the new node.

What is this cluster used for? If you have any DFS Shares running on the system that you completely remove it, oh god can DFS make life harder.

Also, i want to warn you that fixing a corrupted Cluster DB using the cluster command should resolve the issue, and NOT copying the registry from one node to another.

  Although YMMV, as I have done that as part of the process once with Microsoft on a deeper cluster issue, despite it not working, we at the time thought it might resolve the issue, however in those scenarios the cluster node had not been evicted form the cluster, replicating cluster DB config between the nodes not in the cluster would be of little value because the nodes need to replicate that between themselves on initial joining the cluster.  So now you have perhaps muddied the waters a bit too much.

The cluster name and Node membership is kept in Active directory, and if you have a secure channel issue that might be the actual cause of being unable to re-configure the node.

Having full cluster Diagnostic logs, and working through the scenario with you on what actions were taken to cause the issue in the first place, and what features you're running in the cluster for reference, would help.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial