Link to home
Start Free TrialLog in
Avatar of shatarie
shatarie

asked on

CLustering

The cluster service on Server 2016 causes the server to hard lock upon start.  I cannot seem to get anywhere with any of the articles I read.
Avatar of Brian Murphy
Brian Murphy
Flag of United States of America image

Cluster Services for what purpose?

Clustered Shared Services?  Microsoft SQL Server? Hyper-V?

Does it happen on all nodes in the cluster?

Any third-party software installed we can be made aware of?
Avatar of shatarie
shatarie

ASKER

It is a Hyper-V environment. We have 3 blade servers running 13 VM's.  It does happen on the other two as well.  I should have been more specific.
No problem.  Just curious regarding symptoms.  I believe you stated that the clustering service causes the 2016 server to hard lock but can you provide any evidence relative to logging?  

Does this happen on reboot?  When you stop the service and restart the service?  Does it happen immediately or after some other process runs?

Trying to narrow down the scope of troubleshooting.
When I perform a selective startup and not allow the Cluster service everything starts.  When I manually start the service, the server hard locks requiring a phyical shutdown.  Also, please see attached event log.  Thank you so much for your help BTW.
I don't think the event log came through but I was going to ask if you could run this PS command on one node?

Use the Timespan option to specify something that covers just the most recent data, 2 days should be sufficient.  

md C:\cslogs
Get-ClusterLog -TimeSpan 2 -UseLocalTime -Destination C:\cslogs

Please just send one node log for time being.
Sadly I am not there.  Here are the logs .....

The description for Event ID 5398 from source Microsoft-Windows-FailoverClustering cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event:

2
1
Quorum Disk DBLADE1 DBLADE2




The description for Event ID 1653 from source Microsoft-Windows-FailoverClustering cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event:

DBLADE2


The description for Event ID 1573 from source Microsoft-Windows-FailoverClustering cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event:

DBLADE2
When is the next opportunity to run that cluster logs command?  

This will provide more details.  The information above is too generic to really get anywhere given the codes.

Does make me curious regarding the quorum however because the errors above sometimes correspond to some type of communication error but generally you see this during the configuration cluster service, CSV's, cluster name, etc....

So, where is the actual quorum disk located?  What type of storage?  Storage Services Direct (S2D)?  iSCSI?
Tomorrow am.  Wills you be available?  It will be about 9am CNTRL.
Yes, I'll be online tomorrow, most of the day probably.  I'm in CST as well.
thank you so much Brian.
Brian,

Here is the output

    Directory: C:\


Mode                LastWriteTime         Length Name
----                -------------         ------ ----
d-----         1/5/2019   8:56 AM                cslogs


PS C:\Users\administrator.NEWDIANA> get-clusterlog -timespan 2 -uselocaltime -destination c:\cslogs
get-clusterlog : Unable to connect to DBLADE2 via WMI.  This may be due to networking issues or firewall configuration
on DBLADE2.
    The RPC server is unavailable. (Exception from HRESULT: 0x800706BA)
At line:1 char:1
+ get-clusterlog -timespan 2 -uselocaltime -destination c:\cslogs
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (:) [Get-ClusterLog], ClusterCmdletException
    + FullyQualifiedErrorId : Get-ClusterLog,Microsoft.FailoverClusters.PowerShell.GetClusterLogCommand
get-clusterlog : Unable to connect to DBLADE3 via WMI.  This may be due to networking issues or firewall configuration
on DBLADE3.
    The RPC server is unavailable. (Exception from HRESULT: 0x800706BA)
At line:1 char:1
+ get-clusterlog -timespan 2 -uselocaltime -destination c:\cslogs
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (:) [Get-ClusterLog], ClusterCmdletException
    + FullyQualifiedErrorId : Get-ClusterLog,Microsoft.FailoverClusters.PowerShell.GetClusterLogCommand

Mode                LastWriteTime         Length Name
----                -------------         ------ ----
-a----         1/5/2019   8:57 AM            638 DBLADE1_cluster.log
get-clusterlog : Object reference not set to an instance of an object.
At line:1 char:1
+ get-clusterlog -timespan 2 -uselocaltime -destination c:\cslogs
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (:) [Get-ClusterLog], ClusterCmdletException
    + FullyQualifiedErrorId : Get-ClusterLog,Microsoft.FailoverClusters.PowerShell.GetClusterLogCommand
get-clusterlog : Object reference not set to an instance of an object.
At line:1 char:1
+ get-clusterlog -timespan 2 -uselocaltime -destination c:\cslogs
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (:) [Get-ClusterLog], ClusterCmdletException
    + FullyQualifiedErrorId : Get-ClusterLog,Microsoft.FailoverClusters.PowerShell.GetClusterLogCommand
Okay, as I stated earlier it does appear to be a communication issue.  Give me some time to research further and come back with best recommendations.
Yes, did not think it was.  Seems to be almost a conflict for some reason with the local service.  Thank you for your assistance Brian.
Okay, I need to rule a few things out being this error could be a "false positive" related to how powershell obtains information itself not necessarily related to the actual problem.  Technically, this could be anything from Firewall blocking RPC even though you have any/any rule to simple DNS related issue even though it is rarely that easy.  

The changes I'm proposing are temporary, for now, to get past this WMI related error to obtain clusterlog output - correctly.  Some of these things might already be set correctly but I've seen where they were disabled by upper level GPO or part of the server build.  We need to confirm these are set as needed for now and then run the clusterlog command again with new output.  If it is coming from GPO, I'm going to have to suggest we separate out these clustered servers to a separate OU and with a separate GPO policy hosted in the Central Repository that has a "blank slate" being there could be upper level GPO policy that is causing both the WMI problem and the Clustering issue.

I would start by first opening CMD prompt each server and let's verify the GPO settings with "gpresults" as follows (example):
Open CMD Prompt
Type cd\
Type md c:\gpresult
Type cd gpresult /H node_1.html

Repeat for Node 2

This is for your records and to verify whether or not the following settings can be made without initially segmenting these servers to their own Active Directory OU, having to disable inheritance, creating a "blank" GPO.  We can start with this but there will be other steps but I prefer to troubleshoot in increments and process of ruling out certain things being I simply have no insight into how everything is configured so your getting my "best guess".  We can start with WBEMTest but it would only lead us again to the require steps below.

Must be completed on both nodes:

Step 1: Need to verify the following services are configured and started.

Open services.msc and verify/set the following services are set to "Automatic" and all are started for now:
Verify that “TCP/IP NetBIOS Helper” is set to Automatic and started.
Verify that “Remote Procedure Call (RPC)” is set to Automatic and started.
Verify “Windows Management Instrumentation” is set to Automatic and started.
Remote Access Auto Connection Manager
Remote Access Connection Manager
Remote Procedure Call (RPC)
Remote Procedure Call (RPC) Locator
Remote Registry
TCP/IP NetBIOS Helper
Windows Management Instrumentation
Windows Firewall (Yes, start this service is disabled.  Stopping this service or disabling causes issues with functionality)

Open the Group Policy Object Editor snap-in (gpedit.msc) to modify local GPO - Windows Firewall assumed running.

Step 2: Open Computer Configuration, open Administrative Templates, open Network, open Network Connections, open Windows Firewall, and then open first Domain Profile then Standard Profile and enable the following exception: “Allow Remote Administration Exception” and “Allow File and Printer Sharing Exception“.

This also assumes that upper-level GPO's have not removed Authenticated Users from "Access this computer from network" policy under Computer Security.

Now, please run clusterlog on both nodes this time:
Get-ClusterLog -TimeSpan 2 -UseLocalTime -Destination C:\cslogs

Let me know.  I have to step away for about 1.5 hours so my next response will be delayed.
How did this turn out?  These should have cleared up the WMI and RPC communication errors and give us a good starting point.
I only had one node up at the time and the cluster service was disabled. The other two nodes were offline. I have since brought these nodes online as well, but I had to disable the cluster service on them as well so they would not lock up. When I brought the other two online, and then ran the command you gave me last night to run, it runs with no issues save for the logs themselves being empty. The error messages earlier were no doubt because the machines were not online. I do not get those errors anymore when running "Get-ClusterLog -TimeSpan 2 -UseLocalTime -Destination C:\cslogs" command. In fact, I can run that command on all three server nodes and I get the exact same output: three different logs (one for each server) and each of them is just the headers; no data. I have all three servers up running now, but the cluster service is disabled. If I re-enable the service, the server will lock-up. I have verified that the other services are running as requested. I do not have Windows Firewall enabled on these machines; I disabled that the other day when I was troubleshooting the issue.
That is odd being the original error posted conveys 2 nodes but the second 3 nodes. Regardless still need to verify that the RPC and WMI services are functioning correctly before starting cluster service being this can cause the issue you described.  And, to properly troubleshoot I would need Firewall turned back on due to it causing instability when turned off and then verification that the following services are in fact running on all three nodes.

Remote Access Auto Connection Manager
Remote Access Connection Manager
Remote Procedure Call (RPC)
Remote Procedure Call (RPC) Locator
Windows Management Instrumentation

The communication problem can also be caused by (please verify yes or no)

1. Multiple physical or virtual networks configured on the servers where you might have two NICs on same the same Subnet.
2. Any of those configured for DHCP and where it could be registering in DNS dynamically.
3. Static routes in route table (perform a "route print" from CMD prompt to verify

Once these services are online including Firewall with the exception I stated above because turning off the firewall just as easily causes the issue whereas turning it back on without that exception stated also causes the issue.  

We are essentially starting from the beginning being I simply do not know the history of this environment from a build perspective or ongoing change management perspective.

Next, please perform a nslookup for all the hostnames (assumes the servernames have not changed since adding to domain).  
Does it come back with multiple IP addresses?

With the cluster service disabled we are limited in what information we can get unless we can expand the logging to eventlog then we simply might have to start the service up on 1 node so something can get generated to the event log that is useful.  This could very well be a DNS or Quorum drive issue (see my questions earlier).  

All these steps are leading up to something.  I think of it as Occam's razor approach, it works.  The cluster servers should really be in a dedicated OU as well to set these policies higher up, enforce down, then run "gpupdate /force".  Being this is not currently in "production" mode I'm comfortable suggesting these changes.

After the above is completed please try this command and see if it returns something like the example given: (have not tried this with service disabled but curious)

From Powershell:
(get-cluster).EnabledEventLogs

Sample Output:
Microsoft-Windows-Hyper-V-VmSwitch-Diagnostic,4,0xFFFFFFFD
Microsoft-Windows-SMBDirect/Debug,4
Microsoft-Windows-SMBServer/Analytic
Microsoft-Windows-Kernel-LiveDump/Analytic

At some point, we have to start the service on one node but not before the above suggestions are validated.

If we find anomalies with DNS returning multiple IP's we need to start there.

We also need to evaluate the quorum configuration although you should have received errors when creating the cluster and validation wizard after the fact.  I'm assuming your using the CSV to host the VM's.

 Let me know.
Also, if you agree, we can turn on debug logging and start the service on one of the servers.

On NODE-A
Verify Nslookup output for all three nodes by hostname resolve to one IP in AD

Open Event Viewer (eventvwr.msc)
Click View then “Show Analytic and Debug Logs”
Browse down to Applications and Services Logs \ Microsoft \ Windows \ FailoverClustering-Client \ Diagnostic
Right-click on Diagnostic and select “Enable Log”
Services.msc, set all three servers Cluster service to "Manual"
Start Cluster service on NODEA
Right-click on Diagnostic and select “Disable Log” - debug tracing will be generated to the Diagnostic channel and viewable only after you disable logging.
(If server locks, reboot)
Left-click on Diagnostic to view the logging captured.

This will at least give something to go on, hopefully.  I've yet to see starting the service actually lock a server up completely.  It resembles a type of driver problem without the blue screen.  Something would have had to change after the initial configuration, assuming it was working correctly prior?

But, for this to have a chance to work I recommend the above steps being we cannot run the Cluster Validation Tool or Powershell command with the service started.  The service is not going to start unless all the services stated above are started prior.

The service might not start if the quorum drive is not configured correctly or the storage area network relative to virtual or physical network configuration is incorrect or... DNS issues.  

I've yet to see that lock a server up completely, however..... very odd but yet to come across something that cannot be resolved using methodical troubleshooting process.

Also, always a chance going to lock again when trying to get the debug but at least the event log will be there post reboot for examination. Need something to go on otherwise it is a deep dive in to the exact configuration of the servers, quorum drive and storage relative to is it iSCSI, FC, NAS/SMB and Storage Spaces and so forth.
Brian,

I had to leave but the manager is currently working on it.  I will send her your comments and thank you again!
Good deal.
She said she may end calling MS.  If she does and it is resolved, I will let you know what they did.
Oh, also
FailoverClustering/Diagnostic and FailoverClustering/DiagnosticVerbose

Logs.  I believe those are enabled by default but good to double check and possibly there is additional information in those now worth posting or uploading.
Highly likely they will walk you through the exact steps but resolution is resolution.  I find this particular scenario very interesting being I have not seen cluster service hard lock a server much less three of them and with someone on the phone it can possibly expedite the process.

They are probably going to ask for the data under C:\ProgramData\Microsoft\Windows\WER\ReportArchive

Need to download this tool: https://www.microsoft.com/en-us/download/details.aspx?id=44226

It would be interesting to see if this provides any more details despite the fact the cluster service is hard locking the system.

Just need to verify those other logging diagnostics are turned on that I mentioned just prior.
I agree, I like your advice.  The thought behind MS is speed to resolution and accountability. You seem interested so when it is resolved I will convey what the tells me.  Thanks again Brian.
I appreciate that and my pleasure.  Plus Microsoft can easily shadow the session, their on the phone, eventually it will get resolved just might need to request escalation engineer because this one is somewhat challenging, IMO.
Brian,

She just got off the phone with Microsoft. It seems like the culprit was our anti-virus software. The engineer that she spoke with said that in the cases that he's encountered when there is a clustering issue with more than one node; that it is often due to third-party software. The anti-virus software runs at the kernel level whereas the cluster service runs at the user level; so it appeared that there was a resource conflict that was causing the service to hang up and as a result cause the system to lock up. After she uninstalled the anti-virus, rebooted, and restarted the cluster service; she could immediately see the cluster volume, and all of the VM's swapped to that server and resumed normal operations.

This was a very strange issue. I wanted to share with you what we found that resolved the issue, and I again wanted to thank you for the assistance you offered during this time.
Wow, glad to hear it is resolved and hopefully this series of posts will help someone else in the future with similar issue.
Agreed and actually the culprit specifically was Eset.
This question needs an answer!
Become an EE member today
7 DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform.
View membership options
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.