Windows 2016 cluster some time can't connect to cluster by cluster failover manager

hi,

Right now setup 3 x Windows 2016 data center nodes and created the WSFC, weeks later I found sth strange, which one of the nodes can't connect to cluster anymore and the  error message is one of the depending resource is failed to start.

I restart the 3 x nodes and the nodes can't connect to the WSFC cluster change to other nodes. with the same error message comes out.

any idea on this ?

also how to check which depending resource it is ? how to start it ?
LVL 1
marrowyungSenior Technical architecture (Data)Asked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

65tdRetiredCommented:
What is the quorum configuration, share, dedicated disk?

Review the system event logs failing that use powershell commands to review the cluster log:
https://technet.microsoft.com/en-us/library/ee461045.aspx
0
Lawrence TsePrinciple ConsultantCommented:
Is that the "disconnected node" always jumping between 2 out of 3, with the 3rd node always have no problem?  If that's the case it might be communication problem between the alternating failure nodes.  Please try starting your diagnostics from "partitioned cluster network".

Also, since you have 3 node, you should have "node majority" as your quorum configuration.
0
marrowyungSenior Technical architecture (Data)Author Commented:
Lawrence Tse,

"Also, since you have 3 node, you should have "node majority" as your quorum configuration."

anyway I can double check it ?

"Is that the "disconnected node" always jumping between 2 out of 3, with the 3rd node always have no problem?  I"

in random order.

65td,

"What is the quorum configuration, share, dedicated disk?"

it set it up for SQL 2016 AOG configuration, should be shared nothing structure. anyway for me to double check it?

I check the cluster event log, from time to time it say:

The Cluster service failed to bring clustered role 'Cluster Group' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.

Open in new window


how can I check resource is on or off ? how can I start it manually in Windows 2016.

also this is one of the error I see often:

Cluster resource 'Cluster Name' of type 'Network Name' in clustered role 'Cluster Group' failed. The error code was '0x42c' ('The dependency service or group failed to start.').

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

Open in new window


any idea? can't just check using PS. how to know why I can't bring it online?


how about this error, any hints and idea?

[System] 0000166c.000068b0::2017/12/01-16:14:38.284 ERR   Network Name resource 'Cluster Name' (with associated network name '<name>') has Kerberos Authentication support enabled. Failed to add required credentials to the LSA - the associated error code is '1068'.
[System] 000008c0.00003f74::2017/12/01-16:14:38.285 ERR   Cluster resource 'Cluster Name' of type 'Network Name' in clustered role 'Cluster Group' failed. The error code was '0x42c' ('The dependency service or group failed to start.').

Open in new window

0
Cloud Class® Course: C++ 11 Fundamentals

This course will introduce you to C++ 11 and teach you about syntax fundamentals.

Lawrence TsePrinciple ConsultantCommented:
The only dependent resource of ‘cluster network name’ is ‘IP address’.  Can you check the ‘IP address’ resource of the ‘cluster group’ can be successfully brought online?
0
marrowyungSenior Technical architecture (Data)Author Commented:
how can I check the resource "IP address". how can I go to that page? and how to bring it online?
0
marrowyungSenior Technical architecture (Data)Author Commented:
when I try failvoer in WSFC manager I saw this, any idea on what is missing

failover errorI can't see why the dependency report of my WSFC seems showing it depends on nothing, even network, is it possible ?

dependency report
and i found this might be the cause but I can't do the change as I don't see dependency !
0
marrowyungSenior Technical architecture (Data)Author Commented:
hi all,

any comment ?

after I create a WSFC using wizard, it seems sth else needs to be done before I can failover ? new resource needs to add ?
0
Lawrence TsePrinciple ConsultantCommented:
Sorry need to get to work to show you a screen cap.

As you see from Failover Cluster Manager,  From left hand side pane, select your cluster name, then from right hand side detail pane, you will see several "collapsible" boxes.  Expand all, until you see "Cluster Core Resources", then expand your cluster name, you will be able to see your cluster IP address.
1.png
0
Lawrence TsePrinciple ConsultantCommented:
You can also use the following PowerShell command (need to run in Administrator context) to start a resource immediately.

1.png
0
Lawrence TsePrinciple ConsultantCommented:
From your screen capture, I can see you are doing a SQL server cluster.  But the icon shown seems like a Generic Application?  Was it installed by SQL Server Installation Wizard, or you just add SQL server service as an generic clustered application?
0
marrowyungSenior Technical architecture (Data)Author Commented:
"From your screen capture, I can see you are doing a SQL server cluster.  But the icon shown seems like a Generic Application?"

I am setting up SQL 2016 AOG, this is it, I don't need more.

" Was it installed by SQL Server Installation Wizard, or you just add SQL server service as an generic clustered application?"
I just setup my WSFC and then SQL2016 AOG .

I follow this:

https://www.mssqltips.com/sqlservertip/2519/sql-server-alwayson-availability-groups--part-1-configuration/

anything missed ?
0
marrowyungSenior Technical architecture (Data)Author Commented:
0
marrowyungSenior Technical architecture (Data)Author Commented:
hi,

I did try to bring the resource online:

bring online failed.
the error is :

Cluster resource '<cluster name>' of type 'SQL Server Availability Group' in clustered role '<cluster name>' failed.

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

Open in new window

0
Lawrence TsePrinciple ConsultantCommented:
You can always refer to Microsoft's official guide when it's available, though for very rare case it can be a little bit incorrect as well.

But for the SQL AAG one you forwarded above, it's quite accurate.

Now seems like your AAG listener name is not being able to failover, or to narrow down, when your AAG Listener name failover to another node, it cannot be brought up by that node.

There are few possibilities:
1. The cluster virtual computer object does not have permission to bring up the SQL AAG Listener name.  But if that's the case, on day one you should not be able to start the name, no matter on which node.

2. The AAG Listener name is unable to bind to a specific computer, that can because the cluster IP is invalid to that specific node (like lying on different subnet), or the computer cannot do DNS registration after binding the name with the cluster IP address.

You can check the following:

1. In AD, find the AAG Listener Name (this is a computer object), make sure in AD Users and Computers you have enabled "advanced" mode such that you can see the "security tab".  Make sure the cluster virtual name computer object (e.g. CLUSTER1) has "full control" permission on the AAG Listener Name Computer Object, else, your cluster won't be able to bring up the AAG Listener computer name resource in Failover Cluster Manager.

2. In SQL Server AAG Properties, make sure the IP address tie to the AAG cluster listener name can contact AD and DNS, and accessible from all node within segment without routing.  That means, the following will work:

- AD + DNS: AD1|192.168.1.254/24
- Node 1: SRV1/192.168.1.11/24
- Node 2: SRV2/192.168.1.12/24
- Node 3: SRV3/192.168.1.13/24
- Cluster: CLUSTER1/192.168.1.101/24
- AAG listener: AAG/192.168.1.102/24

The following will not work (by default, but can work after some advanced configuration):

- AD + DNS: AD1|192.168.1.254/25
- Node 1: SRV1/192.168.1.11/25
- Node 2: SRV2/192.168.1.129/25 (<- cannot failover to this node under this configuration)
- Node 3: SRV3/192.168.1.13/25
- Cluster: CLUSTER1/192.168.1.101/25
- AAG listener: AAG/192.168.1.102/25

3. Is that some node in cluster cannot contact AD or AD-DNS?

Hope this helps.
0
marrowyungSenior Technical architecture (Data)Author Commented:
"3. Is that some node in cluster cannot contact AD or AD-DNS?:"

this is the standalone cluster. windows 2016 can make create workgroup cluster, I didn't join domain. is it a problem  ?
0
Lawrence TsePrinciple ConsultantCommented:
SQL server workload is totally supported on AD Detached Cluster as said by Microsoft.  In this case, please check:

1. The common local admin account is still active on all node
2. The computer name DNS suffix are all correctly set on all node, and matching that on your DNS
3. The DNS records for AAG listener and cluster name exist
4. There should have no cluster name or AAG listener name computer object in AD
5. You're using SQL Sever authentication instead of Windows authentication
6. In each node of cluster make sure TCPIP setting DNS suffix is correctly specified
0
marrowyungSenior Technical architecture (Data)Author Commented:
"SQL server workload is totally supported on AD Detached Cluster as said by Microsoft."

since Windows 2016 and SQL 2016, it is supported. but I found creating SQL 2016 AOG is ok, can failvoer, but can't create AOG listener!
0
Lawrence TsePrinciple ConsultantCommented:
So did you created the DNS record for the listener?
0
marrowyungSenior Technical architecture (Data)Author Commented:
sir,

"1. The common local admin account is still active on all node
2. The computer name DNS suffix are all correctly set on all node, and matching that on your DNS
3. The DNS records for AAG listener and cluster name exist"

I use local host file for it. as the sql server don't join domain.

"4. There should have no cluster name or AAG listener name computer object in AD
5. You're using SQL Sever authentication instead of Windows authentication
6. In each node of cluster make sure TCPIP setting DNS suffix is correctly specified"

I can ping them correctly !  why 4 is needed ? no cluster name or AAG name computer objcet in AD then it doesn't works, right?

we are using mix authentication.
0
marrowyungSenior Technical architecture (Data)Author Commented:
any update for me?
0
65tdRetiredCommented:
Is the cluster not running?
The cluster group configuration to me looks like it's missing the quorum.

Can also use cluster /? in an elevated command window,  to move groups or review status of the cluster (very similar to PS commands but built-in)
0
marrowyungSenior Technical architecture (Data)Author Commented:
"Is the cluster not running?"

it is running as I created that. if there is an error the cluster wizard will prompt that , right?

then i create SQL AOG on top of it.! I see the error when I create the SQL listener and it bring me to that message !

it is the latest SQL 2016 and 2017 offer the workgroup AOG, no need AD any more!
0
marrowyungSenior Technical architecture (Data)Author Commented:
I just want to setup workgraup SQL 2016 AOG and this edition can not going AD except now ! there are 3 x node which I formed the WSFC and SQL installed on top of it, please share with me what kind of quorum I should created with step by steps instruction if you can.
0
marrowyungSenior Technical architecture (Data)Author Commented:
by this :

https://social.technet.microsoft.com/wiki/contents/articles/36143.sql-server-2016-step-by-step-creating-alwayson-availability-group.aspx

I use file share as the quorum and I run the quorum creation wizard and I get this.

Cluster Managed Voting
Enabled
Witness Type
File Share Witness
Witness Resource
\\10.xxx.xx.xxx\Quorum
Errors
 *  Could not grant the cluster access to the file share '\\10.xxx.xx.xxx\Quorum'.

There was an error granting the cluster access to the selected file share '\\10.xxx.xx.xxx\Quorum'.

Failed to grant permissions for the cluster 'SWxxxxSQLPoC' to access the share 'Quorum'.

An error occurred looking up the security ID of the cluster name object for 'SWxxxxSQLPoC'.

No mapping between account names and security IDs was done

Open in new window


what is the mean ? I already assign everyone full control on that folder.
0
65tdRetiredCommented:
Try adding the cluster computer account to the share and the NTFS permissions for the quorum.
0
marrowyungSenior Technical architecture (Data)Author Commented:
"Try adding the cluster computer account to the share and the NTFS permissions for the quorum."

what cluster computer account is about ? I see CLIUSR account in the local computer management console and I add it as full control. still can't ! should I add CLIUSR account FROM the rest of the nodes?
0
Lawrence TsePrinciple ConsultantCommented:
Hi,

FYI.  For workgroup cluster, file share witness quorum is not supported.  You will need to use shared disk quorum or cloud witness
0
marrowyungSenior Technical architecture (Data)Author Commented:
hi,

"FYI.  For workgroup cluster, file share witness quorum is not supported.  You will need to use shared disk quorum or cloud witness
"

but I use 3 x nodes ! so must be using file share ?
0
65tdRetiredCommented:
Please review this document:
https://blogs.msdn.microsoft.com/clustering/2015/08/17/workgroup-and-multi-domain-clusters-in-windows-server-2016/

Note Quorum section as stated by TSE:
Quorum Configuration
The witness type recommended for Workgroup clusters and Multi-domain clusters is a Cloud Witness or Disk Witness.  File Share Witness (FSW) is not supported with a Workgroup or Multi-domain cluster.

Need to use Node and Disk Majority quorum mode.
Explained Node and Disk Majority quorum node in doc below:
https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2008-R2-and-2008/cc770830%28v%3dws.10%29
0
marrowyungSenior Technical architecture (Data)Author Commented:
hi,

by this link:

https://www.mssqltips.com/sqlservertip/4951/deploy-a-windows-server-2016-failover-cluster-without-active-directory-part-1/

how can I download the DNS manager to admin  the DNS ?

I think I need to add the record to DNS but I am using the local host file for it.
0
marrowyungSenior Technical architecture (Data)Author Commented:
hi,

"Note Quorum section as stated by TSE:
Quorum Configuration
The witness type recommended for Workgroup clusters and Multi-domain clusters is a Cloud Witness or Disk Witness.  File Share Witness (FSW) is not supported with a Workgroup or Multi-domain cluster."

how about I created a domain already and I join the domain ? still disk witness if we use 3 nodes?

I found that when I try to recreate the WSFC by destory it and recreate it seems doesn't work. once I add a nodes on the WSFC wizard the add nodes page will disappear, any experience like this before
?
0
65tdRetiredCommented:
Do you DNS admin rights?
RSAT is required to manage DNS and other tools are included such as AD tools.
RSAT install link:
https://support.microsoft.com/en-ca/help/2693643/remote-server-administration-tools-rsat-for-windows-operating-systems

Have you tried using the following quorum type: Node Majority mode?

Not sure I understand the adding nodes issue, is it after he primary node is created no more nodes can be added?

If so on the secondary nodes to be added (not added yet) run as admin:  cluster node <computername> /forcecleanup
0
marrowyungSenior Technical architecture (Data)Author Commented:
"Do you DNS admin rights?"

I create the AD and DNS using that local administrator account and form a new AD, that local administrator became the doamin admin, so answer is a yes!

"Have you tried using the following quorum type: Node Majority mode?

I can't select this one as this one is not a option for me, any reason ?

"Not sure I understand the adding nodes issue, is it after he primary node is created no more nodes can be added?"

it is before that!


when I create a cluster:

when creating a WSFC cluster using WSFC cluster manger
and I click the browse button:

nothing happened.
when I click the browse, the UI gone!

from the log:

sqlservr (3552) An attempt to open the file "C:\Windows\system32\LogFiles\Sum\SystemIdentity.mdb" for read / write access failed with system error 5 (0x00000005): "Access is denied. ".  The open file operation will fail with error -1032 (0xfffffbf8).
0
marrowyungSenior Technical architecture (Data)Author Commented:
forget to mention that, once I make one SQL serve node become AD and I destroy the cluster and try to recreate it again, I see this !

now all 3 x nodes SQL server experiencing the same issue! can run failover cluster manager but can't even browse to add the first node as part of cluster.
any failover cluster installation program?

is the best thing to recreate the cluster is to run "destroy cluster" and then recreate ?
0
65tdRetiredCommented:
What version of SQL and edition of SQL is used?

Are the SQL databases backed up?

At this point it seems destroying the cluster would be the next step.
To review:
SQL will be running on a 3 node cluster.
OS version
Where is the AD on one of the 3 nodes?

Still want a non domain cluster?
0
marrowyungSenior Technical architecture (Data)Author Commented:
"What version of SQL and edition of SQL is used?"

SQL 2016, I tried SQL 2017 too, it also doesn't work ! but diff is, SQL 2017, once I create a test AD and join all nodes to that AD, SQL 2017 allow me to delete AOG but not SQL 2016.

"Are the SQL databases backed up?"

yes !

"At this point it seems destroying the cluster would be the next step."

as i said, I destroy the cluster but I can't create any more! it seems some nodes still see a cluster there after I destroy that. any way to make sure that cluster totally removed  ?

"
Still want a non domain cluster?"

no!

"SQL will be running on a 3 node cluster.
OS version
Where is the AD on one of the 3 nodes?
"
AD is on top of one of the node.

OS is Windows 2016 data center edition .

SQL is 2016 and I installed SQL 2017 on it too .
0
65tdRetiredCommented:
Is the SQL edition standard or enterprise?

from an elevated command prompt run run as admin:  cluster node <computername> /forcecleanup
Where computername = server 1, 2 and 3.

The cluster name will be in dns and ad, remove them.
Then cluster install again, just the basic cluster.
0
marrowyungSenior Technical architecture (Data)Author Commented:
"Is the SQL edition standard or enterprise?"

SQL enterprise ! but seems not related to WSFC creation, right?
0
marrowyungSenior Technical architecture (Data)Author Commented:
when running the command it said:



"Get-Cluster : A positional parameter cannot be found that accepts argument '<server name>'.
At line:1 char:1
+ cluster node <server name> /forcecleanup
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidArgument: (:) [Get-Cluster], ParameterBindingException
    + FullyQualifiedErrorId : PositionalParameterNotFound,Microsoft.FailoverClusters.PowerShell.GetClusterCommand
'

why ?

I read similar thing: https://blogs.technet.microsoft.com/canitpro/2012/02/14/how-to-clean-up-cluster-nodes-from-destroyed-clusters/

it also doesn't works.

and what I tried to see is if remove failover cluster role/feature from the node and add back agian, also doens't work. and it seems that some PS command doesn't work as the cluster service can't start and the error from log is:

The Cluster Service service terminated with the following service-specific error: 
The system cannot find the file specified.

Open in new window



how can I repair file in the server?
0
marrowyungSenior Technical architecture (Data)Author Commented:
previously this is the message when I creating the file share quorum:

1)
1
2)
2
3) I saw error:

3
any reason for it and what should I do ?
0
65tdRetiredCommented:
1. You are correct MSSQL does not effect the cluster build.  Just wanted to make sure that it was not standard which only supports a twon node cluster.

2. The command line tools for cleanup are run in an elevated (cmd runas administrator) command window.
    cluster.exe commands are : cluster node <server name> /forcecleanup
    powershell commands are: Clear-ClusterNode -Name node4 -Force

Q1.  is the file share on a different server not on any of the cluster nodes?

Q2.  What share and file permissions are set for the cluster witness share?

See the link for a related issue, note the info on share and file permissions:

https://social.technet.microsoft.com/Forums/windowsserver/en-US/a3129f10-da98-4d23-a21c-e873ea208723/configuring-quorum-file-share-witness-access-denied?forum=winserverClustering
0
marrowyungSenior Technical architecture (Data)Author Commented:
"cluster.exe commands are : cluster node <server name> /forcecleanup "


if in CMD command prompt, it will say this command is not found, but powershell can

"Q1.  is the file share on a different server not on any of the cluster nodes?"

on the first server I setup the cluster, I just want to make sure it works as the exercise is to test SQL server 2016 AOG, not play around cluster.

"Q2.  What share and file permissions are set for the cluster witness share?"

everyone read/write
0
65tdRetiredCommented:
Ok if the cluster software is installed, then on a node in an elevated command prompt cluster.exe  /? should return a list of commands.

File share can not be on any of the nodes, try an other quorum, such as:
 Odd number of nodes  use Node Majority.
0
marrowyungSenior Technical architecture (Data)Author Commented:
" Odd number of nodes  use Node Majority."

it is not allowed on my server, you can see screenshot. I can't see why! the failover cluster suggest file share or icloud witness!

can't see why !
0
65tdRetiredCommented:
Try any quorum just to see if it completes the install can change it later.

could start with: Node and Disk Majority for Even number of nodes (but not a multi-site cluster)
0
marrowyungSenior Technical architecture (Data)Author Commented:
"could start with: Node and Disk Majority"

if UI don't allow me to do this, can powershell script can do it ?

how ?
0
marrowyungSenior Technical architecture (Data)Author Commented:
hi,

" cluster.exe commands are : cluster node <server name> /forcecleanup "

it said cluster command is not found.
"
    powershell commands are: Clear-ClusterNode -Name node4 -Force
"
this one seems works on the node reported to be joined to other domain.

and when I try to add back the node to the cluster this result message came out when i validate that cluster.

* The servers do not all have the same domain role.
* The servers are not all in the same Organizational Unit (OU) in Active Directory. It is recommended that all nodes be in the same OU.
* Node node1 is reachable from Node node1 by only one pair of network interfaces. It is possible that this network path is a single point of failure for communication within the cluster. Please verify that this single path is highly available, or consider adding additional networks to the cluster.
* Node node1 is reachable from Node node2 by only one pair of network interfaces. It is possible that this network path is a single point of failure for communication within the cluster. Please verify that this single path is highly available, or consider adding additional networks to the cluster.
* The cluster is not configured with a quorum witness. As a best practice, configure a quorum witness to help achieve the highest availability of the cluster.
* The cluster network name <cluster name> does not have Create Computer Objects permissions on the Organizational Unit OU=Domain Controllers,DC=UAT,DC=LOCAL. This can result in issues during the creation of additional network names in this OU.

Open in new window



  this one:

" The servers are not all in the same Organizational Unit (OU) in Active Directory. It is recommended that all nodes be in the same OU.

why an OU is necessary ?

and this one:

"The cluster network name <cluster name> does not have Create Computer Objects permissions on the Organizational Unit OU=Domain Controllers,DC=UAT,DC=LOCAL. This can result in issues during the creation of additional network names in this OU.
"

how can I fix it
?
0
marrowyungSenior Technical architecture (Data)Author Commented:
"Q2.  What share and file permissions are set for the cluster witness share?"

this is it:

share permission
permission
permission
anything you see is not normal here?

"Odd number of nodes  use Node Majority."

"could start with: Node and Disk Majority for Even number of nodes (but not a multi-site cluster)"

I have 3 x nodes and the cluster wizard only allow me choose this as quoum:

when choosing quorum.
0
marrowyungSenior Technical architecture (Data)Author Commented:
I am reading this:

http://www.howtonetworking.com/server/cluster12.htm

"2. Assign the cluster computer name read/write permissions to the shared folder at both the Share level and NTFS level permissions. Note: you should find the cluster computer under the Computer objects of ADUC."

cluster computer name is just the virtual name we give cluster when we create it?

and I really don't have the quorum configuration option at all, I only have 3 x choice I shown you , any idea?

and this link: https://social.technet.microsoft.com/wiki/contents/articles/36143.sql-server-2016-step-by-step-creating-alwayson-availability-group.aspx , show me a very close to my version of quorum configuration page ! it is not the SAME as the first link!

I am using Windows 2016..
0
65tdRetiredCommented:
cluster computer name is just the virtual name we give cluster when we create it?
 Correct

Will the cluster build continue if the first selection "Configure a Disk Witness"?
0
marrowyungSenior Technical architecture (Data)Author Commented:
"Will the cluster build continue if the first selection "Configure a Disk Witness"?"

hi, what is that mean ?

do you know why the cluster quorum create wizard is very diff ? windows 2016 ?
0
65tdRetiredCommented:
Will the cluster install continue?

No I do not, progress I guess?
0
marrowyungSenior Technical architecture (Data)Author Commented:
"Will the cluster install continue?"

When I go for it, I don't allow me to select a disk, what should I do? define it first?  how ?

disk witness
when I create  a disk

when creating
i got this:

error comes out
0
65tdRetiredCommented:
Hi
Yes a dedicated drive (or LUN) of 500MB needs be configured that the cluster can see, then it should be happy.
0
marrowyungSenior Technical architecture (Data)Author Commented:
"Yes a dedicated drive (or LUN) of 500MB needs be configured that the cluster can see, then it should be happy.
"

but that one is for shared SAN storage, right?

I only have 3 x standalone VMs for it.
0
65tdRetiredCommented:
No the quorum is it's own disk on the san like the large shared storage.
0
marrowyungSenior Technical architecture (Data)Author Commented:
dedicated disk on the SAN you mean ? we don't have this in this configuration.
0
65tdRetiredCommented:
OH.
Does the cluster nodes  have access to a shared disk?

Could make a two node cluster and use the 3rd node as the file share.

2-node-trad-cluster.odg2-node-trad-cluster.odg
0
marrowyungSenior Technical architecture (Data)Author Commented:
"Does the cluster nodes  have access to a shared disk?
"

no, all standalone machine ! the goal is just to setup a SQL 2016 AOG and test if it is working fine and we done the MySQL multi master test.

"Could make a two node cluster and use the 3rd node as the file share"

using file share witness ? only that one can works now ?
0
marrowyungSenior Technical architecture (Data)Author Commented:
what is the odg file is about, I can't open it.
0
65tdRetiredCommented:
to setup a MSSQL 2016 AOG for testing:
Do each of the virtual nodes have addition drives?

For a two node  MSSQL 2016 AOG, one could install clustering and use a share on the 3 vm.
No other LUN's would be required for the cluster, then MSSQL can be installed and tested.node traditional cluster
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
marrowyungSenior Technical architecture (Data)Author Commented:
"to setup a MSSQL 2016 AOG for testing:
Do each of the virtual nodes have addition drives?
"

I am not sure what addition drives means? all nodes has C: , E: and S:

"For a two node  MSSQL 2016 AOG, one could install clustering and use a share on the 3 vm.

you mean only use 2x nodes to form a WSFC and the 3rd VM nodes as the files share ?
0
65tdRetiredCommented:
you mean only use 2x nodes to form a WSFC and the 3rd VM nodes as the files share ?
       Correct
0
marrowyungSenior Technical architecture (Data)Author Commented:
I thought about that before and I am asking for one more server as the AD and I will use that AD to share one folder out as the file share witness.

The point I want to use all 3 x servers as SQL server AOG as it is the minimum recommended number of PoC machines.

I also want to test about load balancing read operation as well so more than one secondary replica is needed, so total number of 3.

so using AD as one of the file share witness is ok ?
0
65tdRetiredCommented:
yes.
0
marrowyungSenior Technical architecture (Data)Author Commented:
ok. I will go for this direction . will setup AD and setup files share on it.
0
marrowyungSenior Technical architecture (Data)Author Commented:
Tks all, might come back later if I still have quorum and WSFC problem!

I have direction already as I also read a lot WSFC this week ! I really belive that I missed a quorum.
0
65tdRetiredCommented:
Good luck, you sound like you are on the right track.
0
marrowyungSenior Technical architecture (Data)Author Commented:
I can't see once:

1) I create WSFC in workgroup mode and let SQL 2016 turn on AOG mode, it can change to AOG mode and failover within SQL server.
2) create AD,
3) make all nodes join the AD.
4) uninstall SQL sever 2016.
5) destroy the WSFC.
6) recreate WSFC based on new UAT AD.
7) install SQL 2016 and enable SQL 2016 AOG,

the result is this time the SQL server configuration UI will say missing quorum !! funny ! why don't show earlier !?
0
65tdRetiredCommented:
Is the WSFC working correctly with the 3 nodes and file share quorum?
0
marrowyungSenior Technical architecture (Data)Author Commented:
I am waiting of the VMs to rebuilt.
0
marrowyungSenior Technical architecture (Data)Author Commented:
hi, now I reinstall the whole thing with a SEPARATE AD and all reinstalled SQL nodes, and I use file share witness as the quorum and the process of creating the quorum is good now.

but still see red cross from the failover manager:

failover-manager-redcross.jpg
I make remote registry start as I think I know why the UI dead when i add a notes, it is this service is stopped.

and on the share permission of the Quorum, I make all nodes 's computer object read write. should be fine.

but still see similar message like group resource failed to up, dependency failed.

because of this exercise,I disable firewall on all nodes.

and the cluster vaildation report is that:

* An error occurred while executing the test.
Unable to connect to SWVD02DSQLPoC via WMI.  This may be due to networking issues or firewall configuration on SWVD02DSQLPoC.test.local.

The RPC server is unavailable. (Exception from HRESULT: 0x800706BA)
* An error occurred while executing the test.
Failed to initialize the Configuration tests.

Unable to connect to SWVD02DSQLPoC via WMI.  This may be due to networking issues or firewall configuration on SWVD02DSQLPoC.test.local.

The RPC server is unavailable. (Exception from HRESULT: 0x800706BA)
* An error occurred while executing the test.
Failed to initialize the Configuration tests.

Unable to connect to SWVD02DSQLPoC via WMI.  This may be due to networking issues or firewall configuration on SWVD02DSQLPoC.test.local.

The RPC server is unavailable. (Exception from HRESULT: 0x800706BA)
* An error occurred while executing the test.
There was an error initializing the network tests.

There was an error creating the server side agent (CPrepSrv).

Retrieving the COM class factory for remote component with CLSID {E1568352-586D-43E4-933F-8E6DC4DE317A} from machine SWVD02DSQLPoC.test.local failed due to the following error: 80070005 SWVD02DSQLPoC.test.local.
* An error occurred while executing the test.
An error occurred getting the cluster node state for 'SWVD02DSQLPoC.test.local'.

Access is denied
* An error occurred while executing the test.
There was an error initializing the network tests.

There was an error creating the server side agent (CPrepSrv).

Retrieving the COM class factory for remote component with CLSID {E1568352-586D-43E4-933F-8E6DC4DE317A} from machine SWVD02DSQLPoC.test.local failed due to the following error: 80070005 SWVD02DSQLPoC.test.local.
* An error occurred while executing the test.
Unable to connect to SWVD02DSQLPoC.test.local via WMI.  This may be due to networking issues or firewall configuration on SWVD02DSQLPoC.test.local.

The RPC server is unavailable. (Exception from HRESULT: 0x800706BA)
* An error occurred while executing the test.
Failed to initialize the Configuration tests.

Unable to connect to SWVD02DSQLPoC.test.local via WMI.  This may be due to networking issues or firewall configuration on SWVD02DSQLPoC.test.local.

The RPC server is unavailable. (Exception from HRESULT: 0x800706BA)
* An error occurred while executing the test.
Failed to initialize the Configuration tests.

Unable to connect to SWVD02DSQLPoC.test.local via WMI.  This may be due to networking issues or firewall configuration on SWVD02DSQLPoC.test.local.

The RPC server is unavailable. (Exception from HRESULT: 0x800706BA)
* An error occurred while executing the test.
Failed to initialize the Configuration tests.

Unable to connect to SWVD02DSQLPoC.test.local via WMI.  This may be due to networking issues or firewall configuration on SWVD02DSQLPoC.test.local.

The RPC server is unavailable. (Exception from HRESULT: 0x800706BA)
* An error occurred while executing the test.
Failed to initialize the Configuration tests.

Unable to connect to SWVD02DSQLPoC.test.local via WMI.  This may be due to networking issues or firewall configuration on SWVD02DSQLPoC.test.local.

The RPC server is unavailable. (Exception from HRESULT: 0x800706BA)
* An error occurred while executing the test.
Failed to initialize the Configuration tests.

Unable to connect to SWVD02DSQLPoC.test.local via WMI.  This may be due to networking issues or firewall configuration on SWVD02DSQLPoC.test.local.

The RPC server is unavailable. (Exception from HRESULT: 0x800706BA)
* An error occurred while executing the test.
Failed to initialize the Configuration tests.

Unable to connect to SWVD02DSQLPoC.test.local via WMI.  This may be due to networking issues or firewall configuration on SWVD02DSQLPoC.test.local.

The RPC server is unavailable. (Exception from HRESULT: 0x800706BA)
* An error occurred while executing the test.
Failed to initialize the Configuration tests.

Unable to connect to SWVD02DSQLPoC.test.local via WMI.  This may be due to networking issues or firewall configuration on SWVD02DSQLPoC.test.local.

The RPC server is unavailable. (Exception from HRESULT: 0x800706BA)
* An error occurred while executing the test.
Unable to connect to SWVD02DSQLPoC.test.local via WMI.  This may be due to networking issues or firewall configuration on SWVD02DSQLPoC.test.local.

The RPC server is unavailable. (Exception from HRESULT: 0x800706BA)

Open in new window

0
65tdRetiredCommented:
Is the name SWVD02DSQLPoC.test.local the admin of the cluster or the virtual SQL cluster name?
0
marrowyungSenior Technical architecture (Data)Author Commented:
"Is the name SWVD02DSQLPoC.test.local the admin of the cluster"

sorry I am confused, that one is a serve name, how can I make this server name/host the admin of the cluster? the cluster has another name.

SWVD02DSQLPoC.test.local just the one of the node.

now it has this error:

Cluster resource 'Cluster Name' of type 'Network Name' in clustered role 'Cluster Group' failed. The error code was '0x42c' ('The dependency service or group failed to start.').

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

Open in new window


this is the dependency report:

cluster service dependency
from the cluster event log withing WSFC manager, I saw this:

Network Name resource 'Cluster Name' (with associated network name '<Cluster management connection string>') has Kerberos Authentication support enabled. Failed to add required credentials to the LSA - the associated error code is '1068'.

Open in new window


any hints ?
0
marrowyungSenior Technical architecture (Data)Author Commented:
ok now, I think this link give a big hand on why cluster resource shows a cross:

https://social.technet.microsoft.com/Forums/ie/en-US/f68817ad-3d9e-4052-b4d6-b1b7b064f39b/1068-error-when-bringing-cluster-name-resource-online?forum=winserverClustering

it seems there are 2x service needed by WSFS:

1) Network list service
2) Network location service.

and 1) depends on 2) and 2) depends on DHCP client service.

our DHCP client service disabled by other reason !

just like the remote registry service cause the add nodes to WSFC UI to drop out!  I figure it out now why I can't add a node via cluster creation wizard now.

once I can start 1) and 2) and WSFC manager now show:

WSFC error
much better right? is it said it is all done and I can't go ahead and setup SQL server 2016 now ?


and the WSFC vaild report result is :


 * Node SWVD03DSQLPoC.test.local is reachable from Node SWVD02DSQLPoC.test.local by only one pair of network interfaces. It is possible that this network path is a single point of failure for communication within the cluster. Please verify that this single path is highly available, or consider adding additional networks to the cluster.
 * Node SWVD04DSQLPoC.test.local is reachable from Node SWVD02DSQLPoC.test.local by only one pair of network interfaces. It is possible that this network path is a single point of failure for communication within the cluster. Please verify that this single path is highly available, or consider adding additional networks to the cluster.
 * Node SWVD02DSQLPoC.test.local is reachable from Node SWVD03DSQLPoC.test.local by only one pair of network interfaces. It is possible that this network path is a single point of failure for communication within the cluster. Please verify that this single path is highly available, or consider adding additional networks to the cluster.
 * Node SWVD04DSQLPoC.test.local is reachable from Node SWVD03DSQLPoC.test.local by only one pair of network interfaces. It is possible that this network path is a single point of failure for communication within the cluster. Please verify that this single path is highly available, or consider adding additional networks to the cluster.
 * Node SWVD02DSQLPoC.test.local is reachable from Node SWVD04DSQLPoC.test.local by only one pair of network interfaces. It is possible that this network path is a single point of failure for communication within the cluster. Please verify that this single path is highly available, or consider adding additional networks to the cluster.
 * Node SWVD03DSQLPoC.test.local is reachable from Node SWVD04DSQLPoC.test.local by only one pair of network interfaces. It is possible that this network path is a single point of failure for communication within the cluster. Please verify that this single path is highly available, or consider adding additional networks to the cluster.
 * Unable to determine whether the Windows Firewall on node SWVD02DSQLPoC.test.local is configured to allow cluster network communication. The Windows Firewall service has been stopped or disabled. You may ignore this warning if the Windows Firewall will not be running during normal cluster operation.

There are no more endpoints available from the endpoint mapper. (Exception from HRESULT: 0x800706D9)

 * Unable to determine whether the Windows Firewall on node SWVD03DSQLPoC.test.local is configured to allow cluster network communication. The Windows Firewall service has been stopped or disabled. You may ignore this warning if the Windows Firewall will not be running during normal cluster operation.

There are no more endpoints available from the endpoint mapper. (Exception from HRESULT: 0x800706D9)

 * Unable to determine whether the Windows Firewall on node SWVD04DSQLPoC.test.local is configured to allow cluster network communication. The Windows Firewall service has been stopped or disabled. You may ignore this warning if the Windows Firewall will not be running during normal cluster operation.

There are no more endpoints available from the endpoint mapper. (Exception from HRESULT: 0x800706D9)

Open in new window

0
65tdRetiredCommented:
Looks way better.
Is the windows firewall enabled?
0
marrowyungSenior Technical architecture (Data)Author Commented:
hi,

Should be disable it right? I do it for debugging purpose as I don't want anything stop this test, make sense right?

should I ignore the log from the validation report ? I prepare to install SQL server on each node today.

what is this means : There are no more endpoints available from the endpoint mapper. (Exception from HRESULT: 0x800706D9) ?

by this :
1) https://serverfault.com/questions/568917/there-are-no-more-endpoints-available-from-the-endpoint-mapper.

2) https://johnyassa.com/2012/12/12/there-are-no-more-endpoints-available-from-the-endpoint-mapper-exception-from-hresult-0x800706d9/

 this means enable firewall and that's why you said that ?


tks.
0
65tdRetiredCommented:
At least for testing turn the windows fire off,
"Using Control Panel:
Open the Firewall control panel item (firewall.cpl)
Click "Turn Windows Firewall on or off"
Turn off the firewall for the current or all profiles"
re run the validation, hopefully continue with SQL install...
0
marrowyungSenior Technical architecture (Data)Author Commented:
"At least for testing turn the windows fire off, "

what i mean is , I already turn off the firewall before I create the cluster as I see the firewall message even it say can't detect firewall.

and the log I upload is that log AFTER the firewall is off.
0
65tdRetiredCommented:
Is the firewall disabled or set to off?
Are there 2 virtual network cards are being used by each virtual machine, one public and one heartbeat or cluster network?
0
marrowyungSenior Technical architecture (Data)Author Commented:
"Is the firewall disabled or set to off?"

I disabled it manually for now.

"Are there 2 virtual network cards are being used by each virtual machine, one public and one heartbeat or cluster network?"

just have one.
0
marrowyungSenior Technical architecture (Data)Author Commented:
and now I found out sth study and i found WSFC show up error in diff way.

days later , all warning gone by itself ! amazing and I didn't do anything !!

I think as long as WSFC failover is ok, everything is ok.
0
65tdRetiredCommented:
Sounds good.
Have failed the cluster group over to any of the other nodes?

Good luck with the rest of your testing.
0
marrowyungSenior Technical architecture (Data)Author Commented:
"Have failed the cluster group over to any of the other nodes?"

no, failover many time, everything is fine !

this can be very misleading
0
65tdRetiredCommented:
Sounds like the cluster is ok then.
0
marrowyungSenior Technical architecture (Data)Author Commented:
the WSFC very misleading!

my AOG group setup and running well. now will setup distributed AOG group and load balanced read only group.

did you tried that before? are they really working as expected.
0
marrowyungSenior Technical architecture (Data)Author Commented:
one thing, as quorum can't be lost and as file share witness is a share some where else, what if that share ALSO dead, cluster gone too !

can distributed file system's file share solve this problem ?
0
65tdRetiredCommented:
In a two node cluster yes.  3 node i'm not sure.
If this cluster is for testing aren't you getting ahead of yourself.
One could try a DFS share, in theory it should work.
0
marrowyungSenior Technical architecture (Data)Author Commented:
"If this cluster is for testing aren't you getting ahead of yourself."

yes ! this is what I am.

"One could try a DFS share, in theory it should work."

why i ask is, it seems MS said failover cluster don't support DFS.

 I thought every windows admin will have this problem, right? what is the server host file share with dead? quorum gone !
0
65tdRetiredCommented:
I have never considered using DFS for a quorum share, if MS says its not supported then so be it.
One can setup a second manual quorum failover share on another file server.

In general one is concerned with a share failure as well as a network switch failure, power outage, cooling system failure, server hardware failure and OS corruption, etc.
0
marrowyungSenior Technical architecture (Data)Author Commented:
tks.

"I have never considered using DFS for a quorum share"

 I think very less people will do it.

"One can setup a second manual quorum failover share on another file server."

in WSFC, quorum can failover ? I think just the resource can failover.

my concern is. if quorum die, WSFC die, SQL AOG die too.
0
65tdRetiredCommented:
Without testing the cluster and SQL, I am unsure of what would happen, I only have experience with two node SQL clusters.
I recall losing the quorum and SQL being up, but i'm not 100% sure anymore.
I guess your testing will produce the answers!
0
marrowyungSenior Technical architecture (Data)Author Commented:
usually how we rebuilt quorum and file share witness ?
0
65tdRetiredCommented:
0
marrowyungSenior Technical architecture (Data)Author Commented:
oh, it seems that the only problem is the cluster can't start, but we have to use the Quorum Configuration Wizard and temporarily change the mode from Node and File Share Majority to Node Majority, remove the old file share and create the new one , that's it ?
0
65tdRetiredCommented:
Yes that should be about it. In a 2 node cluster we used a second manual quorum file share so maintenance could be performed on the main file share.
0
marrowyungSenior Technical architecture (Data)Author Commented:
"In a 2 node cluster we used a second manual quorum file share so maintenance could be performed on the main file share."

I don't understand, is this means we define one more file share witiness in WSFC as a standby ?

no such kinds of DR file system /distributed file system for this kind of file share witness ?
0
65tdRetiredCommented:
No one can not but one can prepare a "standby" quorum should you need it.
Not  that I'm aware of.
Shares are usally stable.
0
marrowyungSenior Technical architecture (Data)Author Commented:
"Shares are usally stable."

yes but a no if I want to plan it.

"No one can not but one can prepare a "standby" quorum should you need it."

standby means not define in WSFC at all ?

then only define it when WSFC  quorum dead ?
0
65tdRetiredCommented:
If one needs to do maintenance on the file server hosting the quorum share, then the quorum can be moved to another share (once the share is setup, permissions (share and ntfs)).
 the cluster will recreate the needed quorum files.
0
marrowyungSenior Technical architecture (Data)Author Commented:
"If one needs to do maintenance on the file server hosting the quorum share, t"

is not about  maintenance, it is about if one day file server dead by some reason (only some one know), then it kills the quorum and finally found out they don't back that up too !

what can I do ?

"then the quorum can be moved to another share (once the share is setup, permissions (share and ntfs)).
 the cluster will recreate the needed quorum files."

how to move quorum ?
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Windows OS

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.