How to recover from FWS and DAG Member failure in 2 Node DAG
November 21, 2017 at 12:28 pm Leave a comment
Hi Folks
Recently we had a situation where one of our customer was affected with a malware and most of his servers became unusable. The impact caused the File Witness Servers( a Domain Controller) and one of the Exchange Node from the 2 Node DAG environment to become instable.
So after studying the impact we decided to do the below ;
- Remove the Failed Node from the DAG and rebuild it from scratch and attach it to the DAG again.
- Change the FWS to another server.
But unfortunately we were not able to proceed as we expected because the cluster service on the remaining node was not able to reach any cluster defined. When I opened the Failover Cluster Manager I was not able to reach or connect it to the DAG Cluster (As it was not able to reach any the Quorum in our case it is the FWS. The same was confirmed by the below command:
- cluster node
This will show the failed node as down and the survived DAG node in Joining state
To overcome the problem you have to restart the cluster without quorum to do that type the below command on the Exchange server
net stop clussvc
net start clussvc /fq
Boom .. everything returned normal with Windows Clustering on the remaining node ( you could verify it with the same command ; cluster node) . I was able to connect it to the DAG cluster via the Windows Clustering Manager.
Now the cluster is restored and I had to move the FWS to another server so I ran the command below which set the new FWS ( Source: https://practical365.com/exchange-server/recovering-a-failed-exchange-2016-database-availability-group-member/)
Set-DatabaseAvailabilityGroup -Identity “DAG-Name” -WitnessDirectory c:\FWS -WitnessServer “New Server Name”
Now we were able to proceed with the remaining steps that is to
– remove the Mailbox Copies from the Failed Server
– Move the Active Mailboxes from the Failed Server to the active Server
The commands I used are
- Get-MailboxDatabaseCopyStatus -Server “Failed Exchange Server Name” | Remove-MailboxDatabaseCopy -Confirm:$false
- Move-ActiveMailboxDatabase “Mailbox Database Name” -ActivateOnServer “Exchange Server Name” -SkipHealthChecks -SkipActiveCopyChecks -SkipClientExperienceChecks -SkipLagChecks -MountDialOverride:BESTEFFORT
Thereafter you could proceed with the remaining steps as mentioned below;
To remove the failed server from the DAG (-ConfigurationOnly switch will execute the command without trying to contact the failed server)
- Remove-DatabaseAvailabilityGroupServer -Identity “DAG Name” -MailboxServer “Failed Exchange Server Name” -ConfigurationOnly
Thereafter you need to remove the failed server from the Cluster Group to do that;
- Get-ClusterNode “Failed Exchange Server Name” | Remove-ClusterNode
Once you are able to pass through all the steps , the only thing left is to rejoin the Failed Exchange Server to the same DAG. (Refer Article:https://practical365.com/exchange-server/recovering-a-failed-exchange-2016-database-availability-group-member/)
Hope this will help someone in a similar situation.
Good Luck
Muralee
Entry filed under: Exchange and O365. Tags: dag, Exchange 2010, exchange 2013, exchange 2016, FWS, node failure, quorrum, Windows Clustering Manager.
Trackback this post | Subscribe to the comments via RSS Feed