Recovering from Split-Brain situation in high availability environment (4308756)

Return

Feedback Submitted

Did this article solve an issue for you?

Select Rating

Title

Recovering from Split-Brain situation in high availability environment
Description
A split brain situation is caused by a temporary failure of the network link between the cluster nodes, resulting in both nodes switching to the active (primary) role while disconnected. This might cause new data (for example, log messages) to be created on both nodes without being replicated to the other node. Thus, it is likely in this situation that two diverging sets of data have been created, which cannot be trivially merged.
Caution:
Hazard of data loss. In a split brain situation, valuable log messages might be available on both SSB nodes, so special care must be taken to avoid data loss.
The nodes of the SSB cluster automatically recognize the split brain situation once the connection between the nodes is re-established, and do not perform any data synchronization to prevent data loss. When a split brain situation is detected, it is visible on the SSB system monitor, in the system logs (Split-Brain detected, dropping connection!), on the Basic Settings > High Availability page, and SSB sends an alert as well.
NOTE:
After the connection between the nodes has been restored, the split brain situation will persist. The core firmware will be active on one of the nodes, while it will not start on the other node.
Once the network connection between the nodes has been re-established, one of the nodes will become the primary node, while the other one will be the secondary node. Find out which node is the primary node. There are two ways to identify the primary node:
- Locally: Log in to each SSB locally, and wait for the console menu to come up. The console menu only appears on the primary node.
- Remotely: Try connecting to each SSB using SSH. It is only the primary node that you can directly connect to via SSH. The secondary node cannot be accessed externally, only via SSH from the Primary.
To recover an SSB cluster from a split brain situation, complete the procedures described in Data recovery and HA state recovery.

Resolution

Data recovery

Purpose:

In the procedure described here, data will be saved from the host currently acting as the secondary host. This is required because data on this host will later be overwritten by the data available on the current Primary.

NOTE:

During data recovery, there will be no service provided by SSB.

Steps:

Log in to the Primary node as root locally (or remotely using SSH) to access the Console menu. If no menu is showing up after login, then this is the secondary node. Try the other node.
Select Shells > Boot Shell.
Enter /usr/share/heartbeat/hb_standby. This will change the current slave node to Primary and the current Primary node to secondary (HA failover).
Exit the console.
Wait a few seconds for the HA failover to complete.
Log in on the other host. If no Console menu is showing up, the HA failover has not completed yet. Wait a few seconds and try logging in again.
Select Shells > Core Shell.
Issue the systemctl stop syslog-ng.service command to disable all traffic going through SSB.
Save the files from /opt/ssb/var/logspace/ that you want to keep. Use scp or rsync to copy data to your remote host.

TIP:

To find the files modified in the last n*24 hours, use find . -mtime -n.

To find the files modified in the last n minutes, use find . -mmin -n .

Exit the console.
Log in again, and select Shells > Boot Shell.
Enter /usr/share/heartbeat/hb_standby. This will change the current secondary node to Primary and the current Primary node to secondary (HA failover).
Exit the console.
Wait a few minutes to let the failover happen, so the node you were using will become the secondary node and the other node will become the Primary node.
The nodes are still in a split-brain state but now you have all the data backed up from the secondary node, and you can synchronize the data from the Primary node to the secondary node, which will turn the HA state from "Split-brain" to "HA". For details on how to do that, see HA state recovery.

HA state recovery

Purpose:

In the procedure described here, the "Split-brain" state will be turned to the "HA" state.

Caution:

Keep in mind that the data on the current Primary node will be copied to the current secondary node and data that is available only on the secondary node will be lost (as that data will be overwritten).

Steps: Swapping the nodes (optional):

NOTE:

If you completed the procedure described in Data recovery, you do not have to swap the nodes. You can proceed to the steps about data synchronization.

If you want to swap the two nodes to make the Primary node the secondary node and the secondary node the Primary node, perform the following steps.

Log in to the Primary node as root locally (or remotely using SSH) to access the Console menu. If no menu is showing up after login, then this is the secondary node. Try the other node.
Select Shells > Boot Shell.
Enter /usr/share/heartbeat/hb_standby. This will output:
```
Going standby [all]
```
Exit the console.
Wait a few minutes to let the failover happen, so the node you were using will become the secondary node and the other node will be the Primary node.

Steps: Initializing data synchronization:

To initialize data synchronization, complete the following steps.

Log in to the secondary node as root locally (or remotely using SSH) to access the Console menu. If the menu is showing up, then this is the Primary node. Try logging in to the other node.
Note that you are in the boot shell now as on the secondary node, only the boot shell is available.
Invalidate the DRBD. Issue the following commands:
drbdadm disconnect all
drbdadm invalidate all
drbdadm connect all
Reboot the secondary node.
Following this step, the Primary will be in Standalone state, while the secondary's DRBD status will be WFConnection.
The console will display an Inconsistent (10) message. This is normal behavior, and it is safe to ignore this message.
Reboot the Primary node. The SSB cluster will now be functional, accepting traffic as before.
After both nodes reboot, the cluster should be in Degraded Sync state, the Primary being SyncSource and the secondary being SyncTarget. The Primary node should start synchronizing its data to the secondary node. Depending on the amount of data, this can take a long time.
Enable all incoming traffic on the Primary node. Navigate to Basic Settings > System > Service control > Syslog traffic, indexing & search: and click Enable.
If the web interface is not accessible or unstable, complete the following steps on the active SSB:
1. Log in to SSB as root locally (or remotely using SSH) to access the console menu.
2. Select Shells > Core Shell, and issue the systemctl start syslog-ng.service command.
3. Issue the date, and check the system date and time. If it is incorrect (for example, it displays 2000 January), replace the system battery. For details, see the hardware manual of the appliance.

Feedback Submitted

Did this article solve an issue for you?

Select Rating

Request a KB Article

Please select your product:

To serve you better, please complete the Purpose of your Chat:

Recommended Solutions for Your Problem

Recovering from Split-Brain situation in high availability environment (4308756)

Title

Description

Resolution

Data recovery

Purpose:

Steps:

HA state recovery

Purpose:

Steps: Swapping the nodes (optional):

Steps: Initializing data synchronization:

Leave a Comment