A split brain situation is caused by a temporary failure of the network link between the cluster nodes, resulting in both nodes switching to the active (that is, primary node) role while disconnected. This might cause new data (for example, audit trails) to be created on both nodes without being replicated to the other node. Thus, it is likely in this situation that two diverging sets of data have been created, which cannot be trivially merged.
|
Caution:
Hazard of data loss In a split brain situation, valuable audit trails might be available on both One Identity Safeguard for Privileged Sessions (SPS) nodes, so special care must be taken to avoid data loss. |
The nodes of the SPS cluster automatically recognize the split brain situation once the connection between the nodes is reestablished, and do not perform any data synchronization to prevent data loss. When a split brain situation is detected, it is visible on the SPS system monitor, in the system logs (Split-Brain detected, dropping connection!), on the Basic Settings > High Availability page, and SPS sends an alert as well.
Once the network connection between the nodes has been re-established, one of the nodes will become the active (that is, primary) node, while the other one will be the backup node (that is, the secondary node). This means that one node is providing services similar to normal operation, and the other one is kept passive (as a backup) to avoid network interferences. Note that there is no synchronization between the nodes at this stage.
To recover a SPS cluster from a split brain situation, complete the following steps.
|
Caution:
Do NOT shut down the nodes. |
Data recovery
In the procedure described here, data will be saved from the host currently acting as the secondary node host. This is required because data on this host will later be overwritten by the data available on the current primary node.
NOTE: During data recovery, there will be no service provided by SPS.
To recover from a split brain situation
-
Log in to the primary node. If no Console menu is showing up after login, then this is the secondary node. In this case, try the other node.
-
Select Shells > Boot Shell.
-
Enter /usr/share/heartbeat/hb_standby. This will change the current secondary node to primary node and the current primary node to secondary node (HA failover).
-
Exit the console.
-
Wait a few seconds for the HA failover to complete.
-
Log in on the other host. If no Console menu is showing up, the HA failover has not completed yet. Wait a few seconds and try logging in again.
-
Select Shells > Core Shell.
-
Issue the systemctl stop zorp-core.service command to disable all traffic going through SPS.
-
Save the files from /var/lib/zorp/audit that you want to keep. Use scp or rsync to copy data to your remote host.
TIP: To find the files modified in the last n*24 hours, run find . -mtime -n.
To find the files modified in the last n minutes, run find . -mmin -n.
-
Enter:
pg_dump -U scb -f /root/database.sql
Back up the /root/database.sql file.
-
Exit the console.
-
Log in again, and select Shells > Boot Shell.
-
To change the current secondary node to the primary node, and the current primary node to the secondary node (HA failover), run /usr/share/heartbeat/hb_standby.
-
Exit the console.
-
Wait a few minutes to let the failover happen, so the node you were using will become the secondary node and the other node will become the primary node.
The nodes are still in a split-brain state but now you have all the data backed up from the secondary node, and you can synchronize the data from the primary node to the secondary node, which will turn the HA state from "Split-brain" to "HA". For details on how to do that, see HA state recovery.
HA state recovery
In the procedure described here, the "Split-brain" state will be turned to the "HA" state. Keep in mind that the data on the current primary node will be copied to the current secondary node and data that is available only on the secondary node will be lost (as that data will be overwritten).
Swapping the nodes (optional)
NOTE: If you completed the procedure described in Data recovery, you do not have to swap the nodes. You can proceed to the steps about data synchronization.
To swap the two nodes to make the primary node the secondary node and the secondary node the primary node,
-
Log in to the primary node. If no Console menu is showing up after login, then this is the secondary node. In this case, try the other node.
-
Select Shells > Boot Shell.
-
Enter /usr/share/heartbeat/hb_standby. This will output:
Going standby [all]
-
Exit the console.
-
Wait a few minutes to let the failover happen, so the node you were using will become the secondary node and the other node will be the primary node.
Initializing data synchronization
To initialize data synchronization
-
Log in to the secondary node. If the Console menu is showing up, then this is the primary node. In this case, try logging in to the other node.
-
Enter the following commands. These commands will make the secondary node discard the data available only here, on this node.
drbdadm secondary r0 drbdadm connect --discard-my-data r0
-
Log out of the secondary node.
-
Log in to the primary node.
-
Select Shells > Boot Shell.
-
Enter:
drbdadm connect r0
-
Exit the console.
-
Check the High Availability state on the web interface of SPS, in the Basic Settings > High Availability > Status field. During synchronization, the status will say Degraded Sync, and after the synchronization completes, it will say HA.