Recovering from a Split-Brain situation (4303029)

Restituisci

Feedback inviato

Questo articolo ti è servito per risolvere il problema?

Seleziona valutazione

Titolo

Recovering from a Split-Brain situation
Descrizione

A split brain situation is caused by a temporary failure of the network link between the cluster nodes, resulting in both nodes switching to the active (master) role while disconnected. This might cause new data (for example, audit trails) to be created on both nodes without being replicated to the other node. Thus, it is likely in this situation that two diverging sets of data have been created, which cannot be trivially merged.

Caution:
Hazard of data loss! In a split brain situation, valuable audit trails might be available on both SPS nodes, so special care must be taken to avoid data loss.

The nodes of the SPS cluster automatically recognize the split brain situation once the connection between the nodes is reestablished, and do not perform any data synchronization to prevent data loss. When a split brain situation is detected, it is visible on the SPS system monitor, in the system logs (Split-Brain detected, dropping connection!), on the Basic Settings > High Availability page, and SPS sends an alert as well.

Once the network connection between the nodes has been re-established, one of the nodes will become the active (master) node, while the other one will be passive (the slave node). This means that one node is providing services similar to normal operation, and the other one is kept passive to avoid network interferences. Note that there is no synchronization between the nodes at this stage.

To recover a SPS cluster from a split brain situation, complete the following steps.

Caution:
Do NOT shut down the nodes.
Risoluzione
Data recovery

Purpose:

In the procedure described here, data will be saved from the host currently acting as the slave host. This is required because data on this host will later be overwritten by the data available on the current master.

NOTE:
During data recovery, there will be no service provided by SPS.

Steps:
1. Log in to the master node. If no Console menu is showing up after login, then this is the slave node. Try the other node.
2. Select Shells > Boot Shell.
3. Enter /usr/share/heartbeat/hb_standby. This will change the current slave node to master and the current master node to slave (HA failover).
4. Exit the console.
5. Wait a few seconds for the HA failover to complete.
6. Log in on the other host. If no Console menu is showing up, the HA failover has not completed yet. Wait a few seconds and try logging in again.
7. Select Shells > Core Shell.
8. Issue the systemctl stop zorp-core.service command to disable all traffic going through SPS.
9. Save the files from /var/lib/zorp/audit that you want to keep. Use scp or rsync to copy data to your remote host.
  
  TIP:
  To find the files modified in the last n*24 hours, use find . -mtime -n.
  
  To find the files modified in the last n minutes, use find . -mmin -n .
10. Enter:
```
pg_dump -U scb -f /root/database.sql
```
  Back up the /root/database.sql file.
11. Exit the console.
12. Log in again, and select Shells > Boot Shell.
13. Enter /usr/share/heartbeat/hb_standby. This will change the current slave node to master and the current master node to slave (HA failover).
14. Exit the console.
15. Wait a few minutes to let the failover happen, so the node you were using will become the slave node and the other node will become the master node.
  
  The nodes are still in a split-brain state but now you have all the data backed up from the slave node, and you can synchronize the data from the master node to the slave node, which will turn the HA state from "Split-brain" to "HA". For details on how to do that, see HA state recovery.
HA state recovery

Purpose:

In the procedure described here, the "Split-brain" state will be turned to the "HA" state. Keep in mind that the data on the current master node will be copied to the current slave node and data that is available only on the slave node will be lost (as that data will be overwritten).

Steps: Swapping the nodes (optional):

NOTE:
If you completed the procedure described in Data recovery, you do not have to swap the nodes. You can proceed to the steps about data synchronization.

If you want to swap the two nodes to make the master node the slave node and the slave node the master node, perform the following steps:
1. Log in to the master node. If no Console menu is showing up after login, then this is the slave node. Try the other node.
2. Select Shells > Boot Shell.
3. Enter /usr/share/heartbeat/hb_standby. This will output:
```
Going standby [all]
```
4. Exit the console.
5. Wait a few minutes to let the failover happen, so the node you were using will become the slave node and the other node will be the master node.
Steps: Initializing data synchronization:

To initialize data synchronization, complete the following steps:
1. Log in to the slave node. If the Console menu is showing up, then this is the master node. Try logging in to the other node.
2. Enter the following commands. These commands will make the slave node discard the data available only here, on this node.
```
drbdadm secondary r0
drbdadm connect --discard-my-data r0
```
3. Log out of the slave node.
4. Log in to the master node.
5. Select Shells > Boot Shell.
6. Enter:
```
drbdadm connect r0
```
7. Exit the console.
8. Check the High Availability state on the web interface of SPS, in the Basic Settings > High Availability > Status field. During synchronization, the status will say Degraded Sync, and after the synchronization completes, it will say HA.

Feedback inviato

Questo articolo ti è servito per risolvere il problema?

Seleziona valutazione

Request a KB Article

Contenuti consigliati

Prodotto/i:: One Identity Safeguard for Privileged Sessions
7.5, 7.0.5 LTS, 7.0.4 LTS, 7.0.3.1 LTS, 7.0.3 LTS, 7.0.2.1 LTS, 7.0.2 LTS, 7.0.1.1 LTS, 7.0.1 LTS, 7.0 LTS, 6.0.12, 6.0.11, 6.9.4, 6.9.3, 6.9.2, 6.7.0, 6.0.10, 6.0.9, 6.0.7, 6.0.6, 6.0.5, 6.0.4, 6.0.3, 6.0.2, 6.0.1, 6.0 LTS

Argomento/i:: How To

Cronologia articoli:: Data di creazione: 12/18/2018
Ultimo aggiornamento: 7/30/2024

Cerca in tutti gli articoli

Seleziona il prodotto:

Per una maggiore efficacia, compilare l'Obiettivo della chat:

Soluzioni consigliate per il problema

Recovering from a Split-Brain situation (4303029)

Titolo

Descrizione

Risoluzione

Data recovery

Purpose:

Steps:

HA state recovery

Purpose:

Steps: Swapping the nodes (optional):

Steps: Initializing data synchronization:

Leave a Comment