The following sections help you to solve problems related to high availability clusters.
For a description of the possible statuses of the One Identity Safeguard for Privileged Sessions (SPS) cluster and its nodes, the DRBD data storage system, and the heartbeat interfaces (if configured), see Understanding One Identity Safeguard for Privileged Sessions (SPS) cluster statuses.
To recover a cluster that has broken down, see Recovering One Identity Safeguard for Privileged Sessions (SPS) if both nodes broke down.
To resolve a split-bran situation when the nodes of the cluster were simultaneously active for a time, see Recovering from a split brain situation.
To replace a broken node with a new appliance, see Replacing a HA node in a One Identity Safeguard for Privileged Sessions (SPS) cluster.
This section explains the possible statuses of the One Identity Safeguard for Privileged Sessions (SPS) cluster and its nodes, the DRBD data storage system, and the heartbeat interfaces (if configured). SPS displays this information on the Basic Settings > High Availability page.
The Status field indicates whether the SPS nodes recognize each other properly and whether those are configured to operate in high availability mode. The status of the individual SPS nodes is indicated in the Node HA state field of the each node. The following statuses can occur:
Standalone: There is only one SPS unit running in standalone mode, or the units have not been converted to a cluster (the Node HA state of both nodes is standalone). Click Convert to Cluster to enable High Availability mode.
HA: The two SPS nodes are running in High Availability mode. Node HA state is HA on both nodes, and the Node HA UUID is the same on both nodes.
Half: High Availability mode is not configured properly, one node is in standalone, the other one in HA mode. Connect to the node in HA mode, and click Join HA to enable High Availability mode.
Broken: The two SPS nodes are running in High Availability mode. Node HA state is HA on both nodes, but the Node HA UUID is different. For assistance, contact our Support Team.
Degraded: SPS was running in high availability mode, but one of the nodes has disappeared (for example broken down, or removed from the network). Power on, reconnect, or repair the missing node.
Degraded (Disk Failure): A hard disk of the secondary node is not functioning properly and must be replaced. To request a replacement hard disk and for details on replacing the hard disk, contact our Support Team.
Degraded Sync: Two SPS units were joined to High Availability mode, and the first-time synchronization of the disks is currently in progress. Wait for the synchronization to complete. Note that in case of large disks with lots of stored data, synchronizing the disks can take several hours.
Split brain: The two nodes lost the connection to each other, with the possibility of both nodes being active nodes (that is, primary nodes) for a time.
|
Caution:
Hazard of data loss In this case, valuable audit trails might be available on both SPS nodes, so special care must be taken to avoid data loss. For details on solving this problem, see Recovering from a split brain situation. Do NOT reboot or shut down the nodes. |
Invalidated: The data on one of the nodes is considered out-of-sync and should be updated with data from the other node. This state usually occurs during the recovery of a split-brain situation when the DRBD is manually invalidated.
Converted: After converting nodes to a cluster (clicking Convert to Cluster) or enabling High Availability mode (clicking Join HA) and before rebooting the node(s).
|
NOTE:
If you experience problems because the nodes of the HA cluster do not find each other during system startup, navigate to Basic Settings > High Availability and select HA (Fix current). That way the IP address of the HA interfaces of the nodes will be fix, which helps if the HA connection between the nodes is slow. |
The DRBD status field indicates whether the latest data (including SPS configuration,
The DRBD status field also indicates the connection between the disk system of the SPS nodes. The following statuses are possible:
Connected: Both nodes are functioning properly.
Connected (Disk Failure): A hard disk of the secondary node is not functioning properly and must be replaced. To request a replacement hard disk and for details on replacing the hard disk, contact our Support Team.
Invalidated: The data on one of the nodes is considered out-of-sync and should be updated with data from the other node. This state usually occurs during the recovery of a split-brain situation when the DRBD is manually invalidated.
Sync source or Sync target: One node (Sync target) is downloading data from the other node (Sync source).
When synchronizing data, the progress and the remaining time is displayed in the System monitor.
|
Caution:
When the two nodes are synchronizing data, do not reboot or shutdown the primary node. If you absolutely must shutdown the primary node during synchronization, shutdown the secondary node first, and then the primary node. |
Split brain: The two nodes lost the connection to each other, with the possibility of both nodes being active nodes (that is, primary nodes) for a time.
|
Caution:
Hazard of data loss In this case, valuable audit trails might be available on both SPS nodes, so special care must be taken to avoid data loss. For details on solving this problem, see Recovering from a split brain situation. |
WFConnection: One node is waiting for the other node, the connection between the nodes has not been established yet.
If a redundant heartbeat interface is configured, its status is also displayed in the Redundant Heartbeat status field, and also in the HA > Redundant field of the System monitor. For a description of redundant heartbeat interfaces, see Redundant heartbeat interfaces.
The possible status messages are explained below.
NOT USED: There are no redundant heartbeat interfaces configured.
OK: Normal operation, every redundant heartbeat interface is working properly.
DEGRADED-WORKING: Two or more redundant heartbeat interfaces are configured, and at least one of them is functioning properly. This status is displayed also when a new redundant heartbeat interface has been configured, but the nodes of the SPS cluster has not been restarted yet.
DEGRADED: The connection between the redundant heartbeat interfaces has been lost. Investigate the problem to restore the connection.
INVALID: An error occurred with the redundant heartbeat interfaces. Contact the One Identity Support Team for help. For assistance, contact our Support Team.
It can happen that both nodes break down simultaneously (for example because of a power failure), or the secondary node breaks down before the original primary node recovers.
|
NOTE:
As of One Identity Safeguard for Privileged Sessions (SPS) version 2.0.2, when both nodes of a cluster boot up in parallel, the node with the 1.2.4.1 HA IP address will become the primary node. |
To properly recover SPS
Power off both nodes by pressing and releasing the power button.
|
Caution:
Hazard of data loss If SPS does not shut off, press and hold the power button for approximately 4 seconds. This method terminates connections passing SPS and might result in data loss. |
Power on the node that was the primary node before SPS broke down. Consult the system logs to find out which node was the primary node before the incident: when a node boots as primary node, or when a takeover occurs, SPS sends a log message identifying the primary node.
|
TIP:
Configure remote logging to send the log messages of SPS to a remote server where the messages are available even if the logs stored on SPS become unaccessible. For details on configuring remote logging, see System logging, SNMP and e-mail alerts. |
Wait until this node finishes the boot process.
Power on the other node.
A split brain situation is caused by a temporary failure of the network link between the cluster nodes, resulting in both nodes switching to the active (that is, primary node) role while disconnected. This might cause new data (for example, audit trails) to be created on both nodes without being replicated to the other node. Thus, it is likely in this situation that two diverging sets of data have been created, which cannot be trivially merged.
|
Caution:
Hazard of data loss In a split brain situation, valuable audit trails might be available on both One Identity Safeguard for Privileged Sessions (SPS) nodes, so special care must be taken to avoid data loss. |
The nodes of the SPS cluster automatically recognize the split brain situation once the connection between the nodes is reestablished, and do not perform any data synchronization to prevent data loss. When a split brain situation is detected, it is visible on the SPS system monitor, in the system logs (Split-Brain detected, dropping connection!), on the Basic Settings > High Availability page, and SPS sends an alert as well.
Once the network connection between the nodes has been re-established, one of the nodes will become the active (that is, primary) node, while the other one will be the backup node (that is, the secondary node). This means that one node is providing services similar to normal operation, and the other one is kept passive (as a backup) to avoid network interferences. Note that there is no synchronization between the nodes at this stage.
To recover a SPS cluster from a split brain situation, complete the following steps.
|
Caution:
Do NOT shut down the nodes. |
In the procedure described here, data will be saved from the host currently acting as the secondary node host. This is required because data on this host will later be overwritten by the data available on the current primary node.
|
NOTE:
During data recovery, there will be no service provided by SPS. |
To recover from a split brain situation
Log in to the primary node. If no Console menu is showing up after login, then this is the secondary node. In this case, try the other node.
Select Shells > Boot Shell.
Enter /usr/share/heartbeat/hb_standby. This will change the current secondary node to primary node and the current primary node to secondary node (HA failover).
Exit the console.
Wait a few seconds for the HA failover to complete.
Log in on the other host. If no Console menu is showing up, the HA failover has not completed yet. Wait a few seconds and try logging in again.
Select Shells > Core Shell.
Issue the systemctl stop zorp-core.service command to disable all traffic going through SPS.
Save the files from /var/lib/zorp/audit that you want to keep. Use scp or rsync to copy data to your remote host.
|
TIP:
To find the files modified in the last n*24 hours, use find . -mtime -n. To find the files modified in the last n minutes, use find . -mmin -n . |
Enter:
pg_dump -U scb -f /root/database.sql
Back up the /root/database.sql file.
Exit the console.
Log in again, and select Shells > Boot Shell.
Enter /usr/share/heartbeat/hb_standby. This will change the current secondary node to primary node and the current primary node to secondary node (HA failover).
Exit the console.
Wait a few minutes to let the failover happen, so the node you were using will become the secondary node and the other node will become the primary node.
The nodes are still in a split-brain state but now you have all the data backed up from the secondary node, and you can synchronize the data from the primary node to the secondary node, which will turn the HA state from "Split-brain" to "HA". For details on how to do that, see HA state recovery.
In the procedure described here, the "Split-brain" state will be turned to the "HA" state. Keep in mind that the data on the current primary node will be copied to the current secondary node and data that is available only on the secondary node will be lost (as that data will be overwritten).
|
NOTE:
If you completed the procedure described in Data recovery, you do not have to swap the nodes. You can proceed to the steps about data synchronization. |
If you want to swap the two nodes to make the primary node the secondary node and the secondary node the primary node, perform the following steps:
Log in to the primary node. If no Console menu is showing up after login, then this is the secondary node. In this case, try the other node.
Select Shells > Boot Shell.
Enter /usr/share/heartbeat/hb_standby. This will output:
Going standby [all]
Exit the console.
Wait a few minutes to let the failover happen, so the node you were using will become the secondary node and the other node will be the primary node.
To initialize data synchronization, complete the following steps:
Log in to the secondary node. If the Console menu is showing up, then this is the primary node. In this case, try logging in to the other node.
Enter the following commands. These commands will make the secondary node discard the data available only here, on this node.
drbdadm secondary r0 drbdadm connect --discard-my-data r0
Log out of the secondary node.
Log in to the primary node.
Select Shells > Boot Shell.
Enter:
drbdadm connect r0
Exit the console.
Check the High Availability state on the web interface of SPS, in the Basic Settings > High Availability > Status field. During synchronization, the status will say Degraded Sync, and after the synchronization completes, it will say HA.
© ALL RIGHTS RESERVED. Terms of Use Privacy Cookie Preference Center