Chat now with support
Chat with Support

One Identity Safeguard for Privileged Sessions 5.9.0 - Administration Guide

Preface Introduction The concepts of SPS The Welcome Wizard and the first login Basic settings User management and access control Managing SPS
Controlling SPS: reboot, shutdown Managing Safeguard for Privileged Sessions clusters Managing a high availability SPS cluster Upgrading SPS Managing the SPS license Accessing the SPS console Sealed mode Out-of-band management of SPS Managing the certificates used on SPS
General connection settings HTTP-specific settings ICA-specific settings RDP-specific settings SSH-specific settings Telnet-specific settings VMware Horizon View connections VNC-specific settings Indexing audit trails Using the Search (classic) interface Using the Search interface Searching session data on a central node in a cluster Advanced authentication and authorization techniques Reports The SPS RPC API The SPS REST API SPS scenarios Troubleshooting SPS Configuring external devices Using SCP with agent-forwarding Security checklist for configuring SPS Jumplists for in-product help Third-party contributions About us

Troubleshooting a SPS cluster

The following sections help you to solve problems related to high availability clusters.

Understanding SPS cluster statuses

This section explains the possible statuses of the SPS cluster and its nodes, the DRBD data storage system, and the heartbeat interfaces (if configured). SPS displays this information on the Basic Settings > High Availability page.

The Status field indicates whether the SPS nodes recognize each other properly and whether those are configured to operate in high availability mode. The status of the individual SPS nodes is indicated in the Node HA state field of the each node. The following statuses can occur:

  • Standalone: There is only one SPS unit running in standalone mode, or the units have not been converted to a cluster (the Node HA state of both nodes is standalone). Click Convert to Cluster to enable High Availability mode.

  • HA: The two SPS nodes are running in High Availability mode. Node HA state is HA on both nodes, and the Node HA UUID is the same on both nodes.

  • Half: High Availability mode is not configured properly, one node is in standalone, the other one in HA mode. Connect to the node in HA mode, and click Join HA to enable High Availability mode.

  • Broken: The two SPS nodes are running in High Availability mode. Node HA state is HA on both nodes, but the Node HA UUID is different. Contact the One Identity Support Team for help. For contact details, see About us.

  • Degraded: SPS was running in high availability mode, but one of the nodes has disappeared (for example broken down, or removed from the network). Power on, reconnect, or repair the missing node.

  • Degraded (Disk Failure): A hard disk of the slave node is not functioning properly and must be replaced. To request a replacement hard disk and for details on replacing the hard disk, contact our Support Team.

  • Degraded Sync: Two SPS units were joined to High Availability mode, and the first-time synchronization of the disks is currently in progress. Wait for the synchronization to complete. Note that in case of large disks with lots of stored data, synchronizing the disks can take several hours.

  • Split brain: The two nodes lost the connection to each other, with the possibility of both nodes being active (master) for a time.

    Caution:

    Hazard of data loss! In this case, valuable audit trails might be available on both SPS nodes, so special care must be taken to avoid data loss. For details on solving this problem, see Recovering from a split brain situation.

    Do NOT reboot or shut down the nodes.

  • Invalidated: The data on one of the nodes is considered out-of-sync and should be updated with data from the other node. This state usually occurs during the recovery of a split-brain situation when the DRBD is manually invalidated.

  • Converted: After converting nodes to a cluster (clicking Convert to Cluster) or enabling High Availability mode (clicking Join HA) and before rebooting the node(s).

NOTE:

If you experience problems because the nodes of the HA cluster do not find each other during system startup, navigate to Basic Settings > High Availability and select HA (Fix current). That way the IP address of the HA interfaces of the nodes will be fix, which helps if the HA connection between the nodes is slow.

The DRBD status field indicates whether the latest data (including SPS configuration, audit trails, log files, and so on) is available on both SPS nodes. The master node (this node) must always be in consistent status to prevent data loss. Inconsistent status means that the data on the node is not up-to-date, and should be synchronized from the node having the latest data.

The DRBD status field also indicates the connection between the disk system of the SPS nodes. The following statuses are possible:

  • Connected: Both nodes are functioning properly.

  • Connected (Disk Failure): A hard disk of the slave node is not functioning properly and must be replaced. To request a replacement hard disk and for details on replacing the hard disk, contact our Support Team.

  • Invalidated: The data on one of the nodes is considered out-of-sync and should be updated with data from the other node. This state usually occurs during the recovery of a split-brain situation when the DRBD is manually invalidated.

  • Sync source or Sync target: One node (Sync target) is downloading data from the other node (Sync source).

    When synchronizing data, the progress and the remaining time is displayed in the System monitor.

    Caution:

    When the two nodes are synchronizing data, do not reboot or shutdown the master node. If you absolutely must shutdown the master node during synchronization, shutdown the slave node first, and then the master node.

  • Split brain: The two nodes lost the connection to each other, with the possibility of both nodes being active (master) for a time.

    Caution:

    Hazard of data loss! In this case, valuable audit trails might be available on both SPS nodes, so special care must be taken to avoid data loss. For details on solving this problem, see Recovering from a split brain situation.

  • WFConnection: One node is waiting for the other node, the connection between the nodes has not been established yet.

If a redundant heartbeat interface is configured, its status is also displayed in the Redundant Heartbeat status field, and also in the HA > Redundant field of the System monitor. For a description of redundant heartbeat interfaces, see Redundant heartbeat interfaces.

The possible status messages are explained below.

  • NOT USED: There are no redundant heartbeat interfaces configured.

  • OK: Normal operation, every redundant heartbeat interface is working properly.

  • DEGRADED-WORKING: Two or more redundant heartbeat interfaces are configured, and at least one of them is functioning properly. This status is displayed also when a new redundant heartbeat interface has been configured, but the nodes of the SPS cluster has not been restarted yet.

  • DEGRADED: The connection between the redundant heartbeat interfaces has been lost. Investigate the problem to restore the connection.

  • INVALID: An error occurred with the redundant heartbeat interfaces. Contact the One Identity Support Team for help. For contact details, see About us.

Recovering SPS if both nodes broke down

Purpose:

It can happen that both nodes break down simultaneously (for example because of a power failure), or the slave node breaks down before the original master node recovers. To properly recover SPS, complete the following steps:

NOTE:

As of SPS version 2.0.2, when both nodes of a cluster boot up in parallel, the node with the 1.2.4.1 HA IP address will become the master node.

Steps:
  1. Power off both nodes by pressing and releasing the power button.

    Caution:

    Hazard of data loss! If SPS does not shut off, press and hold the power button for approximately 4 seconds. This method terminates connections passing SPS and might result in data loss.

  2. Power on the node that was the master before SPS broke down. Consult the system logs to find out which node was the master before the incident: when a node boots as master, or when a takeover occurs, SPS sends a log message identifying the master node.

    TIP:

    Configure remote logging to send the log messages of SPS to a remote server where the messages are available even if the logs stored on SPS become unaccessible. For details on configuring remote logging, see System logging, SNMP and e-mail alerts.

  3. Wait until this node finishes the boot process.

  4. Power on the other node.

Recovering from a split brain situation

A split brain situation is caused by a temporary failure of the network link between the cluster nodes, resulting in both nodes switching to the active (master) role while disconnected. This might cause new data (for example, audit trails) to be created on both nodes without being replicated to the other node. Thus, it is likely in this situation that two diverging sets of data have been created, which cannot be trivially merged.

Caution:

Hazard of data loss! In a split brain situation, valuable audit trails might be available on both SPS nodes, so special care must be taken to avoid data loss.

The nodes of the SPS cluster automatically recognize the split brain situation once the connection between the nodes is reestablished, and do not perform any data synchronization to prevent data loss. When a split brain situation is detected, it is visible on the SPS system monitor, in the system logs (Split-Brain detected, dropping connection!), on the Basic Settings > High Availability page, and SPS sends an alert as well.

Once the network connection between the nodes has been re-established, one of the nodes will become the active (master) node, while the other one will be passive (the slave node). This means that one node is providing services similar to normal operation, and the other one is kept passive to avoid network interferences. Note that there is no synchronization between the nodes at this stage.

To recover a SPS cluster from a split brain situation, complete the following steps.

Caution:

Do NOT shut down the nodes.

Data recovery
Purpose:

In the procedure described here, data will be saved from the host currently acting as the slave host. This is required because data on this host will later be overwritten by the data available on the current master.

NOTE:

During data recovery, there will be no service provided by SPS.

Steps:
  1. Log in to the master node. If no Console menu is showing up after login, then this is the slave node. Try the other node.

  2. Select Shells > Boot Shell.

  3. Enter /usr/share/heartbeat/hb_standby. This will change the current slave node to master and the current master node to slave (HA failover).

  4. Exit the console.

  5. Wait a few seconds for the HA failover to complete.

  6. Log in on the other host. If no Console menu is showing up, the HA failover has not completed yet. Wait a few seconds and try logging in again.

  7. Select Shells > Core Shell.

  8. Issue the systemctl stop zorp-core.service command to disable all traffic going through SPS.

  9. Save the files from /var/lib/zorp/audit that you want to keep. Use scp or rsync to copy data to your remote host.

    TIP:

    To find the files modified in the last n*24 hours, use find . -mtime -n.

    To find the files modified in the last n minutes, use find . -mmin -n .

  10. Enter:

    pg_dump -U scb -f /root/database.sql

    Back up the /root/database.sql file.

  11. Exit the console.

  12. Log in again, and select Shells > Boot Shell.

  13. Enter /usr/share/heartbeat/hb_standby. This will change the current slave node to master and the current master node to slave (HA failover).

  14. Exit the console.

  15. Wait a few minutes to let the failover happen, so the node you were using will become the slave node and the other node will become the master node.

    The nodes are still in a split-brain state but now you have all the data backed up from the slave node, and you can synchronize the data from the master node to the slave node, which will turn the HA state from "Split-brain" to "HA". For details on how to do that, see HA state recovery.

HA state recovery
Purpose:

In the procedure described here, the "Split-brain" state will be turned to the "HA" state. Keep in mind that the data on the current master node will be copied to the current slave node and data that is available only on the slave node will be lost (as that data will be overwritten).

Steps: Swapping the nodes (optional):

NOTE:

If you completed the procedure described in Data recovery, you do not have to swap the nodes. You can proceed to the steps about data synchronization.

If you want to swap the two nodes to make the master node the slave node and the slave node the master node, perform the following steps:

  1. Log in to the master node. If no Console menu is showing up after login, then this is the slave node. Try the other node.

  2. Select Shells > Boot Shell.

  3. Enter /usr/share/heartbeat/hb_standby. This will output:

    Going standby [all]
  4. Exit the console.

  5. Wait a few minutes to let the failover happen, so the node you were using will become the slave node and the other node will be the master node.

Steps: Initializing data synchronization:

To initialize data synchronization, complete the following steps:

  1. Log in to the slave node. If the Console menu is showing up, then this is the master node. Try logging in to the other node.

  2. Enter the following commands. These commands will make the slave node discard the data available only here, on this node.

    drbdadm secondary r0
    drbdadm connect --discard-my-data r0
  3. Log out of the slave node.

  4. Log in to the master node.

  5. Select Shells > Boot Shell.

  6. Enter:

    drbdadm connect r0
  7. Exit the console.

  8. Check the High Availability state on the web interface of SPS, in the Basic Settings > High Availability > Status field. During synchronization, the status will say Degraded Sync, and after the synchronization completes, it will say HA.

Related Documents