Chat now with support
Chat with Support

One Identity Safeguard for Privileged Sessions 6.7.2 - Scalability and High Availability in Safeguard

Scalability in joint SPP and SPS deployments

In case of SPP-initiated workflows, it is possible to assign SPS scalability clusters to Managed Networks. The SPP cluster periodically checks the load on the members of the SPS cluster and assigns new connections to the best available appliance.

In case of SPS-initiated workflows, SPS appliances always target the primary appliance of the SPP cluster, but those queries don’t usually require scaling out to multiple SPP appliances.

Disaster Scenarios

The following describes how Disaster Scenarios work in the Safeguard product line.

Disaster Scenarios in One Identity Safeguard for Privileged Passwords (SPP)

Failure of a replica node

The cluster recognizes that this node is out of the circulation and automatically redirects traffic. All vital data is replicated between the nodes, so no data loss happens.

It is recommended that managed networks contain more than one appliance and if they do, other nodes will take over the tasks of the failed node. Managed networks can be reconfigured or disabled to provide continuity of service.

Failure of the primary node

Configuration changes are not possible, but normal operation continues. It is possible to change the primary manually to any of the replica nodes.

Failure of more than half of the cluster

The cluster switches into a read-only mode where config changes are not possible and password check and change tasks are paused. The Offline Workflow Mode setting can be used to manually or automatically restore access request workflow. If the majority of appliances have failed completely, a cluster reset operation can be used to change to a new primary without consensus. A backup restore is only necessary if all of the appliances in the cluster are lost.

Losing connectivity between appliances

The part of the cluster that gets isolated and sees less than half of the original cluster switches into read-only mode, whereas the bigger part remains active. Whether the isolated nodes continue serving access requests is configurable via the Offline Workflow Mode setting. When connection is re-established, the appliance state is automatically synchronized.

Disaster Scenarios in One Identity Safeguard for Privileged Sessions (SPS)

Failure of a node in an HA pair

If the failed node was the master, the hot-spare automatically takes over the IP address and all traffic. Ongoing connections are disconnected. No data is lost as everything is replicated between the pairs. After the failed node is replaced, a resync is required which might require up to 24 hours. In all failure scenarios below, if the failed node has an HA pair, it takes over all functionality automatically and the same recovery steps are required as listed above.

Failure of a managed node (non-master appliance) in the scalability cluster

If it did not have an HA pair, traffic going through that node will stop. It is up to the network configuration to handle the outage and redirect traffic to another appliance in the cluster. In case of SPP-initiated workflows, SPP will try to redirect the traffic towards a different SPS when the SPS configuration master becomes aware of the outage. If central search was enabled, it remains possible to perform searches but video-like playback of sessions won’t be available.

Failure of the configuration master in a scalability cluster

If it did not have an HA pair, it won’t be possible to make any configuration changes in the cluster, but functioning nodes will keep serving connections. It is not possible to move the config master role to a different appliance, it must be restored from a backup.

Failure of the search master in a scalability cluster

If it did not have an HA pair, it won’t be possible to search in audit information but all other functionality will be unaffected. The other nodes will buffer audit information until the search master node becomes available again. They are able to survive ~24 hours of downtime when operating at full capacity. After that, they stop accepting new connections.

Losing connectivity between HA pairs

Both appliances check if they see the outside network and if they do (and it’s only the other node that they’ve lost), both of them assume that they need to operate as master nodes. It will lead to a split-brain situation that will cause service outage and needs to be recovered manually. It is highly recommended to configure redundant HA links between the nodes.

For more information, see "Redundant heartbeat interfaces" in the Administration Guide.

Losing connectivity between nodes in a scalability cluster

Some functionality (like making config changes or searching in new audit information) will be lost until the outage is resolved, but individual nodes will continue serving new connections.

Related Documents

The document was helpful.

Select Rating

I easily found the information I needed.

Select Rating