Maintain and diagnosis cluster members from Cluster Management:
- web client: Navigate to Cluster > Cluster Management.
When a node is selected in the Cluster view, the right of the pane displays details about the selected appliance. From this pane you can run the following maintenance and diagnostic tasks against the selected appliance.
To fix more serious issues with a cluster, you can perform additional operations depending on the state of the cluster members. Some such operations include:
To ensure password and SSH key consistency and individual accountability for privileged accounts, when an appliance loses consensus in the cluster, access requests are disabled. In the event of an extended network partition, the Appliance Administrator can either automatically or manually place an appliance in Offline Workflow Mode to run access request workflow on that appliance in isolation from the rest of the cluster. When the network issues are resolved and connectivity is reestablished, the Appliance Administrator can either automatically or manually resume online operations to merge audit logs, drop any in-flight access requests, and return the appliance to full participation in the cluster.
Offline workflow considerations
- In Offline Workflow Mode, an appliance functions apart from the other members of the cluster. Users can request passwords and sessions.
- Settings for Offline Workflow are set on an individual appliance.
- Suspend/Restore account does not work in Offline Workflow mode.
Passwords and SSH keys in Offline Workflow Mode
- In Offline Workflow Mode, the appliance is enabled to request, approve, and release passwords, SSH key, and sessions without a quorum, using cached policy data.
-
In Offline Workflow Mode, when policy requires change after check-in, the requirement is bypassed to allow for subsequent check out. In this case, a Access Request Password or SSH Key Reset By-passed Event is generated, stating: An access request subsequent check out is available as password [or SSH key] reset was by-passed.
-
Password and SSH key changes will be rescheduled and will possibly complete when network connectivity is restored even while the appliance is in Offline Workflow Mode.
- Users may still request a password or SSH key from the primary or another replica on the cluster with consensus; password and SSH key check and changes works as usual. The result is that passwords or SSH keys may get out of sync on the appliance running Offline Workflow Mode. This is expected behavior and the password and SSH key will remain out of sync until the partition is healed.
-
On a network partition where one or more appliances are in Offline Workflow Mode, it is possible for two individuals to have the same password and SSH key at the same time. Tying actions back to a single responsible individual is not possible. It will still be possible to identify each person that had access to the password and SSH key at the time.
Policies in Offline Workflow Mode
-
Policy will be enforced as it existed at the time the appliance, now in Offline Workflow Mode, lost network connectivity to the rest of the cluster.
-
Policy requiring a password and SSH key change after check-in is bypassed and subsequent check-out from the appliance in Offline Workflow Mode is allowed.
- Policy is Read-only. Therefore, update and delete configuration operations are not allowed on the appliance in Offline Workflow Mode.
-
Policy changes are only allowed if directed at an online primary within the cluster. Policy changes on the online primary do not affect the appliance in Offline Workflow Mode. Once the offline workflow appliance has resumed online operations the policy changes will be distributed.
Work flow in Offline Workflow Mode
User experience: Enable Offline Workflow Mode
Users that are requesting a password and SSH key in Safeguard are returned to the Home page. Password and SSH key requests prior to the switch to Offline Workflow Mode are not displayed.
- When the switch to Offline Workflow Mode starts, this message displays: Safeguard is switching to Offline Workflow Mode. Please wait until this process is complete before proceeding with any current work. The bottom of the Home page displays this information: (Switching to Offline Workflow Mode...) and Disconnected. If the user clicks Refresh, the banner is replaced with: The service is unavailable.
- When the switch to Offline Workflow Mode is complete, a banner with this information is displayed: Safeguard is currently in Offline Workflow Mode. Previous access requests are temporarily unavailable. You may submit new requests to continue working in Offline Workflow Mode. The bottom of the Home page displays these messages: (Offline Workflow Mode) and the connection status: Connecting then Connected.
Administrators can view the workflow status on the Cluster View pane where a message like this displays: Offline Workflow Enabled (This appliance is running access workflow in isolation from the cluster.) For more information, see Cluster Management.
User experience: Resume Online Operations
When the switch to Resume Online Operations has begun, this message displays: Safeguard is returning to normal operations. Please wait until this process is complete before proceeding with any current work. The bottom of the Home page displays this information: (Returning to normal operations) and Disconnected.
Once online operations are restored, the bottom of the Home page displays this information: Connected.
Notifications
-
The Appliance Administrator is notified when an appliance has lost consensus (quorum) via the ApplianceStateChangedEvent.
- The following events can be configured for email notifications and are written to the audit log:
-
ClusterPrimaryQuorumLostEvent
-
ClusterPrimaryQuorumRestoredEvent
-
ClusterReplicaQuorumLostEvent
-
ClusterReplicaQuorumRestoredEvent
- All access request notifications are still generated.
-
The Notification service identifies whether access workflow is available on an appliance via the IsPasswordRequestAvailable, IsSSHKeyRequsteAvailable, and IsSessionsRequestAvailable properties. The following API endpoint can be used to make this determination:
https://<hostname or IP>/service/notification/v4/Status/Availability
Audit logs in Offline Workflow Mode
Avoid modifications to the cluster configuration
-
It is recommended that no changes to cluster membership are made while an appliance is in Offline Workflow Mode. The online operations must be automatically or manually resumed before adding or removing other nodes to ensure the appliance can seamlessly reintegrate with the cluster.
The Appliance Administrator is advised to resume the online operations as soon as possible for individual password or SSH key accountability, policy adherence, and audit integrity.
Cluster patching is not allowed
During a cluster patch, Offline Workflow Mode cannot be triggered manually or automatically on any of the clustered appliances.
Considerations to resume online operations
- The network partition must be corrected before resuming online operations with full functionality.
-
You can resume online operations of an appliance in Offline Workflow Mode without a quorum. To resume online operations, it is highly recommended that network connectivity is restored between a majority of the cluster members, including the member in Offline Workflow Mode.
-
When resuming online operations, any access requests that are in flight on the appliance that is running in Offline Workflow Mode will be dropped.
- While it is possible to resume online operations if the appliance is not connected, making access requests will no longer be available.
Automatic versus manual workflow
The Appliance Administrator can manually control Offline Workflow Mode using the following steps. Manual intervention is possible when automatic Offline Workflow Mode is enabled. For more information, see Offline Workflow (automatic).
To manually enable Offline Workflow Mode
-
Go to Cluster Management:
- web client: Navigate to Cluster > Cluster Management.
- In the cluster view (left pane) of the offline appliance, click the member of the cluster that is offline.
-
In the appliance details and cluster health pane (right pane), review the errors and warnings to verify the appliance has lost consensus.
-
On the offline appliance, click Enable Offline Workflow. (This option is only available when the appliance has lost consensus with the cluster.)
A message like the following displays:
This appliance will run access workflow in isolation from the cluster to work around loss of consensus with the cluster. Users will be able to request, approve and release passwords, SSH key, and sessions via this appliance using cached data. When connectivity is restored, you should resume online operations to reintegrate this appliance with the cluster and merge audit logs.
Type 'Enable Offline Workflow' in the box below to confirm.
See KB263580 for more information.
- In the dialog, type Enable Offline Workflow and click Enter. The appliance is in Offline Workflow Mode and enters maintenance. In the Activity Center, the Event for the appliance goes from Enable Offline Workflow Started to Enable Offline Workflow Completed.
- You can verify that new requests are enabled and view the following health checks on the Cluster Management window:
- If there is communication to the other members in the cluster, while connected to the member in Offline Worflow mode, a message like this displays at the top of the messages: Cluster connectivity detected. When communication is reestablished, you can manually resume online operations to the appliance.
- A warning icon displays next to an appliance in Offline Workflow Mode. An error icon is displayed if viewed from any other member in the cluster if the member is unable to communicate with the member in Offline Workflow Mode. At any time, you can click Check Health to update the information.
- A warning message like the following will display: Request Workflow: Access workflow on this appliance is operating in offline isolation from the cluster. This warning will persist until online operations are resumed by an Appliance Administrator.
To manually resume online operations
Before resuming online operations, see Considerations to resume online operations.
-
Go to Cluster Management:
- web client: Navigate to Cluster > Cluster Management.
- In the cluster view (left pane), click the member of the cluster that is offline.
-
On the appliance in Offline Workflow Mode, click Resume Online Operations. (This operation is only available when the appliance is in Offline Workflow Mode.)
A message like the following displays:
The appliance will be reconfigured for online operations. The appliance will attempt to reintegrate with the cluster and merge audit logs. Refer to the to the Admin Guide for more information.
Type 'Resume Online Operations' in the box below to confirm.
- In the dialog, type in Resume Online Operations and click Enter.
- When maintenance is complete, click Restart. The appliance is returned to Maintenance mode.
- You can verify health checks on the Cluster Management window. If a warning icon still displays next to the appliance, select the appliance and click Check Health to rerun the cluster health check and display the most up-to-date health information.
SPP allows you to failover to a replica appliance by promoting it to be the new primary.
NOTE: You can promote a replica to be the new primary anytime the cluster has consensus (that is, the majority of the cluster nodes are online and able to communicate). If you have a quorum failure (that is, the majority of the cluster members do not achieve consensus), you must perform a cluster reset instead. For more information, see Resetting a cluster that has lost consensus.
To promote a replica to be the new primary in a cluster
- log in to a healthy cluster member as an Appliance Administrator.
- Go to Cluster Management:
- web client: Navigate to Cluster > Cluster Management.
- In the cluster view (left pane), select the replica node that is to become the new primary.
- Click Failover.
-
In the Failover confirmation dialog, enter the word Failover and click OK to proceed.
During the failover operation, all of the appliances in the cluster are placed in Maintenance mode.
Once the failover operation completes, the selected replica appliance appears as the primary with a state of online. All other appliances (including the "old" primary) in the cluster appear as replicas with a state of online.