It can happen that both nodes break down simultaneously (for example because of a power failure), or the slave node breaks down before the original master node recovers. The following describes how to properly recover syslog-ng Store Box (SSB).
NOTE: When both nodes of a cluster boot up in parallel, the node with the 1.2.4.1 HA IP address will become the master node.
To properly recover SSB
-
Power off both nodes by pressing and releasing the power button.
|
Caution:
Hazard of data loss If SSB does not shut off, press and hold the power button for approximately 4 seconds. This method terminates connections passing SSB and might result in data loss. |
-
Power on the node that was the master before SSB broke down. Consult the system logs to find out which node was the master before the incident: when a node boots as master, or when a takeover occurs, SSB sends a log message identifying the master node.
TIP: Configure remote logging to send the log messages of SSB to a remote server where the messages are available even if the logs stored on SSB become unaccessible. For details on configuring remote logging, see SNMP and e-mail alerts.
-
Wait until this node finishes the boot process.
-
Power on the other node.
A split brain situation is caused by a temporary failure of the network link between the cluster nodes, resulting in both nodes switching to the active (master) role while disconnected. This might cause new data (for example, log messages) to be created on both nodes without being replicated to the other node. Thus, it is likely in this situation that two diverging sets of data have been created, which cannot be easily merged.
|
Caution:
Hazard of data loss In a split brain situation, valuable log messages might be available on both syslog-ng Store Box (SSB) nodes, so special care must be taken to avoid data loss. |
The nodes of the SSB cluster automatically recognize the split brain situation once the connection between the nodes is re-established, and do not perform any data synchronization to prevent data loss. When a split brain situation is detected, it is visible on the SSB system monitor, in the system logs (Split-Brain detected, dropping connection!), on the Basic Settings > High Availability page, and SSB sends an alert as well.
NOTE: After the connection between the nodes has been restored, the split brain situation will persist. The core firmware will be active on one of the nodes, while it will not start on the other node.
Once the network connection between the nodes has been re-established, one of the nodes will become the master node, while the other one will be the slave node. Find out which node is the master node. There are two ways to identify the master node:
-
Locally: Log in to each SSB locally, and wait for the console menu to come up. The console menu only appears on the master node.
-
Remotely: Try connecting to each SSB using SSH. It is only the master node that you can directly connect to via SSH. The slave node cannot be accessed externally, only via SSH from the master.
To recover an SSB cluster from a split brain situation, complete the procedures described in Data recovery and HA state recovery.
|
Caution:
Do NOT shut down the nodes. |
Data recovery
In the procedure described here, data will be saved from the host currently acting as the slave host. This is required because data on this host will later be overwritten by the data available on the current master.
NOTE: During data recovery, there will be no service provided by SSB.
To configure recovering from a split brain situation
-
Log in to the master node as root locally (or remotely using SSH) to access the Console menu. If no menu is showing up after login, then this is the slave node. Try the other node.
-
Select Shells > Boot Shell.
-
Enter /usr/share/heartbeat/hb_standby. This will change the current slave node to master and the current master node to slave (HA failover).
-
Exit the console.
-
Wait a few seconds for the HA failover to complete.
-
Log in on the other host. If no Console menu is showing up, the HA failover has not completed yet. Wait a few seconds and try logging in again.
-
Select Shells > Core Shell.
-
Issue the systemctl stop syslog-ng.service command to disable all traffic going through SSB.
-
Save the files from /opt/ssb/var/logspace/ that you want to keep. Use scp or rsync to copy data to your remote host.
TIP: To find the files modified in the last n*24 hours, use find . -mtime -n.
To find the files modified in the last n minutes, use find . -mmin -n .
-
Exit the console.
-
Log in again, and select Shells > Boot Shell.
-
Enter /usr/share/heartbeat/hb_standby. This will change the current slave node to master and the current master node to slave (HA failover).
-
Exit the console.
-
Wait a few minutes to let the failover happen, so the node you were using will become the slave node and the other node will become the master node.
The nodes are still in a split-brain state but now you have all the data backed up from the slave node, and you can synchronize the data from the master node to the slave node, which will turn the HA state from "Split-brain" to "HA". For details on how to do that, see HA state recovery.
HA state recovery
In the procedure described here, the "Split-brain" state will be turned to the "HA" state.
|
Caution:
Keep in mind that the data on the current master node will be copied to the current slave node and data that is available only on the slave node will be lost (as that data will be overwritten). |
To swap the nodes (optional)
NOTE: If you completed the procedure described in Data recovery, you do not have to swap the nodes. You can proceed to the steps about data synchronization.
If you want to swap the two nodes to make the master node the slave node and the slave node the master node, perform the following steps.
-
Log in to the master node as root locally (or remotely using SSH) to access the Console menu. If no menu is showing up after login, then this is the slave node. Try the other node.
-
Select Shells > Boot Shell.
-
Enter /usr/share/heartbeat/hb_standby. This will output:
Going standby [all]
-
Exit the console.
-
Wait a few minutes to let the failover happen, so the node you were using will become the slave node and the other node will be the master node.
To initialize data synchronization
-
Log in to the slave node as root locally (or remotely using SSH) to access the Console menu. If the menu is showing up, then this is the master node. Try logging in to the other node.
Note that you are in the boot shell now as on the slave node, only the boot shell is available.
-
Invalidate the DRBD. Issue the following commands:
drbdadm secondary r0
drbdadm connect --discard-my-data r0
ssh ssb-other
drbdadm connect r0
-
Reboot the slave node.
Following this step, the master will be in Standalone state, while the slave's DRBD status will be WFConnection.
The console will display an Inconsistent (10) message. This is normal behavior, and it is safe to ignore this message.
-
Reboot the master node. The SSB cluster will now be functional, accepting traffic as before.
-
After both nodes reboot, the cluster should be in Degraded Sync state, the master being SyncSource and the slave being SyncTarget. The master node should start synchronizing its data to the slave node. Depending on the amount of data, this can take a long time. To adjust the speed of synchronization, see Adjusting the synchronization speed.
-
Enable all incoming traffic on the master node. Navigate to Basic Settings > System > Service control > Syslog traffic, indexing & search: and click Enable.
If the web interface is not accessible or unstable, complete the following steps on the active SSB:
-
Log in to SSB as root locally (or remotely using SSH) to access the console menu.
-
Select Shells > Core Shell, and issue the systemctl start syslog-ng.service command.
-
Issue the date, and check the system date and time. If it is incorrect (for example, it displays 2000 January), replace the system battery. For details, see the hardware manual of the appliance.
This section describes how to replace a unit in an syslog-ng Store Box (SSB) cluster with a new appliance.
To replace a unit in an SSB cluster with a new appliance
-
Verify the HA status on the working node. Select Basic Settings > High Availability. If one of the nodes has broken down or is missing, the Status field displays DEGRADED.
-
Note down the IP addresses of the Heartbeat and the Next hop monitoring interfaces.
-
Perform a full system backup. Before replacing the node, create a complete system backup of the working node. For details, see Data and configuration backups.
-
Check which firmware version is running on the working node. Select Basic Settings > System > Version details and write down the exact version numbers.
-
Log in to your support portal account and download the CD ISO for the same SSB version that is running on your working node.
-
Without connecting the replacement unit to the network, install the replacement unit from the ISO file. Use the IPMI interface if needed.
-
When the installation is finished, connect the two SSB units with an Ethernet cable via the Ethernet connectors labeled as 4 (or HA).
-
Reboot the replacement unit and wait until it finishes booting.
-
Log in to the working node and verify the HA state. Select Basic Settings > High Availability. The Status field should display HALF.
-
Reconfigure the IP addresses of the Heartbeat and the Next hop monitoring interfaces. Click .
-
Click Other node > Join HA.
-
Click Other node > Reboot.
-
The replacement unit will reboot and start synchronizing data from the working node. The Basic Settings > High Availability > Status field will display DEGRADED SYNC until the synchronization finishes. Depending on the size of the hard disks and the amount of data stored, this can take several hours.
-
After the synchronization is finished, connect the other Ethernet cables to their respective interfaces (external to 1 or EXT, management to 2 or MGMT) as needed for your environment.
Expected result:
A node of the SSB cluster is replaced with a new appliance.
The IP addresses of the HA interfaces connecting the two nodes are detected automatically, during boot. When a node comes online, it attempts to connect to the IP address 1.2.4.1. If no other node responds until timeout, then it sets the IP address of its HA interface to 1.2.4.1, otherwise (if there is a responding node on 1.2.4.1) it sets its own HA interface to 1.2.4.2.
Replaced nodes do not yet know the HA configuration (or any other HA settings), and will attempt to negotiate it automatically in the same way. If the network is, for any reason, too slow to connect the nodes on time, the replacement node boots with the IP address of 1.2.4.1, which can cause an IP conflict if the other node has also set its IP to that same address previously. In this case, the replacement node cannot join the HA cluster.
To manually assign the correct IP address to the HA interface of a node, perform the following steps:
-
Log in to the node using the IPMI interface or the physical console.
Configuration changes have not been synced to the new (replacement) node, as it could not join the HA cluster. Use the default password of the root user of syslog-ng Store Box (SSB), see "Installing the SSB hardware" in the Installation Guide.
-
From the console menu, choose 6 HA address.
Figure 224: The console menu
-
Choose the IP address of the node.
Figure 225: The console menu
-
Reboot the node.