ha (High Availability) Release Notes

The ha probe allows you to manage queues, probes and the NAS AutoOperator in a High Availability setup. The probe runs on the standby Hub. If it loses contact with the primary Hub it initiates a failover after a defined interval. When the primary Hub comes back online the probe will reverse the failover (failback)
uimpga-ga
ha_RN
The ha probe allows you to manage queues, probes and the NAS AutoOperator in a High Availability setup. The probe runs on the standby Hub. If it loses contact with the primary Hub it initiates a failover after a defined interval. When the primary Hub comes back online the probe will reverse the failover (failback)
 
Contents
 
 
 
Revision History
This section describes the history of the revisions for this probe.
 
Version
 
Description
 
State
 
 
Date
 
1.47
Fixed Defect:
  • Fixed an issue in which the UMP failover script was sometimes not starting when the ha 1.46 was initiating the failover. To resolve this issue, a new configuration parameter “NAS_AO_first_if_failover” is now available. To use this parameter, add it to the ha.cfg file. This parameter lets the ha probe decide the order of executing steps when the failover starts. Depending on the value, this parameter behaves as follows: (Support Case: 01324021):
    • NAS_AO_first_if_failover = 0 (default value) (This behavior is the same as in ha 1.46 and earlier.) In this case, when the failover starts, the ha probe performs the actions in the following sequence:
      1. Issues the "Initiating failover from remote Hub XXXXXX" alarm.
      2. Activates the queues configured in [Queues to enable].
      3. Activates nas Auto Operator.
      4. Activates probes configured in [Probes to enable].
  • NAS_AO_first_if_failover = 1 In this case, when the failover starts, the ha probe performs the actions in the following sequence:
    1. Activates nas Auto Operator.
    2. Issues "Initiating failover from remote Hub XXXXXX" alarm.
    3. Activates queues configured in [Queues to enable].
    4. Activates probes configured in [Probes to enable].
      Note that the UMP failover script is started on arrival of the "Initiating failover from remote Hub XXXXXX" alarm in this scenario because the ha probe is activating nas Auto Operator in Step 1.
GA
July 2019
1.46
 
What's New:
 
 
Fixed Issue:
 
  • Fixed an issue in which the last queue (configured as Queues to enable) was not getting disabled during failback. (Support Cases: 245787 and 246794)
GA
October 2018
1.45
 
What's New:
 
  • Added support for a tunnel between the primary and secondary hub (the ha probe resides on the secondary hub).
  • Removed support for 32 bit Operating Systems.
  • The HA probe now defaults to 'queue_activate_method=queue_active' to enable and disable queues using hub callbacks.
GA
June 2014
1.44
Added support for a wait interval before the probe begins failback after re-establishing communication with the primary hub.
Added Admin Console GUI.
GA
March 2014
1.41
Fixed startup sequence so it checks if a state change is required in the initial run.
GA
March 2011
1.40
Added support for internationalization.
Added support for reading alarm tokens from cfg.
GA
December 2010
1.30
Added NIS(TNT2) Changes.
GA
September 2010
1.25
Probe now caches IP and port of remote address to avoid repeated lookups on the "static" data.
Cache is refreshed every hour.
Fixes problem, where the hub is busy and times out on the name lookup; in the worst case causing an incorrect failover.
Configuration tool maps hub address to the spooler address automatically, as this provides better alive status.
GA 
September 2009
1.23
Fixed bug where subsystem id was ignored for alarm messages.
Fixed heartbeat message timing issue. Changed default subsystem id to 1.2.3.8.
GA
June 2008
1.20
Changed how queues are activated/deactivated to avoid potential problems with Hub restarting in the middle of the operation.
Added option to take a probe down when failing over with the new section "probes_down".
Fixed minor memory leak when restarting the probe.
Added configuration tool.
Changed name of section from "queues" to "queue_up".
Added section "queue_down" for queues which need to be deactivated when failover occurs. This is useful where the secondary hub has a post queue for e.g. QoS data to the primary hub. To avoid duplicate entries this has to be deactivated. It is reactivated after the primary hub comes back online.
Port to Linux, Solaris 8 (sparc) and AIX 5. No functional changes.
Changed control mechanism to active heartbeat checks. Queue is no longer required.
Initial Release.
GA 
April 2008
Hardware Requirements
This probe has no additional hardware requirements.
Software Requirements
When installing of a 64-bit Linux platform, these 32-bit libraries are required:
  • Debian/Ubuntu -- ia32-libs
  • Redhat/CentOS -- glibc-2.12
Considerations
Installation Considerations
The probe must be installed on the standby Hub.
The probe is not activated after distribution. It must be configured, then activated manually.
If your NAS does not have the subsystem ID 1.2.3.8 defined, add it to the subsystems list in the nas or change the messages configurations to use the string "HA" in place of the subsystem ID.
Upgrade Considerations
When updating to version 1.20 the old "queues" section is renamed to "queues_up".
To take advantage of the spooler address change, the configuration must be saved from the configuration tool after probe update.
General Use Considerations
Ensure that the ha probe is installed on the secondary hub, not the primary hub.
In the 
setup
 section these keys are the most relevant:
  • remote_hub - This is the primary Hub's full Nimsoft address in the form /Domain/Hub/Robot/hub
  • hb_seconds - This is the number of seconds between heartbeat messages. Minimum value is 5 seconds to avoid "denial of service" on the primary Hub.
  • wait_seconds - This is the number of seconds the probe should wait before initiating a failover. The failover is ended immediately when the primary Hub comes back online.
  • reset_nas_ao - This allows you to specify whether or not to (de)activate the nas AutoOperator on the failover system. Specify 'yes' or 'no'. The default is 'yes'.
In the 
probes_up
 section, you can specify a list of probes that are to be activated on the local Hub when a failover occurs. When the remote_hub comes back online these probes are deactivated again. The keys are of the form probe_0, probe_1 and so on while the values are the names of probes to be started/stopped.
In the 
queues_up
 section, you should specify the queues which are to be started during a failover. The same queue definitions must be set on both the primary and secondary Hubs. The keys are of the form queue_0, queue_1 and so on while the values are the names of queues to be started/stopped.
In the 
queues_down
 section, you should specify the queues which are to be stopped during a failover. The keys are of the form queue_0, queue_1 and so on while the values are the names of queues to be started/stopped.
In the 
Messages
 section, you can change the alarm messages and their severities that are sent when a problem occurs. The severities are numeric values from 0 (clear) through 5 (critical).
 
Best Practices and sequence of probes in the secondary hub for failover
 
To enable ems probe for HA, keep in mind the following points, in the probes_up section of HA probe: IM Config, probes_up 
 Probes
 MUST 
be in the correct order based on probe startup order / prerequisite probe dependencies.
If you don’t respect the correct order you will have some probes that will not start, for example because a prerequisite probe that must start before it, is not fully activated yet. If the data_engine doesn’t start, e.g., because the distsrv was not left Activated/running, then many other probes that are dependent upon the data_engine will not start and they will appear 
red
.
Here is an example of the probes_up section of the HA.cfg. Notice that the data_engine is listed first. Before you test failover, check the HA.cfg file to ensure the probes/queues are in the correct order.
<probes_up>
probe_0 = 
data_engine
 
probe_1 = ace
probe_2 = baseline_engine
probe_3 = prediction_engine
probe_4 = emailgtw
probe_5 = discovery_server
probe_6 = udm_manager
probe_7 = 
ems
 
probe_8 = trellis
probe_9 = maintenance_mode
probe_10 = mon_config_service
probe_11 = qos_processor
probe_12 = sla_engine
probe_13 = spectrumgtw
probe_14 = nis_server
probe_15 = wasp
</probes_up> 
 
Key probe startup/operational dependencies for HA
 
Here are some of the key probe dependencies you need to be keep in mind when deciding upon the probe startup sequence.
  • distsrv MUST start before data_engine (keep the distsrv up/running and NOT in the HA.cfg)
  • data_engine MUST start before service_host/wasp to avoid logging invalid errors
  • alarm_enrichment MUST start before nas
  • baseline_engine MUST start before prediction_engine
  • admin_console is a web app run by service_host, so it does not appear in the list of probes to enable, and neither does mps or uimserver_home
get_running_status callback behavior:
A new callback: get_running_state (PDS probe) has been added to the ha probe. The expected return states are 0, 1, and 2. 
Following is the behavior based on return states:
  • 0: Probe is running normally, does not delete the cache.
  • 1: status changed from secondary to primary, deletes the ems db cache in the primary hub.
  • 2:  status changed from primary to secondary, delete the ems db cache in the secondary hub.