Establish Fault Tolerance

Contents
casp1031
Contents
Establishing Fault Tolerance
You can set up a fault-tolerant environment when you first install
CA Spectrum
, before any models have been created. Or you can set up a fault-tolerant environment after you install
CA Spectrum
.
The following procedure describes how to set up two
s: a primary and a secondary. You can also set up a tertiary
by taking the same steps. However, assign the tertiary
a higher precedence number than the secondary
.
 To establish fault tolerance in an environment with a Southbound Gateway integration, see the Southbound Gateway Toolkit
Follow these steps:
  1. Install the same version of
    CA Spectrum
    with the same modeling catalog on both the primary
    and the secondary
    . Each server requires the same landscape handle.
  2. Verify that both the primary and secondary
    s have entries in their .hostrc files that give the
    s mutual access permissions.
    If you are specifying secure users for the secondary
    in the .hostrc file on the primary
    , and the secondary
    is running in the Windows environment, include the user SYSTEM in the secure user list.
  3. Verify that the MAIN_LOCATION_HOST_NAME parameter in the .locrc file on the secondary
    server points to the same system name as the .locrc file on the primary
    . Otherwise, synchronization fails.
  4. Configure the primary and secondary
    s so that the user running each
    is the same. If the users are not the same, the secondary
    fails or does not run properly after an Online Backup.
  5. Make a copy of the primary
    database by running Online Backup. Or, if the
    is shut down, use the SSdbsave utility with the -cm argument (to save the modeling catalog and any new models).
For more information, see the
.
  1. Verify that the save file that you created is available to the server that hosts the secondary
    . Copy the file to the server if necessary.
  2. On the secondary server, with
    shutdown, navigate to the
    CA Spectrum
    SS directory and load the save file using the following command:
    ../SS-Tools/SSdbload -il -add precedence savefile
    • precedence
      Specifies a numeric value greater than the primary server default value of 10 (20 is recommended).
    • savefile
      Specifies the name of the saved file that was previously created.
  3. (Optional) Add the line 'secondary_polling=yes' to the .vnmrc file to let the secondary
    function as a hot backup
  4. Start the primary
    , if it is not already running.
  5. Start the secondary
    .
  6. To verify the setup, use the MapUpdate command with the view argument to display the current landscape map.
For more information, see the
.
The secondary
is now available to take over automatically if the primary
fails. If you previously activated secondary polling, the secondary
is available immediately. Otherwise, polling begins when the server detects that it has lost contact with the primary
.
When service switches from the primary
to the secondary
, the Connection Status icon SPEC--fault_tolerance_icon displays yellow. To view the connection status of all servers in a landscape, click the Connection Status icon. In the Connection Status dialog, the Connection Status icon for each server in the landscape displays yellow to indicate the “switched” condition.
When the primary
comes back online, the secondary
stops polling (unless you have set secondary_polling to 'yes'). All the applications switch back to the primary
. However, any edits that you make to the secondary
while it is active are
not
automatically replicated to the primary
. Manually recreate these modifications on the primary
.
When you restart the primary
, connections are accepted when all models are loaded, but
before
all models are activated. The models can take some time to activate. Because the secondary
stops polling when the primary
is restarted, a gap in your network management coverage can result.
To avoid this situation, edit the .vnmrc file on the primary
so that the wait_active resource is set to 'yes'. This parameter causes the server to wait until all of the models are activated before accepting any connections. The message area in the
CA Spectrum
Control Panel also dynamically displays the percentage of models that are activated. The
can appear to take longer to come up. However when all the models are activated, the
is ready to manage the network.
You can also set the wait_active resource to 'yes' on the secondary
. During a planned shutdown of the primary
, you can then verify in the
CA Spectrum
Control Panel that the secondary
is ready to take over.
For more information, see the
.
Validate Fault Tolerance Configuration
After you have set up fault tolerance in a distributed
deployment, verify that the OneClick server has access to both primary and secondary
s. Without connectivity to both servers, the OneClick server cannot fail over to the secondary
.
Follow these steps:
  1. Access the OneClick Administration, Landscapes web page.
  2. Check the ‘Secondary Status’ column. Verify that OneClick has established contact with the secondary SpectroSERVER.
    The status also indicates whether Fault Tolerance is ready for failover.
    The Fault Tolerance configuration is validated.
Test Fault Tolerance
During an initial installation, the secondary
might not have access to all the devices to which the primary
has access. This situation causes the secondary
to generate false alarms. To avoid false alarms, verify that the secondary
can manage your network devices by testing fault tolerance.
 Test fault tolerance whenever new devices are added to the primary
.
Follow these steps:
  1. With both the primary and secondary
    s up and running, bring down the primary
    .
    The Connection Status icon SPEC--fault_tolerance_icon is yellow to indicate the "switched" condition.
    A red connector indicates that the OneClick server was not able to contact the secondary
    .
  2. Wait 15 - 20 minutes for the secondary
    to run.
  3. Verify the following conditions:
    • The Connection Status icon does not display red.
    • All device models and pingable models maintain SNMP or ICMP contact.
      If this contact is lost, verify that the secondary
      has access to your devices. Contact a Network Administrator to resolve this problem, if applicable.
    • CA Spectrum
      is managing all devices that have an established contact state. Verify the status by checking for device contact or management contact loss alarms from any of the device models.
  4. Restart the primary
    .
    The Connection Status icon displays green to indicate a normal contact state.
Fault-Tolerant Recovery
Following are the two possible failure scenarios:
  • The primary
    stops. The secondary
    then forwards event and statistical information to the primary Archive Manager that is running on the server that hosts the primary
    . When the primary
    restarts, no event and statistical data have been lost.
  • The computer where the primary
    and the primary Archive Manager is running stops operating completely. The secondary
    then caches event and statistical data in its database until the primary
    computer comes back online. If a secondary Archive Manager is running, historical, and real-time information is available in OneClick, but the information is still cached for transfer to primary Archive Manager.
Restart both the primary Archive Manager and the primary
if their server goes down, or if the primary
stops operating.
It is no longer necessary to start the Archive Manager before the SpectroSERVER, the cached events from the secondary SpectroSERVER can be transferred at any time, even after the primary SpectroSERVER has started logging new events.
Follow these steps:
  1. Start the SPECTRUM Control Panel on the primary
    host.
  2. To start the
    , click Start
    on the SPECTRUM Control Panel.
    When the primary Archive Manager is again operational, the secondary
    connects and transfers its cached event data to the primary Archive Manager.
Change the Host Names of the Primary and Secondary SpectroSERVERs
s in a fault-tolerant environment use a precedence value that is associated with their host names to recognize their relationship to one another. Therefore, to preserve the fault-tolerant relationship, use SSdbsave and SSdbload to change the host name of your primary
.
Follow these steps:
  1. Save the database using SSdbsave with the -cm option.
  2. Change the host name.
  3. Reload the database with the save file that you created in the first step. Run SSdbload with the -il option and the -replace option:
    SSdbload -il -replace precedence savefile
    This command causes the database to associate the new host name with the precedence value (10) that designates a primary
    .
    The change in the host name is communicated to any warm or hot standby
    s the next time that the databases are synchronized as a result of Online Backup being run.
    In the meantime, however, the host name change prevents the standby
    s from detecting that the primary
    is running. As a result, any
    that is configured as a warm standby starts polling.
  4. Load the save file on the warm standby using SSdbload with the -il and -replace options, and specify a higher precedence value (for example, 20) that designates it as a standby.
Now you can change the host name of the secondary
.
Follow these steps:
  1. Save the database using SSdbsave with the -cm option.
  2. Make the change to the host name.
  3. Reload the database with the save file that you created in the first step. Run SSdbload with the -il option and the -replace option:
    SSdbload -il -replace precedence savefile
    This command causes the database to associate the new host name with the precedence value (20) that designates a secondary
    .
    When you restart the secondary
    , the server communicates the new host name and precedence to the primary
    .
 
For more information, see the
.