About SpectroSERVER Fault Tolerance
About SpectroSERVER Fault Tolerance
Fault tolerance requires more than one to manage a given landscape. A copy of the database for that landscape is loaded on each . However, only a single copy is active at any time. The with the active database is known as the
primary. The inactive database runs on a standby , which is the secondary . You can also install another inactive copy of the database on a tertiary .
If the primary fails, the database on the secondary becomes active, and the secondary starts managing the network. Applications that are connected to the primary are automatically switched to the secondary . When the primary returns to service, the applications automatically switch back to the primary , and the secondary becomes inactive again.
Not all applications can exercise the full range of their capabilities when they are being run from a secondary . The main reason to set up a fault-tolerant environment is to ensure continuous monitoring of the network, not to create a full copy of
Precedence in a Fault Tolerant Environment
Primary, secondary, and tertiary s that manage the same landscape must all have the same landscape handle and the same modeling catalog. The servers are distinguished from one another with a numeric precedence value. The lowest number indicates the primary . s are installed with a default precedence value of 10. To designate a as a secondary server, assign it a higher precedence number, such as 20. Likewise, a tertiary would have a higher precedence than the secondary, for example, 30.
When you first set up a fault tolerant environment, you can assign precedence values at the time you are loading database copies on any standby s using the SSdbload utility.
To change precedence values later, you can use the Loaded Landscapes subview. Access this subview by selecting a local landscape in the Navigation panel, and then selecting the Information tab in the Component Detail panel.
The Loaded Landscapes subview is different from the Control subview. Access the Control subview by selecting the VNM in the Navigation panel and then selecting the Information tab in the Component Detail panel.
A single database is active at any given time in a fault tolerant
CA Spectrumenvironment. Therefore, the other databases must be updated periodically to reflect new models and changes to attribute values in the active database. This synchronization of data is accomplished through the
CA SpectrumOnline Backup feature. You can run Online Backup on demand or at regularly scheduled intervals. When you run Online Backup against the primary , it creates a backup copy of the current database. Online Backup automatically loads the copy onto each designated secondary .
As in any DSS environment, each of the s in a fault tolerant environment must have the same modeling catalog installed. Online Backup copies the current modeling catalog. However, it does not copy all the .i files or other elements that are associated with individual management modules. Therefore, if you install any new management modules on your primary , also install the same new management modules on any secondary s.
For more information, see the
EventDisp and the Alertmap files that are defined in the <
$SPECROOT>/custom/Events directory are propagated to fault-tolerant servers when the secondary polls the primary for status information.
Support for Fault-Tolerant Archive Manager
You can run the Archive Manager on the secondary host in a fault-tolerant environment. This secondary Archive Manager provides visibility to events in OneClick when the primary Archive Manager is down.
Primary or secondary locally stores events in the following two scenarios:
- When primary Archive Manager is down, and the primary is running. In this case, primary locally stores events as they are created until primary Archive Manager is up.
- When the primary host itself is down. In this case, the secondary locally stores events as they are created until the primary Archive Manager is up.
You can start the secondary Archive Manager on the secondary host to provide visibility to not only events as they are created when the primary Archive Manager is down, but also historical events.
When you start the secondary Archive Manager, it acts as a client to the primary to receive and log events as they are created. This behavior does not affect the normal connection between the primary and primary Archive Manager. When the primary Archive Manager goes down, OneClick fails over to the secondary Archive Manager to provide event data.
When the primary host itself goes down, the secondary locally stores events, but also forwards events to secondary Archive Manager. When the primary Archive Manager comes up, the secondary transfers all the locally stored events to it.
Archive Manager Data Synchronization
The secondary Archive Manager provides a best-effort synchronization of events, and there is no event synchronization that occurs between the primary Archive Manager and the secondary Archive Manager. When the secondary Archive Manager is running and connected to a , it receives a copy of all events as they are generated. Anytime the secondary Archive Manager is down, events are not stored on the secondary. This functionality is distinctly different from the functionality of primary Archive Manager, where the stores the events for later transfer to the primary Archive Manager.
This means that when the secondary Archive Manager is started for the first time, its DDM database does not contain any events, and no attempt is made to synchronize with the primary. Once the secondary Archive Manager has been running for MAX_EVENT_DAYS configured in the .configrc, it is generally in sync with the primary Archive Manager database.
Generate an Alarm If the Secondary Is Not Restarted
When a primary synchronizes its database with the secondary , a Contact Lost to Secondary Server (0x00010c0e) event and alarm are generated. The secondary has been brought down to load the new database from the primary .
You can set up a rule to process this alarm so that the alarm is generated only if the secondary is not restarted.
The EventPair rule lets you specify that a new event is generated if the Contact Lost to Secondary Server event occurs and a Contact Established to Secondary Server (0x00010c0f) event does not follow within a specified time period. You can then specify that this new event creates an event and an alarm indicating that the secondary is still down.
Follow these steps:
- Open the EventDisp file with a text editor.The EventDisp file is located in the<$SPECROOT>/SS/CsVendor/Cabletron directory.
- Find the line that reads 0x00010c0e E 50 A 2, 0x00010c0e and change this line to the following:0x00010c0e R Aprisma.EventPair, 0x00010c0f,<numberofsecondstowait><generatedeventcode>
- <generatedeventcode>Is the event code to generate if the secondary does not come up within the time specified in<numberofsecondstowait>.
- Add the following line to the EventDisp file:<generatedeventcode>E 50 A 2, <generatedalarmcode>
- <generatedeventcode>Is the event code generated in Step 2 if the secondary did not come up. 'E 50' indicates that the event is logged and has a severity value of 50. A 2 indicates that a major alarm is created.<generatedalarmcode>is the alarm code to generate based on this event.
- Create a Probable Cause file for this alarm that indicates that contact with the secondary has not been reestablished after data synchronization.
For more information, see the
Secondary SpectroSERVER Readiness Levels
A secondary is considered to be at one of three different levels of readiness. Readiness depends on server configuration and status. The readiness levels are defined as follows:
- HotThe secondary is running and is available to take over immediately upon failure of the primary because it is already polling. To configure a secondary for this level of readiness, add the following line to the .vnmrc file: secondary_polling=yes. This statement causes the standby to commence polling and processing traps whenever it starts, regardless of its connection status with the primary .
- WarmThe secondary is running, but the server can take a short time to become fully available. The secondary has not been configured to start pollinguntilit loses contact with the primary . For example, it has no secondary_polling entry in the .vnmrc file, or the entry is set to no.If the secondary_polling entry is not in the .vnmrc file or the entry is set to no, the secondary does not process traps while in standby mode.
- ColdThe secondary is not running and must be started when there is a failure of the primary . In this case, it is irrelevant whether the secondary is configured for secondary polling.