Fault Isolation

Contents
casp1032
Contents
Fault Management is one of the key requirements of network management. A fault is different from an error because it is an abnormal condition that requires management attention and repair. Problems that results in faults could be caused by a bad firmware, a bad hardware, or a bad network. Each of these problems requires a different response from the network manager. Thus the goal is to determine the exact location of the fault and to get the attention of the network administrators as quickly as possible.
CA Spectrum
intelligence is capable of isolating a network problem to the most probable faulty component. To speed up fault isolation and to reduce unnecessary traffic, two actions occur:
  • Are-You-Down Action
    Upon losing contact with the device it represents, a model sends the Are-You-Down action to all of its neighbors to determine its own condition. If all of the neighbors return a response of TRUE, the condition color of the model turns gray (meaning “my device might be down, but it is impossible to tell because all the neighbors are down”). However, if any of the neighbors return a response of FALSE, the condition color of the model turns red (meaning “my device must be down, because one of the neighbors is up”).
  • Are-You-Up Action
    Upon re-establishing contact with the device it represents, a model sends the Are-You-Up action to its neighbors to speed up the fault isolation. Upon receiving this action, each neighbor returns TRUE if it has an established contact status. If the contact status of the model is lost, and the next-time-to-poll is more than 60 seconds, then the model pings the device for quicker fault isolation.
Every time the status of the model changes, or the information available to
CA Spectrum
changes, a new assessment occurs.
CA Spectrum
intelligence keeps the topology presentation as current and as accurate as possible, but it depends on correct modeling to accurately assess contact status and determine device failures on the network. Correct modeling includes placing your VNM model in proper relation to the other models that represent your network; it must have a resolved connection in the Topology view of a model that represents a device to which the VNM host is connected. When the VNM model is properly connected and
CA Spectrum
loses contact with a model, the icon representing that model displays a condition color of Gray, Orange, or Red, which helps the network administrators to locate the faults immediately.
Improved Fanout Performance
With Spectrum 10.3.1, the Fanout performance has been improved by propagating the 'Are_You_Down' action to only SNMP capable neighbors of Fanout. To enable the Fanout performance enhancement, set the parameter attribute 'improve_fanout_performance' to True in the $Specroot\SS\.vnmrc file and restart the SpectroSERVER. By default this parameter is set to False. When all the non-SNMP significant device models are down and any SNMP capable device is up, then Fanout condition turns red and the non-SNMP device models are suppressed (shown in Fig.1). To change the non-SNMP insignificant model device to significant model device, refer to the section on How Model Category Affects Contact Status.
Fig.1
image.png
If a non-SNMP significant device model is in a maintenance mode, then this model is considered as down (shown in Fig.1.1).
Fig.1.1
image (1).png
How Model Category Affects Contact Status
Each fault is associated with a particular condition, which is represented by a particular color that displays on the icon representing the model where the fault occurs. The condition color reflects both the contact status and the alarm status of the model. However, the contact status and condition color asserted for a model also depend upon which of the following categories a model belongs to. The following list summarizes how the categories to which a model and its neighbors belong influences its contact status and condition color.
  • Significant Device Models
    Any device that requires the attention of the administrator for the smooth operation of the network is called a significant device. To change an insignificant model into a significant model change the value of the attribute Value_When_Red (0x1000e) to 7.
  • Insignificant Device Models
    An insignificant device such as an end-user PC toggles between Blue and Green contact states and does not generate alarms or event messages to get the attention of the administrator. To change a significant model into an insignificant model change the value of the attribute Value_When_Red (0x1000e) to 0.
  • Inferred Connectors
    These are dumb models that do not poll, but that keep track of a list of their Data Relay neighbors. Possible inferred connectors are: WA_Segment, Fanout, and so on.
    CA Spectrum
    automatically enables Live Pipes for all ports that are connected to a WA_Segment.
    CA Spectrum
    intelligence does not expect Fanout models to be connected to each other; thus this configuration results in inaccurate contact status displays. If two Fanouts are connected to each other and each of them is in turn connected to a device with a green contact status, the Fanouts nonetheless turn red. If two Fanouts are connected to each other with no other devices that are connected to either one, both Fanouts turn gray.
  • Shared Media Link
    The Shared Media Link is a specialized inferred connector. These models are similar to Fanouts, but the fault management works differently. Unlike a Fanout model, the Shared Media Link model condition is based on configured threshold values.
    Example:
    If the critical threshold is set to 80, the Shared Media Link turns red when it loses contact with 80 percent of the downstream models.
  • Composite and Discrete Topology Models
    The contact status of LAN, LAN 802.3, LAN 802.5, and so on, models is determined by the contact status of its collected children. A LAN model with lost contact status turns either red or gray, depending on the condition of its collected models.
  • Wide Area Links
    Wide Area Links (WA_Links) are modeled with wide area segment (WA_Segment) models. This allows for proper rollup of the Wide Area Link condition. WA_Link models can only represent point-to-point connections, such as T1 and T3 lines, and there can be no more than two devices that are connected to it at a time. Also, you must connect the WA_Segment model to the correct port of the device models.
    WA_Link models can accommodate only one WA_Segment model. If you attempt to paste more than one WA_Segment model into a WA_Link model’s Topology view, the second one is destroyed immediately and an alarm is generated.
    spec--vnmsegment_OTH
  • Wide Area Segments
    WA_Segments poll the InternalPortLinkStatus (IPLS) attribute of each interface model which Connects_To the WA_Segment. This is an active poll, meaning that the IPLS of each connected interface is read at every polling interval rather than simply watched for a change in the attribute. Therefore,
    CA Spectrum
    does not have to lose contact with one of the connected routers for a fault isolation alarm to be generated on a WA_Link.
    The polling of the connected ports’ IPLS is regulated by the WA_Link model’s Polling_Interval and PollingStatus attributes. When the Polling_Interval changes to zero (0) or PollingStatus goes to FALSE, polling of the connected port’s IPLS is stopped.
    If one of the connected interfaces has an IPLS of BAD (for example, Admin Status is ON, but Open Status is OFF), then the WA_Segment’s Contact_Status is set to ‘lost’ and the WA_Segment turns gray. The WA_Link turns red.
    If one of the connected interfaces has an IPLS of ‘disabled’ (for example, Admin Status is OFF), then the WA_Segment’s Contact_Status is set to ‘lost’ and the WA_Segment turns gray. The WA_Link turns orange. This is because the alarm must be severe enough to be viewed in the Alarms tab, but it is not a “Contact Lost” alarm.
    If the DISABLED interface causes
    CA Spectrum
    to lose contact with the remote router, then the WA_Link turns red. This is the regular InferConnector-type fault isolation working.
Model Category
Connected Models (Neighbors)
Condition Color
Significant Devices (Modeling Hub-types only)
connected to a VNM...
turn Red after losing contact
Significant Devices
with no connections to other models (a zero connector count)...
Significant Devices
connected to an established Data Relay neighbor...
Composite and Discrete Topologies
in which all of the collected children have a lost contact status and at least one of those collected children is Red...
Inferred Connectors
where the Fanout model has lost contact but one of its neighbors is good and the associated port has bad port link status, then it...
Significant Devices, Inferred Connectors, and WA_Links
where all neighbors have also lost contact status...
turn Gray after losing contact.
Composite and Discrete Topologies
in which all ocs and none of those collected children are Red...
Significant Devices (Modeling Hub-types only)
connected to an end-point neighbor (such as a PC) that has established contact status...
turn Orange after losing contact.
WA_Links
WA_Segment (or Fanout) is good and one of the routers is lost then...
Significant Devices
connected to a model with an Established contact status...
turn Green.
Composite/Discrete Topologies and WA_Links
in which any of the collected children has established contact status, then the LAN will also...
Inferred Connectors
connected to a model with an established contact status where at least one of its neighbors is
Good
and its associated port (port that is connected to the Fanout) status is
Good
...
Significant and Insignificant Devices
not yet connected to other devices...
turn Blue
Composite/Discrete Topologies and WA_Links
when all collected children of a LAN have initial contact status, then the LAN will also have the initial contact status...
Fault Isolation Examples
The following examples illustrate how
CA Spectrum
fault isolation operates with various network configurations and problem scenarios.
Example: Proactive Fault Isolation
This example demonstrates that fault isolation is a proactive mechanism which does not depend upon polling all of the connected models.
Consider a simple network topology as shown in the following diagram. The device H1 is connected to the VNM model. Devices H1, H2, and H3 poll every 3 minutes. H4 polls every 5 minutes. The PC polls every 30 minutes.
spec--faultisolation1
Assume that H2 is BAD. As a result H2 turns red, H4 turns gray, PC (insignificant model) turns blue, while H1 and H3 remain green.
Fault isolation is initiated when H2, H4, or PC polls. If H4 is lost, it sends an Are-You-Down action to H2. If H2 is lost by then, it sends TRUE to H4, otherwise it pings itself and then sends the response to H4. This causes H4 to turn gray.
Now H2 is lost, and it sends Are-You-Down action to H1. Because H1 is established, H2 has to decide between orange and red conditions. H2 pings PC. Since PC cannot respond H2 turns red. The ping from H2 puts PC in a lost state. Since PC is an insignificant device, it turns blue.
Example: Modeling a Fanout.
This example demonstrates fault isolation when modeling a Fanout.
Assume that the Fanout is red and D2, D3, and PC are gray. The following diagram illustrates this scenario.
spec--faultisolation2
The Fanout registers a watch on D1's contact status. If D1 goes down, the Fanout turns Gray as a result of the watch trigger.
When D3 eventually polls successfully, D3 has an established contact status and turns Green. D3 then sends an Are-You-Up action to the Fanout. The Fanout reads device P3’s (D3’s port connection to the Fanout) internal link port status. Assuming the port has a good status, the watch is cleared, and the Fanout turns Green with an established contact status. This means that as long as P1 (D1’s port connection to the Fanout) has good internal link port status, the contact status of the inferred connector remains good.
What if D2 goes bad? D2 loses its contact status and sends an Are-You-Down action to the Fanout. The Fanout pings D1, and finds D1 to be good. The intelligence then examines the status of P1. Assuming Link-Status of P1 is good, the Fanout returns FALSE to model D2. This causes D2 to turn Red.
What if P1 is bad? This is the same case as disconnecting the network connection to the Fanout. If D3 polls first, it loses its contact status and sends an Are-You-Down action to the Fanout. The Fanout pings D1 as finds it as a good neighbor. Fanout then reads the internal-port-link-status of the port P1. Because P1 is bad, the Fanout will lose its contact status and turns Red. The Fanout returns TRUE to the model D3. This causes D3 to turn Gray. D2 will also turn Gray in the same way as D3. PC being the insignificant device will turn Blue immediately after losing its contact status.
Example: Redundant Paths Fault Isolation
This example shows how
CA Spectrum
manages devices using redundant paths if a link is shut down administratively (i.e., admin-status equals
down
).
The following diagram depicts a network with redundant WA Links. Here VNM manages Rtr3 through link WL-1 and Rtr2 using link WL-2. Assume that the network administrator shuts down the WL-1 link. This causes WL-1 to turn gray. Rtr3 turns red because VNM cannot talk to it through WL-1. The redundancy intelligence of Rtr3 modifies its agent address, so that VNM can talk to it using links WL-2 and WL-3. This causes Rtr3 to turn green again. The link WL-1 still has the gray condition.
spec--faultisolation3
Example: Inferred Connector Fault Isolation
This example demonstrates that fault isolation for an Inferred Connector requires specific modeling. Assume that two routing devices, Rtr1 and Rtr2, are connected at both ends of the WA_Link and that their ports are P1 and P2 respectively.
WA_Link models needs to be associated with a WA_Segment (or Fanout) model through the Collects relation to enable the proper rollup of the WA_Link condition. The devices at either end of the WA_Link needs to be connected to the WA_Segment collected by the WA_Link model. You do this by navigating into the device’s Device Topology view and resolving the WA_Segment off-page reference icon to the appropriate port. You can view the connections by navigating into the WA_Segment’s view.
This cross-connection is important for fault isolation to work, as shown in the following diagram.
spec--faultisolation4
Assume that P1 is the port on Rtr1 and P2 is the port on Rtr2. The routers that are connected to the WA_Segment causes it to behave as described in the following table. Note that the port link status becomes important in determining the status of the WA_Link only when both routers are “contact established.”
Rtr1
Rtr2
WA_Link
Initial
Initial
Blue
Established
Lost
Red
Lost
Lost
Gray
Established
Established
Check Port States*
* If both Rtr1 and Rtr2 have a contact status of
established
then the port status of P1 and P2 determines the condition of the WA_Link. If any port is BAD, the WA_Link turns RED. If any port is DISABLED, the WA_Link turns ORANGE. Otherwise, the WA_Link turns GREEN.