Manage Alarms with Centralized Alarm Policies

uim203
HID_Alarm_Policy
An alarm policy defines a set of metrics and alarm conditions in a centralized location, so that monitoring administrators can view and manage alarm reporting easily. Administrators can also create alarm policies in response to new conditions and needs. They can manage all aspects of alarm behavior in an alarm policy; for example, manage the alarm thresholds, timing, and messages configured for alarms. The Alarm Policies feature lets you perform the following actions:
  • View a list of alarm policies.
  • Add alarm policies.
  • Add and delete conditions that trigger an alarm. 
  • Add alarm conditions to monitor individual devices, a group of devices, or a specific monitoring technology (such as Docker).
  • Configure Time Over Threshold alarming to reduce alarm noise to an actionable level.
  • Customize alarm messages to provide the information you need.
Contents
2
2
Prerequisites
The following are the prerequisites for creating an alarm policy:
  • Ensure that the robot version is 7.96 or later.
    • If your robot version is 9.31 or 9.31S, ensure that the MCS version is also 9.31. If this compatibility is not maintained, MCS profiles and alarm policies will not work.
  • Ensure that the profile is an enhanced monitoring profile and it is collecting metrics.
  • Ensure that Monitoring Configuration Service (MCS) is already configured.
Create an Alarm Policy
The complete process to create an alarm policy requires you to work in Operator Console. You create an enhanced profile in OC (with metrics collection enabled). Only when the metrics collection starts, you can create an alarm policy in Operator Console.
Follow these steps:
  1. Log in to OC.
  2. Create an enhanced monitoring profile with metrics collection enabled. The following screenshot shows an enhanced monitoring profile in OC
  3. In Operator Console, click
    Settings
    in the left pane.
  4. Click the
    Alarm Policies
    card.
    The
    Alarm Policies
    page opens.
  5. Click the plus icon Plus Icon at the bottom of the page.
  6. Enter a policy name in the
    Alarm policy name
    field.
    Enter a policy name that helps you distinguish one policy from other policies. If you are creating an alarm policy for a device or group, you can include the device name or IP address or the group name. Include key words in the name to make it easier to search for a specific policy.
  7. Click
    Add condition
    ( Add Condition Icon ).
    The
    Set Condition
    dialog opens. This dialog lets you define alarm conditions.
    An alarm condition defines what is monitored. You can set alarm conditions for a group (device and container), a specific device, or a monitoring technology.
  8. Select the type of alarm condition on the
    Set Condition
    dialog:
    • Device
      Monitors the state or performance metrics for a device component.
      To configure an alarm condition for a device, select a device name, the metric, and the component that you want to monitor. The following example screenshot shows the settings for the Device type:
      filter_device.png
      When there are multiple hosts that collect the same metric, the list of monitoring hosts also displays in the
      Set Condition
      page. You can select only one host at a time to create alarm condition. Create another condition to collect metric on the other host. The following example screenshot shows the
      Select a Monitoring Host
      section that appears when there are multiple hosts that collect the same metric:
      select_monitoring_host.png
    • Monitoring technology
      Monitors metrics associated with a specific monitoring technology.
      To configure an alarm condition for a monitoring technology, select a monitoring technology, a configuration profile, and a metric. The following example screenshot shows the settings for the Monitoring Technology type:
      Monitoring_Technology_Set_Condition.png
    • Group
      Monitors the state or performance metrics for a group (container or device). Add alarm conditions that apply to all devices in a group. Groups are displayed as a navigation tree; container groups followed by subgroups. Expand a container group to select a subgroup. When you create the condition at the container group, all subgroups (child container groups and device groups) in that container group inherit it. Support for container group is helpful in scenarios where you want a single threshold for each metric on a device. That threshold policy can be rolled down from the container group to the device group and then to the device.
      To enable the alarm policy functionality for container groups, use the MCS raw configuration to set the value of the
      enable_container_support_for_alarm_policy
      parameter in the
      timed
      section to
      true
      . By default, the value is
      false
      .
      To configure an alarm condition for a group, select the group name and the metric that you want to monitor on all devices in a group. You can also specify whether you want to generate alarms on all the components or for some specific components. By default, alarms are generated on all the components. To generate alarms on specific components, use a regular expression to filter the components. Select one of the following options depending on your requirements:
      • All Components
        Lets you generate alarms on all the components of all devices in a group.
      • RegEx
        Lets you filter the components based on a regular expression, which enables you to generate alarms only on the filtered components. Use meta characters such as
        *
        and
        ?
        to construct a regular expression and pattern matching. RegEx supports regular expressions written in PERL. For example, if you want to generate alarms on the CPU Usage of the CPUs—CPU-11, CPU-12, and CPU-13—of all the devices in a group, you can define the regular expression as:
        CPU-1[1-3]
        . You can also use simple text with wild card operators for matching the target string. For example, the
        CPU*
        expression matches all the CPUs on the system (CPU-0, CPU-1 and so on till CPU-15). There are certain limitations on how you can define specific regular expressions.
      The following screenshot shows settings for the Group type:
      Group_Set_Condition.png
  9. Click
    OK
    to save the condition information.
  10. Specify an appropriate priority in the
    Priority
    field to evaluate the metric condition for the alarm policy at the group level. The condition that has the highest priority is used for generating alarms on the device. The range of the priority value is from 0 through 10000. You can specify the priority value only for the alarm policy at the group level, not at the device level or monitoring technology level. At the device level, the priority of the condition is set to the highest value and it takes precedence over other condition priorities for the same metric on that device. At the monitoring technology level, though the UI does not show the condition priority, CA UIM internally sets the value to 100, which cannot be changed. The default priority value is 100 at the group level and monitoring technology level.
    For more information about specific use cases, see the related section.
    The following screenshot shows the priority of a condition for a device-level alarm policy. Note that the priority is set to Highest and the value cannot be changed:
    Device-level-priority.jpg
    The following screenshot shows the priority of a condition for a group-level alarm policy. Note that the priority field shows the default priority of 100; you can change the value in this case:
    group-priority.jpg
  11. Set the alarm threshold by entering the alarm severity, threshold type (static or dynamic), operator, threshold value, and and alarm timing (Immediate or Time over Threshold), as needed.
    If you select
    Time over threshold
    , enter the number of minutes, hours, or days the metric needs to violate the threshold value. For example, when the
    Time over threshold
    is three hours in 4 hours, Infrastructure Management generates an alarm when there is a consecutive threshold violation for three hours within a four-hour time period.
    The following screenshot shows the alarm condition with condition priority as 100 (default value), alarm severity as Critical, threshold type as static, operator as greater than, threshold value as 80, and alarm creation timing as Immediate:
    Group_Alarm_Policy_With_Condition.PNG
  12. Click the arrow next to the
    Alarm messages
    section to review the default alarm messages. You can also customize the alarm messages  to contain additional information.
  13. Click
    Save
    (in the lower right corner) to create an alarm policy with one or more alarm conditions.
    This alarm policy generates alarms with the default alarm messages when the configured thresholds are violated.
The following example screenshot shows a created alarm policy:
Created_Alarm_Policy.PNG
When you create an enhanced profile in OC and the probe template includes default threshold values, then a default alarm policy is created in the Operator Console for this enhanced profile. The creator of the default alarm policy is displayed as
CA default policy
in the Operator Console. Additionally, when you convert your non-enhanced profile to an enhanced profile, a corresponding alarm policy is created in the Operator Console for the converted profile. Creation of this alarm policy adds threshold values that are present in the non-enhanced profile to the spooler metric (plugin_metric) section. The creator of this alarm policy is displayed as
CA profile migration
in the Operator Console.
Export/Import Alarm Policies
With UIM 20.3.0, the policy management API is enhanced to support export and import of alarm policies from one domain to another. To perform these operations, you must have the Policy Management ACL permission.
You can export the alarm policies based on the following:
  • Alarm policy identifiers
  • Group identifier
  • Device identifier
  • Technology – probe name
The supported export formats are XML and JSON.
To import alarm policies, map the groups, devices, and profiles to the GROUP, DEVICE, and TECHNOLOGY target types respectively.
Sample file format is as below:
[ { "sourcePolicyTargetId": 0, "sourcePolicyTargetType": "DEVICE", "targetPolicyTargetId": 0 } ]
Newly Added APIs
POST /v0/policy/export
Input parameters:
  • policyIds
    - List of identifiers of the alarm policies to export.
  • groupId
    - Identifier of the group to retrieve alarm policies.
  • deviceId
    - Identifier of the device to retrieve alarm policies.
  • probe
    - Technology to retrieve alarm policies.
  • policyFileType
    - JSON(default), XML
Returns the XML/JSON alarm policies file to be imported.
The exported alarm policies can be downloaded in swagger by clicking on the link in the response body.
ExportedPolicy.PNG
POST  /v0/policy/import
  • targetMappingFile
    - Policy target map in JSON format that is used to map the device, group or technology in the source file and the corresponding attributes in the target environment while importing alarm policies.
    Example:
    [ { "sourcePolicyTargetId": 0, "sourcePolicyTargetType": "DEVICE", "targetPolicyTargetId": 0 } ]
  • policiesFile
    - File used to import alarm policy.
Returns the list of alarm policies imported.
Policy Management in High Availability Mode
When the policy_management_ws probe is deployed on multiple wasp nodes, you must ensure that all probes do not start processing the policies. That is, there should always be only one processing node. You can do this by manually doing the configuration or by running the policy_management_ws probe in the High Availability (HA) mode. Perform the following configuration on the adminconsoleapp that is running on the primary hub (under the <adminconsole> tag).
Follow the below steps to run policy_management_ws probe in the High Availability (HA) mode:
  1. In wasp.cfg, go to the folder: webapps/adminconsoleapp/custom/uncrypted
  2. Update the attribute ha_mode.
    Allowed values are: HA or MANUAL (default).
    • When set to HA, all the policy management nodes work in co-ordination with the adminconsoleapp running on the primary hub. The controller component running as part of adminconsoleapp controls which node to process and makes sure that only one node processes at a time.
    • When not set or set to MANUAL, all nodes read the policy_processing flag from wasp.cfg file of the respective node and process the policies if the value is set to true.
  3. Update the additional attributes heartbeat_interval_min and no_failed_attempts.
    • heartbeat_interval_min : Defines the time interval that specifies how often the policy_management_ws nodes send the heartbeat to the controller running as part of adminconsoleapp. The default value is 5 minutes.
    • no_failed_attempts : Defines the number of failed attempts to send heartbeat before stopping the policy processing. The default value is 3. With the default configuration, the policy processing on a node stops in 15 minutes in case of communication issues between the controller and the node. After 20 minutes, a new node becomes the policy processing node.
  4. Click Save.
Manual Configuration
If ha_mode is configured to MANUAL, the policy management works in the manual mode. However, administrators can manually choose the failover node in case of any failure on the primary node.
For enabling this option:
  1. Deploy the policy_management_ws probe on all the OC servers and set “policy_processing” to true in one of the nodes (primary node) as shown in the following screenshot:
    policy_processing = true is not present by default. You must add it to the wasp.cfg file on the OC server for which you want to process the alarm policies.
Centralized Threshold Management for Technologies Monitored Remotely
(From UIM 20.3.1) The alarm policy functionality provides a centralized threshold management for technologies that are monitored remotely. For remote probes, alarm policies are not tied with the robot, which implies that the same policies are not applied to all the devices that a remote probe manages. This ability lets you define separate thresholds for different devices or groups that are monitored through the same remote probe.
Therefore, for devices or groups that a remote probe manages, alarm policies are now applied only to those devices for which they are created. This ensures that alarms are generated only for the relevant devices, allowing you to manage your policies and alarms in a more efficient manner.
This functionality is applicable only for those remote policies that are created after you upgrade to UIM 20.3.1. Note that UIM 20.3.1 is a patch release. The UIM 20.3.1 patch does not include any upgrade installer for the UIM Server. The patch includes separate standalone artifacts that you can use to upgrade the respective components. For more information about the artifacts that are available as a part of the UIM 20.3.1 patch release, see the UIM 20.3.1 article
Review the following example to understand how the enhanced functionality works.
Example
The example setup contains two groups: Group A and Group B. The first group includes two devices: vm1 and vm2. The second group also includes two devices: vm3 and vm4. The computer 12vm4 is acting as a monitoring host and is managing both the groups. The Network Connectivity MCS profile (net_connect probe) is created on this monitoring host.
The following screenshot shows the target devices under Group A and Group B, monitoring host (12vm4), and the MCS profile:
GroupA_Profile is the profile deployed on Group A (VM1 and VM2). This profile uses 12vm4 as its monitoring host in the profile configuration. Additionally, Ping Response Time is the metrics that this profile is supposed to collect. Similarly, Group B has the same configuration, where GroupB_Profile is the profile with snw12vm4 as its monitoring host and Ping Response Time as the metrics.
The following screenshot shows the configuration for Group A:
The alarm policy GroupA_AP is created on Group A, and the other alarm policy GroupB_AP is created on Group B. The following screenshot shows the two alarm policies:
Now, if you check the alarms, you find that the respective policies are creating alarms only on those devices on which they are created. The following screenshot shows that the GroupA_AP policy is creating alarms on Group A (vm1 and vm2). Similarly, the GroupB_AP is creating alarms on Group B (vm3 and vm4):
In this example scenario, prior to 20.3.1, the behavior was that if a policy was created on Group A, the same policy was getting applied to Group B devices also. Now, with this enhanced functionality, alarm policies are not applied to both the group devices; they are applied only to their associated group devices.
FAQs
This section provides more information on some specific areas related to alarm policy.
How do I create a new alarm policy in disabled state?
When you create an alarm policy in disabled state, the alarm policy is created successfully but it is not enforced by default. This ability gives you the option to evaluate your alarm policy before you enable it to receive alarms.
Follow these steps:
  1. Click
    Settings
    ( Settings Icon ).
  2. Select the
    Alarm Policies
    card.
    A list of existing alarm policies appears.
  3. Click the plus icon Plus Icon at the bottom of the page.
    The new policy screen appears.
  4. Enter a name in the
    Alarm Policy Name
    field.
  5. Click
    Add condition
    ( Add Condition Icon ).
  6. Select the type of alarm condition on the
    Set conditions
    dialog.
  7. Select options that apply to the type of alarm condition.
  8. Click
    OK
    to save the condition information.
  9. Set the alarm threshold. Modify the alarm severity, threshold type (static or dynamic), and alarm timing, as needed.
  10. Click the
    Save and Disable
    button.
    The alarm policy is created in the disabled state and the status tag for the newly created alarm policy displays
    Disabled
    , in the alarm policies page.
How do I disable (or enable) an existing alarm policy?
If you want to disable (or enable) an existing alarm policy, you can do so. By disabling the existing alarm policy, you no longer receive any alarms for that policy. This allows you to temporarily disable the alarm policy without the need to delete it. And, when you want to receive alarms from the same disabled alarm policy, you can simply enable it. You are not required to create a new alarm policy.
Follow these steps:
  1. Click
    Settings
    ( Settings Icon ).
  2. Select the
    Alarm Policies
    card.
    A list of existing alarm policies appears.
  3. Click the required alarm policy.
  4. Toggle the option in the lower left corner to
    Disabled (
    or
    Enabled)
    .
    The alarm policy is disabled (or enabled) and a relevant confirmation message is displayed. For example, the policy status displays the Disabled tag.png tag against the disabled alarm policy, when you look at the list of policies in the Alarm Policies page.
    The following screenshot shows an example where an existing alarm policy is disabled:
    Alarm_Policy_Disabled.PNG
Click the
Delete
button (in the lower left corner) to delete an existing alarm policy.
How do I disable an alarm condition?
You can disable a specific alarm condition in an alarm policy. In case of multiple conditions in an alarm policy, disabling one condition does not affect other existing conditions. Doing this will stop generating alarms for disabled alarm conditions from an alarm policy, while other alarms from conditions that are still enabled will continue to be generated. For example, you have created an alarm policy for a device that the File and Directory Scan (dirscan) probe monitors. For the same metric, you have created two separate conditions with different threshold values. You now want to disable one of the conditions.
Follow these steps:
  1. Click
    Settings
    ( Settings Icon ).
  2. Select the
    Alarm Policies
    card.
    A list of existing alarm policies appears.
  3. Click the required alarm policy.
  4. Scroll to the alarm condition that you want to disable.
  5. Select the
    Inline Menu
    ( Inline_menu icon.png ), and then select
    Disable condition
    .
  6. Select
    Save
    .
    The condition is disabled and alarms are no longer generated for the disabled alarm condition. The status of the condition ( Disabled tag.png ) is displayed next to it. The following screenshot shows an example:
    Disable_Condition.PNG
    Enable_Condition.PNG
To enable the condition, select
Enable
condition
, and click
Save.
The status of the condition is changed and the Disabled tag no longer appears.
What are the limitations for regular expressions usage?
The following regular expressions cannot filter components for a group:
  • Incorrect regular expression:
    CPU-(0|1)
    Workaround:
    Use the regular expression:
    CPU-[0-1]
    Matches the components:
    CPU-0 and CPU-1
  • Incorrect regular expression:
    CPU.11
    Workaround:
    Use the regular expression:
    /CPU.11/
    Matches the component:
    CPU-11
  • Incorrect regular expression:
    total/i
    Workaround:
    Use the regular expression:
    /[tT][oO][tT][aA][lL]/
    Matches all occurrences of the string
    total
    irrespective of the case. That is, the expression matches
    total
    ,
    Total
    ,
    tOtal
    ,
    toTal
    ,
    TotAl
    ,
    TOTAL
    , and so on.
The following regular expression has limitations on how it searches for the components:
  • tmp1|tmp2
    : Matches all the directories starting with
    tmp1 (
    such as
    tmp1
    ,
    tmp11
    ,
    tmp14
    ,
    tmp156
    ,
    tmp1.x
    ) and only
    tmp 2
    .
Which configuration file includes alarm policy-related information?
When an alarm policy is created, all alarm policy-related information is written in the plugin_metric configuration file  (
..\Nimsoft\plugins\plugin_metric\plugin_metric.cfg
). MCS deploys the alarm policy to spooler. Spooler reads the configuration and generates alarms based on the condition. plugin_metric.cfg is the central place for all the alarm policies related to all the probes of a robot. The following plugin_metric.cfg snippet shows the information about an alarm policy for the dirscan probe:
policy_metric.png Alarm policy logs are available under
..\Nimsoft\probes\service\wasp
. The name of the log file is
policy_management.log
.
How do I correct the plugin_metric file?
When you create an alarm policy or an enhanced profile, its configuration information is written in the plugin_metric file.In robot versions prior to the secure versions, sometimes, this information is not written properly in the plugin_metric file. For example, you create an alarm policy, but that alarm policy configuration is not deployed properly. In this case, the corresponding information is not updated correctly in the plugin_metric file and this creates issues. Similarly, when you delete a child profile from the OC UI, the same information is not deleted from the plugin_metric file. This issue has been fixed in the robot version released after CA UIM 9.2.0 releases.To resolve such issues in your environment, you can use the
plugin_metric_correction
callback that is available for the mon_config_service probe. This callback re-deploys enhanced profiles and alarm policies based on your input.
Follow these steps:
  1. Ensure that you do not create any MCS profiles or alarm policies when you are performing this operation.
  2. (Optional) Open the mon_config_service raw configuration and increase the thread count to 10 in the
    timed
    section for each parameter:
    • device_processing_threads
    • config_deployment_threads
    We recommend that you increase the thread count so that the process completes quickly. After you complete the process, change the settings back to the original values.
  3. Access the probe utility (pu) for the mon_config_service probe.
  4. Locate and select the
    plugin_metric_correction
    callback from the drop-down list.
  5. Enter the appropriate information for the following parameters, as required:
    • process_all_devices_flag
      Enter the value as true if you want to re-deploy enhanced profiles or alarm policies on all the devices. If you select this parameter, all the remaining parameters are not required.
    • robot_names
      Enter the specific robot name on which you want to re-deploy the enhanced profiles or alarm policies. If you want to use more than one entry, enter a comma-separated list.
    • computer_system_ids
      Enter the specific computer system ID (cs_id) on which you want to re-deploy the enhanced profiles or alarm policies. If you want to use more than one entry, enter a comma-separated list.
    • cm_group_ids
      Enter the specific group ID on which you want to re-deploy the enhanced profiles or alarm policies. All the devices that are part of that group are considered for re-deployment. If you want to use more than one entry, enter a comma-separated list.
    Note:
    You can use any combination of
    robot_names
    ,
    computer_system_ids
    , and
    cm_group_ids
    .
  6. Run the callback.
    A message appears in the right pane stating that the process has started for the devices. However, note that no completion message is displayed. The process completes all related tasks in the background. If you want to check the status, you need to verify the database.
  7. Verify the status by running the following queries:
    • select * from ssrv2policytargetstatus where cs_id in (<ID>);
    • select * from ssrv2profile where cs_id in (<ID>);
    The status OK means that the re-deployment has occurred without any issue.
  8. Similarly, to find whether any error has occurred, run the following query:
    • select * from ssrv2audittrail where
      userid
      like 'plugin_correction%';
    From the result of this query, note down the object IDs (failed computer system IDs), review the error messages, resolve them, and then again run the callback for these failed devices.
You have successfully repaired the plugin_metric file.
What are the
condition priority
-related scenarios for alarm policies?
Consider the following sample hierarchy to understand various scenarios:
Priority Condition for Alarm Policy
Priority Condition for Alarm Policy
  • This sample hierarchy includes a root container group (C1).
  • The root container group includes child container groups (C2, C3, C4, C5, and C6).
  • Two child container groups (C3 and C6) contain device groups (G1 in C3 and G2 in C6).
  • These device groups include certain devices (D1 in G1 and D1, D2 in G2). The device D1 is part of the two device groups G1 and G2.
  • An alarm policy condition (PC1, PC2, PC3, PG1, PC4, PC5, PC6, and PG2) is created for each group. The alarm policy conditions PG1 and PG2 are for device groups; all other alarm policy conditions are for container groups.
For applying alarm policies to the device D1 in context of the above hierarchy, the following use cases are applicable:
Use Case 1: Alarm policy with the condition having the same metric and the same priority
If a device is part of multiple groups where conditions have the same metric and the same priority, then all the conditions are applied to the device.For example, if the metrics and priorities are as follows, then all alarm policy conditions PC1, PC2, PC3, PG1, PC4, PC5, PC6, and PG2 are applied and corresponding alarms are generated. In this example, the metric M1 is present in all conditions, and all conditions have the same priority of 100. Therefore, eight alarms are generated in this case.
  • PC1
    Metric: M1, Priority: 100
  • PC2
    Metric: M1, Priority: 100
  • PC3
    Metric: M1, Priority: 100
  • PG1
    Metric: M1, Priority: 100
  • PC4
    Metric: M1, Priority: 100
  • PC5
    Metric: M1, Priority: 100
  • PC6
    Metric: M1, Priority: 100
  • PG2
    Metric: M1, Priority: 100
Use Case 2: Alarm policy with the condition having the same metric and different priorities
If a device is part of multiple groups where conditions have the same metric and different priorities, then the highest priority is taken into consideration to decide which alarm is generated. CA UIM verifies whether all the conditions for the device contain different priorities for the same metric. If so, the highest priority is taken into consideration.For example, if the metrics and priorities are as follows, then PC2 and PC4 have the highest priority of 200 for the same metric M1. In this case, only two alarms are generated for these conditions (PC2 and PC4), because they have the highest priority out of all other conditions:
  • PC1
    Metric: M1, Priority: 100
  • PC2
    Metric: M1, Priority: 200
  • PC3
    Metric: M1, Priority: 100
  • PG1
    Metric: M1, Priority: 100
  • PC4
    Metric: M1, Priority: 200
  • PC5
    Metric: M1, Priority: 100
  • PC6
    Metric: M1, Priority: 100
  • PG2
    Metric: M1, Priority: 100
Use Case 3: Alarm policy with the condition having multiple metrics and the same priority
If a device is part of multiple groups where conditions have multiple metrics and the same priority, then all the metrics will be applied to the device.For example, if the metrics and priorities are as follows, then two alarms are generated for the metric M1, two for M2, one for M3, one for M4, one for M5, and one for M6:
  • PC1
    Metric: M1, Priority: 100
  • PC2
    Metric: M1, Priority: 100
  • PC3
    Metric: M2, Priority: 100
  • PG1
    Metric: M3, Priority: 100
  • PC4
    Metric: M4, Priority: 100
  • PC5
    Metric: M5, Priority: 100
  • PC6
    Metric: M6, Priority: 100
  • PG2
    Metric: M2, Priority: 100
Use Case 4: Alarm policy with the condition having multiple metrics and different priorities
If a device is part of multiple groups where conditions have multiple metrics and different priorities, then the highest priority is taken into consideration and the corresponding metrics is applied.For example, if the metrics and priorities are as follows, then two alarms are generated for the metric M1 because PC2 and PC4 have the highest priority (200):
  • PC1
    Metric: M1, Priority: 100
  • PC2
    Metric: M1, Priority: 200
  • PC3
    Metric: M2, Priority: 100
  • PG1
    Metric: M5, Priority: 100
  • PC4
    Metric: M1, Priority: 200
  • PC5
    Metric: M1, Priority: 100
  • PC6
    Metric: M3, Priority: 100
  • PG2
    Metric: M2, Priority: 100
Upgrade/Migrate Scenarios
While upgrading/migrating from a previous version to 9.2.0, the following scenarios are considered:
  • When you upgrade an existing alarm policy (created in 9.0.2) to 9.2.0, the priority of the condition for the upgraded alarm policy is set to 100 at the group level and monitoring technology level and to the highest value at the device level. The behavior of the upgraded alarm policy is the same as explained in the above-mentioned use cases (Use Case 1, Use Case 2, Use Case 3, and Use Case 4).
  • When you migrate a device-level legacy profile to an enhanced profile, the priority of the condition for the device-level alarm policy always gets the highest priority.
  • When you migrate a group-level legacy profile to an enhanced profile, the priority of the condition for the group-level alarm policy takes the same priority as that of the profile.
Additional Considerations
Review the following considerations:
  • The metric_precedence parameter in the plugin_metric.cfg file is updated with the condition priority.
  • When a new container is added to the hierarchy or an existing one is deleted from the hierarchy, the alarm policy is applied based on the new hierarchy. And, if the condition priority is the same, all the alarm policies in the hierarchy are applied to the device.
  • When an alarm policy is deleted from the hierarchy, all related entries are removed from the database and the plugin_metric.cfg file.
  • For two different alarm policy conditions for the same device and the same metric, alarms are generated from both the conditions as the priority remains the same for both of them.
  • If an alarm policy has multiple conditions and you make any update to the alarm policy, the priority of the conditions change accordingly.
How do I determine if an alarm policy needs to be updated?
You should observe the existing alarms in the
Alarms
( Alarms View Icon ) view. There may be too many alarms that are generated for a metric, the performance levels you want to monitor are outside the industry norm, or you want to differentiate monitoring for regional and global locations to account for localized issues. Once you develop a monitoring strategy, you can change alarm behavior by opening the alarm policy that generates the alarms and updating, adding, or deleting the alarm thresholds. See the next topic for information about accessing a specific alarm policy.
How do I access alarm policies?
Follow these steps:
  1. Click
    Settings
    ( Settings Icon ).
  2. Select the
    Alarm Policies
    card.
    A list of existing alarm policies appears.
  3. From the
    Alarm Policies
    view, click a policy name to view the configuration. Use the "Custom filter" field to quickly search for a specific policy. Click the column headings to sort policies alphabetically by technology, policy name, or creator.
The following information is provided in the policy list to help you locate a specific alarm policy.
  • Monitor
    - Displays the monitoring technology for an alarm policy.
  • Alarm policy
    - Provides the policy name and the metrics that are configured in the policy.
    The alarm policy name is either the name of the monitoring profile from which the alarm policy was generated, or the name you entered when you created the policy. Mouse over the metrics under the policy name to see a complete list of metrics configured in the policy.
  • Applies to
    - Shows the device, group, component, combination of components monitored by a policy, and the type of target being monitored.
  • Creator
    - Displays the username who created an alarm policy
    , or CA default policy
    appears if Infrastructure Management generated the alarm policy automatically. The date reflects the policy creation date or the date the policy was last updated.
Can I create several alarm conditions for the same metric?
You can configure several alarm conditions from the same metric. In the same alarm policy, you could configure the same alarm condition for the same metric, but apply the metric thresholds to different groups. This provides consistent monitoring across the devices in various groups.
Example:
A monitoring administrator monitors Windows devices for the San Francisco, Chicago, and Boston business units. The Windows devices are grouped by business unit. Because alarm policies can contain alarm threshold configuration for more than one device, group, or technology, the monitoring administrator creates a single alarm policy to apply to the devices in the three business units individually.One way to configure the alarm policy is to create an alarm condition for each group, and each metric to be monitored. The following table shows an alarm condition created for the Boston and Chicago groups:
Condition
Group
Metric
Monitoring probe
Component
Priority
Thresholds
Generate an alarm when the configured thresholds are violated.
Boston
Up time
cdm
All components
100
Critical, static, greater than, 80, Immediate
Generate an alarm when the configured thresholds are violated.
Chicago
Up time
cdm
All components
100
Critical, static, greater than, 80, Immediate
Why would I change alarm thresholds?
Configured alarm thresholds are carried over from a monitoring profile during the one-time alarm policy generation process. You might want to change the threshold settings for the following reasons:
  • The alarm severity is too high or low.
  • Instead of receiving persistent (
    immediate
    ) alarms, you want to receive alarms only after successive alarm threshold violations have occurred within a configured window of time (
    Time over threshold
    ).
  • You want different performance thresholds for regional groups of computers, or for older versus new devices and servers.
How do I modify, add, or delete alarm thresholds?
Generated alarm policies provide alarms based on predefined, best practices monitoring. Update the threshold settings to reflect your monitoring needs.
Follow these steps:
  1. In an alarm policy, scroll to the desired alarm condition.
  2. Click
    Expand
    (v) to view the configured thresholds.
  3. Modify the configured alarm severity, threshold type (static or dynamic), operator, or threshold value as needed.
  4. Modify the configured alarm creation timing.
    If you select
    Time over threshold
    , enter the number of minutes, hours, or days the metric needs to violate the threshold value. Next, enter the number of minutes, hours, or days to specify the total window of time. For example, when the
    Time over threshold
    is
    three hours in 4 hours
    , Infrastructure Management generates an alarm when there is a consecutive threshold violation for three hours within a four-hour time period.
  5. Click
    Add
    ( Add Icon ) or
    Delete
    ( Delete Icon ) to add or delete thresholds for a metric.
  6. Click
    Save
    (lower right corner) to save your change to the alarm policy.
    Note:
    You cannot save updates to an alarm policy until you have entered the required information for each threshold configured in an alarm condition.
    If you delete a threshold, alarms that were previously generated remain in the system until the close alarm rule time frame is reached.
Can I configure more than one threshold for a metric?
You can configure more than one threshold for a metric to track different severities. The following scenario describes a case in which several thresholds for a metric alerts an administrator to perform different actions to address performance issues.
Use Case
To help you keep track of the user experience or determine when to upgrade equipment, you could configure different thresholds for CPU Usage. For example, you could configure the following three thresholds to generate alarms for different purposes:
  • To help you determine when equipment should be updated or replaced, configure a threshold that generates a critical alarm when CPU usage is at 95 percent for 24 hours within a 36-hour window (time over threshold alarming).
  • Configure a second threshold to generate a major alarm any time CPU usage exceeds 90 percent (immediate alarm). This alarm could help you track processing jobs that should be scheduled to run after hours.
  • Generate a minor alarm when CPU usage is greater than 60 percent for 4 days within a 5-day window of time (time over threshold alarming). This alarm would let you know that users are experiencing data processing delays.
The following screenshot shows several thresholds that are configured for a single metric. Several thresholds configured for a single metric
How do I edit an alarm condition?
For any alarm condition, you can modify what is being monitored, the selected metric, and the threshold. You can also monitor the same metric for a device or group, or configure an alarm condition for a technology. When you configure alarm conditions for a technology, the alarm condition is applied to any device with that technology in your environment.
Follow these steps:
  1. Within an alarm policy, scroll to the
    Condition
    that you want to change.
  2. Click
    Edit
    .
  3. Modify any option on the
    Set condition
    dialog.
    1. Expand (v) Type, Device, Metric, Component, Monitoring technology, or Group.
    2. Select the desired setting.
    3. If you change the type of condition, ensure that all the options are configured.
    4. Click
      OK
      to save your updates.
  4. Expand (v)
    Thresholds
    .
  5. Modify existing alarm thresholds, if needed.
  6. Click
    Add threshold
    ( Add Threshold Icon ) to configure another threshold.
    1. Select an alarm severity, the type of threshold, an operator, and enter a threshold value.
    2. Next, select the timing for an alarm.
  7. Click
    Remove threshold
    ( Remove Threshold Icon ) to delete a configured threshold.
  8. Save
    (lower right corner) the updates to the alarm policy.
How do I delete an alarm condition?
When you delete an alarm condition from a policy, alarms are no longer generated for the metric. If the metric is enabled, CA UIM continues to generate metric data. CA UIM saves the alarm history for the configured period of time.
Follow these steps:
  1. Scroll to the alarm condition you want to delete.
  2. Click the
    Inline Menu
    ( Inline Menu Button Icon ), and then select
    Delete condition
    .
    Alarms are no longer generated for the deleted alarm condition.
How do I customize alarm messages?
Each alarm policy can have up to three predefined alarm messages: a general message, a Time Over Threshold message, and a close alarm message. These predefined messages provide sufficient information to help you troubleshoot an issue. However, you can customize the alarm messages to contain additional information. For each type of predefined message, there is a list of supported variables that you can use in a message to indicate the exact device and threshold violation details. A general and close alarm message appears for each alarm policy. The Time over Threshold violation alarm message appears after a Time over Threshold alarm is configured.The default alarm violation messages and variables are:
  • Immediate threshold violation message
    ${metric_name} on ${component_name} for ${device_name} is at ${metric_value} ${metric_unit).
    Example: CPU monitor on C:/ for  test_system is at 90percent.
  • Time over Threshold violation message
    ${metric_name} on ${component_name} for ${device_name} is at ${metric_value} ${metric_unit). It has violated the threshold for at least ${tot_slider} ${tot_slider_unit} out of ${tot_time_frame} ${tot_time_frame_unit}.
    Example: CPU monitor on C:/ for test_system is at 90%. It has violated the threshold for at least 1 minute out of 5 minutes.
  • Close alarm message
    ${metric_name} on ${component_name} for ${device_name} is OK.
    Example: CPU monitor on C:/ for test_system is OK.
You can customize any of the default alarm violations messages to provide information that is relevant to your environment. You can enter text that describes a business location, or can add the variables that provide the information you want. For a complete list of supported variables, see the Alarm Message Variables topic.
Follow these steps:
  1. Within an alarm policy, scroll to the Alarm messages section.
  2. Click the
    Inline Menu
    ( Inline Menu Button Icon ) for the message you want to change.
    The Alarm Messages dialog displays the alarm message and the available variables.
  3. Enter text and additional variables to modify the message.
  4. At any time, you can click
    Reset to Default
    to return the modified message to the predefined default settings.
  5. Click
    Save
    to update the message with your changes.
What do I need to know about alarm thresholds?
The alarm threshold settings determine when an alarm is generated. An alarm threshold consists of three elements:
  • Alarm Severity
    : The severity of an alarm.
    Alarms can be critical, major, minor, warning, or informational.
  • Threshold
    : Identifies how threshold violations are handled.
    A threshold is composed of a threshold type (static or dynamic), an operator, and a value.
    • Threshold type
      : For static alarms, violations are determined based on an absolute value that is collected for a metric. Dynamic alarms are generated when the calculated average trend is a configured percentage equal to, above, or below the calculated baseline for a metric.
    • Operator
      and Threshold Value
      : Identifies the acceptable state or level of performance.
      An alarm is generated when a sample, collected for a metric at a configured interval, violates the threshold value.
  • Alarm Creation Timing
    : Indicates how long after a threshold violation occurs that an alarm is generated.
    Infrastructure Management can generate an alarm
    immediately
    after a threshold violation occurs or after a certain number of threshold violations occur within a configured time period (
    Time over threshold
    ).
What are alarm thresholds tied to?
An alarm threshold is tied to a single metric. You can configure alarm thresholds for a device, a monitoring technology, or a group.
What is the difference between a static and a dynamic alarm?
There are two types of alarms: static and dynamic. A static alarm is generated when a metric reaches a configured threshold value. For example, when CPU Usage on a target device reaches 95%, the policy generates a critical alarm. When you are monitoring a device that has persistent issues, consider configuring a static alarm.Dynamic alarms are generated based on the moving average of the baseline data that was collected over the previous 28 days. When you specify a threshold value for a dynamic alarm, an alarm is generated when the calculated average of the data reaches the configured percentage above or below the average trend. The calculated average trend can change over time as the collected baseline data changes. If you enter a dynamic threshold of >10% for CPU Usage, and the average trend of CPU Usage for the last 28 days is 85, an alarm is generated when the CPU Usage goes above 95%.When you are monitoring a healthy, stable device whose resources are used in a consistent manner, configure a dynamic alarm.
What is the difference between immediate and time over threshold alarming?
Infrastructure Management can generate an alarm
immediately
after a threshold violation occurs, or after a certain number of threshold violations occur within a configured time period (
Time over threshold
). The Time Over Threshold is an event processing rule which reduces the number of alarms that are generated when threshold violation events occur. You can use Time Over Threshold to filter out data spikes and monitor problematic metrics over a set period. Instead of sending an alarm immediately after a threshold violation occurs, the Time Over Threshold function:
  • Monitors the events that occur during a user-defined sliding time window.
  • Tracks the length of time that the metric is at each alarm severity.
  • Raises an alarm if the cumulative time the metric is in violation during the sliding window reaches the set Time Over Threshold.
For example, you could configure a static or dynamic alarm that is generated when the threshold has been continuously violated for 5 minutes in a 15-minute sliding time period. The following figure shows when the alarm is generated.
Time Over Threshold Alarm
Time Over Threshold Alarm
Can I change the name of a monitoring profile after a corresponding alarm policy is generated?
Do not change the name of a monitoring profile after it is used to generate an alarm policy. Alarm policies are dependent on monitoring profiles. If you change the monitoring profile name or the corresponding alarm policy name, CA UIM stops generating alarms for the devices, groups, or technologies monitored by the alarm policy. Other than the lack of alarms, there is no indication or error message that a profile has been deleted.
Can I change the name of an alarm policy that was generated from a monitoring profile?
Do not change the name of an alarm policy generated from a monitoring profile. Alarm policies are dependent on monitoring profiles. If you change the monitoring profile name or the corresponding alarm policy name, CA UIM stops generating alarms for the devices, groups, or technologies monitored by the alarm policy. Other than the lack of alarms, there is no indication or error message that a profile has been deleted.
Can I delete the monitoring profile after the alarm policy is generated?
Do not delete a monitoring profile associated with an alarm policy. The alarm policies are dependent on monitoring profiles. If you inadvertently delete a monitoring profile, CA UIM stops generating alarms for the devices, groups, or technologies monitored by the associate alarm policy. Other than the lack of alarms, there is no indication or error message that a profile has been deleted.
How do I search for an alarm policy?
Click
Settings
( ic_settings.png ), and then select the
Alarm Policies
card. A filtering mechanism is available in the top left corner of the alarm policies list. Enter a technology, an alarm policy name, a metric name, or a creator to search for a specific alarm policy.
How many alarm thresholds can I configure for a metric?
For a single metric, you can configure as many thresholds as you need to monitor a target device.
My alarms are chatty or I'm seeing alarm flapping. What can I do?
Consider adjusting the alarm threshold setting. If you created a monitoring configuration profile using the predefined threshold settings, these setting might not be appropriate for your environment. If you are seeing alarm flapping—where an alarm is generated, quickly closed, and generated again within a short time period—consider configuring the Time Over Threshold timing option for an alarm. When you configure the Time Over Threshold (TOT) option, an alarm is generated only when the TOT threshold is reached the configured number of times, during the configured sliding window.
How can I reset an alarm message to default settings?
You can return a customized alarm message to the predefined alarm message at any time.
Follow these steps:
  1. Click
    Inline Action
    ( Inline menu.png ) next to the desired alarm message.
  2. On the Alarm message dialog, click
    Reset to Default
    .
    The predefined message appears in the Alarm Messages panel. The next alarm that is generated displays the predefined alarm message.