The Time Over Threshold Event Rule

Contents
uimpga-ga
Contents
Prerequisites
To use Time Over Threshold, you must have the following probe versions installed at each hub level where Time Over Threshold functionality is desired:
  • alarm_enrichment 4.40 or later
  • baseline_engine 2.34 or later
  • nas 4.40 or later
  • Probe Provisioning Manager (PPM) 2.38 or later
  • prediction_engine 1.01 or later
Time Over Threshold and Secondary Hubs
The prerequisites for Time Over Threshold apply to any secondary hub in which you want Time Over Threshold functionality active. Once you have deployed the required probes and configured Time Over Threshold on your secondary hubs, you can forward your alarms to the primary hub using nas forwarding and replication.
TOT replication flow
TOT replication flow
(CA UIM 9.0.2) Prerequisites for Time Over Threshold Configuration in MCS
For more information about the prerequisites for Time Over Threshold configuration in MCS, see Configuring Alarm Thresholds in MCS.
Overview
Time Over Threshold (TOT) is an event processing rule that allows you to reduce the number of alarms that are generated when threshold violation events occur. You can use Time Over Threshold to filter out data spikes and monitor problematic metrics over a set period. Instead of sending an alarm immediately after a threshold violation has occurred. Time Over Threshold:
  • Monitors the events that occur during a user-defined sliding time window.
  • Tracks the length of time that the metric is at each alarm severity.
  • Raises an alarm if the cumulative time the metric is in violation during the sliding window reaches the set Time Over Threshold
Example: Time Over Threshold in a Consecutive Block
This example uses the following settings:
  • Sliding Window:
    30 minutes.
  • Time Over Threshold:
    10 minutes.
  • Auto-Clear:
    Not set.
  • Alarm Severities:
    Clear, Information, Warning, Minor, Major, and Critical alarm thresholds are set in the probe GUI.
2320254.png
The Time Over Threshold does not have to occur consecutively within a sliding time window. All of the time in a sliding window is counted toward the Time Over Threshold.
Example: Time Over Threshold in a Nonconsecutive Block
This example uses the following settings:
  • Sliding Window:
    30 minutes.
  • Time Over Threshold:
    10 minutes.
  • Auto-Clear:
    Not set.
  • Alarm Severities Set:
    Clear, Information, Warning, Minor, and Major alarm thresholds are set in the probe GUI.
2320634.png
Time Over Threshold Workflow
tot_probe_flow
tot_probe_flow
  1. The baseline_engine probe evaluates QoS metrics from probes against static and dynamic threshold definitions.
  2. The baseline_engine probe generates threshold violation messages when thresholds are crossed.
  3. The nas probe implements the Time Over Threshold event processing rule to filter out data spikes. This event processing produces a more accurate reflection of threshold violation behavior.
(CA UIM 9.0.2) Configure Time Over Threshold in MCS
For more information about how to configure Time Over Threshold in MCS, see Configuring Alarm Thresholds in MCS.
Alarm Suppression During Time Over Threshold
After a metric reaches a Time Over Threshold state, an alarm is generated for each additional threshold violation. By default, these duplicate alarms will increase the suppression count for the alarm, but will otherwise not be visible. If suppression is turned off, the duplicate alarms are treated as new alarms and will be visible in USM or the nas GUI. 
If the alarm is deleted (acknowledged) in UMP or Infrastructure Manager, then the time window is not reset. The alarm breach at the probe needs to be cleared for a sufficient period (under the TOT time within the sliding window time) for the alarms to be suppressed again.
Alarm Clear Conditions Using Time Over Threshold
Auto-clear is an optional setting that clears a Time Over Threshold alarm when there are no new threshold violation events for the defined time period. If auto-clear is turned on, a timer begins after a clear event is received. If no subsequent threshold violation events arrive in the auto-clear window after the clear event is received, the alarm is automatically cleared (set to level 0). The arrival of a threshold violation event resets the clear rule, which waits for the next clear event to arrive before the timer starts again.
An auto-cleared Time Over Threshold alarm can be automatically acknowledged (and closed) using the
Accept automatic 'acknowledgment' of alarm
option in the nas probe GUI, which is enabled by default. If this option has been disabled, alarms will remain in the alarm history with Clear (green) Severity and must be manually acknowledged.
Auto-clear times are retained when the alarm_enrichment probe is not active. If the alarm_enrichment probe stops and is then reactivated, any running Auto-clear timers are restarted with either:
  • The time of the original Auto-clear, if it is still in the future.
  • One-minute, if the original Auto-clear time is in the past.
Example: Time Over Threshold Using Auto-Clear
This example uses the following settings:
  • Sliding Window:
    30 minutes.
  • Time Over Threshold:
    10 minutes.
  • Auto-Clear:
    5 minutes
  • Alarm Severities:
    Clear, Information, Warning, Minor, and Major alarm thresholds are set in the probe GUI.
2320865.png
Alarm Severity Changes During Time Over Threshold
Time Over Threshold is evaluated at each user-defined event severity. This means that a metric must be at an elevated alarm severity for the defined Time Over Threshold before the severity changes. The new alarm severity level is then set to match cumulative event severity in the Time Over Threshold Window.
Each time a threshold violation event arrives, the Time Over Threshold alarm severity is determined as follows:
  1. The cumulative time of the threshold violation events within the sliding window with Critical severity is calculated. If that time exceeds the defined Time Over Threshold, the alarm severity is set to Critical and rule processing is complete.
  2. The cumulative time of threshold violation events within the sliding window with a severity that is Major or greater is calculated. If that time exceeds the defined Time Over Threshold, the alarm severity is set to Major and rule processing is complete.
  3. The cumulative time of threshold violation events within the sliding window with a severity that is Minor or greater is calculated. If that time exceeds the defined Time Over Threshold, the alarm severity is set to Minor and rule processing is complete. Otherwise, the algorithm continues in this pattern for the remaining severity levels.
Example: Time Over Threshold with Increasing Severity
This example uses the following settings:
  • Sliding Window:
    20 minutes.
  • Time Over Threshold:
    10 minutes.
  • Auto-Clear:
    Not set.
  • Alarm Severities:
    Clear, Information, Warning, Minor, and Major alarm thresholds are set in the probe GUI.
  • Alarm Suppression:
    On.
2320864.png
In this example:
  1. Time 20
    - A Time Over Threshold alarm is raised after ten minutes of Time Over Threshold event time is accumulated. The alarm severity is set to 1, because the first Time Over Threshold rule condition that matches is 'event severity is 1 or greater'.
  2. Time 25
    - The severity is elevated to 2 because the Time Over Threshold rule condition 'event severity is 2 or greater' is now true
  3. Time 30
    - The severity is elevated to 3 because the Time Over Threshold rule condition 'event severity is 3 or greater' is now true.
Time Over Threshold only evaluates on alarm severity levels that are set in the probe configuration GUI.
Example: Time Over Threshold with Two Set Severities
This example uses the following settings:
  • Sliding Window:
    30 minutes.
  • Time Over Threshold:
    10 minutes.
  • Auto-Clear:
    Not set.
  • Alarm Severities:
    Minor and Major alarm thresholds are set in the probe GUI.
2321566.png
In this example:
  1. Time 30
    - A Time Over Threshold alarm is raised after ten minutes of Time Over Threshold event time is accumulated. The Time Over Threshold alarm severity is set to 3, because the first Time Over Threshold rule condition that matches is 'event severity is 3 or greater'.
Example: Time Over Threshold With Multiple Severities
This example uses the following settings:
  • Sliding Window:
    8 minutes.
  • Time Over Threshold:
    4 minutes.
  • Auto-Clear:
    4 minutes.
  • Alarm Severities:
    Clear, Information, Warning, Minor, and Major alarm thresholds are set in the probe GUI.
  • Alarm Suppression:
    On.
2321247.png
In this example:
  1. Time 8
    - A Time Over Threshold alarm is raised after four minutes of Time Over Threshold event time is accumulated. The alarm severity is set to 1, because the first Time Over Threshold rule condition that matches is 'event severity is 1 or greater'.
  2. Time 10
    - The severity is elevated to 2 because the TOT rule condition ‘event severity is 2 or greater’ is now true.
  3. Time 16
    - The severity is elevated to 3 because the TOT rule condition ‘event severity is 3 or greater’ is now true.
  4. Time 21
    - The alarm severity decreases to 2 because there are no longer 4 minutes or more of severity 3 or greater within the 8-minute sliding window, but there are 4 minutes or more of severity 2 or greater
  5. Time 25
    - The alarm severity decreases to 1 because there are no longer 4 minutes or more of severity 2 or greater within the 8-minute sliding window, but there are 4 minutes or more of severity 1 or greater
  6. Time 30
    - The alarm is cleared because no new violations occur for four-minutes and the auto-clear condition is met.
Supported Threshold Types
The static and dynamic threshold limit types are currently supported with Time Over Threshold. See Configuring Alarm Thresholds or Configuring Alarm Thresholds in MCS for more information.
The type of thresholds available vary by probe and by UI. Not all threshold types are supported by all probes in all UIs. If a threshold type is not configurable in a probe configuration UI, or in an MCS template, either the probe or MCS does not support that threshold type.
Additional Time Over Threshold Scenarios
The following examples show extra Time Over Threshold scenarios using specific probe metrics.
Example: URL_response Probe Metric Time to First Byte
This example uses the following settings:
  • Sliding Window:
    5 minutes.
  • Time Over Threshold:
    3 minutes.
  • Auto-Clear:
    Not set.
  • Alarm Severities:
    • Alarm Severity 2 is set to 100 ms.
    • Alarm Severity 3 is set to 300 ms.
    • Alarm Severity 4 is set to 700 ms.
    • Alarm Severity 5 is set to 1,000 ms.
  • Alarm Suppression:
    On.
2321586.png
In this example:
  1. Time 8
    -Three-minutes of time to first byte of 100 ms or greater is observed in the sliding window and an alarm of severity 2 is sent.
  2. Time 14
    - Three-minutes of time to first byte of 300 ms or greater is observed. The alarm increases to severity 3.
  3. Time 20
    - Three-minutes of time to first byte of 700 ms or greater is observed. The alarm increases to severity 4.
  4. Time 25
    - Three-minutes of time to first byte of 1000 ms or greater occurs. The alarm increases to severity 5.
Example: CDM Probe Metric Disk Usage
This example uses the following settings:
  • Sliding Window:
    45 minutes.
  • Time Over Threshold:
    5 minutes.
  • Auto-Clear:
    Not set.
  • Alarm Severities:
    The Critical alarm threshold is set to 80% in the probe GUI.
2321585.png
In this example:
  1. Time Over Threshold only occurs for four-minutes and no alarm is sent.
Example: CDM Probe Metric Disk Usage (Modified to Send a Time Over Threshold Alarm)
This example uses the following settings:
  • Sliding Window:
    15 minutes.
  • Time Over Threshold:
    5 minutes.
  • Auto-Clear:
    5 minutes.
  • Alarm Severities:
    The Critical alarm threshold is set to 80% in the probe GUI.
2321837.png
  1. Time 15
    -Five-minutes of disk usage at 80% or greater is observed in the sliding window and an alarm of severity 5 is sent.
  2. Time 21
    - The alarm is cleared after five-minutes of time below the set severity level.
Best Practices for Time Over Threshold
Observe the following best practices when using Time Over Threshold:
  • Set the Time Over Threshold to a longer interval than the sample period for the QoS metric. Setting a smaller Time Over Threshold produces the same results as leaving the Time Over Threshold rule disabled.
  • Evaluate your monitored system and determine the appropriate values for both the sliding window and Time Over Threshold. Values that are too large for your system can result in the suppression of alarms you may need to be aware of.
Setting a smaller Auto-clear window may result in an excessive number of alarms
as well as cause other unexpected alarm results
.
The Clear Delay Time (TC) value
MUST NOT
be less than the Time Over Threshold (TOT) interval value for automatically clearing alarms.
Configure Time Over Threshold
Any alarms generated from a secondary nas must be passed to the primary nas using replication.
Time Over Threshold is configured using the individual probe UIs in Admin Console or by using the relevant templates in MCS.
The type of thresholds available vary by probe and by UI. Not all threshold types are supported by all probes in all UIs. If a threshold type is not configurable in a probe configuration UI, or in an MCS template, either the probe or MCS does not support that threshold type.
The following example shows the Time Over Threshold settings for the cdm probe Disk Usage metric:
screen.png
Follow these steps:
  1. In the probe GUI, select a node in the tree to view any associated monitors and QoS metrics.
  2. Select the monitor that you want to modify from the available list.
  3. Click the
    Publish Data
    ,
    Publish Alarms
    , and
    Compute Baseline
    check boxes.
  4. The cdm probe only supports dynamic time over threshold calculations. Click the
    Dynamic Alarm
    check box.
  5. Configure the dynamic alarm settings. For more information, see the appropriate section in the Configuring Alarm Thresholds article.
  6. Select the
    Enable Dynamic Time Over Threshold
    check box.
  7. Enter values for the following fields:
    • Time Over Threshold <TOT>
      - The length of time a metric must remain over threshold before an alarm is sent.
    • Sliding Time Window <TW>
      -The length of time in the sliding window in which metrics are monitored for threshold violations.
    • Time Units for <TOT> and <TW>
      - The unit of measurement used by the
      Time Over Threshold
      and
      Time Window
      parameters. Limited to minutes, hours, or days.
    • Automatically Clear Alarm
      - Enable the Auto-clear functionality.
    • Clear Delay Time
      - The length of time used in the Auto-clear timer. If no alarms are sent in the set time period, the alarm is automatically cleared.
      If no clear delay time is set, alarms are never cleared.
    • Time Units for <TC>
      - The unit of measurement used by the Auto-clear. Limited to minutes, hours, or days.
  8. Save your changes.
Post-Rule Configuration Updates
After configuring Time Over Threshold, the following changes will take effect immediately:
  • New Time Over Threshold rules.
  • Changes to the Clear Delay Time parameter.
  • Changes to the Time Over Threshold active state.
Also, after your configuration is saved, the ppm probe most local to the probe you are configuring creates a bus message with the subject TOT_RULE_CONFIG. There is an associated queue on the hub named
tot_rule_config
that has subscribed to the TOT_RULE CONFIG message subject. The alarm_enrichment probe processes these messages and writes to a local file that is named
rule_config.xml
. The rule_config.xml file is stored in file directory
<UIM_install>\probes\service\nas\alarm_enrichment
. The following is an example rule_config.xml file that contains two rules.
image2017-9-22 16:2:49.png
When the alarm_enrichment probe starts, it reads Rule_config.xml into memory. When an alarm is processed through the alarm_enrichment probe with a Met_id that matches one in the Rule_config.xml file, the alarms is not posted to the alarm2 subject. This action means that the alarm is ignored during the Time Over Threshold period.
The following changes will take effect at the next received alarm:
  • Changes to the Time Over Threshold parameter.
  • Changes to the Sliding Time Window parameter.
Troubleshooting Time Over Threshold
I see Errors Regarding alarm_enrichment
Symptoms:
  • I have received a Critical alarm stating that the alarm_enrichment probe version is incorrect, or that the alarm_enrichment probe must be activated.
  • I see the following error message in the Admin Console probe configuration GUI:
    "Time over threshold is not available. Unable to read or write the configuration from the alarm_enrichment probe."
Solution:
  • Verify that alarm_enrichment probe version 4.40 or later is installed and activated at Hub level.
The Time Over Threshold Configuration Parameters are Unavailable
Symptoms:
  • I do not see the Time Over Threshold configuration parameters in the Admin Console GUI of my probe.
  • I do see the Dynamic Threshold configuration parameters.
  • I have received no additional error messages or alarms.
Solution:
  • Verify that the correct versions of nas, ppm, and prediction_engine are installed and activated at the Hub level.