How Threshold and State Definitions Work

Several alert mechanisms or actions are available for alerting purposes. You can set threshold or state definitions to raise an exception or an alert when resource usage is outside a defined range or the resource is not in the desired state.
cspm142
Several alert mechanisms or actions are available for alerting purposes. You can set threshold or state definitions to raise an exception or an alert when resource usage is outside a defined range or the resource is not in the desired state.
Thresholds and states are different:
  • Threshold definitions assign numeric warning and problem limit values to a metric or resource.
    Thresholds are always numeric.
  • State definitions assign a status to a specific condition of a metric or resource.
    States always have character values.
Some examples of states include:
  • ACTIVE
  • INACTIVE
  • ONLINE
  • OFFLINE
Review the following sections for information about threshold and state rule types:
 
 
Threshold Rule Types
Threshold rule types define the algorithm that is used to determine when a collected metric has exceeded the allowable resource usage.
Multiple rule types or checks can be defined for a single metric:
  • AUTO
  • UPPER
  • LOWER
  • CHANGE
Rule Type: AUTO
A threshold rule type of AUTO uses the factory default of UPPER, LOWER, or CHANGE defined for the metric. The default value is displayed in the Rule field on the VARS command display.
Rule Type: UPPER
An UPPER limit rule type defines the upper warning and problem limit values that a collected metric can use. If the metric value exceeds the warning or problem limit values, defined actions can be triggered that are associated with the definition.
Algorithm:
IF value >= problemlimit THEN
status = PROBLEM
ELSE
IF value >= warninglimit THEN
status = WARNING
ELSE
status = NORMAL
END
END
 
Example:
 The CPU percent busy is 93%.
Defined limit values:
  •    Warning - 80%
  •    Problem - 90%
Using the algorithm above, the status that is assigned to the metric is PROBLEM. The value (93%) is greater than the defined problem limit value of 90%.
Rule Type: LOWER
A LOWER limit rule type defines the lower warning and problem limit values that a collected metric can use. If the metric value falls below the warning or problem limit values, defined actions can be triggered that are associated with the definition.
 Algorithm:
IF value <= problemlimit THEN
status = PROBLEM
ELSE
IF value <= warninglimit THEN
status = WARNING
ELSE
status = NORMAL
END
END
 
Example:
 The amount of free Common Storage is 128K
Defined limit values:
  • Warning - 256K
  • Problem - 64K
Using the algorithm above, the status that is assigned to the metric is WARNING. The value (128K) is less than the defined warning limit value (256K), but is not less than the problem limit (64K).
Rule Type: CHANGE
Workload and resource consumption can be very unpredictable. For this reason, creating a "baseline" of expected values for resource consumption can be very difficult.
Creating a baseline value for a metric is simply the expected value of resource consumption for a known time period. Workload must be predictable or must be a recurring event for the baselining approach to work. Workload is not as predictable as it was in the past, largely due to workload arriving from internet and global applications.Since workload is unpredictable, a different or smarter detection method for analyzing exceptions is needed.
A CHANGE limit rule type defines the amount that a metric can change. This change can be in an upward or downward direction. The CHANGE rule does not require specific values for a metric to be specified. The change is measured in terms of the number of standard deviations from the historical average.
If the change in the metric exceeds the defined warning or problem limit values, defined actions can be triggered that are associated with the definition. The change in the metric value can be an upward or downward change.
Within the CHANGE rule type definition, the warning and problem limit value is defined as the number of standard deviations that must be exceeded for the metric to be assigned the status of WARNING or PROBLEM.
Generally speaking, we assume that if a data value is within one standard deviation of the mean, its status is considered NORMAL.
Standard Deviation (stddev)
Standard deviation is a measure of the variability or distribution of a set of data values. If a set of data has a low standard deviation, this indicates that all the data values are very close to the average or mean of the set. A high standard deviation value indicates that the data is spread out over a wide range of data values.
Mean (mean)
The average value for the entire set of data.
Duration Average (duravg)
The average value for the most current "n" intervals of data where "n" is the interval or duration in minutes.
Duration Change (durchg)
The difference between the duration average and the mean.
durchg = duravg - mean
Warning Limit Value (warninglimit)
The actual warning limit value is calculated dynamically during the threshold checking or evaluation process. The warning limit value must be calculated dynamically because it is based on the mean and standard deviation of the data.
warninglimit = warning * stddev
Problem Limit Value (problemlimit)
The actual problem limit value is calculated dynamically during the threshold checking or evaluation process. The problem limit value must be calculated dynamically because it is based on the mean and standard deviation of the data.
problemlimit = problem * stddev
Algorithm:
IF durchg >= problemlimit THEN
status = PROBLEM
ELSE
IF durchg >= warninglimit THEN
status = WARNING
ELSE
status = NORMAL
END
END
 
Example:
 
      Metric:  CPU percent busy
      The 60 minute mean or average: 40%
      The standard deviation: 12
      The duration average: 55
      The duration change: 15        (duravg - mean)
      The duration change stddev: 1.250     (durchg / stddev)
A simple analysis of the mean and standard deviation indicates that the CPU percent busy average has been typically in the range of 28% to 52%.
-1 stddev mean +1 stddev --------- ---- --------- 28% 40% 52%
Defined limit values (number of standard deviations):
    • Warning - 1.000
    • Problem - 2.000
The warning and problem limit values that are used during the check process is calculated dynamically as follows:
warninglimit = warning * stddev 12 = 1.000 * 12 problemlimit = problem * stddev 24 = 2.000 * 12
The defined limit values have the following meaning:
    • If the "duration change value (15)" is greater than 24, the assigned status is PROBLEM.
    • If the "duration change value (15)" is greater than 12, the assigned status is WARNING.
    • Using the algorithm above the status assigned to the metric is WARNING. The duration change value (15) is greater than the warning limit value (12), but is less than the problem limit value (24).
State Rule Types
States do not have multiple rule types. State exception processing assigns a status to the current state of a metric or resource.
Example: The current state of a CPU process is OFFLINE.
DEFINE CPUSTAT RSCE = STATE ONLINE STATUS NORMAL
DEFINE CPUSTAT RSCE = STATE OFFLINE STATUS WARNING
DEFINE CPUSTAT RSCE = STATE PARKED STATUS HIGH
Based on the example definitions, the status WARNING is assigned to a CPU in the state OFFLINE.
Notifications and Actions
Multiple methods of notification and actions can be taken when an exception occurs:
  • Write a message to the log
  • Write a message to the console
  • Send an SNMP alert trap
  • Send an event notification to CA OPS/MVS
  • Request an Event Capture
  • Execute a pre-defined REXX EXEC or IMOD
  • Cancel a CICS transaction