Assisted Triage and Analysts

Assisted Triage is an engine and story generator. Assisted Triage identifies the most meaningful events that occurred in your busy systems and provides contextualized information (stories) about these events. These stories appear as problems and anomalies with headlines. The reliable and intelligent nature of the stories that Assisted Triage generates keeps you fully apprised of the state of your monitoring domain.
apmdevops106
Assisted Triage is an engine and story generator. Assisted Triage identifies the most meaningful events that occurred in your busy systems and provides contextualized information (stories) about these events. These stories appear as problems and anomalies with headlines. The reliable and intelligent nature of the stories that Assisted Triage generates keeps you fully apprised of the state of your monitoring domain.
 
 
How Assisted Triage Works
Assisted Triage creates problems and anomalies about events in your system. Assisted Triage reacts to the following types of events:
  • Stalls
  • Errors
  • Alerts
  • Unstable response times
Problems and anomalies explain aspects of one or more events. For example, the aspects include:
  • WHAT summarizes the event including any suspected causes (the WHY). This information appears as a headline for a problem or anomaly in the Experience View and Analysis Notebook.
  • WHERE locates an event occurrence, typically information like the host and agent name. WHERE can have more details when available.
  • WHO identifies the transactions that are affected or might be impacted. This aspect also determines how many transactions are affected.
  • WHEN records an event occurrence; typically the start and end of a stall event, an error event, or an instability.
  • WHY explains an event occurrence. For example, the following statement explains a high-call ratio problem:
    Potential high call ratio from ViewOrders|service to 138.0.0.1_7080|getService 2 in the order of 214980
The following diagram and corresponding steps describe how Assisted Triage works:
Assisted Triage Architecture
Assisted Triage Architecture
  1. Events in your APM system occur as variance intensity, errors, stalls, APM alerts, and so on. An event contains a possible suspect for causing the problem.
  2. An event generator gathers event data from different sources and sends the data to the event processor.
  3. The event contextualizer receives the events from generators across a cluster, processes the events, and gathers any related events into a context. The context information includes the potential impact of the leftmost component, and all the transactions that flow through the component.
    The contextualizer passes this context information to the editor.
  4. The editor tracks different contexts and assigns one reporter per specific event context for further analysis.
  5. Reporters know the different types of analysts that are available in the system, and run the context through each analyst. Analysts analyze the context for event types, patterns, and potential impact, and then each analyst creates a statement. Analysts work together to record evidence or create stories from the statements, and then store the data in the APM database. Stories are purged from the database when they are older than 62 days.
  6. The stories appear as problems or anomalies on the Experience View and Analysis Notebook.
 
Note:
 The Enterprise Manager generates and collects metrics about the Assisted Triage components. These supportability metrics are useful in assessing the Enterprise Manager health.
Analysts
Analysts are like medical specialists who know how to diagnose specific classes of illnesses. Assisted Triage uses the following main types of analysts. Each type of analyst includes other specific analysts.
 
Event analysts
 look for certain event types and create event statements that serve as evidence. Examples of event analysts include:
  • A Differential Analysis analyst checks for variance intensity
  • An error analyst checks for error events in contexts
  • A resource event analyst monitors alert events on system resources
 
Pattern analysts
 look for certain patterns in the context and create pattern statements. These statements are one part of a story summary. Examples of pattern analysts include:
  • A default analyst determines the deepest component in a context (per the relationship map). The default analyst is also referred to as the Zone Identifier.
  • A High Call Ratio analyst looks for the deepest component in the given context (per the relationship map). This analyst sees if the component calls any backend type nodes an unusual number of times.
The statements from the analysts form a story summary.
Story Example: Default Analyst (Zone Identifier)
This example explains a default analyst (or Zone Identifier) story. This analyst always works whether other specific analysts find patterns. The default analyst identifies a probable zone. The zone can be a frontend, a backend, or an internal component between them. For example, a statement from the default analyst looks like this headline:
Problem isolated to {type} {component}
{type} can be a frontend, business transaction, internal component, or backend.
{component} is the component name involved in the zone.
For example, consider the following components in the system:
  • Frontend F
  • Backend B
  • Internal Component M
All these components are related through owning a business transaction: F->M->B.
The following sequence of events occurs in the transaction flow:
  1. Events occur which are only related to Frontend F.
    The default analyst story reports an event that is isolated to Frontend F.
  2. An event occurs for Internal Component M.
    The default analyst relates these two events because they are in same transaction flow. The analyst states: Problem isolated to internal component M 
  3. An event occurs for Backend B.
    The default analyst combines all three events and states: Problem isolated to backend B 
Browse the anomalies and problems on the Experience View and Analysis Notebook for a headline that includes a type and component, for example:
Problem isolated to internal component AxisServlet|service
This headline describes a Default Analyst story. For example, the details might describe a problem in the zone between the frontend and backend transactions for ACME app.
Story Example: High Call Ratio
This example explains how Assisted Triage reports a High Call Ratio story. A High Call Ratio story occurs when a client component issues too many transactions of its own, bogging down the overlying transaction that initiated it. That is, when the ratio from caller to callee results in a low number for the caller and a high number for the callee, for example, 1:20. This number shows that one call to the caller results in 20 calls to the callee. The pattern analyst reports High Call Ratio stories for backend nodes/components such as databases or web services clients.
The following symptoms can indicate a High Call Ratio problem:
  • Latency is high "before" a component in the call stack, but the latency for the component is low, indicates a High Call Ratio use of the component, or before the component.
  • Long latency transactions with "bar code" patterns: Component A calls component B numerous times in a short interval. This behavior typically results in normal latency for B but high latency for A.
Browse the anomalies and problems on the Experience View and Analysis Notebook to identify a High Call Ratio headline. For example, the anomalies and problems show the following headline:
Potential high call ratio from {culprit.name} to {calledComp.name} in the order of {ratio}
This headline describes a High Call Ratio story. For example, the details might describe a latency problem for a client connection to a database in New York.
Example: How an Analyst Determines the Deepest Component in a High Call Ratio Story
The following example shows how a pattern analyst looks for the deepest component in a High Call Ratio story:
  1. Differential Analysis triggers an alert--a transaction has slowed and it is now part of a story.
  2. An event occurred on the call path from the transaction. The analyst searches for the deepest component in the context.
  3. One component is a dead end. The component does not call a backend, so the analyst ignores it.
  4. One component calls a backend. Using historical data, the analyst compares the numbers of Responses Per Interval at the calling component to the number of Responses Per Interval for the backend calls. If the ratio is high (for example, > 1:50), the transaction has an abnormally high-call ratio that could be adversely affecting the performance of the app.
    Other components can also have a high ratio of calls to the database. The analyst does not diagnose the high ratio until there are components on the call path from a frontend. The analyst is not concerned with the entire app, but rather it is a detective for an identified story.
Deepest Component Example High Call Ratio Story
Deepest Component Example High Call Ratio Story
Resource Event Analyst Support
Assisted Triage uses a resource event analyst to monitor alerts on resource events like CPU, memory, and so on as follows:
  1. An application experiences problems and, or due to, system resources issues.
  2. The resource events are listed as suspects for the given problem or anomaly.
Resource analysts support agents for CA APM and CA APM Infrastructure Agent. Assisted Triage provides context for the infrastructure information that is reported by Infrastructure Agent for an application. Alerts that are sent from Infrastructure Agent are incorporated in Assisted Triage stories (evidences). For example:
  1. A CPU is running high on a server.
  2. Infrastructure Agent reports this problem.
  3. Assisted Triage associates this resource problem with the impacted application.
The following prerequisite steps apply to the Resource Event Analyst:
  1. Ensure that CA APM Infrastructure Agent monitoring is enabled and alerts are mapped to infrastructure components.
  2. In the Map View, select 
    Application Layer
     to see application components.
  3. Click an application component on the Map.
    Correlation values should exist including the corresponding infrastructure component.
 
More Information: