Root Cause Analysis and Alarm Correlation

Root Cause Analysis and Alarm Correlation
doi13
A large banking company with a vast infrastructure always has to be online for maintaining the business processes continuity. 
Watch this eight minute video to become familiar with Root Cause Analysis and Alarm Correlation concept: 

The following scenario explains how Sarah (Business Owner) verifies the business-critical services of the MobileBanking service and works with Quan (IT Operations Owner) to identify the root cause of an ailing service. 
  1. Sarah logs into the CA Digital Operational Intelligence Console.  
    The Service Analytics Overview page displays all the available services and their overall health.
  2. Sarah identifies that the MobileBanking service is in an unhealthy state. The alarms and availability of the service are low with high user count. 
  3. Sarah clicks the MobileBanking service from the Overview page and gets redirected to the MobileBanking summary page. 
  4. Sarah views the detailed information of the service and the Key Performance Indicator (KPI) trends and observes that the ServiceAlarms are in critical state. 
  5. Sarah contacts Quan to troubleshoot this issue. 
  6. Quan logs into the CA Digital Operational Intelligence Console.  
  7. Quan clicks the Alarms Overview widget on the MobileBanking summary page and gets redirected to the Alarm Analytics page seamlessly in context of the service.
  8. Quan finds all the alarms that are associated with the devices and mapped to the MobileBanking service. 
    The Alarm Analytics page displays service alarms for MobileBanking service. For more information about service alarms, see Alarm Analytics. Service alarms are the smart alarms that are generated by analytics engine to help Quan identify the issue faster. Quan can also view all alarms.
  9. To verify the memory utilization, Quan clicks the Service alarm for the device (db-server-pc-01). 
    The Overview, Affected metric and Impacted services tabs get displayed in the Alarm details view. Quan observes that the metric chart in the Affected metric tab shows the anomaly alarm that is overlaid on the chart.
    An anomaly alarm is generated by the Data Science Engine for the configured metrics. This alarm is generated when a threshold is crossed for the configured metric value. For more information about Anomaly Alarms, see Alarm Analytics
  10. Quan wants to view the behavior of other metrics to troubleshoot the issue. Quan clicks the Correlated metrics link which redirects to Performance Analytics page in the context of the device, metric, and time. 
  11. From Available metrics, Quan adds Paging and CPU Usage metrics. He observes no spike on this chart for the db-server-pc-01 device. 
  12. Quan decides to add metrics for another device (app-server-pc-2). Quan observes that the memory utilization of app-server-pc-2 device shows a similar trend.
  13. Quan verifies the number of users on the MobileBanking summary page and decides to verify the user traffic. Quan adds the Aggregated Traffic metric for both the devices. 
  14. Quan observes that the aggregated traffic shows a spike at the same time for both db-server-pc-01 and app-server-pc-2. 
Result:
Quan concludes that the MobileBanking service might be experiencing an organic growth in terms of traffic. To support such scale, the memory allocation must be increased to ensure that availability of the service is not impacted.