Fault Tolerance

Fault tolerance enables your capm environment to continue operating properly when a hardware failure or network issue occurs. In an environment where fault tolerance is configured, a secondary inactive Data Aggregator automatically becomes active. The newly active Data Aggregator takes over to organize and feed data to npc and the Data Repository. The newly active Data Aggregator retains all state information from the previously active Data Aggregator. When the host with the network issue or hardware failure is available again, the host is automatically available for failover. For information about viewing the health of your system, see .
capm350
HID_Fault_Tolerance
Fault tolerance enables your 
CA Performance Management
 environment to continue operating properly when a hardware failure or network issue occurs. In an environment where fault tolerance is configured, a secondary inactive Data Aggregator automatically becomes active. The newly active Data Aggregator takes over to organize and feed data to 
Performance Center
 and the Data Repository. The newly active Data Aggregator retains all state information from the previously active Data Aggregator. When the host with the network issue or hardware failure is available again, the host is automatically available for failover. For information about viewing the health of your system, see View the Health of the System.
If failover occurs, but the ActiveMQ process is still running, you should manually stop ActiveMQ with the following command:
 
service activemq stop
 
 
Fault Tolerant Architecture
The following diagram shows the system architecture of a fault tolerant environment:
Træfik is a modern HTTP reverse proxy and load balancer that is made to deploy microservices with ease. Consul is a tool that is used to manage services in the 
CA Performance Management
 deployment.
High Availability
High Availability
Hardware Requirements
The following extra hardware is required for a fault tolerant environment:
  • One added Data Aggregator server
  • Proxy server
  • Ensure that you have a new shared data directory (example: 
    /DASharedRepo
    ) and that the same user ID is shared between Data Aggregator hosts. Data from whichever Data Aggregator is active is stored in this directory.
    For information about the sizing requirements, see the capm Sizing Tool.
     If you are using NFS, only NFS 4 and higher is supported because of the ActiveMQ Kaha locking requirements.
     The shared data directory must be accessible at all times. If the shared data directory is down and is inaccessible, no data is loaded and data loss occurs.
Data Loss Comparison
In a fault tolerant environment, some data loss may still occur when a hardware failure or network issue occurs. However, the amount of data loss is less than in an environment without fault tolerance configured. The following table compares the data loss from a hardware failure or network outage:
Hardware Failure
Network Outage
 
Is fault tolerance configured?
 
 
No
 
 
Yes
 
 
No
 
 
Yes
 
What happens to rollups?
Pending rollups are lost and never recovered.
The other available Data Aggregator consumes the pending rollups when it becomes active.
Pending backups are consumed when the network is restored.
The other available Data Aggregator consumes the pending rollups when it becomes active.
What is lost in memory?
For 10K polls in memory at scale, loss should not exceed 1 poll cycle. Max loss would be 10K items per metric family.
For 10K polls in memory at scale, loss should not exceed 1 poll cycle. Max loss would be 10K items per metric family.
For 10K polls in memory at scale, loss should not exceed 1 poll cycle. Max loss would be 10K items per metric family.
For 10K polls in memory at scale, loss should not exceed 1 poll cycle. Max loss would be 10K items per metric family.
What happens to DTO files?
If the hardware failure is the disk, all files are lost. Otherwise, whole DTO files are consumed when the hardware is restarted after repair. Incomplete files are discarded.
Whole DTO files are processed and partially written DTO files are discarded. A DTO file is 1 metric family over 1 poll cycle.
Whole DTO files are processed and partially written DTO files are discarded. The Data Aggregator attempts to shut down gracefully and close any DTO file in flight.
Whole DTO files are processed and partially written DTO files are discarded. A DTO file is 1 metric family over 1 poll cycle.
What happens with the ActiveMQ Broker?
For 600-MB cache in memory and an average message size of 1.3K, approximately 470K messages could be lost.
For 600-MB cache in memory and an average message size of 1.3K, approximately 470K messages could be lost.
For 600-MB cache in memory and an average message size of 1.3K, approximately 470K messages could be lost.
For 600-MB cache in memory and an average message size of 1.3K, approximately 470K messages could be lost.
What happens with thresholding?
Data loss does not exceed 1 poll cycle.
Data loss does not exceed 1 poll cycle.
Data loss does not exceed 1 poll cycle.
Data loss does not exceed 1 poll cycle.
Configure the Failover Settings
During failover, the inactive Data Aggregator has 45 minutes to start by default. If the Data Aggregator does not start within 45 minutes, the fault tolerant environment tries to start the other host. And this process repeats for each host every 45 minutes until one of the hosts start.
We recommend you observe how much time passes between when the command to start the Data Aggregator is issued and the Data Aggregator REST service is available. Adjust the 
startwait
 parameter as appropriate before configuring fault tolerance.
 Ensure you configure enough time. Do not set the configurable start time to less than 45 minutes. A start time that is too low can result in data loss or system malfunction.
If it always takes longer than 20 to 30 minutes to start a Data Aggregator, the hardware might be under resourced. If the hardware is under resourced, 
CA Performance Management
 stops functioning. For information about the sizing requirements, see the capm Sizing Tool.
A configurable failover wait time is set to 5 minutes by default. Failover only happens when the active Data Aggregator is unresponsive to the fault tolerance heartbeat for longer than the configure time (default: 5 minutes). If you have limited network availability with periodical network outages or system thrashing that may last several minutes, you can increase the failover wait time.
 Do not set the configurable failover wait time to less than 5 minutes. A failover wait time less than 5 minutes can result in data corruption or data loss.
 
Follow these steps:
 
  1. Edit the 
    config.json
     file in the following directory:
     
     
    Data_Aggregator_Install_Directory
    /consul-ext/conf/
     
  2. Edit the 
    startime
     and 
    failwait
     parameters (
    s
     second, 
    m
     minute, 
    hour).
  3. Save your changes.
Configure a Fault Tolerant Environment
When you first install or upgrade the 
CA Performance Management
 components to the 3.5 release or higher, you are prompted to configure a fault tolerant environment. After the initial installation or upgrade to a fault tolerant environment, the responses to the fault tolerant environment prompts are saved and the prompts do not appear during future upgrades of the fault tolerant environment. A fault tolerant environment requires a new shared directory (example: 
/DASharedRepo
) to help limit data loss. The shared drive stores customized metric families, DTO files, and the ActiveMQ Kaha database. When a hardware failure or network issue occurs, the newly active Data Aggregator accesses the shared drive. The Data Aggregator picks up where the now inactive Data Aggregator left off. The user ID that the shared drive is created with must be synced to both of the Data Aggregators. Then both Data Aggregators have read and write permissions to that directory.
 
Follow these steps:
 
  1. Follow the installation or upgrade procedure for the Data Repository:
  2. Ensure that you have a new shared data directory (example: 
    /DASharedRepo
    ) and that the same user ID is shared between Data Aggregator hosts. Data from whichever Data Aggregator is active is stored in this directory.
    For information about the sizing requirements, see the capm Sizing Tool.
     If you are using NFS, only NFS 4 and higher is supported because of the ActiveMQ Kaha locking requirements.
     The shared data directory must be accessible at all times. If the shared data directory is down and is inaccessible, no data is loaded and data loss occurs.
  3. Follow the installation or upgrade procedure for the active Data Aggregator:
    As you proceed through the Data Aggregator install or upgrade, you are prompted about configuring fault tolerance.
  4. Complete the following prompts:
    The entries to the following prompts must match for both Data Aggregators.
    •  
      Configure Data Aggregator For Fault Tolerance
      Specify 2 to configure fault tolerance.
      Default:
       1
       The default is for a non-fault tolerant environment.
    •  
      Data Aggregator Proxy Host
      Specify the host name/IP address of the proxy server.
    •  
      Consul HTTP port:
      Specify the port for communication with Consul.
      Default:
       8500
    •  
      Choose host IP address for Consul
       
       This prompt appears only when multiple public IP addresses are configured.
      Specify the bind address that the Consul agents use to communicate with each other. The Consul agents include the proxy host and both Data Aggregators in the cluster. If prompted for an address, specify an address that the other two hosts in the Consul cluster can reach.
  5. Install the secondary inactive Data Aggregator.
    One of the two available Data Aggregators becomes the active Data Aggregator. The other Data Aggregator is available for failover.
  6. Follow the installation or upgrade procedure for each Data Collector:
    As you proceed through the Data Collector install or upgrade, you are prompted for a failover location for fault tolerance. The Data Collector installer prompts for the inactive Data Aggregator host if fault tolerance is configured.
  7. Follow the installation or upgrade procedure for 
    Performance Center
    :
    As you proceed through the 
    Performance Center
     upgrade, you are prompted about fault tolerance. Follow the prompts to migrate the Data Aggregator data source from the original Data Aggregator to the proxy host.
Verify Communication Ports
Open the following ports to allow communications to function properly in a fault tolerant environment:
  •  
    TCP 8300
    In a fault tolerant environment, enables communication between the proxy server and the Data Aggregators.
  •  
    TCP/UDP 8301
    In a fault tolerant environment, enables LAN communication between the proxy server and the Data Aggregators.
  •  
    TCP 8500
    In a fault tolerant environment, enables communication between the proxy server and the Data Aggregators to the HTTP API.
Verify the Fault Tolerant Environment Configuration
After you install each Data Aggregator and add it as a data source, the System Status page provides the overall health status of your Data Aggregators. For more information, see View the Health of the System.