Fault Tolerance

Fault tolerance enables
DX NetOps Performance Management
to continue operating properly when a hardware failure or network issue occurs.
HID_Fault_Tolerance
Fault tolerance enables
DX NetOps Performance Management
to continue operating properly when a hardware failure or network issue occurs. In fault-tolerant environments, a secondary inactive data aggregator automatically becomes active. The newly-active data aggregator takes over to organize and feed data to
NetOps Portal
and to the data repository. The newly-active data aggregator retains state information from the previously-active data aggregator. The data aggregator host with the network issue or hardware failure is available for failover when it becomes available.
Use the following process to configure a fault-tolerant environment:
In this article:
The System Architecture of Fault-Tolerant Environments
The following diagram shows the system architecture of a fault-tolerant environment:
High Availability
Træfik is a modern HTTP reverse proxy and load balancer that you can use to deploy microservices with ease. Consul is a tool that you can use to manage services in the
DX NetOps Performance Management
deployment.
For more information:
Hardware Requirements for Fault-Tolerant Environments
The following extra hardware is required for a fault-tolerant environment:
  • An additional data aggregator server.
  • A proxy server. The proxy server works as the third node of the service management cluster for fault tolerance. The service management cluster includes the proxy server, the active data aggregator, and the inactive data aggregator.
  • Ensure that you have a new shared data directory (for example,
    /DASharedRepo
    ) and that the same user ID is shared between data aggregator hosts. Data from whichever data aggregator is active is stored in this directory.
    For more information about the sizing requirements, see the
    DX NetOps Performance Management
    Sizing Tool
    .
    If you are using Network File System (NFS),
    DX NetOps Performance Management
    supports only NFSv4 and higher because of the ActiveMQ Kaha locking requirements.
    To avoid data loss and to prevent data from loading, the shared data directory must be accessible and up at all times.
Data Loss Comparison
Some data loss might occur in a fault-tolerant environment when a hardware failure or network issue occurs. However, the amount of data loss is less than in a non-fault-tolerant environment. The following table compares the data loss from a hardware failure or network outage:
Hardware Failure
Network Outage
Is fault tolerance configured?
No
Yes
No
Yes
What happens to rollups?
Pending rollups are lost and never recovered.
The other available data aggregator consumes the pending rollups when it becomes active.
Pending backups are consumed when the network is restored.
The other available data aggregator consumes the pending rollups when it becomes active.
What is lost in memory?
For 10K polls in memory at scale, loss should not exceed 1 poll cycle. Max loss would be 10K items per metric family.
For 10K polls in memory at scale, loss should not exceed 1 poll cycle. Max loss would be 10K items per metric family.
For 10K polls in memory at scale, loss should not exceed 1 poll cycle. Max loss would be 10K items per metric family.
For 10K polls in memory at scale, loss should not exceed 1 poll cycle. Max loss would be 10K items per metric family.
What happens to data transfer object (DTO) files?
If the hardware failure is the disk, all files are lost. Otherwise, whole DTO files are consumed when the hardware is restarted after repair. Incomplete files are discarded.
Whole DTO files are processed and partially written DTO files are discarded. A DTO file is 1 metric family over 1 poll cycle.
Whole DTO files are processed and partially written DTO files are discarded. The data aggregator attempts to shut down gracefully and close any DTO file in flight.
Whole DTO files are processed and partially written DTO files are discarded. A DTO file is 1 metric family over 1 poll cycle.
What happens with the ActiveMQ Broker?
For 600-MB cache in memory and an average message size of 1.3K, approximately 470K messages could be lost.
For 600-MB cache in memory and an average message size of 1.3K, approximately 470K messages could be lost.
For 600-MB cache in memory and an average message size of 1.3K, approximately 470K messages could be lost.
For 600-MB cache in memory and an average message size of 1.3K, approximately 470K messages could be lost.
What happens with thresholding?
Data loss does not exceed 1 poll cycle.
Data loss does not exceed 1 poll cycle.
Data loss does not exceed 1 poll cycle.
Data loss does not exceed 1 poll cycle.