Disaster Recovery

If a large-scale disaster occurs, the disaster recovery plan for capm enables a switchover to a recovery system.
If a large-scale disaster occurs, you can switch over to a recovery system using a disaster recovery plan for
DX NetOps Performance Management
. This plan involves provisioning a secondary system as a recovery environment and regularly transferring data from the primary system.
This plan is not a temporary measure. The recovery system completely replaces the primary system.
To reverse this process and return monitoring to the original site, go through the same process with the recovery and primary hosts switched. If new hardware is deployed due to the disaster, start by reinstalling the
DX NetOps Performance Management
components on the new hardware. If the original hardware is available and no upgrades have occurred on the active recovery system, start by configuring incremental data transfer.
The following video introduces how
DX NetOps Performance Management
establishes a detailed disaster recovery plan meant to re-establish normal operations in the event of a major disruption, such as a hurricane or fire:

In this article:
2
The following diagram shows the primary and recovery systems, and the files that are copied regularly:
Disaster Recover Architecture
Disaster Recover Architecture
Install Components for the Recovery System
The recovery system is a secondary system that contains all the components for
DX NetOps Performance Management
. Under normal operations, the recovery system is offline. The recovery system has the same requirements as the primary system.
If you upgrade you primary system, upgrade the components in the recovery system. Each recovery component must be running the same version of the product as the primary system.
truetop
Install the Data Repository
Before you install Vertica on the recovery cluster, prepare the environment for the installation.
For more information about how to prepare the recovery cluster, see Prepare to Install the Data Repository.
Ensure that the recovery system has the same settings as the primary cluster for the following settings:
  • Database version
  • Node names
    To get the node name, issue the following Vertica
    admintools
    command on each node:
    /opt/vertica/bin/admintools -t list_allnodes
    This command also returns the installed Vertica version and the database name.
  • Database name
  • Database administrator
  • Database user
  • Catalog directory
    To get the catalog directory configuration, issue the following Vertica
    admintools
    command on each node:
    /opt/vertica/bin/admintools -t list_db -d <database name>
  • Data directory
The recovery cluster has the following requirements:
  • It is accessible from the primary cluster.
  • It has the same number of nodes as the primary cluster.
  • The Database Administrator account (default: dradmin) has passwordless SSH access to and from the recovery cluster from the primary cluster. Provide passwordless SSH access for the Database Administrator account (default: dradmin) from each host in the primary cluster to each host in the recovery cluster. Then, provide passwordless SSH access for the dradmin account from each host in the recovery cluster back to each host in the primary cluster.
    To configure passwordless SSH, issue the following command from each host in the primary cluster to each node in the recovery cluster. Then issue the command from each node in the recovery cluster to each node in the primary cluster:
    ssh-copy-id -i
    dradmin
    @
    target_host
  • For each node pair, you have issued the following Vertica
    admintools
    command:
    ssh [email protected]<paired-node> '/opt/vertica/bin/admintools -t list_allnodes
    If you are prompted for a password, the ssh setup for the Database Administrator account (default: dradmin) is incomplete. Copy cluster will fail.
  • Port 50000 is open between all the data repository nodes and disaster recovery hosts.
The installation process is the same as a normal data repository installation, for more information, see Install the Data Repository.
Use the same configuration for the new cluster as for the source cluster. For example, the Vertica version, node count, database name, administrator, user, catalog directory, and data directory must be the same as the original data repository.
Install the Data Aggregator
Prepare the host and install the data aggregator for the recovery system.
During installation, use the details for the data repository for the recovery system.
Install the Data Collectors
For each data collector in the primary system, install a data collector in the recovery system. The DCM ID identifies the data collector to the system.
This scenario assumes that the data collectors are centrally located with the other components. Some deployments use remote data collectors, which are deployed close to the monitored infrastructure in other data centers or geographical locations. To continue using a remote data collector if a disaster occurs, update it to communicate to the recovery data aggregator.
For more information, see Configure Data Collector When the Data Aggregator IP Address Changes.
Use the following process:
  1. Prepare the hosts for the recovery system data collectors.
    For more information, see Prepare to Install the Data Collectors.
  2. Record the DCM ID values for each data collector in the primary system:
    1. Go to the following URL:
      https://primary_da_host:8581/rest/dcms
    2. For each data collector, note the value in the
      <DcmID>
      tag.
      The following XML shows an example:
      <DataCollectionMgrInfoList>
      <DataCollectionMgrInfo version="1.0.0">
      <ID>4077</ID>
      <DcmID>primary-dc:69658898-a48c-44a6-9cba-963bb9c09684</DcmID>
      <Enabled>true</Enabled>
      <IPAddress>10.237.1.67</IPAddress>
      <RelatedDeviceItem>4078</RelatedDeviceItem>
      ...
  3. For
    each
    recovery data collector, export the DCM ID for the corresponding primary data collector, and install the component in the recovery system:
    1. On the data collector host for the recovery system, export the DCM ID for the primary system data collector:
      export DCM_ID=
      DATA_COLLECTOR_DCM_ID
      Example:
      export DCM_ID=
      primary-dc:69658898-a48c-44a6-9cba-963bb9c09684
    2. From the same session, install the data collector.
      During installation, specify the details for the data aggregator for the recovery system.
      For more information, see Install the Data Collectors.
  4. To verify the installation, look at the DCM ID on the recovery data collector:
    1. Open the following file:
      /opt/IMDataCollector/broker/apache-activemq-
      version
      /conf/activemq.xml
    2. Find the broker name property and verify the DCM ID.
      The following example shows the section of the activemq.xml file that includes the broker name:
      ...
      <broker
      xmlns="http://activemq.apache.org/schema/core"
      brokerName="dc_broker_
      69658898-a48c-44a6-9cba-963bb9c09684
      "
      dataDirectory="${activemq.data}"
      useShutdownHook="false"
      useJmx="true">
      ...
      If the broker name does not match the expected DCM_ID UUID from the originating system, update the file with the correct DCM_ID UUID from the DCM_ID of the originating data collector.
  5. Stop the data collector services:
    service dcmd stop
    For RHEL 7.x or OL,
    service
    invokes
    systemctl
    . You can use
    systemctl
    instead.
    service activemq stop
  6. Stop the data aggregator services:
    1. Log in to the data aggregator host for the recovery system.
    2. Do one of the following steps:
      • Stop the data aggregator and ActiveMQ services:
        service dadaemon stop
        service activemq stop
      • (Fault-tolerant environments) If the local data aggregator is running, issue one of the following commands to shut it down and prevent it from restarting until maintenance is complete:
        • RHEL 6.x:
          service dadaemon maintenance
        • RHEL 7.x, SLES, or OL:
          <installation_directory>
          /scripts/dadaemon maintenance
Install
NetOps Portal
Prepare the host and install
NetOps Portal
for the recovery system.
Do not configure LDAP integration or HTTPS on the recovery system. The settings are inherited from the primary system.
Configure Incremental Data Transfer
Copy data provides the recovery system everything that is required to continue operation when the primary system is down. Devise a plan with a regular interval that occurs often enough to duplicate the required data. Use the same frequency for all components and start the transfer from each component simultaneously. We recommend a daily transfer of all components.
The data aggregator and
NetOps Portal
require regular file copies between the primary system and the recovery system. You can use any file copy and schedule method as required in your system. In our lab environment, we configured SSH between the primary and recovery system, and used crontab to invoke the SCP command from the recovery system.
Example:
The following crontab example is configured on the secondary data aggregator host to copy a backup directory from the primary data aggregator. The copy occurs daily at 12:30 AM:
30 0 * * * scp -r [email protected]_DA:/tmp/backup /tmp/backup
Configure Data Transfer for the Data Repository
For the data repository, use the
vbr
script with the
--copycluster
option to duplicate the primary database to the recovery database. Copy cluster is an incremental backup that copies all updates to the database. Because this data transfer is the longest transfer, the backup frequency to the recovery system is limited by the runtime of the copy cluster. Issue the command multiple times before you schedule a regular transfer to verify the runtime. Ensure that the backup frequency is at lease twice the runtime of the copy cluster.
For existing large databases, copy cluster takes as long as a full backup to complete. To minimize the performance impact to the system, restore a backup of the primary system to the recovery system, then configure and issue the
vbr
script with the
--copycluster
option.
For a large database, an incremental copy cluster for one day takes about one hour. Run an incremental copy cluster at least daily.
Follow these steps:
  1. Create a configuration file for copy cluster on a host in the primary data repository cluster in the
    /home/
    dradmin
    directory.
    Use the example as a model to create the configuration file.
    Example:
    The following example configuration file is set up to copy a database on a three node cluster (v_drdata_node0001, v_drdata_node0002, and v_drdata_node0003) to another cluster consisting of nodes (recovery-host01, recovery-host02, and recovery-host03):
    The dbName parameter is case-sensitive.
    [Misc]
    snapshotName = Copydrdata
    restorePointLimit = 5
    tempDir = /tmp/vbr
    retryCount = 5
    retryDelay = 1
    [Database]
    dbName =
    drdata
    dbUser =
    dradmin
    dbPassword =
    password
    dbPromptForPassword = False
    [Transmission]
    encrypt = False
    checksum = False
    port_rsync = 50000
    [Mapping]
    ; backupDir is not used for cluster copy
    v_drdata_node0001
    =
    recovery-host01
    :/data
    v_drdata_node0002
    =
    recovery
    -host02:/data
    v_drdata_node0003
    =
    recovery
    -host03:/data
  2. Stop the database in the recovery system:
    1. Log in to the recovery database cluster as the database admin user.
    2. Open the Vertica
      admintools
      utility by issuing the following command:
      /opt/vertica/bin/adminTools
    3. Select option
      4 (Stop Database)
      .
    4. Wait for the shutdown to complete.
  3. Copy historical data:
    1. Log in to the primary cluster as the database administrator account.
    2. Issue the following command:
      vbr.py --task copycluster --config-file home/
      dradmin
      /
      CopyClusterConfigurationFile
      .ini
      The historical data for the database is copied and the following message is displayed:
      > vbr.py --config-file home/
      dradmin
      /
      CopyClusterConfigurationFile
      .ini --task copycluster
      Preparing...
      Copying...
      1871652633 out of 1871652633, 100%
      All child processes terminated successfully.
      copycluster done!
  4. Create a cron job to schedule copy cluster from the primary system on a regular interval. The following command initiates the transfer:
    vbr.py --task copycluster --config-file home/
    dradmin
    /
    CopyClusterConfigurationFile
    .ini
Configure Data Transfer for the Data Aggregator
For the data aggregator, schedule a regular copy of the following files from the primary system to the recovery system:
  • <installation_directory>
    /IMDataAggregator/
    apache-karaf-*
    /deploy/*.xml
    Do not copy this file from the
    local-jms-broker.xml
    directory. This directory might not initially contain other files.
    • installation_directory
      The installation directory for the data aggregator.
      Default:
      /opt
    • apache-karaf-*
      The installation directory for Apache Karaf.
      Example:
      apache-karaf-4.2.6
  • <installation_directory>
    /IMDataAggregator/
    apache-karaf-*
    /etc/org.ops4j.pax.web.cfg
    • installation_directory
      The installation directory for the data aggregator.
      Default:
      /opt
    • apache-karaf-*
      The installation directory for Apache Karaf.
      Example:
      apache-karaf-4.2.6
  • <installation_directory>
    /data/custom/devicetypes/DeviceTypes.xml
    In a fault-tolerant environment, a shared directory (for example,
    /DASharedRepo
    ) is defined to help limit data loss. Therefore, in fault-tolerant environments, the file is located in the following directory:
    DASharedRepo
    /custom/devicetypes
    For more information, see Fault Tolerance.
Configure Data Transfer for the Data Collectors
The data collectors do not require a regular backup. All relevant information is stored on the data aggregator and the data repository.
If the primary data collectors include custom memory settings, configure the recovery data collectors as required.
NetOps Portal
For
NetOps Portal
, create a database dump of the netqosportal and em databases, and back up custom settings.
For more information, see Back Up NetOps Portal.
Prepare the Disaster Recovery Scripts
The disaster recovery scripts replace hostname and IP address references to match the components in the recovery system.
For each script, create a copy, and provide the relevant information for your system.
Data Repository Disaster Recovery Script
Location:
/opt/CA/IMDataRepository_vertica
Version
/update_da_dc_database_references.sh
On the data repository host in the recovery system, update the bold sections of the script to match your system:
##############################################################
# UPDATE DAUSER/DAPASS BELOW TO REFLECT THE NON-ADMIN
# VERTICA USERNAME/PASSWORD FOR THIS SYSTEM
##############################################################
DAUSER=
dauser
DAPASS=
dapass
#######################################################################
# UPDATE TO REFLECT THE NEW/RECOVERY DATA AGGREGATOR'S IP ADDRESS BELOW
#######################################################################
RECOVERY_DA_IP_ADDRESS="
<Recovery/New IP Address for the Data Aggregator>
"
#####################################################################
# UPDATE TO REFLECT THE NEW/RECOVERY DATA AGGREGATOR'S HOSTNAME BELOW
#####################################################################
SOURCE_DA_HOSTNAME="
<Source/Original Hostname for the Data Aggregator>
"
RECOVERY_DA_HOSTNAME="
<Recovery/New Hostname for the Data Aggregator>
"
#####################################################################################################
#
# UPDATE THE FOLLOWING ARRAYS TO REFLECT THE SOURCE DATA
# COLLECTOR HOSTNAMES, NEW RECOVERY HOSTNAMES, AND NEW RECOVERY
# IP ADDRESSES RESPECTIVELY.
#
# IMPORTANT: THE ORDER OF THE ENTRIES BELOW IS CRITICAL FOR
# MAPPING PURPOSES. IN ADDITION, PLEASE NOTE THAT IF MULTIPLE VALUES
# ARE REQUIRED, PLEASE SEPARATE VALUES WITH A SINGLE SPACE.
#
#####################################################################################################
declare -a SOURCE_DC_HOSTNAMES=(
<Source/Original DC Hostname 1> <Source/Original DC Hostname 2>
)
declare -a RECOVERY_DC_HOSTNAMES=(
<New/Recovery DC Hostname 1> <New/Recovery DC Hostname 2>
)
declare -a RECOVERY_DC_IP_ADDRESSES=("
<New/Recovery DC Hostname 1 IP Address>
" "
<New/Recovery DC Hostname 2 IP Address>
")
Ensure that the order of the data collectors for the source system and the recovery system is the same. The script uses the order of the list to map the primary system components to the recovery system.
NetOps Portal
Disaster Recovery Script
Location:
/opt/CA/PerformanceCenter/Tools/bin/update_pc_da_database_references.sh
On the
NetOps Portal
host in the recovery system, update the bold sections of the script to match your system:
...
##################################################################
# UPDATE THE FOLLOWING PC/DA VARIABLES TO REFLECT NEW ENVIRONMENT
##################################################################
NEW_PC_IP_ADDRESS="
<Recovery/New PC IP Address>
"
NEW_PC_HOSTNAME="
<Recovery/New PC Hostname>
"
NEW_PC_EVENT_PRODUCER_PORT=
8181
NEW_PC_EVENT_PRODUCER_PROTOCOL="
http
" # change to "https" if using SSL
NEW_DA_IP_ADDRESS="
<Recovery/New DA IP Address>
"
NEW_DA_HOSTNAME="
<Recovery/New DA Hostname>
"
NEW_DA_PORT_NUMBER=
8581
...
Activate the Recovery System
If a large-scale disaster occurs, and the primary system is unavailable, start the recovery system.
Startup time for the recovery system takes the same time that is required to start the data aggregator.
Start the Data Repository
Follow these steps:
  1. Log in to the recovery database cluster as the database admin user.
  2. Open the Vertica
    admintools
    utility by issuing the following command:
    /opt/vertica/bin/adminTools
  3. Select option
    3 (Start Database)
    .
  4. Press the
    Space
    bar next to the database name, select
    OK
    , and then press the
    Enter
    key on your keyboard.
    You are prompted for the database password.
  5. Enter the database password, and then press the
    Enter
    key on your keyboard.
    The data repository starts.
  6. Select
    Exit
    , and then press the
    Enter
    key.
  7. Run the
    /opt/CA/IMDataRepository_vertica/
    your_update_da_dc_database_references
    .sh
    data repository disaster recovery script.
Start the Data Aggregator
Do one of the following steps:
  • Start the ActiveMQ and data aggregator services:
    service activemq start
    service dadaemon start
  • (Fault-tolerant environments) Issue one of the following commands to enable the fault-tolerant data aggregator so that it can start when necessary:
    • RHEL 6.x:
      service dadaemon activate
    • RHEL 7.x, SLES, or OL:
      installation_directory
      /scripts/dadaemon activate
The data aggregator starts. If the data repository is unavailable, the data aggregator shuts down.
Start
NetOps Portal
Follow these steps:
  1. Restore the
    NetOps Portal
    backups.
    For more information, see Restore
    NetOps Portal
    .
  2. Run the
    NetOps Portal
    disaster recovery script:
    /opt/CA/PerformanceCenter/Tools/bin/
    your_update_pc_da_database_references
    .sh
  3. Start the SSO service:
    service caperfcenter_sso start
  4. Wait one minute, then start the event manager and device manager:
    service caperfcenter_eventmanager start
    service caperfcenter_devicemanager start
  5. Wait one minute, then start the console service:
    service caperfcenter_console start
Start the Data Collectors
Issue the following command to start the data collector service:
service dcmd start
The data collector restarts. If the data aggregator is unavailable, the data collector shuts down.
For remote data collectors, update them to connect to the recovery data aggregator. For more information, see Configure Data Collector When the Data Aggregator IP Address Changes.
(Optional) Test the Recovery System
You can optionally test the recovery system manually.
Follow these steps:
  1. Pause the incremental data transfer.
  2. Start the data repository:
    1. Log in to the recovery database cluster as the database admin user.
    2. Open Vertica
      adminTools
      utility by issuing the following command:
      /opt/vertica/bin/adminTools
    3. Select option
      3 (Start Database)
      .
    4. Press the
      Space
      bar next to the database name, select
      OK
      , and then press the
      Enter
      key on your keyboard.
      You are prompted for the database password.
    5. Enter the database password, and then press the
      Enter
      key on your keyboard.
      The data repository starts.
    6. Select
      Exit
      , and then press the
      Enter
      key on your keyboard.
    7. Run the data repository disaster recovery script:
      /opt/CA/IMDataRepository_vertica/
      your_update_da_dc_database_references
      .sh
  3. Start the data aggregator:
    1. Do one of the following steps:
      • Start the ActiveMQ and data aggregator services by issuing the following commands:
        service activemq start
        service dadaemon start
      • (Fault-tolerant environment) Enable the fault-tolerant data aggregator so that it can start when necessary by issuing
        one
        of the following commands:
        • RHEL 6.x:
          service dadaemon activate
        • RHEL 7.x, SLES, or OL:
          DA_Install_Directory
          /scripts/dadaemon activate
      The data aggregator starts. If the data repository is unavailable, the data aggregator shuts down.
  4. Start
    NetOps Portal
    :
    1. Restore the
      NetOps Portal
      backups.
      For more information, see Restore NetOps Portal.
    2. Run the
      NetOps Portal
      disaster recovery script:
      /opt/CA/PerformanceCenter/Tools/bin/
      your_update_pc_da_database_references.sh
    3. Start the SSO service:
      service caperfcenter_sso start
    4. Wait one minute, then start the event manager and device manager:
      service caperfcenter_eventmanager start
      service caperfcenter_devicemanager start
    5. Wait one minute, then start the console service:
      service caperfcenter_console start
  5. Log in to the recovery
    NetOps Portal
    component and run reports against the recovery data repository and the data aggregator.
  6. Verify that the data is available.
  7. (Optional) If you have a set of recovery data collectors that you can double-poll for testing, start one or more data collectors. Verify polling occurs and the data is stored in the database.
    To prevent the recovery system from issuing duplicate notifications and reports during testing, disable them beforehand.
    1. Issue the following command to start the data collector service:
      service dcmd start
      The data collector restarts. If the data aggregator is unavailable, the data collector shuts down.
      For remote data collectors, update them to connect to the recovery data aggregator.
      For more information, see Configure Data Collector When the Data Aggregator IP Address Changes.
  8. Shut down each component.
    After the next incremental data transfer, the database is in sync with the primary systems again. To fail over or test again, repeat these steps.