Disaster Recovery

If a large-scale disaster occurs, the disaster recovery plan for capm enables a switchover to a recovery system. The disaster recovery plan involves provisioning a secondary system as a recovery environment and regularly transferring data from the primary system. This scenario is not a temporary measure. The recovery system completely replaces the primary system.
capm360
If a large-scale disaster occurs, the disaster recovery plan for
CA Performance Management
enables a switchover to a recovery system. The disaster recovery plan involves provisioning a secondary system as a recovery environment and regularly transferring data from the primary system. This scenario is not a temporary measure. The recovery system completely replaces the primary system.
To reverse this process and return monitoring to the original site, go through the same process with the recovery and primary hosts switched. If new hardware is deployed due to the disaster, start by reinstalling the
CA Performance Management
components on the new hardware. If the original hardware is available and no upgrades have occurred on the active recovery system, start by configuring incremental data transfer.
The following video introduces how
CA Performance Management
 establishes a detailed disaster recovery procedure meant to re-establish normal operations in the event of a major disruption such as a hurricane or fire:

2
The following diagram shows the primary and recovery systems, and the files that are copied regularly:
Disaster Recover Architecture
Disaster Recover Architecture
Install Components for the Recovery System
The recovery system is a secondary system that contains all the components for
CA Performance Management
. Under normal operations, the recovery system is offline. The recovery system has the same requirements as the primary system. For more information, see Review Installation Requirements and Considerations.
If you upgrade you primary system, upgrade the components in the recovery system. Each recovery component must be running the same version of the product as the primary system.
truetop
Install the Data Repository
Before you install Vertica on the recovery cluster, prepare the environment for the installation. For more information about preparing the recovery cluster, see Prepare to Install the Data Repository.
Ensure that the recovery system has the same settings as the primary cluster for the following settings:
  • Database version 
  • Node names
    To get the node name, run the following command on each node:
    /opt/vertica/bin/admintools -t list_allnodes'
    This command also returns the installed Vertica version and the database name.
  • Database name
  • Database administrator
  • Database user
  • Catalog directory
    To get the catalog directory configuration, run the following command on each node:
    /opt/vertica/bin/admintools -t list_db -d <database name>
  • Data directory
The recovery cluster has the following requirements:
  • Accessible from the primary cluster
  • The same number of nodes as the primary cluster
  • Passwordless SSH access to and from the recovery cluster for the Database Administrator account (default: dradmin) from the primary cluster.
    Provide passwordless SSH access for the Database Administrator account (default: dradmin) from each host in the primary cluster to each host in the recovery cluster. Then, provide passwordless SSH access for the dradmin account from each host in the recovery cluster back to each host in the primary cluster.
    To configure passwordless SSH, run the following command from each host in the primary cluster to each node in the recovery cluster. Then run the command from each node in the recovery cluster to each node in the primary cluster:
    ssh-copy-id -i
    dradmin
    @
    target_host
  • For each node pair, run the following command:
    ssh [email protected]<paired-node> '/opt/vertica/bin/admintools -t list_allnodes'
    If you are prompted for a password, the ssh setup for the Database Administrator account (default: dradmin) is incomplete. Copy cluster will fail.
  • Port 50000 must be open between all the Data Repository nodes and disaster recovery hosts.
The installation process is the same as a normal Data Repository installation, for more information, see Install the Data Repository.
Use the same configuration for the new cluster as for the source cluster. For example, the vertica version, node count, database name, administrator, user, catalog directory, and data directory must be the same as the original Data Repository.
Install the Data Aggregator
Prepare the host and install the Data Aggregator for the recovery system.
During installation, use the details for the Data Repository for the recovery system.
Install the Data Collectors
For each Data Collector in the primary system, install a Data Collector in the recovery system. The DCM ID identifies the Data Collector to the system.
This scenario assumes that the Data Collectors are centrally located with the other components. Some deployments use remote Data Collectors, which are deployed close to the monitored infrastructure in other data centers or geographical locations. To continue using a remote Data Collector if a disaster occurs, update it to communicate to the recovery Data Aggregator. For more information, see Configure Data Collector When the Data Aggregator IP Address Changes.
To install the Data Collectors, use the following process:
  1. Prepare the hosts for the recovery system Data Collectors. For more information, see Prepare to Install the Data Collectors.
  2. Record the DCM ID values for each Data Collector in the primary system:
    1. Go to the following URL:
      https://primary_da_host:8581/rest/dcms
    2. For each Data Collector, note the value in the
      <DcmID>
      tag.
      The following XML shows an example:
      <DataCollectionMgrInfoList>
         <DataCollectionMgrInfo version="1.0.0">
            <ID>4077</ID>
            <DcmID>primary-dc:69658898-a48c-44a6-9cba-963bb9c09684</DcmID>
            <Enabled>true</Enabled>
            <IPAddress>10.237.1.67</IPAddress>
            <RelatedDeviceItem>4078</RelatedDeviceItem>
      ...
  3. For
    each
    recovery Data Collector, export the DCM ID for the corresponding primary Data Collector, and install the component in the recovery system:
    1. On the Data Collector host for the recovery system, export the DCM ID for the primary system Data Collector:
      export DCM_ID=
      DATA_COLLECTOR_DCM_ID
      Example:
      export DCM_ID=
      primary-dc:69658898-a48c-44a6-9cba-963bb9c09684
    2. From the same session, install the Data Collector.
      During installation, specify the details for the Data Aggregator for the recovery system.
      For more information, see Install the Data Collectors.
  4. To verify the installation, look at the DCM ID on the recovery Data Collector:
    1. Open the following file:
      /opt/IMDataCollector/broker/apache-activemq-
      version
      /conf/activemq.xml
    2. Find the broker name property and verify the DCM ID.
      The following example shows the section of the activemq.xml file that includes the broker name:
      ...
          <broker
              xmlns="http://activemq.apache.org/schema/core"
             
      brokerName="dc_broker_
      69658898-a48c-44a6-9cba-963bb9c09684
      "
              dataDirectory="${activemq.data}"
              useShutdownHook="false"
              useJmx="true">
      ...
      If the broker name does not match the expected DCM ID, update the file with the correct DCM ID. Replace the hostname with the UUID suffix from the DCM_ID.
  5. Stop the Data Collector services:
    service dcmd stop
    For RHEL 7.x or OL,
    service
    invokes
    systemctl
    . You can use
    systemctl
     instead.
    service activemq stop
  6. Stop the Data Aggregator services:
    1. Log in to the Data Aggregator host for the recovery system.
    2. Do one of the following steps:
      • Stop the Data Aggregator and ActiveMQ services:
        service dadaemon stop
        service activemq stop
      •  (Fault tolerant environment) If the local Data Aggregator is running, run one the following commands to shut it down and prevent it from restarting until maintenance is complete:
        • RHEL 6.x:
          service dadaemon maintenance
        • RHEL 7.x, SLES, or OL:
          DA_Install_Directory/scripts/dadaemon maintenance
Install
Performance Center
Prepare the host and install
Performance Center
for the recovery system.
Do not configure LDAP integration or HTTPS on the recovery system. The settings are inherited from the primary system.
Configure Incremental Data Transfer
Copy data provides the recovery system everything that is required to continue operation when the primary system is down. Devise a plan with a regular interval that occurs often enough to duplicate the required data. Use the same frequency for all components and start the transfer from each component simultaneously. We recommend a daily transfer of all components.
  The Data Aggregator and
Performance Center
components require regular files copies between the primary system and the recovery system. You can use any file copy and schedule method as required in your system. In our lab environment, we configured SSH between the primary and recovery system, and used crontab to invoke the SCP command from the recovery system.
Example:
The following crontab example is configured on the secondary Data Aggregator host to copy a backup directory from the primary Data Aggregator. The copy occurs daily at 12:30 AM:
30 0 * * * scp -r [email protected]_DA:/tmp/backup /tmp/backup
top
Data Repository
For the Data Repository, use copy cluster to duplicate the primary database to the recovery database. Copy cluster is an incremental backup that copies all updates to the database. Because this data transfer is the longest transfer, the backup frequency to the recovery system is limited by the runtime of the copy cluster command. Run the command multiple time before you schedule a regular transfer to verify the runtime. Ensure that the backup frequency is at lease twice the runtime of the copy cluster.
For existing large databases, the copy cluster command takes as long as a full backup to complete. To minimize the performance impact to the system, restore a backup of the primary system to the recovery system, then configure and run copy cluster. 
For a large database, an incremental copy cluster command for one day takes about one hour. We recommend you run an incremental copy cluster at least daily.
Follow these steps:
  1. Create a configuration file for copy cluster on a host in the primary Data Repository cluster in the following directory:
    /home/
    dradmin
  2. Use the example as a model to create the configuration file.
    Example:
    The following example configuration file is set up to copy a database on a three node cluster (v_drdata_node0001, v_drdata_node0002, and v_drdata_node0003) to another cluster consisting of nodes (recovery-host01, recovery-host02, and recovery-host03):
    The dbName parameter is case-sensitive.
    [Misc]
    snapshotName = Copydrdata
    restorePointLimit = 5
    tempDir = /tmp/vbr
    retryCount = 5
    retryDelay = 1
    [Database]
    dbName =
    drdata
    dbUser =
    dradmin
    dbPassword =
    password
    dbPromptForPassword = False
    [Transmission]
    encrypt = False
    checksum = False
    port_rsync = 50000
    [Mapping]
    ; backupDir is not used for cluster copy
    v_drdata_node0001
    =
    recovery-host01
    :/data
    v_drdata_node0002
    =
    recovery
    -host02:/data
    v_drdata_node0003
    =
    recovery
    -host03:/data
  3. Stop the database in the recovery system:
    1. Log in to the recovery database cluster as the database admin user.
    2. Open Vertica admin Tools:
      /opt/vertica/bin/adminTools
    3. Select (4) Stop Database. Wait for the shutdown to complete before you run copy cluster.
    Copy historical data:
    1. Log in to the primary cluster as the database administrator account.
    2. Run the copy cluster command:
      vbr.py --task copycluster --config-file home/
      dradmin
      /
      CopyClusterConfigurationFile
      .ini
      The command copies the historical data for the database and displays the following message:
      > vbr.py --config-file home/
      dradmin
      /
      CopyClusterConfigurationFile
      .ini --task copycluster
      Preparing...
      Copying...
      1871652633 out of 1871652633, 100%
      All child processes terminated successfully.
      copycluster done!
  4. Create a cron job to schedule copy cluster from the primary system on a regular interval. The following command initiates the transfer:
    vbr.py --task copycluster --config-file home/
    dradmin
    /
    CopyClusterConfigurationFile
    .ini
Data Aggregator
For the Data Aggregator, schedule a regular copy of the following files from the primary system to the recovery system:
  • DA_install_directory
    /apache-karaf-
    version
    /deploy/*.xml
    Do not copy the following file from this directory:
    local-jms-broker.xml
    This directory might not initially contain other files.
  • DA_install_directory
    /apache-karaf-
    version
    /etc/org.ops4j.pax.web.cfg
  • DA_install_directory
    /data/custom/devicetypes/DeviceTypes.xml
    In a fault tolerant environment, a shared directory (example: 
    /DASharedRepo
    ) is defined to help limit data loss. Therefore, in a fault tolerant environment the file would be located in the following directory:
    DASharedRepo
    /custom/devicetypes/DeviceTypes.xml
    For more information, see Fault Tolerance.
Data Collectors
The Data Collectors do not require a regular backup. All relevant information is stored on the Data Aggregator and Data Repository.
If the primary Data Collectors include custom memory settings, configure the recovery Data Collectors as required.
Performance Center
For
Performance Center
, create a database dump of the netqosportal and em databases, and back up custom settings. For more information, see Back Up and Restore the CA Performance Management Database.
Prepare the Disaster Recovery Scripts
The disaster recovery scripts replace hostname and IP address references to match the components in the recovery system.  
For each script, create a copy, and provide the relevant information for your system.
Data Repository Disaster Recovery Script
Location:
/opt/CA/IMDataRepository_vertica
Version
/update_da_dc_database_references.sh
On the Data Repository host in the recovery system, update the bold sections of the script to match your system:
The disaster recovery script changed for
CA Performance Management
 
releases 3.6.1 and higher.
...
##############################################################
# UPDATE DAUSER/DAPASS BELOW TO REFLECT THE NON-ADMIN 
# VERTICA USERNAME/PASSWORD FOR THIS SYSTEM
##############################################################
DAUSER=
dauser
DAPASS=
dapass
##############################################################
# UPDATE TO REFLECT THE NEW/RECOVERY DA'S IP ADDRESS BELOW
##############################################################
NEW_DA_IP_ADDRESS="
<Recovery/New IP Address for the Data Aggregator>
"
#####################################################################################################
#
# UPDATE THE FOLLOWING ARRAYS TO REFLECT THE SOURCE DATA
# COLLECTOR HOSTNAMES, NEW RECOVERY HOSTNAMES, AND NEW RECOVERY 
# IP ADDRESSES RESPECTIVELY.
#
# IMPORTANT: THE ORDER OF THE ENTRIES BELOW IS CRITICAL FOR
# MAPPING PURPOSES.  
#
#####################################################################################################
declare -a SOURCE_DC_HOSTNAMES=(
<Source/Orignal DC Hostname 1> <Source/Original DC Hostname 2>
)
declare -a RECOVERY_DC_HOSTNAMES=(
<New/Recovery DC Hostname 1> <New/Recovery DC Hostname 2>
)
declare -a RECOVERY_DC_IP_ADDRESSES=("
<New/Recovery DC Hostname 1 IP Address>" "<New/Recovery DC Hostname 2 IP Address>
")
...
Ensure that the order of the Data Collectors for the source system and the recovery system is the same. The script uses the order of the list to map the primary system components to the recovery system.
3.6.1 and higher:
The disaster recovery script changed for
CA Performance Management
 
releases 3.6.1 and higher.
##############################################################
# UPDATE DAUSER/DAPASS BELOW TO REFLECT THE NON-ADMIN
# VERTICA USERNAME/PASSWORD FOR THIS SYSTEM
##############################################################
DAUSER=
dauser
DAPASS=
dapass
 
#######################################################################
# UPDATE TO REFLECT THE NEW/RECOVERY DATA AGGREGATOR'S IP ADDRESS BELOW
#######################################################################
RECOVERY_DA_IP_ADDRESS="
<Recovery/New IP Address for the Data Aggregator>
"
 
#####################################################################
# UPDATE TO REFLECT THE NEW/RECOVERY DATA AGGREGATOR'S HOSTNAME BELOW
#####################################################################
SOURCE_DA_HOSTNAME="
<Source/Original Hostname for the Data Aggregator>
"
RECOVERY_DA_HOSTNAME="
<Recovery/New Hostname for the Data Aggregator>
"
 
#####################################################################################################
#
# UPDATE THE FOLLOWING ARRAYS TO REFLECT THE SOURCE DATA
# COLLECTOR HOSTNAMES, NEW RECOVERY HOSTNAMES, AND NEW RECOVERY
# IP ADDRESSES RESPECTIVELY.
#
# IMPORTANT: THE ORDER OF THE ENTRIES BELOW IS CRITICAL FOR
# MAPPING PURPOSES.  IN ADDITION, PLEASE NOTE THAT IF MULTIPLE VALUES
# ARE REQUIRED, PLEASE SEPARATE VALUES WITH A SINGLE SPACE.
#
#####################################################################################################
declare -a SOURCE_DC_HOSTNAMES=(
<Source/Original DC Hostname 1> <Source/Original DC Hostname 2>
)
declare -a RECOVERY_DC_HOSTNAMES=(
<New/Recovery DC Hostname 1> <New/Recovery DC Hostname 2>
)
declare -a RECOVERY_DC_IP_ADDRESSES=("
<New/Recovery DC Hostname 1 IP Address>
" "
<New/Recovery DC Hostname 2 IP Address>
")
Performance Center
Disaster Recovery Script
Location:
 
/opt/CA/PerformanceCenter/Tools/bin/update_pc_da_database_references.sh
On the
Performance Center
host in the recovery system, update the bold sections of the script to match your system:
...
##################################################################
# UPDATE THE FOLLOWING PC/DA VARIABLES TO REFLECT NEW ENVIRONMENT
##################################################################
NEW_PC_IP_ADDRESS="
<Recovery/New PC IP Address>
"
NEW_PC_HOSTNAME="
<Recovery/New PC Hostname>
"
NEW_PC_EVENT_PRODUCER_PORT=
8181
NEW_PC_EVENT_PRODUCER_PROTOCOL="
http
" # change to "https" if using SSL 
NEW_DA_IP_ADDRESS="
<Recovery/New DA IP Address>
"
NEW_DA_HOSTNAME="
<Recovery/New DA Hostname>
"
NEW_DA_PORT_NUMBER=
8581
...
Activate the Recovery System
If a large-scale disaster occurs, and the primary system is unavailable, start the recovery system.
Startup time for the recovery system takes the same time that is required to start the Data Aggregator.
truetop
Start the Data Repository
Follow these steps:
  1. Log in to the recovery database cluster as the database admin user.
  2. Open Vertica admin Tools:
    /opt/vertica/bin/adminTools
  3. Select (3) Start Database.
  4. Press the Space bar next to the database name, select OK, and press Enter.
    You are prompted for the database password.
  5. Enter the database password and press Enter.
    Data Repository starts.
  6. Select Exit, and press Enter.
  7. Run the Data Repository disaster recovery script:
    /opt/CA/IMDataRepository_vertica/
    your_update_da_dc_database_references.sh
Start the Data Aggregator
Do one of the following steps:
  • Start the ActiveMQ and Data Aggregator services:
    service activemq start
    service dadaemon start
  •  (Fault tolerant environment) Run one the following commands to enable the fault tolerant Data Aggregator so that it can start when necessary:
    • RHEL 6.x:
      service dadaemon activate
    • RHEL 7.x, SLES, or OL:
      DA_Install_Directory
      /scripts/dadaemon activate
Data Aggregator starts. If the Data Repository is unavailable, the Data Aggregator shuts down.
Start
Performance Center
Follow these steps:
  1. Restore the
    Performance Center
    backups. For more information, see Restore Performance Center.
  2. Run the
    Performance Center
    disaster recovery script:
    /opt/CA/PerformanceCenter/Tools/bin/
    your_update_pc_da_database_references.sh
  3. Start the SSO service:
    service caperfcenter_sso start
  4. Wait one minute, then start the event manager and device manager:
    service caperfcenter_eventmanager start
    service caperfcenter_devicemanager start
  5. Wait one minute, then start the console service:
    service caperfcenter_console start
Start the Data Collectors
Use the following command to start the Data Collector service:
service dcmd start
The Data Collector restarts. If the Data Aggregator is unavailable, the Data Collector shuts down.
For remote Data Collectors, update them to connect to the recovery Data Aggregator. For more information, see Configure Data Collector When the Data Aggregator IP Address Changes.
(Optional) Test the Recovery System
You can optionally test the recovery system manually.
Follow these steps:
  1. Pause the incremental data transfer.
  2. Start the Data Repository:
    1. Log in to the recovery database cluster as the database admin user.
    2. Open Vertica admin Tools:
      /opt/vertica/bin/adminTools
    3. Select (3) Start Database.
    4. Press the Space bar next to the database name, select OK, and press Enter.
      You are prompted for the database password.
    5. Enter the database password and press Enter.
      Data Repository starts.
    6. Select Exit, and press Enter.
    7. Run the Data Repository disaster recovery script:
      /opt/CA/IMDataRepository_vertica/
      your_update_da_dc_database_references.sh
  3. Start the Data Aggregator:
    1. Do one of the following steps:
      • Start the ActiveMQ and Data Aggregator services:
        service activemq start
        service dadaemon start
      •  (Fault tolerant environment) Run one the following commands to enable the fault tolerant Data Aggregator so that it can start when necessary:
        • RHEL 6.x:
          service dadaemon activate
        • RHEL 7.x, SLES, or OL:
          DA_Install_Directory
          /scripts/dadaemon activate
      Data Aggregator starts. If the Data Repository is unavailable, the Data Aggregator shuts down.
  4. Start 
    Performance Center
    :
    1. Restore the 
      Performance Center
       backups. For more information, see Restore Performance Center.
    2. Run the 
      Performance Center
       disaster recovery script:
      /opt/CA/PerformanceCenter/Tools/bin/
      your_update_pc_da_database_references.sh
    3. Start the SSO service:
      service caperfcenter_sso start
    4. Wait one minute, then start the event manager and device manager:
      service caperfcenter_eventmanager start
      service caperfcenter_devicemanager start
    5. Wait one minute, then start the console service:
      service caperfcenter_console start
  5. Log in to the recovery 
    Performance Center
     component and run reports against the recovery Data Repository and Data Aggregator.
  6. Verify the data is available.
  7. (Optional) If you have a set of recovery Data Collectors that you can double-poll for testing, start one or more Data Collectors. Verify polling occurs and the data is stored in the database.
    To prevent the recovery system from issuing duplicate notifications and reports during testing, disable them beforehand.
    1. Use the following command to start the Data Collector service:
      service dcmd start
      The Data Collector restarts. If the Data Aggregator is unavailable, the Data Collector shuts down.
      For remote Data Collectors, update them to connect to the recovery Data Aggregator. For more information, see Configure Data Collector When the Data Aggregator IP Address Changes.
  8. Shut down each component.
    After the next incremental data transfer, the database is in sync with the primary systems again. To fail over or test again, repeat these steps.