Topology Sample and Disaster Recovery

The following diagram shows a sample topology with three multiwrite groups, each containing three multiwrite DSAs. All DSAs would have the same prefix.
cad141
Topology Sample
The following diagram shows a sample topology with three multiwrite groups, each containing three multiwrite DSAs. All DSAs would have the same prefix.
Each group has a multiwrite hub that is configured (in red). An example of the write-precedence for each router DSA is included.
sample topology 14.1
sample topology 14.1
Disaster Recovery
While MW-DISP recovery ensures that data remains consistent across all replicating peers during outages, sometimes a DSA must be rebuilt. The DSA must be rebuilt if there is a hardware or disk failure, or database corruption. In such cases, a disaster recovery procedure ensures that all DSAs are up and running with consistent data. These situations typically arise when the DSA must be rebuilt from backup or when the DSA does not start due to grid-related errors. Adding new multiwrite groups or DSAs can follow a similar process to synchronize data between peers. You can consider following a disaster recovery procedure after an extended outage. Using a recovery procedure is more efficient than leaving it to MW-DISP to reconcile large changes.
As with any disaster recovery procedure, use it in a test environment first so that deployment-specific steps can be documented.
When using multiwrite group hubs, there are two disaster recovery scenarios. The following steps ensure that DSAs are still active during recovery.
The recovery steps use the Sample Topology as a reference and can be customized for a specific deployment.
Multiwrite Group Peer Recovery
A multiwrite group DSA in a multiwrite group requires resynchronization to bring the DSA back in line with the multiwrite hub servicing the group.
Disaster scenario:
DSA US2 fails to start with a grid-related error after the computer where it is running stops due to a kernel panic.
Step 1:
Ensure that the recovering DSA is in a stopped state.
  • dxserver stop US2
Step 2:
Set US2 to a time (now) before taking the data snapshot (online dump) from US1. This step ensures that the DSAs replicating to US2 only send recovery updates from the time the data snapshot was taken.
  • Host F: Run dxdisp US2 (This command sets the time US2 was last updated by US3)
  • Host D: Run dxdisp US2 (This command sets the time US2 was last updated by US1)
Step 3:
When feasible, perform an online dump from the hub DSA (US1). The sooner the snapshot is taken after running dxdisp, the smaller the number of updates that are reapplied to US2 during recovery. Thus, the recovery is more efficient.
  • Host D: Telnet to the DSA console of US1 (hub) and run the “dump dxgrid-db;” command to begin an online dump. Use the “logout;” command then to exit the DSA console.
  • Host D: Check the US1 warn-log to see when the dump has started, and more importantly completed.
  • Host D: Once the dump has completed, a file named $DXHOME/data/US1.zdb is created. Copy this file to Host E. For example, copy to Host E: /tmp/US1.zdb.
Compress files before copying between machines as most grid files compress well. Check that the timestamp is recent to ensure that the online backup command created the file being copied.
Step 4:
Prevent US2 from replaying updates back to hub and peers.
  • Host E: dxdisp US1
  • Host E: dxdisp US3
Step 5:
Now a snapshot from the hub has been taken, this information can be copied.
  • Host E: Remove the old transaction log (if enabled) - remove $DXHOME/data/US2.tx
  • Host E: Copy (and uncompress) the backup grid file that is generated in Step 3, for example, copy /tmp/US1.zdb $DXHOME/data/US2.db
  • Host E: dxserver start US2
  • Host E: After a short period, US2 is back in sync with US1. The progress of MW-DISP recovery can be followed in the alarm-log for US2.
US2 does not allow binds from routers or applications until recovery is complete. Recovery time can take longer if there is a large volume of updates that occur in parallel.
Multiwrite Group Hub Recovery
When a multiwrite group hub DSA in a multiwrite group requires resynchronization, this scenario is a little more complicated. In such a case, the DSAs serviced by the hub also require synchronization due to how updates flow in this style of network topology. When a hub is synchronized, all the DSAs in the group that is serviced by the hub require synchronization.
Disaster scenario:
DSA US1 fails to start with a grid-related error, after the computer where it is running stops due to a kernel panic.
Step 1:
Ensure the recovering group of DSAs are in a stopped state.
  • dxserver stop US1
  • dxserver stop US2
  • dxserver stop US3
Step 2:
Set US1 to a time (now) on each hub before taking the data snapshot (online dump) from one of the other hubs. This step ensures that the DSAs replicating to US1 only send recovery updates from the time the data snapshot was taken.
  • Ensure that replication between AU3 and UK1 has the status
    OK
    . This status can be checked by issuing a “get dsp;” command on the console of hub AU3. This step ensures that when taking a snapshot from UK1, the data contains updates from AU3. After dxdisp is performed, AU3 is responsible for recovering these updates directly.
  • Host C: Run dxdisp US1 (this command sets the time US1 was last updated by *hub* AU3).
  • Host G: Run dxdisp US1 (this command sets the time US1 was last updated by *hub* UK1).
Step 3:
When feasible, perform an online dump from hub (UK1).
  • Host G: Telnet to the DSA console of UK1 (hub) and run the “dump dxgrid-db;” command to begin an online dump. Use the “logout;” command then to exit the DSA console.
  • Host G: Check the UK1 warn log to see when the dump has started, and more importantly when it is completed.
  • Host G: Once the dump has completed, a file named $DXHOME/data/UK1.zdb is created.
  • Host G: Copy this file to Host D. For example, copy to Host D: /tmp/UK1.zdb
  • Host G: Copy this file to Host E. For example, copy to Host E: /tmp/UK1.zdb
  • Host G: Copy this file to Host F. For example, copy to Host F: /tmp/UK1.zdb
Compress files before copying between machines as most grid files compress well. Check that the timestamp is recent to ensure that the online backup command created the file being copied.
Step 4:
Prevent US1 from replaying updates back to hubs. Also prevent US2 and US3 from replaying updates back to hub US1.
  • Host D: dxdisp AU3
  • Host D: dxdisp UK1
  • Host D: dxdisp US2
  • Host D: dxdisp US3
  • Host E: dxdisp US1
  • Host E: dxdisp US3
  • Host F: dxdisp US1
  • Host F: dxdisp US2
Step 5:
Instate the snapshot from UK1 on each DSA in the US multiwrite group.
  • Host D: Remove the old transaction log (if enabled) - remove $DXHOME/data/US1.tx
  • Host D: Copy (and uncompress) the backup grid file that is generated in Step 3. For example, copy /tmp/UK1.zdb $DXHOME/data/US1.db.
  • Host D: dxserver start US1
  • Host E: Remove the old transaction log (if enabled) - remove $DXHOME/data/US2.tx
  • Host E: Copy (and uncompress) the backup grid file that is generated in Step 3. For example, copy /tmp/UK1.zdb $DXHOME/data/US2.db.
  • Host E: dxserver start US2
  • Host F: Remove the old transaction log (if enabled) - remove $DXHOME/data/US3.tx
  • Host F: Copy (and uncompress) the backup grid file that is generated in Step 3. For example, copy /tmp/UK1.zdb $DXHOME/data/US3.db.
  • Host F: dxserver start US3
  • Host D: After a period US1 is back in sync with AU3 and UK1. The progress of MW-DISP recovery can be followed in the alarm-log US1. The resynchronization of US1 also includes US2, US3. The recovery process can be monitored by using the “
    get dsp;
    ” command on the consoles of AU3, UK1, US1, US2 & US3 to ensure that replication is functioning as expected.
US1, US2, and US3 do not allow binds from routers or applications until recovery is complete. Recovery time can take longer if there is a large volume of updates that occur in parallel.
Notes
  • For Windows, the path to the grid files is %DXHOME%\data
  • Do not copy or modify the .dp files.
  • On UNIX, perform these steps as the DSA user, that is, dsa.
  • Do not copy the grid file before the dump is complete. Doing so can corrupt or incomplete. In such a case, repeat the process.
  • Check the timestamp of the .zdb file to ensure that it was written recently and an older backup is not accidentally used.