Gateway Disaster Recovery System
Disaster Recovery (DR) is a critical component of a highly available environment. A typical cluster is configured in a single data center. Should disaster strike the data center (for example, earthquake, flood, or human-caused catastrophes), there must be a process to bring the Gateways back online as quickly as possible.
Disaster Recovery (DR) is a critical component of a highly available environment. A typical
API Gatewaycluster is configured in a single data center. Should disaster strike the data center (for example, earthquake, flood, or human-caused catastrophes), there must be a process to bring the Gateways back online as quickly as possible.
There are several possible solutions to a DR configuration, for example:
- Keep a spare non-running Gateway with recent backups that are manually restored.
- A fully functional Gateway remotely located, ready to take over from the primary Gateway cluster at a moment's notice.
This chapter outlines the second option: how to configure a DR system with single node in a remote location. You will learn how to create a database node that replicates the database from the secondary node of the cluster (known as "chain replication"). The DR node is disabled, to prevent writing to the database and causing collisions. Activating the DR node is a manual process that requires several steps.
Before invoking a Disaster Recovery system, ensure that:
- The DR system is in a “warm ready” state with a most-recent-possible copy of the configuration. If your DR system can tolerate a stale configuration, then using a non-replicated database may be a better option.
- An operatingAPI Gatewaycluster is configured, with two database nodes
- All systems are mapped in the/etc/hostsfiles, if they are not configured in the DNS
- All ancillary systems also have a redundant configuration in the DR environment for the Gateway to access. These systems must have the same mappings as the live cluster, including application servers, LDAP, JMS, JDBC, SNMP, SMTP, etc.
Configuring a Disaster Recovery node as described here has the advantage of being fully up to date and ready to go live with minimal effort. Impact on the production cluster nodes is limited to replication reading from the secondary database node.
Consider the issue of load capacity when using a single DR node. If normal traffic is greater than what a single node can handle, you have two options:
- Some form of traffic shaping is required in the DR networking infrastructure
- The DR system needs to be configured as a cluster itself.
A DR cluster needs to be limited to a single database node for initial takeover, with both processing nodes in a disabled state until activated. If there is a chance that the DR cluster will run for an extended period, it is possible to configure the DR cluster with a replicated database.
Disaster Recovery Alternatives
If you do not create a formal disaster recovery plan, the alternatives are more ad hoc and less effective. One option is to retrieve a backup image periodically from the primary node using
wgetand then running it through
ssgrestore.shon an automated basis. This has the major disadvantage of the data being out of date by several hours or more potentially. Retrieving a full backup image also has a larger performance impact on the primary database node. You also need to address the implications of OS-level settings (such as IP addresses, etc.).
Read next:Configure a Disaster Recovery Syste