Gateway Disaster Recovery System

Disaster Recovery (DR) is a critical component of a highly available environment. A typical  cluster is configured in a single data center. Should disaster strike the data center (for example, earthquake, flood, or human-caused catastrophes), there must be a process to bring the Gateways back online as quickly as possible.
gateway83
Disaster Recovery (DR) is a critical component of a highly available environment. A typical
API Gateway
cluster is configured in a single data center. Should disaster strike the data center (for example, earthquake, flood, or human-caused catastrophes), there must be a process to bring the Gateways back online as quickly as possible.
There are several possible solutions to a DR configuration, for example:
  • Keep a spare non-running Gateway with recent backups that are manually restored.
  • A fully functional Gateway remotely located, ready to take over from the primary Gateway cluster at a moment's notice.
This chapter outlines the second option: how to configure a DR system with single node in a remote location. You will learn how to create a database node that replicates the database from the secondary node of the cluster (known as "chain replication"). The DR node is disabled, to prevent writing to the database and causing collisions. Activating the DR node is a manual process that requires several steps.
Contents:
Assumptions
Before invoking a Disaster Recovery system, ensure that:
  • The DR system is in a “warm ready” state with a most-recent-possible copy of the configuration. If your DR system can tolerate a stale configuration, then using a non-replicated database may be a better option.
  • An operating
    API Gateway
    cluster is configured, with two database nodes
  • All systems are mapped in the
    /etc/hosts
    files, if they are not configured in the DNS
  • All ancillary systems also have a redundant configuration in the DR environment for the Gateway to access. These systems must have the same mappings as the live cluster, including application servers, LDAP, JMS, JDBC, SNMP, SMTP, etc.
Advantages
Configuring a Disaster Recovery node as described here has the advantage of being fully up to date and ready to go live with minimal effort. Impact on the production cluster nodes is limited to replication reading from the secondary database node.
Consider the issue of load capacity when using a single DR node. If normal traffic is greater than what a single node can handle, you have two options:
  • Some form of traffic shaping is required in the DR networking infrastructure
  • The DR system needs to be configured as a cluster itself.
A DR cluster needs to be limited to a single database node for initial takeover, with both processing nodes in a disabled state until activated. If there is a chance that the DR cluster will run for an extended period, it is possible to configure the DR cluster with a replicated database.
Disaster Recovery Alternatives
If you do not create a formal disaster recovery plan, the alternatives are more ad hoc and less effective. One option is to retrieve a backup image periodically from the primary node using
wget
and then running it through
ssgrestore.sh
on an automated basis. This has the major disadvantage of the data being out of date by several hours or more potentially. Retrieving a full backup image also has a larger performance impact on the primary database node. You also need to address the implications of OS-level settings (such as IP addresses, etc.).