High Availability / Failover Support on Windows

This document explains how to enable and test failover support (High Availability) on CA XCOM™ Data Transport® for Windows (CA XCOM).
It is given without any warranty whatsoever, in particular, the Windows configurations are given as is, no support in Windows-related questions is possible.
High Availability
A Windows failover cluster can provide High Availability for CA XCOM. A failover cluster is a group of computers or nodes that work together to increase the availability of CA XCOM file transfers. With failover enabled, the nodes are proactively monitored for proper operation. If a node fails, either it is restarted, or its services are moved to another node so that users experience minimum disruption of services.
The choice of Windows Server 2019, which constitutes the failover cluster in this document, is arbitrary, as is the decision to use a domain. A failover cluster can be built without a domain for example. Distributed File System (DFS) replication is just one way to achieve data integrity, even with a computer failing. It is not to say that CA XCOM can make use of only this failover setup.
A computer can fail in many different ways, for example, software error or hardware error, with or without file system corruption and so on. For this failover configuration, the test consists in a reboot of the active server, 1 to 2 minutes into a 15-minute file transfer, both initiated locally (Windows side) or remotely (Linux side). The parameters, which allow a successful completion, are discussed in the document and depend heavily on the file size, the network performance, and the PC hardware. It is more important to show which parameters influence which behavior, rather than to give a particular value, which would prove inappropriate with different network speeds for example.
For the tests discussed here, we can divide the components, which are key to a successful completion of a file transfer, into three parts:
  1. The CA XCOM service, registry entries, and network-related resources
  2. The CA XCOM work files (Q directory and configuration files)
  3. The file being transferred
To successfully resume a file transfer in case of the active machine failing, all three of the above points have to be covered.
Failover Clustering will handle correctly item 1.
DFS Replication addresses failures for item 2.
However, DFS Replication will not correctly replicate a file that is written to by XCOM. Indeed, during a transfer, the file status shows as 0 bytes as it is open without interruption for the entire duration of the transfer. At each checkpoint, CA XCOM on Windows flushes the file. But that is not enough to allow for replication. As a result, DFS Replication does not even create a temporary file on the backup machine. A restart using checkpointing will fail the transfer. As a way out, checkpointing can be turned off, in which case the restart will succeed, albeit at the cost of retransmitting the entire file. A RAID array would be another solution.
Implementing Failover Support
Implementation Process
This section describes how to implement CA XCOM in the failover cluster.
After the definition of the failover cluster and DFS Replication configuration used in these tests, the process for implementing failover support is as follows:
  1. Deploy CA XCOM for Windows
  2. Configure the CA XCOM service for High Availability
Prerequisites
This particular setup requires a failover cluster. Rather than giving all prerequisites, review the Microsoft webpage failover clustering. Servers xxxxxx66 and xxxxxx67 constitute the failover cluster in the tests.
Data availability is implemented with
DFS
(Distributed File System)
Replication
, which needs
DFS Namespaces
. For more information, refer to the webpages DFS Namespaces and DFS Replication.
The tests use the following DFS Replication configuration:
Replication Group Name: XCOMRPLC Replication Group Description: XCOM_HOME installation files and transfer data replication Topology type: Full mesh List of connections (2): xxxxxx67 -> | xxxxxx66 xxxxxx66 -> | xxxxxx67 Default Connection Schedule: Replicate continuously with Full bandwidth Primary Member: xxxxxx66 Replicated Folder Name: D Member: xxxxxx66 Path: D:\ Status: Enabled Member: xxxxxx67 Path: D:\ Status: Enabled NTFS Permission: From primary
Deploy CA XCOM
This section describes how to deploy CA XCOM in a failover cluster. On both machines, the D: drive will be used for XCOM. First, create the D:\XCOM_HOME directory on one machine and check that it has replicated on the other. On the other machine, create D:\install for the software to be installed and verify that it has replicated to the first machine.
Install an up-to-date java 8 version. Then XCOM 11.6 SP3 plus patch on both machines into the D:\XCOM_HOME directory. Installing it on both machines will overwrite the D:\XCOM_HOME directory but is necessary for the registry entries and the XCOMD service.
Set the Log On account of the XCOMD service to the administrator.
Test that you can start the XCOMD service on each machine, but not simultaneously on both.
Configure the CA XCOM Service for High Availability
This section explains how to configure the XCOMD CA XCOM Scheduler Service for High Availability.
Follow these steps:
In the Failover Cluster Manager tree, right-click the cluster that you want to configure and select
Configure Role
.
Choose
Select Role
and select
Generic Service
from the list of roles.
Choose
Select Service
and select
XCOMD CA XCOM Scheduler Service
from the list of Windows services.
Choose
Client Access Point
and enter the name that the clients will use when accessing the clustered role.
Choose
Replicate Registry Settings
and add the following 2 registry entries
SOFTWARE\ComputerAssociates\TraceLogFacility SOFTWARE\ComputerAssociates\XCOM
Click
Next
twice and
Finish
.
In the Failover Cluster Manager, select
Roles
. The right pane shows the cluster application.
If DHCP is not available, you may miss an IP address. Obtain a static IP address in the same subnet and configure it.
Your XCOM failover system is ready.
Test Failover Support
This section explains how to test CA XCOM failover support.
Test Environment
The figure that follows shows the cluster components. XCOMHACluster is the name of the cluster. xxxxxx66 is the primary node and xxxxxx67 is the other node in the cluster. Disk D: is replicated on the two nodes.
The node status is displayed in the Failover Cluster Manager. The following screen shows that both nodes in the cluster are active.
The details of the cluster components are as follows:
Function
IP Address
X.Y.136.126
X.Y.137.1
X.Y.137.2
Prepare Two Locally (Windows) Initiated and Two Remote (Linux) Transfers.
Adapt the xcom.ses files so that more than one transfer can run concurrently, config\xcom.ses file on Windows:
# connection_profile = number_of_sessions_allowed # MVS=2 Y.Z.72.82=6
config/xcom.ses file on Linux:
# connection_profile = number_of_sessions_allowed # MvsTS222=2 # Cluster IP address X.Y.136.126=6 # xxxxxx66 X.Y.137.1=6 # xxxxxx67 X.Y.137.2=6
For each system, Windows and Linux (swap local/remote file names), configure:
# 14 July 2020, user, XCOM Windows 11.6 SP3
# Failover test, send file
QUEUE=YES
USERID=user
#!ENCRYPT
PASSWORD.ENCRYPTED=74 4a 7a 25 fe af f6 ….
REMOTE_SYSTEM=X.Y.136.126
PORT=8044
# Send file (3) Receive file (4)
TRANSFER_TYPE=3
LOCAL_FILE=D:\transfer\1GB.dat
REMOTE_FILE=/tmp/win_send_1GB.dat
FILE_OPTION=REPLACE
CODE_FLAG=BINARY
CARRIAGE_FLAG=NO
COMPRESS=NO
# Set a checkpoint at every 1MB
MAXRECLEN=1024
CHECKPOINT_COUNT=1000
# Retry 10 times every minute
NUMBER_OF_RETRIES=10
RETRY_TIME=60
# continued receive file
CONTROL=NEWXFER
QUEUE=YES
USERID=user
#!ENCRYPT
PASSWORD.ENCRYPTED=74 4a 7a 25 fe af f6 ….
REMOTE_SYSTEM_RF=X.Y.136.126
PORT=8044
# Send file (3) Receive file (4)
TRANSFER_TYPE=4
REMOTE_FILE_RF=/tmp/1GB.dat
LOCAL_FILE_RF=D:\transfer\win_recv_1GB.dat
FILE_OPTION_RF=REPLACE
CODE_FLAG=BINARY
CARRIAGE_FLAG=NO
COMPRESS=NO
# Set a checkpoint at every 1MB
MAXRECLEN=1024
CHECKPOINT_COUNT=1000
# Retry 10 times every minute
NUMBER_OF_RETRIES=10
RETRY_TIME=60
Perform the Test in a Given Chronological Order
In our test environment, the four transfers running concurrently complete in about 10 minutes. To test the failover configuration we reboot the active server about 1 minute 30 seconds into the transfers:
Start stopwatch:
[[email protected] XCOM]$
xcomtcp -f /opt/CA/XCOM/config/cluster_transfers_1GB.cnf
2020/07/15 13:08:27 TID=000711 [/tmp/1GB_linux.dat --> D:\transfer\linux_send_1GB_linux.dat at X.Y.136.126] XCOMU0024I Transfer scheduled for future execution. 2020/07/15 13:08:27 TID=000712 [/tmp/linux_recv_1GB.dat <-- D:\transfer\1GB.dat at X.Y.136.126] XCOMU0024I Transfer scheduled for future execution. user2 000711 - ACTIVE LOCAL XCOMU0029I Locally initiated transfer started. user2 000712 - ACTIVE LOCAL XCOMU0029I Locally initiated transfer started.
After 30” on Windows:
D:\XCOM_HOME>
xcomtcp -f D:\XCOM_HOME\Config\linux_bglor_send_recv_1GB.cnf
Copyright (c) 2012 CA. All rights reserved. 2020/07/15 17:09:58 TID=000061 [D:\transfer\1GB.dat --> /tmp/win_send_1GB.dat at Y.Z.72.82] XCOMN0024I Transfer scheduled for future execution. 2020/07/15 17:09:58 TID=000062 [D:\transfer\win_recv_1GB_linux.dat <-- /tmp/1GB_linux.dat at Y.Z.72.82] XCOMN0024I Transfer scheduled for future execution. D:\XCOM_HOME>
xcomqm –La
xcomuser 000059 - ACTIVE REMOTE XCOMN0026I Remotely initiated first try. xcomuser 000060 - ACTIVE REMOTE XCOMN0026I Remotely initiated first try. administrator 000061 - ACTIVE LOCAL XCOMN0029I Locally initiated transfer started. administrator 000062 - ACTIVE LOCAL XCOMN0029I Locally initiated transfer started.xcomqm displayed a total
After 1 minute 30” reboot the active server
On the other Windows server we see:
D:\XCOM_HOME>
xcomqm -La
xcomuser 000059 - WAITING REMOTE RESTARTABLE XCOMN0436E TP ended abnormally. xcomuser 000060 - WAITING REMOTE RESTARTABLE XCOMN0436E TP ended abnormally. administrator 000061 - PENDING LOCAL RESTARTABLE XCOMN0436E TP ended abnormally. administrator 000062 - PENDING LOCAL RESTARTABLE XCOMN0436E TP ended abnormally.
Linux:
user2 000711 - DONE LOCAL RESTARTABLE #XCOMN0417E Error opening temporary file name user2 000712 - ACTIVE LOCAL RESTARTED XCOMU0047I Local transfer restarted . xcomuser 000713 - ACTIVE REMOTE RESTARTED XCOMU0027I Remotely initiated restart. xcomuser 000714 - DONE REMOTE RESTARTABLE #XCOMU0505E Received a signal from TCP/IP. # Windows transfer 000062 changed IP address, hence a new request is generated for Linux transfer 000714: xcomuser 000715 - ACTIVE REMOTE XCOMU0026I Remotely initiated first try.
After 15 minutes:
Windows:
xcomuser 000067 - DONE REMOTE SUCCESSFUL XCOMN0011I Transfer ended; 1052689 records (1073741824 bytes) transmitted in 680 seconds (1579032 bytes/second) xcomuser 000068 - DONE REMOTE SUCCESSFUL XCOMN0011I Transfer ended; 104858 records (1073741824 bytes) transmitted in 225 seconds (4772185 bytes/second) administrator 000069 - DONE LOCAL SUCCESSFUL XCOMN0011I Transfer ended; 104858 records (1073741824 bytes) transmitted in 495 seconds (2169175 bytes/second) administrator 000070 - DONE LOCAL SUCCESSFUL XCOMN0011I Transfer ended; 1056833 records (1073741824 bytes) transmitted in 488 seconds (2200290 bytes/second)
Linux:
user2 000724 - DONE LOCAL SUCCESSFUL XCOMU0011I Transfer ended; 1052689 blocks xcomuser 000725 - DONE REMOTE SUCCESSFUL XCOMU0011I Transfer ended; 104858 blocks user2 000723 - DONE LOCAL SUCCESSFUL XCOMU0011I Transfer ended; 104858 blocks xcomuser 000726 - DONE REMOTE FAILED XCOMN0416E Error writing output file: Permission denied xcomuser 000727 - DONE REMOTE SUCCESSFUL XCOMU0011I Transfer ended; 1056833 blocks # Windows transfer 000070 changed IP address, hence a new request is generated for Linux transfer 000726/7:
Conclusions:
The following conclusion can be drawn after the failover support (High Availability) on CA XCOM™ Data Transport® for Windows is enabled and tested:
  1. It is important to test extensively to have a reliable setup.
  2. The Generic Service of the cluster, as well as the DFS data replication of the XCOM configuration and Q directory, is robust.
  3. For locally (Windows) initiated transfers the use of the server IP address rather than the common cluster address results in a new transfer request generated on the remote (Linux) system.
  4. Files that are transferred with XCOM cannot be written to DFS data replication in a reliable manner. As the files are open for the entire duration of the transfer, DFS data replication only kicks in at the end of the transfer. The solution is to set CHECKPOINT_COUNT=0 for Windows cluster initiated receive and remotely initiated send files. Or use another means like RAID to achieve high availability.