Fast Data Masker Best Practices
This article focuses on the best practices that are used to mask tables. Use the Fast Data Masker functions to mask tables, seed tables, work with columns, and for cross-reference mapping.
This article focuses on the best practices that are used to mask tables. Use the Fast Data Masker functions to mask tables, seed tables, work with columns, and for cross-reference mapping.
These best practices make the Fast Data Masker tasks easier to understand and to modify.
Manage Seed Tables
The seedtables subdirectory contains seed data in editable text files. The subdirectory is located in the install directory; the following line is an example of the directory path.
Each file contains a list of values that you can use to mask the columns.
The following text file shows the contents of the sample seed file
Amanda Amanda Angela Anita Ann Anna Anne Anne Ashley Beate Betty Beverly Bonnie Carla Carla Carol Carol Carolyn Cary Casey Cathy Chris Christina Christine Christine Christine Cindy Cindy Claire Clarise Collette Connie Cory Courtney Darsha Edith
To use the seed file to mask database columns, enter the seed file name with the appropriate masking function. The following example shows how to mask the column, FIRST_NAME of the table PEOPLE:
The Fast Data Masker component uses parallel processes to ensure the highest possible performance. To do so, determine the amount of work that is required to allocate to each parallel thread.
To automatically assign each table its own thread when you mask multiple tables in one masking run, set the
PARALLEL=option in the
Optionstab. Similarly, when you mask a partitioned table in Oracle, the partitions in the table are automatically assigned to separate threads.
To mask large tables, split the row to be masked for a given thread. The simplest way to split the table is to use the SQL WHERE clauses. When you click on a table in the first tab and center list more than once, the
Add Maskingdialog opens.
Yes, and multiple tabs are provided for the same table. In this way, you can apply the same mask multiple times but with different WHERE clauses. Fast Data Masker can split the mask work into logical chunks, and can assign a thread for each.
Note:For more information, see Use Parallel Threads to Mask Data. Additionally, when masking Microsoft SQL Server tables, use the PARALLEL option only when the table has a primary key or unique index. If the table does not have a primary key or unique index and you use the PARALLEL option, masking is either slow or does not work.
The following table shows an example that is displayed on the
Summarytab after the split:
where id < 10000
where id > 10000
To use the cross-reference functionality, create a table in one of your connections that has the following structure:
CREATE TABLE gtsrc_xref (rx_ref_id varchar(254) NOT NULL,rx_old_value varchar(254) NOT NULL,rx_new_value varchar(254) );ALTER TABLE gtsrc_xrefADD CONSTRAINT gtsrc_xref_pk PRIMARY KEY (rx_ref_id ,rx_old_value );
If you clear the table down, the cross-reference that is mapped is rebuilt during the next mask run.
Note:Using cross-reference to mask large volumes is time consuming. Each row that is to be masked has to perform a lookup in the cross-reference table. To optimize this functionality, see the FASTXREF scramble option. To maintain consistency across different tables and databases when you use seed values to mask, we recommend that you use the HASHLOV function without cross-reference. The HASHLOV function consistently converts an existing value to an integer value in a given range. For example, if you mask a person LAST_NAME, and use the seed data category LASTNAME (which has 2000 entries), the existing LAST_NAME value is assigned a number from 1 through 2000. This value is used to look up the corresponding value in the seed list. For example, the LAST_NAME Jones is always masked to value Burton.
Add WHERE Clause to a Table
The WHERE clause allows you to restrict your masking to only certain rows in the table. This allows you to mask, for example, male names and female names differently based on the GENDER column.
As an example, consider a scenario where you want to mask the employee first names (FIRST_NAME) in the table EMPLOYEE for those rows where the value in the EMPLOYEE ID column is lesser than 5 (EMPLOYEE_ID<5). To perform the mask in the table EMPLOYEE, you use the WHERE CONDITION field to specify the condition (EMPLOYEE_ID<5), select the FIRST_NAME for masking, and choose the appropriate masking options. The following table shows the example:
The following table shows another example. In this example, all the HELP TEXT columns are masked with random values from companies.txt. The rows with an MSP_CODE starting with 'HSD' has the value huw assigned to CREATED BY. The rows with an MSP_CODE starting with 'OFG' has the value fred assigned to CREATED BY:
MSP_CODE LIKE 'HSD%'
MSP_CODE LIKE 'OFG%'
Review the following points:
- To execute the WHERE clause, the WHERE condition for a map is required before subsequent masks for the same table. In other words, the WHERE clause applies to the mask for a table that follows in the CSV file. The other rows for the table specify the specific mask that is applied to the table.
- WHERE clauses are required to contain ANSI standard SQL. You cannot use special characters like; or /.
- Fixed width records cannot have zero length. Therefore, WHERE COL = "" does not work because the column value is padded out to its fixed width size.
- Fast Data Masker currently supports the following operators: <,<=,=, =>, >, IN and LIKE (LIKE is not supported for flat files. It is only used for database masking).
Apply Multiple Functions to the Same Column
When you apply multiple functions to substrings in the same column, list the functions in consecutive rows in the mapping CSV; for example, as shown in the following table:
No limit is set to the number of mask functions that are performed on each column. However, set KeepNulls the same for all the functions.
Note: All functions are required to have "
substr start" and "
substr length" set. Substr specifications for different functions can refer to the same character positions. Also, if you are using cross-references, they (substr specifications) can only be set for the last function that is used on each column as shown in the following example table:
In the preceding example, the mapping CSV shows that a mask of column ACCOUNT_NUMBER in the table PAYMENT_OPTIONS with a zero padded sequence in positions 7 through 16 is expressed as follows:
NOTE: SEQNUMBER returns a left-aligned non-padded number. The sequence is started at 1000000001 and the lead digit is then set to 0.
In the audit report, the function names are suffixed with start:length parameters and concatenated together. For example, "SEQNUMBER(7:10) FIXED(7:1)".
If you specify cross-referencing, the value that is used to look up and update the cross-reference table is a substring of the value, not the whole column value.
Manage Primary and Foreign Keys
To mask primary keys, unique indexes, or columns that match with foreign keys, you might have to drop them before you mask. You can reapply them once the mask is complete.
If you have the DBUPDATES=P option when you run the mask, no masking takes place. But, Fast Data Masker checks if any FK constraints or triggers exist for the tables to be masked. Pre- and post-step scripts are produced to disable or drop the constraints and re-enable or create them after the mask.
It is easy to create duplicate values when you mask primary key columns. For example, you have ID values of 2,5,7,8,100, and apply a SEQNUMBER masking function starting at 100, the value 2 is updated to the value 100 which creates a duplicate.
If you have any questions before you begin, contact CA Support.
Manage Memory Usage
To mask large or multiple tables and seed tables, we recommend that you increase the memory that is allocated to the Fast Data Masker executable.
To increase the allocated memory, specify the appropriate values in the
Start Memory (Mb)and
Max Memory (Mb)fields in the
Summarytab while defining the masking. When you save the defined masking information, Fast Data Masker creates a batch file. This batch file includes the values that you specify for the memory in the
Summarytab. The batch file also includes other required information (for example, location of the connection file, masking file, and options file). The following screenshot shows the increased memory values:
The default values are 100 MB and 1000 MB. In this screenshot, the values are increased to 1000 MB and 10000 MB.
The following snippet shows the example content included in the batch file that is generated after you save the masking. Review that the parameters
Xmx10000Mrepresent the increased memory values:
java -Djava.util.logging.config.file="C:/Program Files/Grid-Tools/FastDataMasker/logging.properties" -Xms1000M -Xmx10000M -jar "C:/Program Files/Grid-Tools/FastDataMasker/Fastdatamasker.jar" "C:/Program Files/Grid-Tools/FastDataMasker/doc/connectSQLSERVER.txt" "C:\Users\
Very Large Table Masking using Large Table Split Feature
Using Large Table Split feature, you can split a very large table to be masked into blocks. Each block is processed by an independent thread. These blocks are processed by the multiple threads running in parallel.
Database Type Support
Large Table Split functionality is only supported for SQL Server, DB2 on LUW and Oracle databases.
Masking Very Large Tables
From TDM 4.9, when masking a very large table with a primary key or a unique index, performance can be improved using the following options:
- LARGETABLESPLITENABLEDEnables large tables processing by splitting into blocks. Set this parameter to Y to enable, and to N to disable.Default: N
- LARGETABLESPLITSIZEDefines the maximum size in rows for each block for Fast Data Masker to start using large table split processing.Default: 1000000
With this setting, Fast Data Masker processes large tables by generating several blocks, with each block containing
LARGETABLESPLITSIZErows for processing.
The existing option
PARALLELdefines the number of threads that can run concurrently to process the blocks. If the
PARALLELoption is not set, and you enable
PARALLELis set to 10 by default. If there are more blocks than threads, the extra blocks are queued for processing. These extra blocks are processed when the threads become available.
The Parallel option is used to define the maximum number of separate threads that the FDM instance utilizes. Each thread consumes additional CPU cycles.
Starting from FDM version 4.9.170, a new performance enhancement is added. The performance on very large tables is improved when the table uses a primary key or a unique index on a single column of numeric type. FDM uses this feature automatically without any input from the user as long as the primary key or unique index are of numeric type.
This new feature does NOT apply if the primary key or unique index is composite or the single column is not of numeric type.
You can now use the pseudo column called
rowidfor Oracle database when the primary key or unique index does not comply with the performance requirements.
(Oracle Only) This new option is called
ORACLEUSEROWIDwhich is set to N by default. When this option is enabled, Fast Data Masker is enforced to use the pseudo column
rowid. The pseudo column
rowidcan be used when the primary key or unique index column is not composite or of non-numeric type. Enabling this option allows FDM to use the performance improvement.
Write Logs to a Local Drive
As a best practice, we recommend that you write the logs to a local drive, not a network drive. Audit/application logs on remote drives significantly affect the masking performance.
Configure the Fast Data Masker Logs Location
By default, all Fast Data Masker logs are located in %AppData%\Grid-Tools\FastDataMasker\Logs (for example, C:\Users\
<user_name>\AppData\Roaming\Grid-Tools\FastDataMasker\logs). If you are unable to use this folder and want to change the location where Fast Data Masker logs are stored, you can do so in the Fast Data Masker Mapper.
Follow these steps:
- ClickStart, All Programs, FastDataMasker, FastDataMaskerto open the Fast Data Masker Mapper.
- Click theOptionstab.
- Locate the row that contains the LOGDIR option.
- In theValuecolumn, enter the file path where you want to save the the Fast Data Masker log files.
- Complete your masking rules and run a masking job.
- Access the file path that you specified in the LOGDIR option.Your log files are now available in this location.
Hash on the Value of a different XML tag when using HASHLOV for XML files
When masking XML files using XPATH elements, you may want to define a custom hash column so you can hash on a different XML element than the default, which is to hash on the current value being masked. For example, when you use an XML input file to update information of a pre-existing user in the system, the masked values in the file must match those in the database. If you use a member ID for hashing in the database which is also in the XML file, you want to use that ID as the hash column.
You can customize the
Relative XPATH to hash onat the step where you configure seed data for HASHLOV. Enter a tag name using XPATH syntax, for example,
Use Parallel Threads to Mask Data
Fast Data Masker lets you use parallel threads to mask large tables. The PARALLEL option enables you to run
To apply parallel threads, Fast Data Masker must split the work to be done into separate chunks. Fast Data Masker can manage this task in one of the following ways:
- A regular mask of multiple tables where none of the tables has a large amount of data. Fast Data Masker automatically sets a thread for each table.
- A large table that is not partitioned. In this case, split the table using thewhereclauses. For this to work, the following must apply:
- The masking CSV must only contain mask for a single table.
- Thewhereclauses must not overlap; for example, if two or more SQL Where clauses select the same rows in the table to be masked, then the mask causes row lock errors.
- For an Oracle-partitioned table, Fast Data Masker automatically assigns a thread to each underlying partition. This cannot be combined with thewhereclauses, and as in point two above, it is applicable to a mask for a single large table.
Note:The number of parallel threads that you can execute concurrently is constrained by the number of physical cores and/or processors available. If the parallel number specified in the options is greater than the number of cores, then some of the threads are held in a queue until resources become available.
To split a table with the
whereclause, use Fast Data Masker as follows:
- Access the Fast Data Masker UI.
- Use the required connection file to connect to the database.Note:When masking Microsoft SQL Server tables, use the PARALLEL option only when the table has a primary key or unique index. If the table does not have a primary key or unique index and you use the PARALLEL option, masking is either slow or does not work.
- Select theMaskingtab and perform the following steps:
- Select the table (for example,PERSONS) in which you want to mask the data.
- Select the columns (for example,FIRST_NAMEandLAST_NAME) that you want to mask.
- Provide appropriate information (for example,HASHLOVmasking type,FULL NAMEdata category, and so on) in the relevant fields.
- Select theOptionstab and specify the number of threads (for example, 4) for thePARALLELoption.
- Select theSummarytab and review the information. The following table shows the example information that is displayed before you split the table:TableColumnFunctionParm1Parm2PERSONSFIRST_NAMEHASHLOVFULL NAME3PERSONSLAST_NAMEHASHLOVFULL NAME4
- Click theSplit Tablebutton.TheSplit Tabledialog opens.
- Enter the following information and clickOK:
- Enter the number of threads; for example, 4.
- Select the numeric column (for example,PERSON_ID) that you want to use for the split.We recommend that you choose an indexed column or a column that is a primary key. That is, the column must contains unique values.
- Review the updated information in theSummarytab. The following table shows the example information after you split the table:TableColumnFunctionParm1Parm2PERSONSWHEREPERSON_ID < 153PERSONSFIRST_NAMEHASHLOVFULL NAME3PERSONSLAST_NAMEHASHLOVFULL NAME4PERSONSWHEREPERSON_ID BETWEEN 153 AND 205PERSONSFIRST_NAMEHASHLOVFULL NAME3PERSONSLAST_NAMEHASHLOVFULL NAME4PERSONSWHEREPERSON_ID BETWEEN 206 AND 258PERSONSFIRST_NAMEHASHLOVFULL NAME3PERSONSLAST_NAMEHASHLOVFULL NAME4PERSONSWHEREPERSON_ID BETWEEN 259 AND 313PERSONSFIRST_NAMEHASHLOVFULL NAME3PERSONSLAST_NAMEHASHLOVFULL NAME4
- Click theSave & Run Maskbutton.
- Enter the name for the masking file that includes all the mapping information and clickSave.
- Enter the name for the options file that includes the selected options information and clickSave.
- ClickOK.A dialog opens that displays the masking progress.
- Review the information and save or close the dialog when masking is done.
You have successfully used parallel threads to mask the data.
Run Fast Data Masker Scripts Remotely
When you save the defined masking information in Fast Data Masker, Fast Data Masker creates a batch file. This batch file includes information about the location of the connection file, masking file, options file, and other related information (for example, start memory). You can use this batch file to run from a remote location where Fast Data Masker is not installed. This is helpful in scenarios where you are facing performance issues on your server because of different components installed on the server. And, to improve the performance, you want to run the batch script from a different server.
To run this batch file from a remote location, edit the batch file and update the paths to the Fast Data Masker .jar, connection file, masking file, and options file. Ensure that they all point to valid locations. You can then run the file from that remote location.
The following example snippet shows the contents of a masking batch file; update the required file locations based on your requirement:
java -Djava.util.logging.config.file="C:/Program Files/Grid-Tools/FastDataMasker/logging.properties" -Xms1000M -Xmx10000M -jar "C:/Program Files/Grid-Tools/FastDataMasker/Fastdatamasker.jar" "C:/Program Files/Grid-Tools/FastDataMasker/doc/connectSQLSERVER.txt" "C:\Users\<username>\AppData\Roaming\Grid-Tools\Fastdatamasker\MyMask.csv" "C:\Users\<username>\AppData\Roaming\Grid-Tools\Fastdatamasker\MyMask_options.txt"
In this snippet, you can find the following example locations:
- C:/Program Files/Grid-Tools/FastDataMasker/Fastdatamasker.jar shows the location of the Fast Data Masker .jar file.
- C:/Program Files/Grid-Tools/FastDataMasker/doc/connectSQLSERVER.txt shows the location of the Microsoft SQL Server connection file.
- C:\Users\<username>\AppData\Roaming\Grid-Tools\Fastdatamasker\MyMask.csv shows the location of the file that contains masking information.
- C:\Users\<username>\AppData\Roaming\Grid-Tools\Fastdatamasker\MyMask_options.txt shows the location of the file that contains the applied options information.