Mask Data Stored in Hadoop

You can use Fast Data Masker masking functions as Hive user-defined functions (UDFs) to mask data stored in Hadoop. The stored data must be structured data and must have a defined schema.
tdm49
You can use Fast Data Masker masking functions as Hive user-defined functions (UDFs) to mask data stored in Hadoop. The stored data must be structured data and must have a defined schema.
CA TDM provides a JAR file that includes Hive UDFs, which are developed based on a standalone Java masking library. The Java masking library includes Fast Data Masker masking functions. When you execute these Hive UDFs (provided in the JAR file) in your Hadoop environment, they perform the defined masking operations and mask the structured data.
High-Level Architecture
The following illustration shows a simple representation of the interaction among different systems:
Hadoop_Hive_Masking
Hadoop_Hive_Masking
The details are as follows:
  • The digit
    1
    in the diagram shows a user executing Hive UDFs provided in the JAR file through the Hive query language and accessing the structured data that is stored in Hadoop. These Hive UDFs include Fast Data Masker masking functions, which are provided in a masking library.
  • The digit
    2
    in the diagram shows the updates that are made to the structured data in Hadoop as a result of executing Hive UDFs.
Mask Structured Data
The high-level process to mask structured data that is stored in Hadoop by using the provided JAR files includes the following steps:
  1. Review the files in the masking package.
  2. Review the supported masking functions.
  3. Deploy the required JAR files and register provided Hive UDFs on the system where Hive is already present.
  4. Execute the appropriate Hive UDFs using the Hive query language.
Review the Files in the Masking Package
CA TDM provides the following files for this masking use case. These files are included in a .zip file (MaskingSDK-
<version>
.zip), located in the root directory of your CA TDM installation media:
  • JAR files
    The following JAR files are available in the package to help you mask the structured data that is stored in Hadoop:
    • catdm-masker-hive-
      <version>
      .jar
      This JAR file includes Hive UDFs. These Hive UDFs include Fast Data Masker masking functions. You can use the Hive UDFs in your Hive queries and can perform the masking operation.
    • catdm-masker-library-
      <version>
      .jar
      This JAR file contains a library of all the supported Fast Data Masker masking functions. Hive UDFs in the catdm-masker-hive-
      <version>
      .jar file reference these masking functions.
    • commons-validator-<version>.jar
      This JAR file contains a library that is used by the HASHIBAN masking function.
  • The HQL file
    The
    catdm-masker-init.hql
    utility automates the following tasks in the Hive environment:
    1. Adds the JAR files to the Hive session.
    2. Adds defined Hive UDFs for all the supported Fast Data Masker masking functions to the Hive session.
  • seedtables-reference folder
    The seedtables-reference folder contains the seedtables which are packaged as part of maskingsdk which can be used in HASHLOV function.
  • ReadMe.txt
    This document explains the usage of seedtables and how to create custom seedtables for use with Hashlov, with examples.
Review the Supported Masking Functions
For Hive UDFs, the first parameter is the original value to be masked. Use the
desc function
<function_name>
statement to get more information about the Hive UDF, as described in this article.
For the supported masking functions and parameters in Hadoop, see the Supported Masking Functions in Hadoop.
Deploy the JAR Files and Register Hive UDFs
To deploy the JAR files and register defined Hive UDFs in your Hive environment, run the catdm-masker-init.hql utility as follows:
  1. Extract the MaskingSDK-
    <version>
    .zip file from the root directory of your CA TDM installation media, to an appropriate location.
  2. Locate and copy the utility (catdm-masker-init.hql) to a computer (where Hive is available) ensuring that no special characters are added to the utility file name.
  3. Locate the JAR files (catdm-masker-hive-<
    version>
    .jar, catdm-masker-library-<
    version>
    .jar, and commons-validator-<
    version>
    .jar) and copy them to the same computer.
  4. Update the catdm-masker-init.hql utility with the paths of the JAR files.
  5. Run the following command in the Hive environment to execute the catdm-masker-init.hql utility:
    hive –i catdm-masker-init.hql
    The utility successfully adds the JAR files and defined Hive UDFs to the Hive session.
  6. Verify that the JAR files are added successfully by using the following Hive statement:
    list jars; Example response is as follows: /home/hadoopuser/camasking/catdm-masker-hive-<
    version>
    .jar /home/hadoopuser/camasking/catdm-masker-library-
    <version>
    .jar /home/hadoopuser/camasking/commons-validator-<
    version>
    .jar
  7. Verify that all the Hive UDFs present in the catdm-masker-init.hql utility are added to your Hive environment by using the following Hive statement:
    show functions; Example response is as follows; the list includes all the Hive UDFs that you have added: acos array tdm_add tdm_adddays tdm_addpercent .... .... tdm_trim tdm_visacard
    Note:
    In addition to the Hive UDFs that you have added, the list also displays other UDFs if they are already present in the environment. For example, acos and array are the two UDFs that are already present in the Hive environment.
In secured Hadoop clusters, adding JARs may result in an error message similar to
insufficient privileges to execute add (state=42000, code=0)
One solution is to ask your Hadoop Cluster Admin to update the Hadoop cluster Hive server configuration hive-site.xml to add the JAR files. Define the full path of the jar files in the property hive.aux.jars.path.
  1. Log on with Hadoop cluster admin privileges.
  2. Copy the jar files catdm-masker-library-<
    version>
    .jar, catdm-masker-hive-version.jar, and
    commons-validator-<
    version>
    .jar to any directory on the Hive Server nodes.
    Example: /usr/catdm/maskingsdk/
  3. Edit the hive-site.xml file and define the hive.aux.jars.path property, and save the file.
    Example:
    hive-site.xml <property> <name>hive.aux.jars.path</name> <value>file:///usr/catdm/maskingsdk/catdm-masker-library-<
    version>
    .jar, file:///usr/catdm/maskingsdk/catdm-masker-hive-<
    version>
    .jar, file:///usr/catdm/maskingsdk/commons-validator-<
    version>
    .jar </value> </property>
  4. Remove the add jars statements from the init hql file. 
  5. Restart the Hive server for the configuration changes to take effect.
Execute Hive UDFs
Use Hive UDFs that the deployed JAR file includes to perform the supported masking operations.
  1. View all Hive UDFs that are available in your Hive environment by using the following Hive statement:
    show functions;
    All Hive UDFs in the Hive environment are displayed.
  2. Note the Hive UDF name that you want to use for masking.
  3. Use the following Hive statements to know more about the Hive UDF:
    desc function
    <function_name>
    ; This statement provides information about the Hive UDF. desc function extended
    <function_name>
    ; This statement provides an example about the Hive UDF usage.
    Appropriate description about the Hive UDF is displayed.
  4. View the schema of the table that you want to mask by using the following Hive statement:
    desc
    <table_name>
    ;
    The table schema is displayed on the screen.
  5. View the data in the table that you want to mask by using the following Hive statement:
    select * from
    <table_name>
    ;
    The existing data is displayed on the screen.
  6. Use the appropriate Hive UDF in a Hive "select" statement to preview how the structured data is masked in the database table:
    select
    <UDF_name_with_parameters>
    from
    <table_name>
    ;
    The result of the Hive "select" statement shows how the structured data would be masked in the database if you use the Hive UDF in a Hive "insert" statement.
The MaskingSDK only provides Hive UDFs to mask the data stored in Hadoop. In order to save the masked data into Hadoop, depending on the table configuration in Hive, use Insert statement variants, like
insert overwrite
or
insert into
.
Example: Use the following insert statement to mask the data in the database:
insert overwrite table <table_name> select <Hive_UDF>(
<table_column_1>
),
<table_column_2>
,....,
<table_column_n>
from
<table_name>
The data is updated and is masked in the table.