Mask Data Stored in Hadoop

You can use Fast Data Masker masking functions as Hive user-defined functions (UDFs) to mask data stored in Hadoop. The stored data must be structured data and must have a defined schema.
tdm42
You can use Fast Data Masker masking functions as Hive user-defined functions (UDFs) to mask data stored in Hadoop. The stored data must be structured data and must have a defined schema.
CA TDM provides a JAR file that includes Hive UDFs, which are developed based on a standalone Java masking library. The Java masking library includes Fast Data Masker masking functions. When you execute these Hive UDFs (provided in the JAR file) in your Hadoop environment, they perform the defined masking operations and mask the structured data.
High-Level Architecture
The following illustration shows a simple representation of the interaction among different systems:
Hadoop_Hive_Masking
Hadoop_Hive_Masking
The details are as follows:
  • The digit
    1
     in the diagram shows a user executing Hive UDFs provided in the JAR file through the Hive query language and accessing the structured data that is stored in Hadoop. These Hive UDFs include Fast Data Masker masking functions, which are provided in a masking library.
  • The digit 
    2
     in the diagram shows the updates that are made to the structured data in Hadoop as a result of executing Hive UDFs. 
Mask Structured Data
The high-level process to mask structured data stored in Hadoop by using the provided JAR files includes the following steps:
  1. Review the files in the masking package.
  2. Review the supported masking functions.
  3. Deploy the required JAR files and register provided Hive UDFs on the system where Hive is already present.
  4. Execute the appropriate Hive UDFs using the Hive query language.
Review the Files in the Masking Package
CA TDM provides the following files for this masking use case. These files are included in a .zip file (MaskingSDK-
<version>
.zip), located in the root directory of your CA TDM installation media:
  • JAR files
    The following JAR files are available in the package to help you mask the structured data that is stored in Hadoop:
    • catdm-masker-hive-
      <version>
      .jar
      This JAR file includes Hive UDFs. These Hive UDFs include Fast Data Masker masking functions. You can use the Hive UDFs in your Hive queries and can perform the masking operation. 
    • catdm-masker-library-
      <version>
      .jar
      This JAR file contains a library of all the supported Fast Data Masker masking functions. Hive UDFs in the catdm-masker-hive-
      <version>
      .jar file reference these masking functions.
  • HQL file
    The
    catdm-masker-init.hql
    utility automates the following tasks in the Hive environment: 
    1. Adds both JAR files to the Hive session.
    2. Adds defined Hive UDFs for all the supported Fast Data Masker masking functions to the Hive session.
Review the Supported Masking Functions
The following table shows Fast Data Masker masking functions and their corresponding Hive UDF equivalents:
Fast Data Masker Masking Function
Corresponding Hive UDF
ADD
tdm_add
ADDDAYS
tdm_adddays
ADDPERCENT
tdm_addpercent
ADDRANDOM
tdm_addrandom
ADDRANDOMDAYS
tdm_addrandomdays
ADDRANDOMHOURS
tdm_addrandomhours
ADDRANDOMMINUTES
tdm_addrandomminutes
ADDRANDOMSECONDS
tdm_addrandomseconds
ADDRANDOMYEARS
tdm_addrandomyears
AMEXCARD
tdm_amexcard
CONCAT
tdm_concatenate
DOB
tdm_dateofbirth
DOD
tdm_dateofdeath
DECRYPT
tdm_decrypt
ENCRYPT
tdm_encrypt
FILL
tdm_fill
FIXEDDAY
tdm_fixedday
GENCARD
tdm_generatecard
GUID
tdm_guid
HASH
tdm_hash
INTRANGE
tdm_intrange
LUHN
tdm_luhn
MASTERCARD
tdm_mastercard
NINO
tdm_nino
NUMERICRANGE
tdm_numericrange
PARTMASK
tdm_partialmask
RANDOM
tdm_random
RANDOMDATE
tdm_randomdate
RANDOMDAYS
tdm_randomdays
RANDEIN
tdm_randomein
RANDHIC
tdm_randomhic
RANDSSN
tdm_randomssn
RANDOMTXT
tdm_randomtext
TRANSLATE
tdm_translate
TRANSPOSE
tdm_transpose
TRIM
tdm_trim
VISACARD
tdm_visacard
Note:
For more information about the Fast Data Masker masking functions, see the Masking Functions and Parameters section.
Deploy the JAR Files and Register Hive UDFs
To deploy the JAR files and register defined Hive UDFs in your Hive environment, run the catdm-masker-init.hql utility as follows:
  1. Extract the MaskingSDK-
    <version>
    .zip file from the root directory of your CA TDM installation media, to an appropriate location.
  2. Locate and copy the utility (catdm-masker-init.hql) to a computer (where Hive is available) ensuring that no special characters are added to the utility file name.
  3. Locate and copy the JAR files (catdm-masker-hive-
    <version>
    .jar and catdm-masker-library
    -<version>
    .jar) to the same computer.
  4. Update the catdm-masker-init.hql utility with the paths of the JAR files.
  5. Run the following command in the Hive environment to execute the catdm-masker-init.hql utility:
    hive –i catdm-masker-init.hql
    The utility successfully adds the JAR files and defined Hive UDFs to the Hive session.
  6. Verify that the JAR files are added successfully by using the following Hive statement:
    list jars; Example response is as follows: /home/hadoopuser/camasking/catdm-masker-hive-
    <version>
    .jar /home/hadoopuser/camasking/catdm-masker-library-
    <version>
    .jar
  7. Verify that all the Hive UDFs present in the catdm-masker-init.hql utility are added to your Hive environment by using the following Hive statement:
    show functions; Example response is as follows; the list includes all the Hive UDFs that you have added: acos array tdm_add tdm_adddays tdm_addpercent .... .... tdm_trim tdm_visacard
    Note:
     In addition to the Hive UDFs that you have added, the list also displays other UDFs if they are already present in the environment. For example, acos and array are the two UDFs that are already present in the Hive environment.
Execute Hive UDFs
Use Hive UDFs that the deployed JAR file includes to perform the supported masking operations.
  1. View all Hive UDFs that are available in your Hive environment by using the following Hive statement:
    show functions;
    All Hive UDFs in the Hive environment are displayed.
  2. Note the Hive UDF name that you want to use for masking.
  3. Use the following Hive statements to know more about the Hive UDF:
    desc function <function_name>; This statement provides information about the Hive UDF. desc function extended <function_name>; This statement provides an example about the Hive UDF usage.
    Appropriate description about the Hive UDF is displayed.
  4. View the schema of the table that you want to mask by using the following Hive statement:
    desc <table_name>;
    The table schema is displayed on the screen.
  5. View the data in the table that you want to mask by using the following Hive statement:
    select * from <table_name>;
    The existing data is displayed on the screen.
  6. Use the appropriate Hive UDF in a Hive "select" statement to preview how the structured data is masked in the database table:
    select <UDF_name_with_parameters> from <table_name>;
    The result of the Hive "select" statement shows how the structured data would be masked in the database if you use the Hive UDF in a Hive "insert" statement.
  7. Use the insert statement to mask the data in the database:
    insert overwrite table <table_name> select <Hive_UDF>(<table_column_1>),<table_column_2>,....,<table_column_n> from <table_name>
    The data is updated and is masked in the table.