Define a Spark Job

Spark jobs lets you run Spark applications on clusters and monitor their status.
dewc121
Spark jobs lets you run Spark applications on clusters and monitor their status.
Spark is an open-source cluster computing framework. CA Workload Automation DE supports Spark on the following cluster managers:
  • Standalone: A simple cluster manager included with Spark.
  • DC/OS Mesos: The Mesophere implementation of Apache Mesos, a general cluster manager that can also run Hadoop MapReduce and service applications.
  • Hadoop YARN: The resource manager in Hadoop 2.
  • Local
You can deploy an application on a cluster in either client or cluster mode. In cluster mode, the framework launches the driver in the cluster. In client mode, the submitter launches the driver outside of the cluster.
  • This job produces spool file.
  • To run this job, your system requires CA WA Advanced Integration for Hadoop 12.0.00.03 or higher. For information about installing Advanced Integration for Hadoop and the supported Hadoop distributions, see CA WA Advanced Integration for Hadoop.
Follow these steps:
  1. Open the Application that you want to add the job to in the Define view:
    1. Select 
      Application
       from the 
      Define
       menu.
      The Application workspace appears.
    2. Click 
      APPLICATION SEARCH
      .
    3. Click the Application in the 
      Application Search 
      pane that appears.
      The Application appears in the workspace.
  2. Select 
    Big Data
     from the drop-down above the Job Palette view.
    The jobs of the Big Data category appear in the Palette view.
  3. Select and drag the 
    SPARK
     job from the Palette view to the workspace.
    The Spark icon appears in the workspace view.
  4. Select the Spark icon, and select 
    Edit
     from the 
    JOB ACTIONS
     drop-down.
    The Properties section of the Spark job appears.
  5. Complete the following required fields:
    • Name
      Defines the name of the job that you want to schedule.
      Limits:
       128 alphanumeric characters, plus the special characters commercial at (@), pound (#), dollar sign ($), underscore (_), square brackets ([]), brace brackets ({}), and percent sign (%) as a symbolic variable introducer character.
    • Advanced integration
      Specifies the name of the advanced integration that runs the job.
      The drop-down list displays only the advanced integrations that you have access permission for and that are defined in the CA WA Desktop Client Topology for the specified job type.
    • Spark connection profile
      Specifies the name of the Spark connection profile that you want to use for running Spark applications. Click 
      DETAILS
       next to this field to view the details of the Spark connection profile.
      • The drop-down list displays only the connection profiles that you have access permission for and that are defined in the Admin view for the specified connection profile type.
      • For more information about creating a Spark connection profile, see Managing Connection Profiles.
    • Spark security profile
      Specifies the name of the user security profile to connect to Spark.
      • The drop-down list displays only the security profiles that you have access permission for and that are defined in the Admin view for the specified security profile type.
      • For more information about creating a Spark security profile, see Managing Security Profiles.
    • Program type
      Specifies the program type of the Spark application that the job runs. Options are: Java, Scala, and Python.
    • Deploy mode
      (Optional) Specifies whether to deploy the Spark driver on the worker nodes (cluster) or locally as an external client. Options are: Client and Cluster.
      This field is required only when the cluster type is YARN, LOCAL, or STANDALONE.
      In the standalone mode, set the value of the
      log4j.rootCategory
      property to
      INFO, console
      in the following file:
      spark_install_dir
      /conf/log4j.properties
      spark_install_dir
      Specifies the Spark installation directory.
    • Path to spark application
      Specifies the path to the Spark application that runs on the cluster. Click the search icon next to this field to browse and select the application from the computer where Spark is installed.
      If the path name contains spaces and special characters, enclose the path name with double quotes.
    • Properties file
      (Optional) Specifies the path to the file that contains Spark configuration properties. Click the search icon next to this field to browse and select the properties file from the computer where Spark is installed.
      If the path name contains spaces and special characters, enclose the path name with double quotes.
    • Spark application arguments
      (Optional) Specifies the list of arguments you want to pass to the Spark application.
      If the path name contains spaces and special characters, enclose the path name with double quotes.
    • Spark options
      (Optional) Specifies the Spark configuration properties. Select
      one
      of the following options from the
      Key
      column and specify the corresponding value in the
      Value
      column:
      • --driver-cores
      • --total-executor-cores
      • --executor-cores
      • --queue
      • --num-executors
      • --archives
      • --principal
      • --keytab
      • --supervise
      • --class
      • If the path name contains spaces and special characters, enclose the path name with double quotes.
      • If the path name includes percent sign (%) as a symbolic variable introducer character, all the special characters that are specified after % are supported.
    • Success Exit Code(s)
      Specifies the exit codes that the server use to identify the job as a success.
      Default:
       0
    • Failure Exit Code(s)
      Specifies the exit codes that the server use to identify the job as a success. 
      Default:
       1
    • The exit code can be a single exit code, a list of exit codes, or a range of exit codes indicated by a hyphen.
    • To specify multiple exit codes, press
      Enter
      after specifying each exit code.
    • If you specify multiple exit codes, enter the most specific codes first followed by the more general ranges.
    • Job success criteria
      (Optional) Defines a regular expression that is used to evaluate a return string. If the return string matches the regular expression, the job completes successfully. Otherwise, the job fails.
      The agent evaluates the job success criteria only when the job completes successfully based on the success exit code.
      • This field only applies to SQL queries that are SELECT statements.
      • Each return string includes the field name from the SELECT statement and its value, separated by an equal sign (=). For example, consider the query SELECT ORD_NUM FROM SALES. To match order number A2976, specify the regular expression ORD_NUM=A2976. Specifying the regular expression A2976 does not match any return string causing the job to fail. You can also specify the regular expression 
        .*A2976
        , which matches any return string that ends with A2976.
      • To compose a regular expression, follow the rules for Java class java.util.regex.Pattern. You can find these rules using a Google search for java pattern.
      • Some characters have a special meaning in regular expressions. To use these characters literally, precede the characters with one backward slash (\). For example, to match the characters *.* literally, specify \*\.\* in your regular expression. The backward slashes escape the characters' special meanings.
    • Additional parameters
      (Optional) Specifies additional parameters you want to pass to Spark. To specify multiple parameters, separate each parameter with a space.
      If the path name contains spaces and special characters, enclose the path name with double quotes.
      Example:
      --deploy-mode client --num-executors 3 --driver-memory 512m --executor-cores 1
    • Job status refresh
      (Optional) Specifies how often (in seconds) the status of the job is refreshed.
      Limits:
       1-86400
      Default:
       60
  6. (Optional) Click
    PREVIEW
    to review the Spark command that the Hadoop agent submits on the Spark computer.
    If the Spark command-specific fields are defined with global variables, symbolic variables, or built-in functions, the preview does not show those values as resolved.
    The preview feature applies only for Spark jobs.
  7. Click 
    SUBMIT
    .
    The Spark job is defined.