Skip to content

Instantly share code, notes, and snippets.

@arina-ielchiieva
Created December 6, 2016 12:47
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save arina-ielchiieva/a1c4cfa3890145c5ecb1b70a39cbff55 to your computer and use it in GitHub Desktop.
Save arina-ielchiieva/a1c4cfa3890145c5ecb1b70a39cbff55 to your computer and use it in GitHub Desktop.
Apache Drill: Dynamic UDFs Support

Dynamic UDFs Support

1 Motivation

Currently (starting from Drill version 1.7 and older) adding a new UDF requires two steps [1]:

  1. Coping the UDF jars (source and binary) to every drillbit.
  2. Restarting the drillbits so they read the UDF jars and register all found UDFs.

Jira DRILL-4726 was created to allow user register / unregister UDFs dynamically without drillbits restart.

2 Requirements

  1. User MUST upload UDF jars to well-known staging area.
  2. The API must provide SQL statement to register / unregister UDFs.

2.1 Use Cases

Company may have 100 nodes with drillbits running in a cluster and many users who need to upload UDFs. Current process requires administrators involvement (copy jars to all nodes, restart drillbits). Besides that all running queries will be killed during restart of drillbits. Company wants users to be able to register UDFs on their own without administrators help or restart of drillbits.

A separate Drill-on-YARN project will run Drill as a YARN process. YARN will copy Drill software to a private location on each node. The system admin may start and stop drillbits at any time depending on load. In this environment UDFs must still work. New nodes must have access to UDFs created before the drillbit started.

3 Background and Research

Before writing initial proposal on dynamic UDFs support, Hive UDF support model was taken into consideration [3, 4]. Unfortunately significant differences in UDF management between Hive and Drill prevented to build alike UDF support mechanism, though some key points were taken into account.

Dynamic UDFs registration task was split into two sub-tasks. First sub-task was jars distribution. During investigation phase different implementation approaches were considered, such as asking user to distribute jars manually, using RPC channels to distribute jars between drillbits etc. Though placing jars into some shared location was considered the best solution. As enhancement adding LIST command [5, 6] to monitor such location or even allowing jars uploading though web UI was proposed to be added in future releases.

The second sub-task was UDF registration itself. Since currently Drill is not planning to have metastore to store functions, statistics etc., and since function registry is too large to be kept in Zookeeper, was decided to store in Zookeeper only dynamically loaded UDFs, in other words to create dynamic UDFs registry. Dynamic UDFs registry was designed to keep list of all dynamically loaded UDFs and their jars. So if during query parsing stage or fragment execution UDF was not found in local function registry, drillbit was going check if such UDF existed in dynamic UDFs registry. If yes, then drillbit registered UDF locally and proceeded query parsing or fragment execution. This lazy-init approach was preferred since immediate UDF registration could lead to certain race conditions, clean up phases which could complicate the whole approach and still impose certain risks on leaving drillbits in inconsistent state [2].

Dynamic UDFs unregistration was challenged with class unloading at runtime problem. To prevent any classpath collisions was decided to ship each dynamically loaded jar with its own classloader. It’s was clearly stated that UDFs unloading was only allowed on jar level, build-in UDFs were not allowed to be unloaded at all.

4 Design Overview

As before user MUST prepare Java program in compliance with all rules (implement Drill’s simple or aggregate interface, add drill-module.conf, and prepare two jars - source and binary) [1]. Both jars MUST be copied to a known staging area. After user issues CREATE command jars will be validated, registered in dynamic UDF registry and copied to udf area. Staging and udf areas SHOULD be located on DFS.

Only when drillbits will receive query or query fragment which contains dynamically registered UDF, appropriate jars will be copied to drillbit local udf directory and registered in local function registry.

UDF unregistration will be done on the jar level. After user issues DROP command with jar name, all UDFs associated with the jar will be deleted from dynamic UDF registry, appropriate jars will be deleted from registry area; all drillbits will be signaled to start local unregistration process.

4.1 Configuration file

4.1.1 Required

Configuration file will contain the following defaults required for dynamic UDF support:

 drill.exec.udf: {
      retry-attempts: 5,	
      directory: {
        base: ${drill.exec.zk.root}"/udf",
        local: ${drill.exec.udf.directory.base}"/local",
        staging: ${drill.exec.udf.directory.base}"/staging",
        registry: ${drill.exec.udf.directory.base}"/registry",
        tmp: ${drill.exec.udf.directory.base}"/tmp"
        }
      }

${drill.exec.udf.retry-attempts} by default is set to 5. In order not to override changes in remote function registry done by another user before updating remote function registry, registry version is checked. If registry version has changed, functions are validated among updated registry one more time. If registry update was unsuccessful up to 5 times, error is returned to the user.

${drill.exec.udf.directory.base} is relative directory used to generated all udf directories (local and remote). ${drill.exec.zk.root property} is used to separate udf areas between clusters that use the same file system..

${drill.exec.udf.directory.local} is relative path concatenated to drill temporary directory to indicate local udf directory which is used as temporary for dynamic udf jars. This directory is cleaned up on drillbit exit. Drill temporary directory is set using environment variable $DRILL_TMP_DIR, if not set then using ${drill.tmp-dir}. If neither of them are set, generates random temporary directory by means of Guava, such directory will be deleted on drillbit exit. $DRILL_TMP_DIR is usually set in start up shell / cmd scripts. ${drill.tmp-dir} is not set by default, user may override it to have custom temporary directory. If ${drill.tmp-dir} is overridden, it will take precedence over $DRILL_TMP_DIR. Udf remote directories is relative path in the default file system connection. In cluster configuration default file system MUST be DFS. On startup Drill will check if staging, udf and tmp areas exist, if not, they will be created. Drill startup will fail if unable to create these directories or they are not writable.

4.1.2 Optional

${drill.exec.udf.directory.fs} - initially default file system is used. If user wants to change it to any other, he can modify this property. In multi-drillbit cluster DFS is a MUST.

${drill.exec.udf.directory.root} - user may can this property if needs custom root for remote udf areas. If this property is not set, user home directory will be used. For example: for Linux - /home/some_user, DFS - /user/some_user, Windows - /C:/User/some_user

4.2 Jars location

Staging, registry and tmp areas are used to store UDFs jars. If user is using more than one drillbit in cluster, these areas MUST be located on DFS.

4.2.1 Staging area

Staging area is location where user copies binary and source jars before registering UDF. Path to staging area will be set in configuration file. Upon successful registration both jars will be deleted from staging area. Upon failure jars will remain in staging area.

4.2.2 Registry area

Registry area is location where Drill copies jars after successful validation. Path to udf area will be set in configuration file. Drillbit copies jars from registry area to each drillbit local file system (local udf area) upon request issued during lazy-init stage. During UDFs unregistration, appropriate jars will be deleted from udf area. User MUST NOT delete jars from registry area to avoid inconsistencies in dynamic UDF registry where path to jars are stored.

4.2.3 Tmp area

Tmp area is location where Drill backups binary and source before starting registration process. Each binary and source will be placed in unique folder. In the end of registration both jars will be deleted from this area.

4.3 UDF registration

Registration will be triggered by the end user command. The command will have one option: CREATE - returns failure if jar with the same name already exists in udf area or if there are conflicting UDFs loaded from other jars. Upon successful registration a list of all the UDFs found in the jar will be returned to the user.

Output example:

    +---------------+-----------------------------------------------------------------------------------------------------------+
    | ok  	      | summary       	        	                                                                           |
    +---------------+-----------------------------------------------------------------------------------------------------------+
    | true          | The following UDFs in jar %s have been registered: %s                                                     |
    +---------------+-----------------------------------------------------------------------------------------------------------+

If jar doesn’t contain any UDFs, user will be notified with appropriate error message.

4.4 UDF unregistration

Unregistration will be triggered by the end user command. The command will have one option:

  1. DROP - unregisters all UDFs according to jar name and removes jars from udf area. Upon successful unregistration, list of unregistered UDFs will be returned to the user.

Output example:

    +---------------+-----------------------------------------------------------------------------------------------------------+
    | ok  	      | summary       	        	                                                                           |
    +---------------+-----------------------------------------------------------------------------------------------------------+
    | true          | The following UDFs in jar %s have been unregistered: %s                                                   |
    +---------------+-----------------------------------------------------------------------------------------------------------+

4.5 Enable / disable dynamic UDFs support

Dynamic UDFs support may impose certain security issues. Not all users SHOULD be allowed to register / unregister UDFs. Since Drill currently doesn’t have full authorization / authentication support, user must have a choice to allow or not dynamic UDFs support. To achieve this system option exec.udf.enable_dynamic_support will be added. By default dynamic UDFs support will be enabled.

4.6 Future enhancements

4.6.1 Upload jars

Jars upload process can be simplified so user wouldn’t have to know staging area actual path. This can be achieved by adding sqlline support to do the jar upload process (ex: !addjar sqlline command) or allowing upload through Web UI.

####4.6.2 List LIST command will have one option:

  1. LIST JARS – lists all jars registered in dynamic UDF registry.

4.6.3 Show

Show command will have 3 options:

SHOW FUNCTIONS [regular_expression] - lists all the user defined and builtin functions. Regular expression can be added to match certain functions.
SHOW UDFS - lists all the user defined functions.
SHOW UDFS IN JAR jar_name - lists all UDFs registered in dynamic UDF registry associated with indicated jar.

5 Implementation Details

5.1 Algorithms (and Data Structures)

5.1.1 Parse client command

  1. CREATE FUNCTION USING JAR ‘jar_name’:
    a. Registers UDFs using jar name in dynamic UDF registry.
    b. Copies source and binary jars to udf area.
  2. DROP FUNCTION USING JAR ‘jar_name’:
    a. Unregisters UDFs according to the jar name from dynamic UDF registry.
    b. Removes source and binary jars from udf area.
    c. Signals all drillbits to unregister UDFs according to the jar name from local function registry.

5.1.2 Zookeeper

First drillbit in cluster at startup creates three persistent znodes. Path is relative to the namespace zookeeper connection was established. Usually it is pre-defined by drill.exec.zk.root property.

  1. udf/registry - UDF registry znode where list of all dynamically UDFs and their jars will be stored. Drillbits will refer to this list of UDFs when they receive request to register udf or when they cannot find udf in local function registry.
  2. udf/unregister - UDF unregistration znode. Each drillbit will listen to changes on this znode. Wherever child (jar name) will be created, each drillbit will start local unregistration process.
  3. udf/jars - UDF jars znode. Before starting registration or unregistration process, ephemeral child path with jar name is created under jars znode to prevent registration or unregistration of the jar with the same name at the same time. If such path exists registration will fail indicating what action is being performed with indicated jar.

5.1.3 Registration

When foreman receives request to create UDFs, it:

  1. Creates child path with jar name under jars area.
  2. Checks that both source and binary jars are present in staging area.
  3. Backups jars to temporary area directory.
  4. Validates against local function registry:
    a. Copies binary jar to local temporary directory on local file system.
    b. Scans jar using temporary classloader.
    c. Checks if there are any duplicates in local function registry.
    d. Returns list of UDFs to be registered.
  5. Validates against dynamic UDF registry:
    a. Receives list of UDFs and theirs jars from dynamic function registry.
    b. Checks if there are no duplicates either by jar name or among UDFs.
  6. Finishes UDFs registration:
    a. Copies jars from DFS temporary location to DFS udf area.
    b. Updates dynamic UDF registry with list of newly registered UDFs if registry version has not changed. If yes, then repeats step 5 till update is complete. Number of repeats is controlled by configuration option.
  7. Performs cleanup:
    a. Removes jars from staging area.
    b. Removes jar from temporary directory on local file system.
    c. Removes jars from temporary area.
    d. Deletes child path with jar name from jars area.
  8. Returns list of registered UDFs to the user.

Registration can fail immediately and return appropriate error to the user if:

  1. Jar name path is present under jars area.
  2. Source or binary jar or both are missing in staging area. Both should have standard naming convention (ex: DrillUDF-1.0.jar, DrillUDF-1.0-sources.jar).
  3. Duplicates are found in local function registry.
  4. Duplicates are found in dynamic UDF registry.

5.1.4 Unregistration

Each dynamically loaded jar will have its own classloader to avoid complications during loading and unloading classes with the same name. When foreman receives DROP command, it:

  1. Creates child path with jar name under jars area.
  2. Sends unregistration signal to all drillbits:
    a. Creates ephemeral znode with jar name under UDF unregistration znode.
    b. Removes ephemeral znode /udf/unregister/jar_name.
  3. Checks if such jar is present in dynamic UDFs registry.
  4. Removes UDFs associated with the jar from dynamic UDFs registry.
  5. Updates registry if registry version has not changed. If yes, then repeats step 5 till update is complete. Number of repeats is controlled by configuration option.
  6. Removes jars from udf area.
  7. Removes child path with jar name under jars area.
  8. Returns list of unregistered UDFs to the user.

When drillbit is notified that child znode was created under UDF unregistration znode, it:

  1. Unregisters all UDFs associated with jar name from local function registry.
  2. Removes jars from local udf directory.

5.1.5 Lazy-initialization

During query parsing stage or fragment execution, if UDF is not present in local function registry, drillbit:

  1. Gets list of registered jars from remote function registry.
  2. Gets list of registered jars from local function registry.
  3. If there are any missing jars in local function registry, starts local registration:
    a. Copies source and binary jars to local udf directory from udf area.
    b. Registers UDFs present in jar in local function registry.
  4. If there were any missing jars, repeats parsing stage or fragment execution.

Lazy-initialization is enabled even if dynamic UDFs support is disabled to cover the case when user has registered some functions and then disabled dynamic UDFs support but still expects these functions will be available.

5.1.6 Drillbit startup / restart

On startup drillbit creates local udf directory if it does not exist, otherwise cleans up it. On drillbit exit local udf directory is cleaned up. On startup drillbit creates three udf areas (registry, staging, tmp) if they do not exist. Listener is set on UDF unregistration znode to start local unregistration process when new child znode is created. Child znode name and content MUST correspond to binary jar name.

5.2 APIs and Protocols

TBA

5.3 Performance

Queries with dynamically registered UDFs will be start-up slower on those drillbits where these UDFs haven’t been uploaded yet. Before executing query or fragment, drillbit will need to copy missing jars from udf area to local file system udf directory and register UDFs present in jars in local function registry.

5.4 Error and Failure handling

During UDFs registration in case of failure the following error messages can be returned:

  1. Jar with %s name is used. Action: REGISTRATION / UNREGISTRATION.
  2. Binary [%s] or source [%s] is absent in udf staging area [%s].
  3. Jar with %s name has been already registered
  4. Jar %s does not contain functions
  5. Found duplicated function in %s - %s.
  6. Failed to update remote function registry. Exceeded retry attempts limit.

During UDFs unregistration in case of failure the following error messages can be returned:

  1. Jar with %s name is used. Action: REGISTRATION / UNREGISTRATION.
  2. Jar %s is not registered in remote registry.
  3. Failed to update remote function registry. Exceeded retry attempts limit.

5.5 Deployment

Migration from dynamically loaded UDFs to build-ins can keep dynamic UDF registry smaller, as well as reduce occupied size of udf area. Though when dynamically loaded UDFs migrate to build-ins, user won’t be able to unregister them using DROP command. To perform such movement, user needs to do the following steps:

  1. Stop all drillbits.
  2. Move jars from udf registry area to each drillbits $DRILL_SITE/jars area (MUST be included in classpath).
  3. Remove remote function registry from zookeeper.
  4. Start all drillbits.

If user want to only certain jars to built-ins functions, user needs to do the following steps:

  1. Copy (not move) certain jars from udf registry area to each drillbits $DRILL_SITE/jars area (MUST be included in classpath).
  2. Execute drop command for each jar.
  3. Stop all drillbits.
  4. Start all drillbits.

5.6 Memory management

NA

5.7 Availability Implications

NA

5.8 Scalability Issues

NA

5.9 Backward Compatibility

Since dynamic UDFs support is a new feature, there will be no backward compatibility issues. All UDFs loaded using manual approach [1] can be loaded using dynamically and vice versa.

5.10 Security and Authentication impact

Before issuing CREATE command, user MUST copy source and binary jars to staging area. Access to this area can be restricted. If user won’t be allowed to copy jars in this area he won’t be able to dynamically register UDFs, as well.

Currently any user can create UDFs. Since Drill currently doesn’t have full authorization / authentication support, user needs to have a choice whether to allow or not dynamic UDFs support. This can be controlled by system option (exec.udf.enable_dynamic_support).

No matter under which user jars were copied to the staging area, subsequent movement of jar to remote or local registry areas will be done under the user who has started the drillbit even if impersonation support is enabled.

5.11 UI changes

NA

5.12 Options and metrics

New system option exec.udf.enable_dynamic_support will be added so user can chose whether to enable dynamic UDFs support or not. By default dynamic UDFs support will be enabled.

5.13 Debugging

NA

5.14 Testing implications

  1. Verify that after CREATE command execution, queries using new UDFs succeed.
  2. Verify that user receives appropriate errors when tries to register duplicated UDFs / jars.
  3. Verify that DROP command unregisters UDFs on all drillbits.
  4. Verify the behavior with concurrent execution of CREATE / DROP commands.
  5. Verify the behavior during drillbit startup / restart.
  6. Verify the lazy-init behaviour.

5.15 Tradeoffs and Limitations

When user runs DROP command, some queries that are using unregistered UDFs may fail during fragment execution if by that time UDF has been unregistered from drillbit local registry. Ideally, before submitting DROP command user SHOULD make sure, no one is running queries using UDFs that are going to be unregistered, otherwise the following error message can be returned - "No match found for function signature custom_upper()".

DROP command operates only on jar level name. User cannot unregister only one UDF from jar where several UDFs are present. To avoid this situation user MAY create one UDF per jar.

All udf ares (remote or local) are created are created at drillbit startup even if dynamic UDFs support is disabled. Drillbit startup will fail, if user who started the drillbit does not have write access to these directories. As future enhancement to it can be considered to create udf directories only when dynamic UDFs support is enabled but this imposes additional responsibility to check udfs areas existance at runtime which can be cumbersome. In particular, what we should do with running drillbit if after enabling dynamic UDFs support, user who started the drillbit can not create udf directories.

6 Implementation Plan

  1. Add parse commands.
  2. Integration with Zookeeper.
  3. Registration process.
  4. Unregistration process.
  5. Lazy-initialization integration.
  6. Add unit tests.

7 Open items

8 Notes

9 References

[1] https://drill.apache.org/docs/develop-custom-functions/
[2] Design document V0
[3] https://cwiki.apache.org/confluence/display/Hive/HivePlugins#HivePlugins-CreatingCustomUDFs
[4] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/ReloadFunction
[5] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli#LanguageManualCli-HiveResources
[6] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Commands

10 Document History

Date Author Version Description
2016-07-29 Arina Ielchiieva 0.1 Initial Draft
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment