Skip to content

Instantly share code, notes, and snippets.

@okumin
Last active October 19, 2023 15:11
Show Gist options
  • Save okumin/30b058b14db1b099ba37ba7dc257fe8e to your computer and use it in GitHub Desktop.
Save okumin/30b058b14db1b099ba37ba7dc257fe8e to your computer and use it in GitHub Desktop.

Overview

In HIVE-12679, we have been trying to introduce a feature to make IMetaStoreClient pluggable. This document is a summary of the past discussions.

Problem statement

Apache Hive hardcodes the implementation of IMetaStoreClient, assuming it alreays talks to Hive Metastore. 99% of Hive users doesn't have any problems because they use HMS as a data catalog. However, some data platforms and their users use alternaive services as data catalogs.

  • Amazone EMR provides an option to use AWS Glue Data Catalog
  • Treasure Data deploys Apache Hive integrated with their own in-house data catalog

External data catalogs typically provide protocols or authentication incompatible with HMS. It is costly to implement a Thrift server which is 100% compatible with HMS on top of the taylored protocol. That's why we would like to have a way to implement an IMetaStoreClient to directly talk to such data catalogs.

Proposed options

Option 1: Add HiveMetaStoreClientFactory to instantiate their own IMetaStoreClient

This is the originally proposed option. The idea is simple. It would introduce a new inerface, HiveMetaStoreClientFactory, and a new parameter metastore.client.factory.class to specify a concrete implementation of HiveMetaStoreClientFactory.

By default, Hive uses SessionHiveMetaStoreClientFactory and definitely keeps the current behavior. Users will follow the steps below if they want to integrate Hive with their own data catalogs.

  1. To implement a concrete implementation of IMetaStoreClient to talk to their data catalog
  2. To implement a concrete implementation of HiveMetaStoreClientFactory to instantiate the custom IMetaStoreClient
  3. To configure metastore.client.factory.class to the FQCN of the custom HiveMetaStoreClientFactory

apache/hive#4444 is the latest Pull Request to implement this option.

Option 2: Configure IMetaStoreClient and InvocationHandler

This option was proposed in this comment.

In this option, users will follow the steps below.

  1. To implement a concrete implementation of IMetaStoreClient to talk to another data catalog
  2. To configure metastore.client.class to the FQCN of the custom IMetaStoreClient
  3. To implement a concrete implementation of InvocationHandler as a dynamic proxy for IMetaStoreClient
  4. To configure metastore.client.proxy.class to the FQCN of the custom IMetaStoreClient

This option assumes we don't want to add an additional factory class. Instead, the parameter in the second step directly accepts the implementation of IMetaStoreClient.

The third and fourth steps are introduced by a little complicated contexts. The current official implementation of IMetaStoreClient is SessionHiveMetaStoreClient. However, actually, Hive doesn't directly use this implementation. It wraps SessionHiveMetaStoreClient with RetryingMetaStoreClient in the primary path.

    if (conf.getBoolVar(ConfVars.METASTORE_FASTPATH)) {
      return new SessionHiveMetaStoreClient(conf, hookLoader, allowEmbedded);
    } else {
      return RetryingMetaStoreClient.getProxy(conf, hookLoader, metaCallTimeMap,
          SessionHiveMetaStoreClient.class.getName(), allowEmbedded);
    }

RetryingMetaStoreClient is a dynamic proxy implementing java.lang.reflect.InvocationHandler to add some traits such as retries against Thrift specific errors. Apparently, RetryingMetaStoreClient is tighly coupled with HMS, and we need a parameter to replace it or disable it.

https://github.com/okumin/hive/commit/d185af2ca17d1cb7351a1320abf14be167579d2d is an example implementation. Note that it is just a sample, and I have not verified it works as it is.

I mostly prefer the first option for multiple reasons.

The dynamic proxy pattern doesn't seem to be so elegant in most cases

I'm not 100% sure why we introduced RetryingMetaStoreClient as an InvocationHandler. I guess that's because we wanted to add the retry capability to too many methods in a reflective manner.

I think most custom clients don't need such reflective convenience of InvocationHandler. For example, if their catalog supports REST APIs and provides a client library, authentication or retries would be handled by the library.

If they really want to use InvocationHandler to wrap their IMetaStoreClient, they can still use it in their HiveMetaStoreClientFactory as we do so for SessionHiveMetaStoreClientFactory in the PR. IMetaStoreClient shouldn't have a tight dependency on a single InvocationHandler in my opinion.

The first option is explicitly flexible

The option uses reflection in a minimal manner to create a HiveMetaStoreClientFactory without any argunemts. It is less surprising, and users can write any procedure in HiveMetaStoreClientFactory#createMetaStoreClient. It can enable us to switch multiple types of IMetaStoreClient(e.g. choose V1 API or V2 API based on a feature flag), we can decorate IMetaStoreClient with multiple proxies(e.g. RetryingMetaStoreClient + another proxy to measure the total duration per call), or so on.

The first option already has actual users

The original owner of HIVE-12679 submitted the first patch in 2016. Unfortunately, Hive community has not accepted the patch for 7 years.

However, some companies were required to resolve the issue and ported the patch on their own. Such services and their users already depend on metastore.client.factory.class or its alias, hive.metastore.client.factory.class.

If we choose another option, they may have to maintain hive.metastore.client.factory.class even though it is never merged into the upstream. Otherwise, their users will be confused as they have to change their configurations or deployments on upgrading Hive. On the other hand, the Option 2 doesn't have any existing users at this point.

I think most users who want HIVE-12679 already ported the patch, folowing the Option 1. They would prefer the Option 1. We don't know who can be potentially happier with the Option 2.

Of course, I understand those companies ported the unmerged patch at their own risk. So, they can't stop the community from choosing the Option 2 or another.

Make data catalog as a backend of HMS using IDataConnectorProvider

Potential users of HIVE-12679 already have their own primary data catalog. So, we assume it is easier to directly connect to the catalog than make it as one of the remote sources because of the following reasons.

  • They need additional infra
  • HMS might not meet some requirements
  • They need to test scalability, performance, availability, and security of HMS
  • They can't skip testing those of their own catalog anyway as Hive is not the only client and they need to keep it
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment