In HIVE-12679, we have been trying to introduce a feature to make IMetaStoreClient
pluggable. This document is a summary of the past discussions.
Apache Hive hardcodes the implementation of IMetaStoreClient
, assuming it alreays talks to Hive Metastore. 99% of Hive users doesn't have any problems because they use HMS as a data catalog. However, some data platforms and their users use alternaive services as data catalogs.
- Amazone EMR provides an option to use AWS Glue Data Catalog
- Treasure Data deploys Apache Hive integrated with their own in-house data catalog
External data catalogs typically provide protocols or authentication incompatible with HMS. It is costly to implement a Thrift server which is 100% compatible with HMS on top of the taylored protocol. That's why we would like to have a way to implement an IMetaStoreClient
to directly talk to such data catalogs.
This is the originally proposed option. The idea is simple. It would introduce a new inerface, HiveMetaStoreClientFactory, and a new parameter metastore.client.factory.class
to specify a concrete implementation of HiveMetaStoreClientFactory
.
By default, Hive uses SessionHiveMetaStoreClientFactory and definitely keeps the current behavior. Users will follow the steps below if they want to integrate Hive with their own data catalogs.
- To implement a concrete implementation of
IMetaStoreClient
to talk to their data catalog - To implement a concrete implementation of
HiveMetaStoreClientFactory
to instantiate the customIMetaStoreClient
- To configure
metastore.client.factory.class
to the FQCN of the customHiveMetaStoreClientFactory
apache/hive#4444 is the latest Pull Request to implement this option.
This option was proposed in this comment.
In this option, users will follow the steps below.
- To implement a concrete implementation of
IMetaStoreClient
to talk to another data catalog - To configure
metastore.client.class
to the FQCN of the customIMetaStoreClient
- To implement a concrete implementation of
InvocationHandler
as a dynamic proxy forIMetaStoreClient
- To configure
metastore.client.proxy.class
to the FQCN of the customIMetaStoreClient
This option assumes we don't want to add an additional factory class. Instead, the parameter in the second step directly accepts the implementation of IMetaStoreClient
.
The third and fourth steps are introduced by a little complicated contexts. The current official implementation of IMetaStoreClient
is SessionHiveMetaStoreClient
. However, actually, Hive doesn't directly use this implementation. It wraps SessionHiveMetaStoreClient
with RetryingMetaStoreClient
in the primary path.
if (conf.getBoolVar(ConfVars.METASTORE_FASTPATH)) {
return new SessionHiveMetaStoreClient(conf, hookLoader, allowEmbedded);
} else {
return RetryingMetaStoreClient.getProxy(conf, hookLoader, metaCallTimeMap,
SessionHiveMetaStoreClient.class.getName(), allowEmbedded);
}
RetryingMetaStoreClient
is a dynamic proxy implementing java.lang.reflect.InvocationHandler to add some traits such as retries against Thrift specific errors. Apparently, RetryingMetaStoreClient
is tighly coupled with HMS, and we need a parameter to replace it or disable it.
https://github.com/okumin/hive/commit/d185af2ca17d1cb7351a1320abf14be167579d2d is an example implementation. Note that it is just a sample, and I have not verified it works as it is.