Skip to content

Instantly share code, notes, and snippets.

@bkietz
Last active November 15, 2020 20:56
Show Gist options
  • Save bkietz/f701d32add6f5a2aeed89ce36a443d43 to your computer and use it in GitHub Desktop.
Save bkietz/f701d32add6f5a2aeed89ce36a443d43 to your computer and use it in GitHub Desktop.
A proposal for a KMS interaction API in parquet-cpp

MOVED DOCUMENT

Moved to: https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220

Heavily inluenced by the corresponding parquet-mr design: https://docs.google.com/document/d/1bEu903840yb95k9q2X-BlsYKuXoygE4VnMDl9xz_zhk

Initial PR to port the parquet-mr design to c++: apache/arrow#8023

format:

# name_of_file
- class in file
- class in file

<detail code block>

comments

properties_driven_crypto_factory.h

  • EncryptionConfiguration
  • DecryptionConfiguration
  • PropertiesDrivenCryptoFactory
class PropertiesDrivenCryptoFactory {
 public:
  explicit PropertiesDrivenCryptoFactory(
      std::shared_ptr<KmsClientFactory> kms_client_factory);

  std::shared_ptr<FileEncryptionProperties> GetFileEncryptionProperties(
      const KmsConnectionConfig& kms_connection_config,
      std::shared_ptr<EncryptionConfiguration> encryption_config);

  std::shared_ptr<FileDecryptionProperties> GetFileDecryptionProperties(
      const KmsConnectionConfig& kms_connection_config,
      std::shared_ptr<DecryptionConfiguration> decryption_config);

  void RemoveCacheEntriesForToken(const std::string& access_token);

  void RemoveCacheEntriesForAllTokens();
};

PropertiesDrivenCryptoFactory contains exactly one KmsClientFactory, so it's more idiomatic to require this on construction than have a post-construction setter method and runtime asserts.

KeyMaterialStore will eventually reside in this header and be added as a field of EncryptionConfiguration, but it should be omitted from the initial PR.

properties_driven_crypto_factory.cc

KeyToolkit, FileKeyWrapper, FileKeyUnwrapper should be moved to properties_driven_crypto_factory.cc and not appear in any header since they are internal implemenation details of PropertiesDrivenCryptoFactory. In fact I recommend using the private implemenation pattern in PropertiesDrivenCryptoFactory and inlining the other classes into that private implemenation.

To maintain consistency with the style of parquet-cpp, the above structures should not be explicitly synchronized with individual mutexes. In the case of a parquet::arrow::FileReader, the request to read a given selection of row groups and columns is issued from a single main thread. Properties (including FileDecryptionProperties) are immutable after construction and thus can be safely shared between threads with no further synchronization. Since PropertiesDrivenCryptoFactory is a factory for FileDecryptionProperties, it will not be referenced from multiple threads and internal synchronization is unnecessary.

kms_client.h:

  • KmsConnectionConfig
  • KmsClient
  • KmsClientFactory
struct KmsConnectionConfig {
  std::string kms_instance_id;
  std::string kms_instance_url;
  std::function<std::string()> access_token;
};

Defaults for fields of KmsConnectionConfig are not meaningful; it defers recognition and handling of a "known default value" to implemenations of KmsClientFactory without clarifying what expected behavior when given those defaults might be, so don't provide any.

access_token is now wrapped in std::function; invoking that function returns the current access token. This will be sufficient to capture arbitrary refresh semantics, including synchronization if that's necessary for an implementer. Moreover it relieves the burden on users of KmsConnectionConfig to detect that refresh is necessary.

custom_kms_conf can be introduced in a follow up when demand arises, but it need not be present in the initial PR.

class KmsClientFactory {
 public:
  virtual ~KmsClientFactory() = default;

  virtual std::shared_ptr<KmsClient> CreateKmsClient(
      KmsConnectionConfig kms_connection_config) = 0;
};

Whether or not KmsClients produced by a factory use local wrapping need not be exposed in the KmsClientFactory interface since an implementer of the interface will already be aware of what wrapping methods are used. In a follow up it may become worthwhile to add virtual bool KmsClientFactory::local_wrapping() const and similar properties so that implementers can report characteristics of their classes in code as well as documentation, but it need not be present in the initial PR.

class KmsClient {
 public:
  virtual ~KmsClient() = default;

  virtual ::arrow::util::Future<std::shared_ptr<::arrow::Buffer>> WrapKey(
      const ::arrow::Buffer& key,
      const std::string& master_key_identifier) = 0;

  virtual ::arrow::util::Future<std::shared_ptr<::arrow::Buffer>> UnwrapKey(
      const ::arrow::Buffer& wrapped_key,
      const std::string& master_key_identifier) = 0;
};

For blobs of arbitrary bytes arrow::Buffer is preferred over std::string.

Interaction with a KMS may be slow and this asynchronicity should be reflected in the KmsClient interface. Returning a Future will allow multiple requests to be issued (for example, for all keys necessary to read selected columns of a given file) then collected.

local_wrap_kms_client.h

  • LocalWrapKmsClientFactory
class LocalWrapKmsClientFactory : public KmsClientFactory {
 public:
  LocalWrapKmsClientFactory() = default;

  std::shared_ptr<KmsClient> CreateKmsClient(
      KmsConnectionConfig kms_connection_config) override;
};

The primary point of extension in this API is KmsClientFactory, so that is the interface which should be implemented. LocalKeyWrap and LocalWrapKmsClient do not need to appear in a header file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment