Skip to content

Instantly share code, notes, and snippets.

@martinsotir
Last active February 22, 2024 08:57
Show Gist options
  • Save martinsotir/ab2cb234ee8363dca3493a1ed50fea57 to your computer and use it in GitHub Desktop.
Save martinsotir/ab2cb234ee8363dca3493a1ed50fea57 to your computer and use it in GitHub Desktop.

Hydra configuration loader for Kedro

⚠️ 2021-03-07: WIP / Unmaintained. This project is a proof of concept, not recommend for general use.

Tested with Kedro 0.17.1 and Hydra 1.0.6.

Author: Martin Sotir



About Kedro

Kedro is a lightweight library and project template designed for fast collaborative prototyping of data-driven pipelines. Kedro formalize project configuration (parameters, credentials, data catalog), pipeline graph execution and definition (from python functions) and provides an extensible entry point command line tool (python makefile-like).

Kedro can be used from a notebook, using a KedroContext object to access the project configuration and data catalog, or from the command line, to run registered pipelines (each defined as a list of functions with specified inputs, outputs, and parameters).

Kedro encodes data science and data pipeline good pratices as code in the most minimal way. Each feature is kept as much simple and barebone as possible. Because of this "skeletal" shape, Kedro will not fit every research and production needs, but Kedro make it very easy to appropriate and extend it to your needs. For exemple, the kedro pipeline feature is way less featured than most production-ready solutions (Prefect, Dagster, Airflow, etc.) but as soon as the core principle are respected (the separation of concerns between data loading, processing and pipeline execution, the data engineering convention, etc.) adding features or transitionning to other technology will not be headache.

For theses reasons, I think Kedro is a good first step before building a custom in-house ML system or switching to more production-ready solution with all the bells and whistles (e.g.: MLrun, MetaFlow, Kubeflow or any combination of the booming MLOps tool ecosystem). I feel that Kedro is particularly suited for R&D teams working on a lot of industrial ml pipeline prototypes (e.g., in R&D service firms), aiming to reuse code, disseminate data science best practices and facilitate collaboration with product development teams, while beeing as much agnostic as possible of the technology to use in the end.

Configuration

A typical kedro project configuration look like this:

── conf
    ├── base
    │   ├── catalog.yml
    │   ├── logging.yml
    │   ├── parameters.yml
    │   └── experiment1
    │       └── parameters.yml
    └── local
        ├── catalog.yml
        ├── credentials.yml
        └── parameters.yml
  • Any number of environments can be created (though only two environments are loaded at the same time: base + the active environment).

  • Configuration files are loaded by a ConfigLoader class with a simple API: the configuration loader is first provided with a list root configuration paths, usually the base and the current active environment (local by default). Then the ConfigLoader has a single get(*patterns: List[str]) method where patterns are used to match configuration files within the root directories.

  • The minimal configuration structure is enforced by the KedroContext class. This class loads catalog, credentials, logging and parameters configuration files using semi-flexible file patterns. For instance, when loading parameters the KedroContext class uses conf_loader.get("parameters*", "parameters*/**", "**/parameters*"): this match parameters.yml but also parameters.json, experiment1/parameters.yml, parameters/model1.toml, etc. Files matched in a get query are merged: duplicated keys in the same environment will raise a error, keys in the active environment take precedence over keys in the base environment. (Note: directory paths are not taken into accounts when loading the configuration, every configuration file is loaded as root-level config).

  • catalog, credentials, logging and parameters are not merged, they are loaded at different times and often interpreted separately. The logging configuration is loaded when a KedroSession starts, the catalog and credential configuration are loaded when the dataset catalog is required (to run a pipeline), parameters configuration are provided to pipeline 'nodes on-demand (each node specify which the part of the parameter configuration it needs).

  • Usually, pipeline ‘nodes do not have a direct access to information in the catalog and credential configuration: this enforces a "separation of concerns" between nodes (concern: how to transform inputs into outputs) and catalog "dataset" entries (concern: how and where to load/write data).

  • The configuration can be extended Kedro application and plugins has access to the ConfigLoader to load additional files. For instance the kedro-mflow plugin will load MLFlow configuration with conf_loader.get("mllfow*").

  • Kedro config loader is based on anyconfig and can read .yaml, .json, .ini, .toml, xml (and more) configuration files. Kedro also provides a TemplatedConfigLoader, extending the ConfigLoader with Jinja templating syntax (string interpolation, loops, etc.).

  • Parameters can be overridden from the command line when running pipeline with the kedro run --params key=value syntax or by passing a dict to the extra_params parameter in sesson.create(...) (e.g. from a notebook).

  • Kedro Project settings are not defined in the main conf

Switching between parameters sets

In Kedro, there is no dedicated feature to easily switch between configuration options for "sub-modules" in the configuration (e.g.: to switch between several predefined model or optimizer parameter sets). There are however several non-straightforward ways to achieve such mechanism.

First, we can leverage the configuration environment feature of Kedro. However, environments affect the whole configuration and there is no support for nested/hierarchical environment in Kedro.

The more generic way to approach this is usually to define mutually exclusive parameter sets in a dictionary or a list, in the same yaml file or spread in several yaml files (but attention must be paid to not create duplicated keys). An option is then selected dynamically at runtime, depending on another parameter from the configuration or command line arguments (handled by the kedro run command).

Example:

# File: parameters.yml
model: xgboost  # default can be overrided by --params argument in the `kedro run` cmd

# File: parameters/models/xgboost.yml
xgboost:
  n_estimators: 10000
  learning_rate: 0.01
  max_depth: 6

# File: parameters/models/randomforest.yml
randomforest:
  n_estimators: 200
  max_depth: 8
  max_features: "sqrt"

When instiating a Pipeline (or within a Notebook):

def make_pipeline():

    ctx = get_current_session().load_context()
    model = ctx.params['model']

    return Pipeline([
        node(train, 
             inputs={'dataset': 'train_set', 'model_config': f"params:{model}"},
             outputs=['report'])
    ])

def train(dataset, model_config):
    ...

(the syntax param:<param_key> is used to pass parameters to kedro pipeline nodes).

An other option would be to take advantage of the Jinja templating capabilities of the TemplatedConfigLoader.


About Hydra

With a powerful configuration loader and auto-generated program entry point, Hydra focuses on increasing researcher productivity while encouraging experiment reproducibility.

Hydra seams targeted at researchers, and especially ML researchers who could run thousands of trials for the same task adjusting a set of hyperparameters each time. Hydra provide a powerful hierarchical configuration loader and automatically generate application entry points that allows scientists to override and discover hyperparameters.

Like Kedro, Hydra is extensible and offer functionality beyond configuration and entry points: remote runner, logging utilities, bridge with hyperparameter search tools, etc.

The 'scope' of Hydra is narrower and quite different from kedro: Hydra has no notion of data catalog, no pipelines, no credentials management, no imposed folder structure. This makes hydra more flexible and less cumbersome than Kedro for projects where those features are not needed. Hydra seems an excellent tool for experimentation and developing ML model in 'isolation', e.g,. for specialized benchmark, with stable and well-defined output/input data.

Hydra configuration (WIP)

Hydra configuration are hierarchical, with a single root configuration file for each application/entry-point (config.yaml in the exemple below):

── conf
   ├── config.yaml
   ├── db
   │   ├── mysql.yaml
   │   └── postgresql.yaml
   ├── schema
   │   ├── school.yaml
   │   ├── support.yaml
   │   └── warehouse.yaml
   └── ui
       ├── full.yaml
       └── view.yaml

Hydra configuration loader stands on OmegaConf (sharing the same author with Hydra). Omegaconf extends the yaml format with conventions and features designeds to simplify the definition of complex software configurations:

  • By default OmegConf loads yaml configuration into python built-in types (dicts and lists) but also provide experimental support to load "structered" config, where part of the configuration are instantiated with user-defined python classes.

  • OmegaConf provides variable references, and value interpolation features (having a parameter value, or part of a paramater value depends on an other paramater value). These are very useful to prevent repeating values while keeping the configuration well structured (for instance, in deeplearning, parameters like the batch size, number of epoch, learning rate are often used by multiple entitites: the training loop, optimizer, LR scheduler, etc.). String value interpolation is also useful to build experiment and run names from configuration parameters without any extra logic in the code. OmegaConf string interpolation can also retrieve system environment variables

  • Mandatory values, read-only

On top of OmegaConf Hydra adds the ability to compose configuration from multiple sources (leveraging the merge utilities in OmegConf)

  • Directory as mutually exclusive parameter sets

  • Default list

  • Scopes

  • Load configuration in entry points, experimental support to load configuration programmatically (e.g.: from a notebook)

  • A configurable syntax to overrides parameters from the commande line (+autocompletion)

  • Hydra settings, Job configuration

  • Ability to generate parameters lists, search spaces

Switching between parameters sets in Hydra (WIP)


Bringing Hydra configuration to Kedro projects

Motivation

Beyond its capabilities, Hydra is interesting because of is attractiveness to researchers. Whereas Kedro can be very frustrating because of the imposed (but well-meaning) structure and limitation (ex: ).

When experiments grow, when they get closer to industrial applications, when benchmarks become benchmarks and models become pipelines: it becomes useful to organise the configuration in way that keep compl

In theory, bringing Hydra configuration to Kedro could:

  1. Increase Kedro usability for ML researcher.

  2. Facilitate the transition from Hydra projects to Kedro.

  3. Bring additional features toK project: for an extended syntax for parameter overrides and parameters auto-completion.

Experiment 1: miminaly invasive Hydra configuration loader

Approach

Override the ConfigLoader _load_config_file method to load Hydra config Load, )

from kedro.config import ConfigLoader

class HydraConfigLoaderMinimal(ConfigLoader):

    def _load_config_file(self, config_file: Path) -> Dict[str, Any]:
        from hydra.experimental import initialize_config_dir, compose
        from omegaconf import OmegaConf

        with initialize_config_dir(config_dir=str(config_file.parent), job_name="app"):
            conf = compose(config_name=config_file.name, overrides=[])
            resolved_conf = OmegaConf.to_container(conf, resolve=self.resolve_interpolation)
            return {k: v for k, v in resolved_conf.items() if not k.startswith("_")}

This code works but has been simplied, check the reommened implementation here: hydra_config_loader_minimal.py.

Usage

//TODO

Support for Kedro templated configuration

If the Kedro configuration was using the string interpolation features from TemplatedCongifLoader we must do a few edits:

First we import the globals variables files (globals.yml) inside each configuration file using the Hydra default list directive:

defaults:
  - globals.yml # <-- import globals.yml variables

example_iris_data:
  type: pandas.CSVDataSet
  filepath: "${_directories.raw}/iris.csv" # String inperpolation from imported variables

Note: if your are using the file name extension .yaml instead of .yml, you need to remove file extension the the globals.yml: just set - globals (Hydra prefers the .yaml file extension).

After this change, the Hydra configuration loader will raise a warning directing us to set the globals.yml package scope explicitly. To supress this warning, we just need to add the directive # @package _global_ in globals.yml:

# @package _global_
_directories: 
  raw: "./data/01_raw"
  interim: "/tmp/02_interim"
  processed: "/data/03_processed"

Note that we prefix global keys with a underscore to hide them from the final configuration structure (globals variables are only used for string interpolation).

Features and drawbacks

✅ Can be used as a drop-in replacement of the kedro ConfigLoader: most existing configurationshould be

✅ Mutually exclusive configuration groups: https://hydra.cc/docs/terminology#config-group

✅ Add support for OmegaConf interpolation patterns(including access to environment variables and hydra configuration):

no easy way to specify Hydra configuration overrides from the kedro run command line tool. Parameters provided with the --params option will ovverride the final configuration parameters but won't impact Hydra group choices (nor interpolation). This means that there is no way to change hydra group choices appart from editing the defaults list in yaml files.

Last resort hack A workarround, if you really want to apply overrides commands to *one* particular file in your configuration, use at your own risk
  1. Add a hydra_overrides parameters in cli.py:
@click.option(
    "--params", type=str, default="", help=PARAMS_ARG_HELP callback=_split_params
)
+@click.argument('hydra_overrides', nargs=-1)
 def run(
     tag,
     env,
     parallel,
     runner,
     is_async,
     node_names,
     to_nodes,
     from_nodes,
     from_inputs,
     to_outputs,
     load_version,
     pipeline,
     config,
     params,
+    hydra_overrides,
):
  1. All cli arguments are registred in the kedro session object and can be retrievied within the HydraConfigLoaderMinimal:
       def _load_config_file(self, config_file: Path, overrides: List[str] = []) -> Dict[str, Any]:
 
         from hydra.experimental import compose, initialize_config_dir
         from omegaconf import OmegaConf
 
         overrides = overrides + self.global_overrides
 
+        override_path_pattern = Path(self.conf_paths[0]) / 'parameters.yml'
+        session = get_current_session(silent=True)
+        if session and (config_file.resolve() == override_path_pattern.resolve()):
+            overrides.extend(session.store['cli']['params']['hydra_overrides'])
 
         with initialize_config_dir(config_dir=str(config_file.parent), job_name=self.job_name):

Here the hydra overrides will be applied only to the base/parameters.yml file.

Do not work well with kedro config enviroments: As each file is parsed by

Powerfull but complex: with serveral environements, where each one can have serveral root configuration files, which are themselve parsed as Hydra configuration that can have nested parameters groups files, we may have created a monster!

Experiment 2: embracing Hydra configuration

Aproach

This time we load load Kedro configuration from a single hydra configuration root file.

The trick to make the Hydra configuration works with Kedro glob path patterns (conf_loader.get("parameters/**") syntax) is to convert keys in the configuration into paths.

For instance, the configuration entry conf['catalog']['iris']['filepath'] will is associated with the path catalog/iris/filepathand will be returned by the .get(catalog/**)` call.

With thos approach should remains compatible with most kedro features and plugins.

Usage

We must first convert our kedro configuration into a Hydra compatible one:

  1. The catalog entries must be set inside a 'catalog' root dictionnary in the Hydra configuration (the format remains the same as for Kedro: https://kedro.readthedocs.io/en/stable/05_data/01_data_catalog.html ) To keep a seperate catalog.yaml file, we can leverage the Hydra default list features (see the example below).

  2. The same apply for 'parameters', 'logging', 'credentials' and any other Kedro root configuration files.

  3. The kedro configuration environment system is completly replaced by a Hydra parameter group 'env'. A configuration file must be defined for each environment (except 'base') in the 'env' directory. This files can override any setting from other configuration files.

For instance:

  • conf/config.yaml (single root configuration file):

    defaults: # Ordered list of parameter group (order is important for overrides).
      - catalog
      - parameters
      - catalog
      - _self_  # This file configution
      - env: local # Last, loads environment overrides (local by default)
    
    directories:
      raw: "./data/01_raw"
      interim: "/tmp/02_interim"
      processed: "/data/03_processed"
  • conf/catalog.yaml (note how we set package scope with the # @package <scope> directive):

    # @package catalog
    
    example_iris_data:
      type: pandas.CSVDataSet
      filepath: "${directories.raw}/iris.csv"
  • conf/env/local.yaml (This time using a global scope):

    # @package _global_
    
    # Override parameters:
    parameters:
    example_test_data_ratio: 0.5
    
    # Overrides catalog entry configuration:
    catalog:
    example_iris_data2:
        filepath: "../data/iris.csv"

Notes:

  • Hydra works better with ".yaml" extension files (rather than ".yml"). When using an extension other than ".yaml" the file extension must be explicitly set in default lists (otherwise Hydra throws "Error: Could not load ").

  • Patterns given the get dictionary can also match nested keys in the Hydra configuration. The depth of the lookup can be controlled with the lookup_depth parameter (= 1 by default).

  • Root keys are not included in the returned configuration dict ('parameters', 'catalog' keys will not appear in the configuration only sub-dictionnaries keys and values are returned). This means that any parameter defined in the root configuration file can not be accessed directly using the .get(*patterns) method, however this values can still be used for string interpolation within other confuraton files. The root configuration files can then replace the globals.yml file from the Kedro TemplatedConfigLoader class.

Next we just need to register the Hydra config loader in hooks.py, adding the env variable in the Hydra overrides list:

@hook_impl
def register_config_loader(self, conf_paths: Iterable[str], env: str, extra_params: Dict[str, Any]) -> ConfigLoader:
    conf_root = Path(list(conf_paths)[0]).parent
    return FullHydraConfigLoader(conf_root=conf_root, overrides=[f'env={env}'])

Autocompletion of parameter overrides

First make sure that autocompletion is enabled in your shell. For bash we need to add the line eval "$(_KEDRO_COMPLETE=source kedro)" in .bashrc (refer to the kedro documentation for other shells)

Next <>

Features and drawbacks

✅ Simpler configuration structure without environment directory

✅ One unique Hydra configuration

✅ Minimal edition of the configuraton, separation of concerns preserved

configuration not bakward compatibile with Kedro ConfigLoader

error-prone translation of path patterns

"""
### Experiment 2: Full Hydra configuration loader ###
Copyright (c) 2021 Martin Sotir. All rights reserved.
This work is licensed under the terms of the MIT license.
For a copy, see <https://opensource.org/licenses/MIT>.
"""
from collections.abc import Mapping
from pathlib import Path, PurePosixPath
from typing import Any, Dict, List
from kedro.config import ConfigLoader
class FullHydraConfigLoader(ConfigLoader):
"""Load a Kedro configuration from a single hydra configuration.
Each call of the ``get`` method will reload the configuration and
return only paths in the dictionnary matching the given patterns.
This kedro config loader should be compatible with most kedro features
and plugins, however configuration files must be edited to follow Hydra
format:
1. The catalog entries must be set inside a 'catalog' root dictionnary
in the Hydra configuration (the format remains the same as for Kedro:
https://kedro.readthedocs.io/en/stable/05_data/01_data_catalog.html )
To keep a seperate `catalog.yaml` file, we can leverage the Hydra
default list features (see the example below).
2. The same apply for 'parameters', 'logging', 'credentials' and any other
Kedro root configuration files.
3. The kedro configuration environment system is completly replaced by a Hydra
parameter group 'env'. A configuration file must be defined for each environment
(except 'base') in the 'env' directory. This files can override any setting
from other configuration files.
Example:
* `conf/config.yaml` (single root configuration file):
```yaml
defaults: # Ordered list of parameter group (order is important for overrides).
- catalog
- parameters
- catalog
- _self_ # This file configution
- env: local # Last, loads environment overrides (local by default)
directories:
raw: "./data/01_raw"
interim: "/tmp/02_interim"
processed: "/data/03_processed"
```
* `conf/catalog.yaml` (note how we set package scope with the `# @package <scope>` directive):
```yaml
# @package catalog
example_iris_data:
type: pandas.CSVDataSet
filepath: "${directories.raw}/iris.csv"
```
* `conf/env/local.yaml` (This time using a _global_ scope):
```
# @package _global_
# Override parameters:
parameters:
example_test_data_ratio: 0.5
# Overrides catalog entry configuration:
catalog:
example_iris_data2:
filepath: "../data/iris.csv"
```
Note: Hydra works better with ".yaml" extension files (rather than ".yml").
When using an extension other than ".yaml" the file extension must be explicitly
set in default lists (otherwise Hydra throws "Error: Could not load <config>").
The trick to make the Hydra configuration works with Kedro glob path patternes
lookup (`conf_loader.get("parameters/**")` syntax) is to convert keys in the
configuration into paths. For instance, the configuration entry
`conf['catalog']['iris']['filepath']` will is associated with the path
`catalog/iris/filepath`and will be returned by the `.get(`catalog/**)` call.
Note:
* Patterns given the ``get`` dictionary can also match nested keys in the
Hydra configuration. The depth of the lookup can be controlled with the
`lookup_depth` parameter (= 1 by default).
* Root keys are not included in the returned configuration dict ('parameters',
'catalog' keys will not appear in the configuration only sub-dictionnaries keys
and values are returned). This means that any parameter defined in the root
configuration file can not be accessed directly using the `.get(*patterns)`
method, however this values can still be used for string interpolation within
other confuraton files. The root configuration files can then replace the
`globals.yml` file from the Kedro TemplatedConfigLoader class.
"""
def __init__(self, conf_root: Path, conf_name: str = 'config', resolve_interpolation=True,
overrides: List[str] = None, job_name='app'):
"""
Args:
conf_root: Root directory of the Hydra configuration.
conf_name: Hydra configuration name (usally root configuration file name,
without extension if the extension is .yaml)
resolve_interpolation: Apply OmegaConf interpolation
overrides: List of Hydra overrides commands applied to the loaded configuration
(https://hydra.cc/docs/advanced/override_grammar/basic)
job_name: hydra job name (used in some hydra configuration fields)
Raises:
ValueError: If ``conf_paths`` is empty.
"""
self.conf_root = Path(conf_root).absolute()
self.conf_name = str(Path(conf_name).stem) if Path(conf_name).suffix == 'yaml' else conf_name
self.resolve_interpolation = resolve_interpolation
self.overrides = overrides if overrides is not None else []
self.job_name = job_name
def _load_configuration(self) -> Dict[str, Any]:
from hydra.experimental import compose, initialize_config_dir
from omegaconf import OmegaConf
with initialize_config_dir(config_dir=str(self.conf_root), job_name=self.job_name):
try:
conf = compose(config_name=self.conf_name, overrides=self.overrides, return_hydra_config=True)
except AssertionError as e:
if "Invalid loaded object type : NoneType" in e.args[0]:
# OmegaConf will raise an assertion error for empty files:
return dict()
else:
raise e
# Interpolation is not applied by default with the compose API:
return OmegaConf.to_container(conf, resolve=self.resolve_interpolation)
pass
def get(self, *patterns: str, include_hydra_conf=False, lookup_depth=1) -> Dict[str, Any]:
# Relaoad configuration each time:
config = self._load_configuration()
if not include_hydra_conf:
del config['hydra']
# convert nested dictionnary to single-level "path"-value dictionnary
config_paths = dict_to_paths(config, sep='/', max_level=lookup_depth)
# Filter paths using pathlib glob syntax and convert back the path dict to a nested dict:
def match_path(path):
path = PurePosixPath(path)
return any((path.match(p) for p in patterns))
filtered_conf = paths_to_dict({path: val for path, val in config_paths.items() if match_path(path)})
# Remove root config keys
filtered_conf = {sub_k: sub_v for k, v in filtered_conf.items() if isinstance(v, dict)
for sub_k, sub_v in v.items()}
# Remove key with '_' prefix (convention from Kedro ConfigLoader):
return {k: v for k, v in filtered_conf.items() if not k.startswith("_")}
def dict_to_paths(val, sep='/', sep_escape='###', max_level=float('inf'), prefix="") -> Dict[str, Any]:
"""
Flatten nested dictionnary into a single dictionnary.
Args:
sep: separator used when joining nested keys.
sep_escape: escape sequence replacing occurances of the separator in existing keys.
max_level: optional limit on the flattening (max_level=0 will return the orginal dict)
"""
if isinstance(val, Mapping) and max_level > 0:
return {
f"{prefix}{k.replace(sep, sep_escape)}{sub_k}": sub_v
for k, v in val.items()
for sub_k, sub_v in dict_to_paths(v, prefix=sep, sep=sep, max_level=max_level - 1).items()}
else:
return {"": val}
def nested_update(base_dict: dict, update: dict):
"""
Merge nested dictionnary. The `update` dict take precedence over `base_dict` values.
`base_dict` is modified inplace.
"""
for k, v, in update.items():
if (k in base_dict) and isinstance(base_dict[k], dict) and isinstance(v, dict):
base_dict[k] = nested_update(base_dict.get(k, {}), v)
else:
base_dict[k] = v
return base_dict
def paths_to_dict(val, sep='/', sep_escape="###"):
"""
Convert flatten back to a regular nested dictionnary (reverse `dict_to_paths` operation)
"""
base_dict = {}
for k, v in val.items():
elems = [elem.replace(sep_escape, sep) for elem in k.split(sep)]
update = dict()
last_update = update
for elem in elems[:-1]:
last_update[elem] = {}
last_update = last_update[elem]
last_update[elems[-1]] = v
base_dict = nested_update(base_dict, update)
return base_dict
"""
### Experiment 1: Miminaly invasive Hydra configuration loader ###
Copyright (c) 2021 Martin Sotir. All rights reserved.
This work is licensed under the terms of the MIT license.
For a copy, see <https://opensource.org/licenses/MIT>.
"""
from pathlib import Path
from typing import Any, Dict, Iterable, List, Union
from kedro.config import ConfigLoader
class HydraConfigLoaderMinimal(ConfigLoader):
"""Recursively scan the directories specified in ``conf_paths`` for
configuration files in the Hydra yaml format, load them,
and return them in the form of a config dictionary.
Each config file matching the pattern given in the ``.get()`` method
will be loaded in a separate Hydra environment and then be merged
exactly in the same way as the traditional Kedro ``ConfigLoader``.
This means that this loader will throw an exception if two configurations
contain duplicated keys.
Each loaded configuration file will be parsed using the Hydra Compose API
(https://hydra.cc/docs/next/experimental/compose_api).
Main features:
* Mutually exclusive configuration groups: https://hydra.cc/docs/terminology#config-group
* OmegaConf interpolation patterns (including access to environment variables):
https://omegaconf.readthedocs.io/en/latest/usage.html#variable-interpolation
* Configuration overrides synthax (trhoug the ``overrides`` parameter):
https://hydra.cc/docs/advanced/override_grammar/basic
Drawbacks:
* Structured configuration loading is not suported/tested.
* Most fields in the hydra configuration returned when ``return_hydra_config``
is set to `True` will not be meaningful (as hydra.main() is not called).
Hydra configurations (logger, job paths, multirun, sweeper, etc.) are
ignored by Kedro.
* If `return_hydra_config` is set to True, a `ValueError` will be raised (for
duplicated keys) if the the givent pattern in ``get()`` match more than one
file (within all paths in `conf_paths`). For this reason, it is not
recommended to set this parameter to true appart from debugging purposes.
If an hydra configuration parameter is required to run a kedro pipeline,
a dedicated configuraton variable can be set explicitly in the yaml
configuration using the interpolation syntax (see:
https://hydra.cc/docs/configure_hydra/intro#hydra ).
For instance: `job_dir: ${hydra.run.dir}`
* Ovverides will be applied indepedently, for each loaded files that match
cnfiguration file patterns. "Append" overrides may not work as expected or
may create duplicated entries.
* For now, there is no easy way to specify Hydra configuration overrides
from the `kedro run` command line tool. Parameters provided with the
`--params` option will ovverride the final configuration parameters
but won't impact Hydra group choices (nor interpolation). This means
that there is no way to change hydra group choices appart from editing
the `defaults` list in yaml files.
"""
def __init__(self, conf_paths: Union[str, Iterable[str]], resolve_interpolation=True,
return_hydra_config=False, global_overrides: List[str] = None, job_name='app'):
"""
Args:
conf_paths: Non-empty path or list of paths to configuration directories.
return_hydra_config: Export the full hydra configuration (not recommeded)
resolve_interpolation: Enable OmegaConf interpolation.
global_overrides: List of Hydra overrides commands applied to all loaded files
(https://hydra.cc/docs/advanced/override_grammar/basic)
job_name: hydra job name (used in some hydra configuration fields)
Raises:
ValueError: If ``conf_paths`` is empty.
"""
self.return_hydra_config = return_hydra_config
self.resolve_interpolation = resolve_interpolation
self.global_overrides = global_overrides if global_overrides is not None else []
self.job_name = job_name
super().__init__(conf_paths)
def _load_config_file(self, config_file: Path, overrides: List[str] = []) -> Dict[str, Any]:
from hydra.experimental import compose, initialize_config_dir
from omegaconf import OmegaConf
overrides = overrides + self.global_overrides
with initialize_config_dir(config_dir=str(config_file.parent), job_name=self.job_name):
try:
conf = compose(config_name=config_file.name, overrides=overrides, return_hydra_config=True)
except AssertionError as e:
if "Invalid loaded object type : NoneType" in e.args[0]:
# OmegaConf will raise an assertion error for empty files:
return dict()
else:
raise e
# Interpolation is not applied by default with the compose API:
resolved_conf = OmegaConf.to_container(conf, resolve=self.resolve_interpolation)
if not self.return_hydra_config:
del resolved_conf['hydra']
# Remove key with '_' prefix (convention from Kedro ConfigLoader):
return {k: v for k, v in resolved_conf.items() if not k.startswith("_")}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment