Skip to content

Instantly share code, notes, and snippets.

@cloudnull
Last active July 21, 2021 14:45
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save cloudnull/3daa86107162d05a8c698fed89141ca2 to your computer and use it in GitHub Desktop.
Save cloudnull/3daa86107162d05a8c698fed89141ca2 to your computer and use it in GitHub Desktop.

Unifying TripleO Orchestration with Task-Core and Directord

Include the URL of your launchpad blueprint: https://blueprints.launchpad.net/tripleo/+spec/unified-orchestration

The purpose of this spec is to introduce core concepts around Task-Core and Directord, explain their benefits, and cover why the project should consider using them.

TripleO has long been established as an enterprise deployment solution for OpenStack. While TripleO has been built to meet the needs of operators, it has never been built to be fast or concise. TripleO maintains many layers of abstraction which are effectively infinitely configurable, all at the expense of time. Over the past few cycles, the TripleO core team has been on a mission to simplifying the stack, focusing on removing unnecessary services and minimizing or marginalizing other over-engineered components. These efforts have come to a head and are now approaching the point of diminishing returns. To further ease the time and complexity burdens within TripleO, the project must look deeper to achieve the next level of improvement; this is where Task-Core and Directord come in.

Task-Core:

A dependency management and inventory graph solution which allows operators to define tasks in simple terms with robust dominion over a given environment. Declarative dependencies will ensure that if a container/config is changed, only the necessary services are reloaded/restarted. Task-Core provides access to the right tools for a given job with provenance, allowing operators and developers to define outcomes confidently.

Directord:

A deployment framework built to manage the data center life cycle, which is both modular and fast. Directord focuses on consistently maintaining deployment expectations with a near real-time level of performance at almost any scale.

Problem Description

TripleO presently uses a collection of bespoke tools to achieve its orchestration goals. While the TripleO tool suite has worked and is likely to continue working should maintainers bear an increased burden, recent revelations around the apparatuses provide an inflection point. Because of the impending perfect storm spanning almost everything in the TripleO stack, the project is presented with a choice: stay the course or confidently course correct.

Staying the course:

The TripleO project increases the core team size and begins planning for long term maintenance. The project focus on developing individual maintainers for at risk components (Ansible, Puppet, Heat); efforts to further simplify TripleO ostensibly come to an end. While new deployment models may be developed, decreasing time complexity and scalability will no longer be a focus of the core team. The core team will ensure the TripleO project remains practical for the forseeable future.

Course correcting:

Begin the systematic replacement of legacy core components with more tailored solutions to meet the project's actual needs. Tailoring the stack will simplify the maintenance required across life cycles. The corrective action necessary to provide TripleO with a quantum leap will be invasive; having said that, once complete, TripleO will exceed operator expectations and meet future scale requirements ensuring platform sovereignty all without breaking the user interface.

Upstream changes within applications like Ansible, where it is fundamentally moving away from the TripleO use case, force TripleO maintainers to take on more ownership for no additional benefit. The TripleO use case is actively working against the future direction of Ansible. Secondly, while puppet has remained stable over the years, the maintainers for puppet modules within TripleO have reached an all-time low, represent a significant amount of complexity in the stack, and becoming more of a risk to the project day. The cost of maintaining systems like Ansible and Puppet, and all of their corresponding overlapping functionality, especially as the project looks to support future OS versions, has a high likelihood of causing a significant disruption to the TripleO project. When an infinitely configurable interface powered by Heat is compound against tightly coupled integrations across a set of increasingly brittle services, TripleO is being faced with an existential crisis; the TripleO project needs to maintain less across the framework.

Presently, TripleO will see its objective end without a course correction as there's no longer any meaningful performance, scale, or configurability that can be squeezed out of the current system. Additionally, as the project veers further off the path of leveraging supportable community tools, TripleO will see the time to deliver indefinably extend, as the project's value proposition invariably declines. To stem the tide, TripleO must greatly simplify the framework, enable developers to build intelligent tasks, and provide meaningful performance enhancements that scale to meet operators' expectations. If TripleO can capitalize on this moment, it will improve the quality of life for day one deployers and day two operations and upgrades.

Proposed Change

Dramatically enhance the TripleO developer, operator, user experience by unifying the stack with tools built for TripleO by TripleO.

In some ways, the move toward Task-Core and Directord creates a General-Problem, as it's proposing the replacement of many bespoke tools, which are well known, with two new homegrown ones. Be that as it may, much attention has been given to the user experience, addressing many well-known pain points commonly associated with TripleO environments. Task-Core and Directord aim to remove problems at scale, drop the development barrier to entry, and open the flood gates of innovation. Teams surrounding TripleO will no longer worry about execution times and convoluted step processes. TripleO Deployers and developers of tomorrow will be empowered to run operations within an environment without dedicating weeks to a risky or otherwise error-prone process.

Overview

This specification consists of two parts that work together to achieve the project goals.

Task-Core:

Task-Core builds upon native OpenStack libraries to create a dependency graph and executes a compiled solution. With Task-Core, TripleO will be able to define a deployment instead of a brute-forcing one. While powerful, Task-Core keeps development easy and consistent, reducing the time to deliver and allowing developers to focus on their actual deliverable, not the orchestration details. Task-Core also guarantees reproducible builds, runtime awareness, and the ability to resume when issues are encountered.

* Templates containing step-logic and ad-hoc tasks will be refactored into

Task-Core definitions.

* Each component can have its own Task-Core purpose, providing resources and

allowing other resources to depend on it.

* The invocation of Task-Core will be baked into the TripleO client, making its

existence transparent to operators and deployers.

* Advanced users will be able to use Task-Core to meet their environment

expectations without fully understanding the deployment nuance of multiple bespoke systems.

Directord:

Directord provides a modular execution platform that is environmentally aware. Because Directord leverages messaging, the platform can guarantee availability, transport, and performance. Directord has been built from the ground up, making use of industry-standard messaging protocols which ensure pseudo-real-time performance and limited resource utilization. The built-in DSL provides most of what the TripleO project will require out of the box. Because no solution is perfect, Directord utilizes a plugin system that will allow developers to create new functionality without compromise or needing to modify core components. Additionally, plugins are handled the same, allowing Directord to ensure the delivery and execution performance remain consistent.

* Directord is a single application that is ideally suited for containers while

also providing native hooks into systems; this allows Directord to operate in heterogeneous environments. Because Directord is a simplified application, operators can choose how they want to run it and are not forced into a one size fits all solution.

* Directord is platform-agnostic, allowing it to run across systems, versions,

and network topologies while simultaneously guarantying it maintains the smallest possible footprint.

* Directord is built upon messaging, giving it the unique ability to span

network topologies with varying latencies; messaging protocols compensate for high latency environments and will finally give TripleO the ability to address multiple data-centers and fully embrace "the edge."

With Task-Core and Directord, TripleO will take a quantum leap in performance and configurability. TripleO will no longer force developers and deployers to run massive single-use systems to meet deployment goals. TripleO will have an intelligent dependency graph that is both easy to understand and extend. TripleO will now be environmentally aware, making it possible to run day two operations quickly and efficiently. TripleO will better fulfill its life cycle management through the use of cluster-aware orchestration. Finally, TripleO will dramatically shrink its maintenance burden by eliminating many bespoke systems running in unique and unsupported ways.

Alternatives

The TripleO core team grows and embraces the maintenance burden of the bespoke legacy tooling currently responsible for orchestration. Additionally, the TripleO project begins documenting the scale limitations and the boundaries that will never be addressed due to these limitations. Finally, TripleO effectively ends the multi-cycle simplification efforts and shifts focus to the required maintenance to maintain functional expectations long term.

Security Impact

While Task-Core and Directord are two new attack surfaces, their implementation will eventually remove the entirety of services like Ansible and Puppet, which are considerably more extensive in scope. A new Security assessment will need to be performed to ensure the tooling exceeds the standard already set.

That said, steps have already been taken to ensure that systems are FIPS compatible, ensuring TripleO aims for a higher standard of operation from day one.

Upgrade Impact

Upgrades will hopefully be impacted in a very positive way. With the introduction of Task-Core, upgrade tasks will use well-defined dependencies and job tailored actions. Therefore, upgrade jobs should be much more efficient, easier to understand, and effectively more straightforward; all of which make execution inherently faster. At present there's no possible way for TripleO to meet the expectation of being able to perform upgrade/update tasks rapidly; in the future, should this specification be implemented, TripleO will address updates and upgrades efficiently, with the aim to regin in maintenance windows so that TripleO is no longer synonymous with operations that take exorbitant amounts of time.

The introduction of Directord will necessitate a rewrite of much of the underlying functionality; however, upgrade tasks should be easily ported into the Directord orchestrations and will allow TripleO to begin writing upgrades that are based on the needs of a job, and allow us to massively simplify the task definitions.

Both Task-Core and Director greatly improve the quality of life for operators and developers when considering upgrades and roll back operations. The TripleO project will finally realize roll-forward/backward capabilities on a per-application basis in a time conscious way. No longer will a failed operation result in cluster wide instability and obscurity. When planning activities the Task-Core dependency graph will ensure only the actions required are included, without duplication, or forcing deployers into multi-day maintenance scenarios. With Directord operations are easily written and transparently executed. The combination of Task-Core and Directord will empower updates and upgrades in ways never thought possible.

Other End User Impact

When following the happy path, the end-user, deployers, and operators will not interact with this change. The user interface will effectively remain the same. If an operator wishes to leverage the advanced capabilities of either Task-Core or Directord, the tooling will be documented and at their disposal.

It should be noted that there's a change in deployment architecture in that Directord follows a server/client model; albeit an ephemeral one. This change aims to be fully transparent, however, it is something that end users, deployers, will need to be aware of.

Performance Impact

This specification, if implemented, will have a massive impact on performance. With Directord, the TripleO project will enjoy near-realtime execution without compromise.

  • Performance analysis has been done comparing configurability and runtime of Directord vs. Ansible, the TripleO default orchestration tool. This analysis highlights some of the performance gains this specification will provide; initial testing suggests that Task-Core and Directord is more than 10x faster than our current tool chain, representing a potential 90% time savings when executing a comparable workload.
  • One of the goals of this specification is to remove impediments in the time to work. Deployers should not be spending exorbitant time waiting for tools to do work; in some cases, waiting longer for a worker to be available than it would take to perform a task manually.
  • Deployers will no longer be required to run a massive server for medium-scale deployment. Regardless of size, the memory footprint and compute cores needed to execute a deployment will be significantly reduced.

Other Deployer Impact

Deployers are the primary focus of this specification, and the impact to them could be positively huge. The time savings alone represents a massive quality of life improvement. The ability to configure deployments and debug problems is an unexpected bonus. If TripleO deployers are also considered developers, the ease of implementing new services will be a welcomed addition. All that said, both Task-Core and Directord represent an unknown factor; as such, they are not battle-tested and will create uncertainty in an otherwise "stable" project.

Implementing both Task-Core and Directord promises a better tomorrow by fulfilling resolutions derived from the past. Extensive testing has been done; all known use-cases, from system-level configuration to container pod orchestration, have been covered, and automated tests have been created to ensure nothing breaks unexpectedly. Additionally, for the first time, these projects have expectations on performance, with tests backing up those claims, even at a large scale. This proposal aims to remove a mountain of technical debt while doing its best to create as little new debt as possible, all under the lens of improving the lives of deployers.

Should TripleO adopt Task-Core and Directord, new cloud topologies will open to deployers. At present, TripleO assumes SSH access between the Undercloud and Overcloud is always present. Additionally, TripleO believes the infrastructure is relatively static, making day two operations risky and potentially painful. Task-Core will reduce the computational burden when crafting action plans, and Directord will ensure actions are always performed against the functional hosts.

Another improvement this specification will enhance is in the area of vendor integrations. Vendors will finally be able to provide meaningful task definitions which leverage an intelligent inventory and dependency system. No longer will TripleO require vendors have in-depth knowledge of every deployment detail, even those outside of the scope of their deliverable. By easing the job definitions, simplifying the development process, and speeding up the execution of tasks deployers will finally be able to develop solutions and test them with confidence, without needing to spend months embedding resources into TripleO and committing to huge capital expenditures associated with a minimally functional environment. Test clouds are still highly recommended sources of information, however, system requirements on the Undercloud will reduce meaning the cost of running test environments, in terms of both hardware and time, will be significantly lowered.

Developer Impact

Task-Core provides access to the right tool when required, meaning the implementation of Task-Core will not adversely impact developers as they can presently write code in whatever format they want; Ansible, Puppet, and Directord are all perfectly viable options. Developers will need to change their focus on tasks and ensure their jobs use the new graphing capabilities. Because of the built-in dependency graph, the implementation of Task-Core should be a welcomed one, without much in the way of negative developer impact. One hugely positive impact on developers can be found in the Task-Core interface validation. Task-Core will validate the input scheme making the framework more intelligent, thereby removing errors caused by the "free-form" input and correctly setting task expectations.

To fully realize the benefits of this specification Ansible tasks will need to be refactored into the Task-Core scheme. While Task-Core can run Ansible and Directord has a plugin system which easily allows develoeprs to port legacy modules into Directord plugins, there will be a developer impact as the TripleO development methodology will change. It's fair to say that the potential developer impact will be huge, yet, the shift isn't monumental. Much of the Ansible presently in TripleO is shell-oriented, and as such, it is easily portable and as stated, compatibility layers exist allowing the TripleO project to make the required shift gradually. That said, once the Ansible tasks are ported, the time saved in execution will be massive; this is on top of the fact that TripleO will no longer be plagued with errors in day two operations resulting from transient inventory.

Example Task-Core and Directord implementation for Keystone:

While this implementation example is fairly basic, it does result in a functional Keystone environment and in roughly 5 minutes and includes services like MySQL, RabbitMQ, Keystone as well as ensuring that the operating systems is setup and configured for a cloud execution environment. The most powerful aspect of this example is the inclusion of the graph dependency system which will allow us easily externalize services, such as in the case where deployers wish offload applications into environments like OKD.

The implementation of Task-Core and Directord will not change the user interface when following a happy path; however, it will allow developers to bridge the TripleO to OKD gap more effectively. As mentioned, Directord is container-native. Images for Directord already exist on Quay, Dockerhub, and Github registries, all of the appropriate meta-data is available to support an OKD environment, and tests have been implemented to ensure Directord is functional from within pod environments. With Directord's ability to automagically support heterogeneous infrastructure, TripleO developers and deployers will now be able to implement solutions bridging container-native and physical infrastructure without relying on fragile interfaces or legacy transport models.

  • The use of advanced messaging protocols means TripleO will efficiently address deployments in local data centers or at the edge without transport stress.
  • The Directord server and storage can be easily offloaded, making it possible for the TripleO Client to be executed from simple environments without access to the overcloud network; imagine running a massive deployment from a laptop.
  • TripleO through the implementation of Task-Core and Directord will finally be able to compartmentalize systems.

Implementation

In terms of essential TripleO integration, most of the work will occur within the tripleoclient, with the following new workflow.

Execution Workflow:

┌────┐   ┌─────────────┐   ┌─────────┐   ┌─────────┬──────┐   ???????????
│USER├──►│TripleOclient├──►│Task-Core├──►│Directord│Server├──►? Network ?
└────┘   └─────────────┘   └─────────┘   └─────────┴──────┘   ???????????
                ▲                                    ▲             ▲
                │              ┌─────────┬───────┐   |             |
                └─────────────►│Directord│Storage│◄──┘             |
                               └─────────┴───────┘                 |
                                                                   |
                                         ┌─────────┬──────┐        |
                                         │Directord│Client│◄───────┘
                                         └─────────┴──────┘
  • Directord|Server - Task executor connecting to client.
  • DirectordServer.
  • Directord|Storage - An optional component, when not externalized, Directord will maintain the runtime storage internally. In this configuration Directord is ephemeral.

To enable a gradual transition, ansible-runner has been implemented within Task-Core, allowing the TripleO project to convert playbooks into tasks that rely upon strongly typed dependencies without requiring a complete rewrite. The initial implementation should be transparent. Once the Task-Core hooks are set within tripleoclient functional groups can then convert their tripleo-ansible roles or ad-hoc Ansible tasks into Directord orchestrations. Teams will have the flexibility to transition code over time and are incentivized by a significantly improved user experience and shorter time to delivery.

Assignee(s)

Primary assignee:
  • Cloudnull - Kevin Carter
  • Mwhahaha - Alex Schultz
Other contributors:
  • Slagel - James Slagel
  • Odyssey4me - Jesse Pretorius

Work Items

  1. Package all of the Task-Core and Directord dependencies, should there be any.
  2. Package both Task-Core and Directord.
  3. Converge on a Directord deployment model (container, system, hybrid).
  4. Implement the Task-Core code path within TripleO client.
  5. Port In template Ansible tasks to Directord orchestrations.
  6. Port Ansible roles into Directord orchestrations.

Dependencies

Both Task-Core and Directord are dependencies, as they're new projects. These dependencies may or may not be brought into the OpenStack namespace; regardless, both of these projects, and their associated dependencies, will need to be packaged and provided for by RDO.

Testing

If successful, the implementation of Task-Core and Directord will leave the existing testing infrastructure unchanged. TripleO will continue to function as it currently does through the use of the tripleoclient.

New tests will be created to ensure the Task-Core and Directord components remain functional and provide an SLA around performance and configurability expectations.

Documentation Impact

Documentation around Ansible will need to be refactored.

New documentation will need to be created to encompass of the of the advanced usage of Task-Core and Directord. Much of the client interactions from the "happy path" will remain unchanged.

References

@bshephar
Copy link

https://gist.github.com/cloudnull/3daa86107162d05a8c698fed89141ca2#upgrade-impact
I think another pain point we have around upgrades is that a failure at any point, leaves the entire cluster in an essentially unknown state. For example, if the upgrade fails after pacemaker has been stopped but before it's restarted. You need to manually go and cleanup the resources before starting the upgrade again. Assuming you have that prerequisite knowledge, else you try the upgrade again and now it fails because Pacemaker isn't running.

You mention "cluster-aware orchestration", which I think based on my understanding would be fairly easy to encompass these kinds of issues during deployments and upgrades and might be worth a brief sentence. Given TripleO's history with difficult to troubleshoot and ambiguous deployment errors, I think we should definitely ensure these pain points are addressed as part of this.

This is looking really good now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment