This work is licensed under a Creative Commons Attribution 3.0 Unported License. http://creativecommons.org/licenses/by/3.0/legalcode
Thesis: Gratuitous unnecessary complexity exists within the Designate code, and in the operation of Designate as a service. Making Designate a producer-worker type of project will vastly simplify the development and operation, and align it more to the true nature of the service it provides (DNS).
designate-pool-manager
does a reasonably good job at pushing Create/Delete changes out to nameservers, but the process gets a bit less shiny after that.
- Polling to see if state is live is done via asynchronous and synchronous RPC with another component.
- Cache usage is a mess (storing mutliple keys, keeping one key around forever, storing a different key for each type of operation).
- Periodic Sync/Recovery are very unreliable as the number of changes grows.
- The update_status and logic for calculating consensus is heavy-handed, too eager, and too complex.
- The state machine is very foggy, and the logic for updating status that gets pushed into central is obfuscated.
- Pool Managers are tied to one pool
designate-zone-manager
does a good job at executing periodic timers for the zones it managers. However:
- One zone-manager process is responsible for a certain set of zones, if the operations for that set of zones gets heavy, a single zone manager process could become overwhelmed.
- We rely on
tooz
to manage the extremely delicate task of ensuring balance and coverage of all zones by zone manager processes. - Certain work (export) that's in the critical path of operations has already crept into the component that wasn't really meant for that. As a substitute for proper workers, the zone-manager is looking like the current answer.
designate-mdns
is a DNS server written in Python. It works well for small amounts of traffic, but as traffic grows, we may realize that we need it to be more specialized, as a DNS server written in Pythons should be. The logic for sending NOTIFYs and polling of changes seems less likely to belong in mdns in the future. If those bits were removed, designate-mdns
could be rewritten to make use of a better tool for the problem.
A change to the underlying architecture of executing actual work on DNS servers and the running of other tasks. Essentially, removing designate-pool-manager
and designate-zone-manager
, replacing them with designate-worker
and designate-producer
(names up for debate) and removing certain logic from designate-mdns
. All of the actual "work" would be put in the scalable designate-worker
process, which has work produced by the API/Central, and designate-producer
. designate-mdns
gets back to it's roots, and only answers AXFRs.
No changes to the API or Database are required, with minimal changes to designate-central
.
These are the services that would remain present:
designate-api
- To receive JSON and parse it for Designatedesignate-central
- To do validation/storage of zone/record datadesignate-mdns
- To only AXFRs from Designate's databasedesignate-worker
- Any and all tasks that Designate needs to produce state on nameserversdesignate-producer
- To run periodic/timed jobs, and produce work fordesignate-worker
that is out the normal path of API operations. For example: Periodic recovery.
Other necessary components:
- Queue - Usually RabbitMQ
- Database - Usually MySQL
- Cache - (encouraged, although not necessary) Memcached, MySQL, Redis
Services/components that are no longer required:
designate-pool-manager
designate-zone-manager
- Zookeeper - although certain elements of
tooz
-like activities remain, mostly distributed locking, which can easily be done withouttooz
using the cache.
The scope of designate-worker
's duties are essentially any and all tasks that Designate needs to take action to perform. For example:
- Create, Update, and Delete zones on pool targets via backend plugins
- Poll that the change is live
- Update a cache with the serial number for a zone/target
- Emit zone-exists events for billing
- Flatten Alias Records
- Clean up deleted zones
- Importing/Exporting zones
- Many more
The service essentially exposes a vast RPCAPI that contains tasks
.
An important difference to Designate's current model is that all of these tasks do not call back. They are all fire-and-forget tasks that will be shoved on a queue and await worker action.
tasks
are essentially functions, that given relatively simple input, make the desired income happen on either nameservers, or the Designate database.
The cache performs a similar function to the current pool manager cache now.
It will store state for each different type of task that a worker can use to decide if it needs to continue with a task
received from the queue, or simply drop it and move on to the next task.
This varies by task, some are relatively simple, knowing whether to perform a zone update to a certain serial number is knowable by seeing the serial number of a zone on each target in a pool. For DNSSEC zone signing, a key would probably be placed to indicate that a certain worker was working on resigning a zone, as it's a more long-running process.
In the absence of such a cache, each worker will act naive and try to complete each task it receives.
Each task will be idempotent, to the degree that it is possible.
As mentioned in the Cache
section, tasks will be able to check whether they need to continue working based on certain indicators in the cache.
But they should also make an effort to not duplicate work, for instance, if it's trying to delete a zone that's already gone, it should interpret the zone being gone as a sign that the delete is successful and move on.
On the whole these tasks would simply be lifted from where they currently exist in the code, and wouldn't change all that much.
A slight change might be that during the course of the task, we may recheck that the work that is being undertaken still needs to be done.
As an example: An API customer creates many recordsets very quickly. The work being dispatched to designate-worker
processes would go a lot of different places, and one of the first updates to actually reach a nameserver might contain all the changes necessary to bring the zone up-to-date. The other tasks being worked should check before they send their NOTIFY that the state is still behind, and check again after they've sent their NOTIFY, but before they've began polling, so that they can cut down on unnecessary work for themselves, and the nameservers.
You could get even smarter about the markers that you drop in a cache for these tasks. For example, on a zone update, you could drop a key in the cache of the nature zoneupdate-foo.com.
, and other if other zoneupdate tasks for the same zone see that key, they could know to throw away their job and move on.
The partioning of certain elements that Designate had previously disappears. This service will send DNS queries, it will do cpu-bound tasks, but it will be one place to scale. It should be possible to have an extremely robust Designate architecture by simply scaling these workers.
designate-mdns
will have it's entire RPCAPI transferred to designate-worker
. This will vastly simplify the the amount of work it needs to do while it sits in the critical path of providing zone transfers to nameservers Designate manages.
As a side-note, this would make this service much easier to optimize, or even rewrite in a faster programming language.
designate-producer
is the place where jobs that produce tasks that are outside of the normal path of API operations and operate on some kind of timer live.
The key difference to the zone-manager
service, is that this service simply generates work to be done, rather than actually doing the work. designate-producer
simply decides what needs to be done, and sends RPC messages on the queue to designate-worker
to actually perform the work.
As we've grown Designate, we've seen the need for this grow vastly, and even more so in the future.
- Deleted zone purging
- Refreshing Secondary Zones
- Emitting zone exists tasks and other billing events
- DNSSEC signing of zones
- Alias record flattening
We could move the periodic_sync
and periodic_recovery
tasks from the Pool Manager to this service.
The periodic_sync
and periodic_recovery
tasks in the Pool Manager have been a constant struggle to maintain and get right. This is due to a lot of factors.
Making the generation of tasks
by periodic processes the job of only one Designate component simplifies the architecture, and allows to solve the problems it presents one time, one way, and generally do one thing well.
This service would use the same cache (or at least a similar one) to the designate-worker
service, and operate similarly if it goes away. If there is no cache, every single designate-producer
process would assume that it is alone, and execute the periodic timer for the entire Designate database, or whichever set of resources the timer operates on (zones, records).
This service would essentially be a group of timers that wake up on a cadence and create work to be put on the queue for designate-worker
processes to pick up.
The overhead is relatively low here, as we're not actually doing the work, but more just scheduling the work to be done. This way we can focus on the unexpectedly difficult problem of dividing up the production of work that these processes will put on the queue.
To explain more clearly, the biggest problem we have in this service is making it fault-tolerant, but not duplicating work for designate-worker
processes to do. This was solved before by tooz
using the zone shards in the Designate database. But this model is not perfect, because we only had one zone-manager
process to do the work for an entire shard, and we relied on questionable tooz
drivers to do the sharding.
designate-worker
processes, as described above, will do a certain amount of optimization so that they don't duplicate work. But if we generate too much cruft, those processes will be bogged down just by the task of seeing if they need to do work. So we should work to minimize the amount of duplicate work we produce.
There are a lot of different ways you could do this, including using tooz
and Zookeeper, and leader election, and all that jazz. But I think believe this is a simpler and more robust implementation.
All designate-producer
processes will have a number, config value, dynamic based on the number of designate-producer
processes, whatever. That will tell them a number of slices to divide up any piece of work,
The designate-producer
will take that number, and compute slices of the shard
value on the zones
table. Each process will then eagerly grab those slices of shards in order, and do the necessary work for the zones in each slice of shards, moving on sequentially when finished.
As an example:
- I have two
designate-producer
processes, and I have configured work to be split into threeslices
. - Periodic Recovery wakes up at roughly the same time on the two
producers
producer-1
attempts to put a keyperiodic-recovery-1
in the cache with an expiry, it is successful. Having placed the key, it begins work on the first slice of zone shards, taking zones inERROR
state within the first 1/3 of zone shards, and calling the appropriatetask
indesignate-worker
.producer-2
tries to placeperiodic-recovery-1
key in the cache, if it does not exist. It find that the key does exist, so it concludes that the first slice of work has already been claimed, and moves on to the second shard. It successfully placesperiodic-recovery-2
in the cache.producer-1
finishes the first slice of work, and attempts to place theperiodic-recovery-2
key in the cache. It finds that it already exists, and moves on toperiodic-recovery-3
, the final slice, and claims it.producer-2
finishes the second slice, is unable to claim the third slice, and, since the number of slices was three, it knows thatperiodic_recovery
has been completed.
This logic for splitting up work can differ by periodic timer, some may not want to split up all the work into that many pieces, this can be pluggable based on task. But it should be a good general rule. For splitting up work.
An alternate implementation could include tracking the tasks
that are created by designate-producer
in the cache, and upon failure in designate-worker
, have the worker replace them on the queue after a sleep, and let another worker try the job. If designate-producer
sees that a job (probably hashed) is still in progress, it will not create another instance of the same task.
One potential complication of this implementation is that, as the number of timers and tasks that are out of Designate's critical path of implementation grow, they may get in the way of designate-worker
processes doing the tasks that are most important, namely CRUD of zones and records.
We propose that the tasks that come are critical, mostly from designate-central
like CRUD of zones, should sit on a "high priority" queue or exchange, and the tasks that come from designate-producer
that are slightly less important sit on a "low priority" queue or exchange. This way, workers can focus on high priority work first, ensuring the most important parts of Designate's job get done first.
Some workers may be configured to only work on the low-priority queue.
Or perhaps we have queues/exchanges for each type of task, this would be an optimal way to monitor the health of different types of tasks, and isolate the sometimes long-running tasks that periodic timers will produce from the relatively quicker, and more important CRUD operations.
- Stand up a
designate-worker
service - Migrate CRUD zone operations to
designate-worker
, reworking the cache implementation. - Stand up
designate-producer
service - Migrate Pool Manager periodic tasks to
designate-producer
, with small modifications to ensure they simply generate work fordesignate-worker
- Move
designate-mdns
' NOTIFYing and polling todesignate-worker
- Fix up the
update_status
logic indesignate-central
- Migrate all tasks from
zone-manager
to a split ofdesignate-worker
anddesignate-producer
, whereproducer
creates the work on the queue andworker
executes it. Ensuring scalable logic for distributed work production using cache or some other method indesignate-producer
- Delete
pool-manager
andzone-manager
- Profit!!!
- Target Milestone for completion:
Mitaka-3 (¯_(ツ)_/¯)
Tim Simmons https://launchpad.net/~timsim Paul Glass https://launchpad.net/~pnglass Eric Larson https://launchpad.net/~eric-larson