mpeterson/ml2-database-consistency.rst Secret

## ml2-database-consistency.rst

      
    Raw
  

              ml2-database-consistency.rst
            
          
    Neutron/Southbound Interface consistency

This document presents the problem and proposes a solution for the data consistency issue between the Neutron and Southbound Interface (SBI). For the purposes of this proposal, we will define SBI as the interface, from the Neutron's standpoint, that is called in order to configure the network controller (e.g.: in OpenDaylight is ODL NeutronNorthbound, in Open Virtual Network (OVN) is the Cloud Management System (CMS), etc).
The solution has been implemented and successfully validated in OVN.
Problem description

In a common Neutron deployment model there could have multiple Neutron API workers processing requests. For each request, the neutron worker will update the Neutron database and then invoke the SBI to translate the information to its specific data model.
There are at least two situations that could lead to some inconsistency between them, for example:
Problem 1: Neutron API workers race condition

In Neutron:
  with neutron_db_transaction:
       update_neutron_db()
       driver.update_port_precommit()
  driver.update_port_postcommit()

In the driver:
  def update_port_postcommit:
      port = neutron_db.get_port()
      update_port_in_southbound_controller(port)
Imagine the case where a port is being updated twice and each request is being handled by a different API worker. The method responsible for updating the resource in the SBI (update_port_postcommit) is not atomic and invoked outside of the Neutron database transaction. This could lead to a problem where the order in which the updates are committed to the Neutron database are different than the order that they are committed to the SBI, resulting in an inconsistency.
This problem has been reported for OVN at bug #1605089.
Problem 2: SBI failures

Another situation is when the changes are already committed in Neutron but an exception is raised upon trying to update the SBI (e.g. lost connectivity to the Southbound controller).
This problems affected OVN -before adopting the solution this change proposes-and ODL -before adopting Journaling. Other drivers might be affected by this issue. Obviously, it would be possible to try to immediately rollback the changes in the Neutron database and raise an exception but, the rollback itself, is an operation that could also fail.
In addition, rollbacks are not very straight forward when it comes to updates or deletes. In a case where a VM is being teared down and the SBI fails to delete a port, re-creating that port in Neutron doesn't necessary fix the problem. The decommission of a VM involves many other things, in fact, things could even be made worse by leaving some dirty data around. This is a problem that would be better dealt with by other methods.
Proposed change

Introduction

In order to fix the problems presented at the Problem description section this document proposes a solution based on the Neutron's revision_number attribute. In summary, for every resource in Neutron there's an attribute called revision_number which gets incremented on each update made on that resource. For example:
$ openstack port create --network nettest porttest
...
| revision_number | 2 |
...

$ openstack port set porttest --mac-address 11:22:33:44:55:66

$ mysql -e "use neutron; select standard_attr_id from ports where id=\"91c08021-ded3-4c5a-8d57-5b5c389f8e39\";"
+------------------+
| standard_attr_id |
+------------------+
|             1427 |
+------------------+

$ mysql -e "use neutron; SELECT revision_number FROM standardattributes WHERE id=1427;"
+-----------------+
| revision_number |
+-----------------+
|               3 |
+-----------------+
This document proposes a solution that will use the revision_number attribute for three things:

Perform a compare-and-swap operation based on the resource version
Guarantee the order of the updates (Problem 1)
Detect when resources in Neutron and the SBI are out-of-sync

Implementation details

To achieve this, a new table schema would be created to allow each SBI to track and manage resources:


Column name
Type
Description


standard_attr_id
Integer
The reference ID from the standardattributes table in Neutron for that resource. ONDELETE SET NULL.


resource_uuid
String
Primary key. The UUID of the resource


resource_type
String
The type of the resource (e.g, Port, Router, ...)


resource_mgmt
String
The name of the SBI driver that manages the resource.


revision_number
Integer
The version of the object present in the SBI.


created_at
DateTime
The time that the entry was created. For troubleshooting purposes


updated_at
DateTime
The time that the entry was updated. For troubleshooting purposes


This table will be provide the support for the interface described hereafter.
Interface

Each public method of the main interface will have two ways of accessing it, as a decorator and as a context manager.
As an example, an usage with decorators would look like this:
@sync_resource
def create_port_postcommit(self, context):
    # operations to create the port
Whereas with the context manager it would look like this:
def create_port_postcommit(self, context):
    port = context.current

    with sync_resource.using(context, port):
        # operations to create the port
new_resource

Creates a new entry in the table with a placeholder value. This is meant to be done before the SBI is called, in order to indicate that a create action needs to be executed. In most cases this will be called from create*_precommit() methods.
The proposed placeholder value is -1.
Expected result:

On Success: The entry for the resource is created with revision_number = -1.
On Exception: The entry is not created.

sync_resource

The method provides a way to handle in which cases the SBI needs to be called and to bump the revision number once there has been success on syncing with the SBI backend. In most cases this will be called in the *_postcommit() methods.
Cases where the SBI is not called:

When the revision_number for the resource is lower than the one from the table. In this case there is no need to call the SBI, as it has already synced to a newer version of the resource.
When the standard_attr_id is NULL, in which case the resource has been deleted in Neutron and therefore an update will be not needed in the SBI.

Expected result:

On Success: it updates the revision_number for the resource to the current resource revision number.
On Exception: The entry for the resource is left untouched.


Note

There's no lock linking both database updates in the postcommit() methods. So, it's true that the method bumping the revision_number column in the new table in Neutron DB could still race but, that should be fine because the SBI should implement checks that allows to react to race conditions, e.g.: recording the revision_number internally.
If so, the mechanism that will detect and fix the out-of-sync resources should detect this inconsistency as well and, based on the revision_number in SBI controller, decide whether to sync the resource or only bump the revision_number in the cache table (in case the resource is already at the right version).

delete_resource

Allows the deletion, from the aformentioned table, of a resource that has been deleted on SBI.
Expected result:

On Success: it deletes the corresponding entry for the resource.
On Exception: The entry for the resource remains, but since the table is defined with a ONDELETE SET NULL, the entry will have its standard_attr_id on NULL.


Note

There's no lock linking both database updates in the postcommit() methods. However, as explained before, there is no need to have one.

Implementation examples

For the different actions: Create, update and delete; this interface can be used as:
1. Create:

In the create*_precommit() method, an entry is created in the table within the same Neutron transaction. The revision_number column for the new entry will have a placeholder value until the resource is successfully created by calling the SBI.
In case creation fails to create the resource by calling the SBI (but succeeds in Neutron), the entry remains logged in the new table and this problem can be detected by fetching all resources where the revision_number column value is equal to the placeholder value.
The pseudo-code will look something like this:
@new_resource
def create_port_precommit(self, context):
    pass

def create_port_postcommit(self, context):
    port = context.current
    with sync_resource.using(context, port):
        sbi.create_port(port)
2. Update:

For update, it is simpler, the revision number for that resource is updated after the SBI has been successfully updated in the update*_postcommit() method. That way, if an update fails to be applied to the controller, the inconsistencies can be detected by a JOIN between the new table and the standardattributes table where the revision_number columns do not match.
The pseudo-code will look something like this:
def update_port_postcommit(self, context):
    port = context.current
    with sync_resource.using(context, port):
        sbi.update_port(port)
3. Delete:

The standard_attr_id column in the new table is a foreign key constraint with a ONDELETE=SET NULL set. That means that, upon Neutron deleting a resource the standard_attr_id column in the new table will be set to NULL.
If deleting a resource succeeds in Neutron but fails in the SDN controller, the inconsistency can be detected by looking at all resources that has a standard_attr_id equals to NULL.
The pseudo-code will look something like this:
def delete_port_postcommit(self, context, port):
    with delete_resource.using(context, port):
        sbi.delete_port(port)
With the above optimization it's possible to create a periodic task that can run frequently to detect and fix the inconsistencies caused by random backend failures.
Prerequisites

These changes have the following prerequisites to be take into consideration from the SBI side:
#1 - Store the revision_number referent to a change in the SBI controller

To be able to compare the version of the resource in Neutron against the version in the ODL driver there is a need to know which version the SBI controller is present at.
In order to achieve this, each SBI that adopts this solution needs to track it internally e.g.: with a special column called external_ids which external systems can use to store information about its own resources that corresponds to the entries in the driver.
Therefore, every time a resource is created or updated in the SBI, the Neutron revision_number referent to that change will be stored in the external_ids column of that resource. That will allow the SBI to look at both databases and detect whether the version in the SBI controller is up-to-date with Neutron or not.
#2 - Ensure correctness when updating the SDN controller

As stated in Problem 1, simultaneous updates to a single resource will race and the order in which these updates are applied is not guaranteed to be correct . That means that, if two or more updates arrive it would not be possible to prevent an older version of that update to be applied after a newer one.
This document proposes having a special SBI command that runs as part of the same transaction that is updating a resource in SBI controller to prevent changes with a lower revision_number to be applied in case the resource in the controller is at a higher revision_number already.
This needs to basically do two things:
1. Add a verify operation to the external_ids column so that if another client modifies that column mid-operation the transaction will be restarted.
2. Compare the revision_number from the update against what is presently stored in SBI controller. If the version in SBI controller is already higher than the version in the update, abort the transaction.
So basically this new command is responsible for guarding the resource by not allowing old changes to be applied on top of new ones. Here's a scenario where two concurrent updates comes in the wrong order and how the solution above will deal with it:
Neutron worker 1 (NW-1): Updates a port with address A (revision_number: 2)
Neutron worker 2 (NW-2): Updates a port with address B (revision_number: 3)
TXN 1: NW-2 transaction is committed first and the SBI resource now has RN 3
TXN 2: NW-1 transaction detects the change in the external_ids column and is restarted.
TXN 2: NW-1 the command now sees that the SBI resource is at RN 3, which is higher than the update version (RN 2) and aborts the transaction.
There's a bit more for the above to work to ensure this works:

Consolidate changes to a resource in a single transaction.
When doing partial updates, use the SBI controller as the source of comparison to create the deltas.
Being able to do a partial update in a resource is important for performance reasons; it's a way to minimize the number of changes that will be performed in the database.
Some of the update() methods could create deltas using the current and original parameters that are passed to it. The current parameter is, as the name says, the current version of the object present in the Neutron DB. The original parameter is the previous version (current - 1) of that object. The problem of creating the deltas by comparing these two objects is because only the data in the Neutron DB is used for it.

So in summary, to guarantee the correctness of the updates this document proposes to:

Have a SBI command is responsible for comparing revision numbers and aborting the transaction, when needed.
Make sure that changes to a resource is in a single transaction.
When doing partial updates, create the deltas based in the current version in the Neutron DB vs the one present in the SBI controller.

#3 - Detect and fix out-of-sync resources

When things are working as expected the above changes should ensure that Neutron DB and the SBI controller are in sync but, what happens when things go bad ? As per Problem 2, things like temporarily losing connectivity with the SBI could cause changes to fail to be committed and the databases getting out-of-sync. It is then mandatory to detect those resources that were affected by these failures and fix them.
This document proposes having utility functions that can be run periodically to "maintain" the system healthy. To this regard, this document proposes to also include utility functions that can be used by such maintenance periodic task.
Utility functions

get_inconsistent_resources

Get a list of inconsistent resources which the revision number from the table aformentioned differs from the standardattributes table.
get_deleted_resources

Get a list of resources that have been deleted from neutron but not in the SBI controller. Once a resource is deleted in Neutron the standard_attr_id foreign key in the revision_numbers table will be set to NULL. Upon successfully deleting the resource throughh the SBI the entry in the revision_number table should also be deleted but if something fails the entry will be kept and returned in this list so the maintenance thread can later fix it.
References


OVN has an implementation of this proposal at: https://github.com/openstack/networking-ovn/blob/master/networking_ovn/db/revision.py and https://github.com/openstack/networking-ovn/blob/master/networking_ovn/db/maintenance.py

Alternatives

Journaling

An alternative solution to this problem is journaling. The basic idea is to create another table in the Neutron database and log every operation (create, update and delete) instead of passing it directly to the SDN controller.
A separated thread (or multiple instances of it) is then responsible for reading this table and applying the operations to the SDN backend.
This approach has been used and validated by drivers such as networking-odl.
Some things to keep in mind about this approach:

The code can get quite complex as this approach is not only about applying the changes to the SDN backend asynchronously. The dependencies between each resource as well as their operations also needs to be computed. For example, before attempting to create a router port the router that this port belongs to needs to be created. Or, before attempting to delete a network all the dependent resources on it (subnets, ports, etc...) needs to be processed first.
The number of journal threads running can cause problems. In my tests I had three controllers, each one with 24 CPU cores (Intel Xeon E5-2620 with hyperthreading enabled) and 64GB RAM. Running 1 journal thread per Neutron API worker has caused ovsdb-server to misbehave when under heavy pressure¹. Running multiple journal threads seem to be causing other types of problems in other drivers as well.
When under heavy pressure², I noticed that the journal threads could come to a halt (or really slowed down) while the API workers were handling a lot of requests. This resulted in some operations taking more than a minute to be processed. This behaviour can be seem in this screenshot.


Given that the 1 journal thread per Neutron API worker approach is problematic, determining the right number of journal threads is also difficult. In my tests, I've noticed that 3 journal threads per controller worked better but that number was pure based on trial & error. In production this number should probably be calculated based in the environment, perhaps something like TripleO (or any upper layer) would be in a better position to make that decision.
At least temporarily, the data in the Neutron database is duplicated between the normal tables and the journal one.
Some operations like creating a new resource via Neutron's API will return HTTP 201, which indicates that the resource has been created and is ready to be used, but as these resources are created asynchronously one could argue that the HTTP codes are now misleading. As a note, the resource will be created at the Neutron database by the time the HTTP request returns but it may not be present in the SDN backend yet.

Given all considerations, this approach is still valid and the fact that it's already been used by other ML2 drivers makes it more open for collaboration and code sharing.
Footnotes


I ran the tests using Browbeat which is basically orchestrate Openstack Rally and monitor the machine's usage of resources.↩
I ran the tests using Browbeat which is basically orchestrate Openstack Rally and monitor the machine's usage of resources.↩
Column name	Type	Description
standard_attr_id	Integer	The reference ID from the standardattributes table in Neutron for that resource. ONDELETE SET NULL.
resource_uuid	String	Primary key. The UUID of the resource
resource_type	String	The type of the resource (e.g, Port, Router, ...)
resource_mgmt	String	The name of the SBI driver that manages the resource.
revision_number	Integer	The version of the object present in the SBI.
created_at	DateTime	The time that the entry was created. For troubleshooting purposes
updated_at	DateTime	The time that the entry was updated. For troubleshooting purposes