adammw/dpif.md

## dpif.md

      
    Raw
  

              dpif.md
            
          
    dpif, the DataPath InterFace.

In Open vSwitch terminology, a "datapath" is a flow-based software switch.
A datapath has no intelligence of its own.  Rather, it relies entirely on
its client to set up flows.  The datapath layer is core to the Open vSwitch
software switch: one could say, without much exaggeration, that everything
in ovs-vswitchd above dpif exists only to make the correct decisions
interacting with dpif.
Typically, the client of a datapath is the software switch module in
"ovs-vswitchd", but other clients can be written.  The "ovs-dpctl" utility
is also a (simple) client.
Overview

The terms written in quotes below are defined in later sections.
When a datapath "port" receives a packet, it extracts the headers (the
"flow").  If the datapath's "flow table" contains a "flow entry" matching
the packet, then it executes the "actions" in the flow entry and increments
the flow's statistics.  If there is no matching flow entry, the datapath
instead appends the packet to an "upcall" queue.
Ports

A datapath has a set of ports that are analogous to the ports on an Ethernet
switch.  At the datapath level, each port has the following information
associated with it:


A name, a short string that must be unique within the host.  This is
typically a name that would be familiar to the system administrator,
e.g. "eth0" or "vif1.1", but it is otherwise arbitrary.


A 32-bit port number that must be unique within the datapath but is
otherwise arbitrary.  The port number is the most important identifier
for a port in the datapath interface.


A type, a short string that identifies the kind of port.  On a Linux
host, typical types are "system" (for a network device such as eth0),
"internal" (for a simulated port used to connect to the TCP/IP stack),
and "gre" (for a GRE tunnel).


A Netlink PID for each upcall reading thread (see "Upcall Queuing and
Ordering" below).


The dpif interface has functions for adding and deleting ports.  When a
datapath implements these (e.g. as the Linux and netdev datapaths do), then
Open vSwitch's ovs-vswitchd daemon can directly control what ports are used
for switching.  Some datapaths might not implement them, or implement them
with restrictions on the types of ports that can be added or removed
(e.g. on ESX), on systems where port membership can only be changed by some
external entity.
Each datapath must have a port, sometimes called the "local port", whose
name is the same as the datapath itself, with port number 0.  The local port
cannot be deleted.
Ports are available as "struct netdev"s.  To obtain a "struct netdev *" for
a port named 'name' with type 'port_type', in a datapath of type
'datapath_type', call netdev_open(name, dpif_port_open_type(datapath_type,
port_type).  The netdev can be used to get and set important data related to
the port, such as:


MTU (netdev_get_mtu(), netdev_set_mtu()).


Ethernet address (netdev_get_etheraddr(), netdev_set_etheraddr()).


Statistics such as the number of packets and bytes transmitted and
received (netdev_get_stats()).


Carrier status (netdev_get_carrier()).


Speed (netdev_get_features()).


QoS queue configuration (netdev_get_queue(), netdev_set_queue() and
related functions.)


Arbitrary port-specific configuration parameters (netdev_get_config(),
netdev_set_config()).  An example of such a parameter is the IP
endpoint for a GRE tunnel.


Flow Table

The flow table is a collection of "flow entries".  Each flow entry contains:


A "flow", that is, a summary of the headers in an Ethernet packet.  The
flow must be unique within the flow table.  Flows are fine-grained
entities that include L2, L3, and L4 headers.  A single TCP connection
consists of two flows, one in each direction.
In Open vSwitch userspace, "struct flow" is the typical way to describe
a flow, but the datapath interface uses a different data format to
allow ABI forward- and backward-compatibility.  datapath/README.md
describes the rationale and design.  Refer to OVS_KEY_ATTR_* and
"struct ovs_key_*" in include/odp-netlink.h for details.
lib/odp-util.h defines several functions for working with these flows.


A "mask" that, for each bit in the flow, specifies whether the datapath
should consider the corresponding flow bit when deciding whether a
given packet matches the flow entry.  The original datapath design did
not support matching: every flow entry was exact match.  With the
addition of a mask, the interface supports datapaths with a spectrum of
wildcard matching capabilities, from those that only support exact
matches to those that support bitwise wildcarding on the entire flow
key, as well as datapaths with capabilities somewhere in between.
Datapaths do not provide a way to query their wildcarding capabilities,
nor is it expected that the client should attempt to probe for the
details of their support.  Instead, a client installs flows with masks
that wildcard as many bits as acceptable.  The datapath then actually
wildcards as many of those bits as it can and changes the wildcard bits
that it does not support into exact match bits.  A datapath that can
wildcard any bit, for example, would install the supplied mask, an
exact-match only datapath would install an exact-match mask regardless
of what mask the client supplied, and a datapath in the middle of the
spectrum would selectively change some wildcard bits into exact match
bits.
Regardless of the requested or installed mask, the datapath retains the
original flow supplied by the client.  (It does not, for example, "zero
out" the wildcarded bits.)  This allows the client to unambiguously
identify the flow entry in later flow table operations.
The flow table does not have priorities; that is, all flow entries have
equal priority.  Detecting overlapping flow entries is expensive in
general, so the datapath is not required to do it.  It is primarily the
client's responsibility not to install flow entries whose flow and mask
combinations overlap.


A list of "actions" that tell the datapath what to do with packets
within a flow.  Some examples of actions are OVS_ACTION_ATTR_OUTPUT,
which transmits the packet out a port, and OVS_ACTION_ATTR_SET, which
modifies packet headers.  Refer to OVS_ACTION_ATTR_* and "struct
ovs_action_*" in include/odp-netlink.h for details.  lib/odp-util.h
defines several functions for working with datapath actions.
The actions list may be empty.  This indicates that nothing should be
done to matching packets, that is, they should be dropped.
(In case you are familiar with OpenFlow, datapath actions are analogous
to OpenFlow actions.)


Statistics: the number of packets and bytes that the flow has
processed, the last time that the flow processed a packet, and the
union of all the TCP flags in packets processed by the flow.  (The
latter is 0 if the flow is not a TCP flow.)


The datapath's client manages the flow table, primarily in reaction to
"upcalls" (see below).
Upcalls

A datapath sometimes needs to notify its client that a packet was received.
The datapath mechanism to do this is called an "upcall".
Upcalls are used in two situations:


When a packet is received, but there is no matching flow entry in its
flow table (a flow table "miss"), this causes an upcall of type
DPIF_UC_MISS.  These are called "miss" upcalls.


A datapath action of type OVS_ACTION_ATTR_USERSPACE causes an upcall of
type DPIF_UC_ACTION.  These are called "action" upcalls.


An upcall contains an entire packet.  There is no attempt to, e.g., copy
only as much of the packet as normally needed to make a forwarding decision.
Such an optimization is doable, but experimental prototypes showed it to be
of little benefit because an upcall typically contains the first packet of a
flow, which is usually short (e.g. a TCP SYN).  Also, the entire packet can
sometimes really be needed.
After a client reads a given upcall, the datapath is finished with it, that
is, the datapath doesn't maintain any lingering state past that point.
The latency from the time that a packet arrives at a port to the time that
it is received from dpif_recv() is critical in some benchmarks.  For
example, if this latency is 1 ms, then a netperf TCP_CRR test, which opens
and closes TCP connections one at a time as quickly as it can, cannot
possibly achieve more than 500 transactions per second, since every
connection consists of two flows with 1-ms latency to set up each one.
To receive upcalls, a client has to enable them with dpif_recv_set().  A
datapath should generally support being opened multiple times (e.g. so that
one may run "ovs-dpctl show" or "ovs-dpctl dump-flows" while "ovs-vswitchd"
is also running) but need not support more than one of these clients
enabling upcalls at once.
Upcall Queuing and Ordering

The datapath's client reads upcalls one at a time by calling dpif_recv().
When more than one upcall is pending, the order in which the datapath
presents upcalls to its client is important.  The datapath's client does not
directly control this order, so the datapath implementer must take care
during design.
The minimal behavior, suitable for initial testing of a datapath
implementation, is that all upcalls are appended to a single queue, which is
delivered to the client in order.
The datapath should ensure that a high rate of upcalls from one particular
port cannot cause upcalls from other sources to be dropped or unreasonably
delayed.  Otherwise, one port conducting a port scan or otherwise initiating
high-rate traffic spanning many flows could suppress other traffic.
Ideally, the datapath should present upcalls from each port in a "round
robin" manner, to ensure fairness.
The client has no control over "miss" upcalls and no insight into the
datapath's implementation, so the datapath is entirely responsible for
queuing and delivering them.  On the other hand, the datapath has
considerable freedom of implementation.  One good approach is to maintain a
separate queue for each port, to prevent any given port's upcalls from
interfering with other ports' upcalls.  If this is impractical, then another
reasonable choice is to maintain some fixed number of queues and assign each
port to one of them.  Ports assigned to the same queue can then interfere
with each other, but not with ports assigned to different queues.  Other
approaches are also possible.
The client has some control over "action" upcalls: it can specify a 32-bit
"Netlink PID" as part of the action.  This terminology comes from the Linux
datapath implementation, which uses a protocol called Netlink in which a PID
designates a particular socket and the upcall data is delivered to the
socket's receive queue.  Generically, though, a Netlink PID identifies a
queue for upcalls.  The basic requirements on the datapath are:


The datapath must provide a Netlink PID associated with each port.  The
client can retrieve the PID with dpif_port_get_pid().


The datapath must provide a "special" Netlink PID not associated with
any port.  dpif_port_get_pid() also provides this PID.  (ovs-vswitchd
uses this PID to queue special packets that must not be lost even if a
port is otherwise busy, such as packets used for tunnel monitoring.)


The minimal behavior of dpif_port_get_pid() and the treatment of the Netlink
PID in "action" upcalls is that dpif_port_get_pid() returns a constant value
and all upcalls are appended to a single queue.
The preferred behavior is:


Each port has a PID that identifies the queue used for "miss" upcalls
on that port.  (Thus, if each port has its own queue for "miss"
upcalls, then each port has a different Netlink PID.)


"miss" upcalls for a given port and "action" upcalls that specify that
port's Netlink PID add their upcalls to the same queue.  The upcalls
are delivered to the datapath's client in the order that the packets
were received, regardless of whether the upcalls are "miss" or "action"
upcalls.


Upcalls that specify the "special" Netlink PID are queued separately.


Multiple threads may want to read upcalls simultaneously from a single
datapath.  To support multiple threads well, one extends the above preferred
behavior:


Each port has multiple PIDs.  The datapath distributes "miss" upcalls
across the PIDs, ensuring that a given flow is mapped in a stable way
to a single PID.


For "action" upcalls, the thread can specify its own Netlink PID or
other threads' Netlink PID of the same port for offloading purpose
(e.g. in a "round robin" manner).


Packet Format

The datapath interface works with packets in a particular form.  This is the
form taken by packets received via upcalls (i.e. by dpif_recv()).  Packets
supplied to the datapath for processing (i.e. to dpif_execute()) also take
this form.
A VLAN tag is represented by an 802.1Q header.  If the layer below the
datapath interface uses another representation, then the datapath interface
must perform conversion.
The datapath interface requires all packets to fit within the MTU.  Some
operating systems internally process packets larger than MTU, with features
such as TSO and UFO.  When such a packet passes through the datapath
interface, it must be broken into multiple MTU or smaller sized packets for
presentation as upcalls.  (This does not happen often, because an upcall
typically contains the first packet of a flow, which is usually short.)
Some operating system TCP/IP stacks maintain packets in an unchecksummed or
partially checksummed state until transmission.  The datapath interface
requires all host-generated packets to be fully checksummed (e.g. IP and TCP
checksums must be correct).  On such an OS, the datapath interface must fill
in these checksums.
Packets passed through the datapath interface must be at least 14 bytes
long, that is, they must have a complete Ethernet header.  They are not
required to be padded to the minimum Ethernet length.
Typical Usage

Typically, the client of a datapath begins by configuring the datapath with
a set of ports.  Afterward, the client runs in a loop polling for upcalls to
arrive.
For each upcall received, the client examines the enclosed packet and
figures out what should be done with it.  For example, if the client
implements a MAC-learning switch, then it searches the forwarding database
for the packet's destination MAC and VLAN and determines the set of ports to
which it should be sent.  In any case, the client composes a set of datapath
actions to properly dispatch the packet and then directs the datapath to
execute those actions on the packet (e.g. with dpif_execute()).
Most of the time, the actions that the client executed on the packet apply
to every packet with the same flow.  For example, the flow includes both
destination MAC and VLAN ID (and much more), so this is true for the
MAC-learning switch example above.  In such a case, the client can also
direct the datapath to treat any further packets in the flow in the same
way, using dpif_flow_put() to add a new flow entry.
Other tasks the client might need to perform, in addition to reacting to
upcalls, include:


Periodically polling flow statistics, perhaps to supply to its own
clients.


Deleting flow entries from the datapath that haven't been used
recently, to save memory.


Updating flow entries whose actions should change.  For example, if a
MAC learning switch learns that a MAC has moved, then it must update
the actions of flow entries that sent packets to the MAC at its old
location.


Adding and removing ports to achieve a new configuration.


Thread-safety

Most of the dpif functions are fully thread-safe: they may be called from
any number of threads on the same or different dpif objects.  The exceptions
are:


dpif_port_poll() and dpif_port_poll_wait() are conditionally
thread-safe: they may be called from different threads only on
different dpif objects.


dpif_flow_dump_next() is conditionally thread-safe: It may be called
from different threads with the same 'struct dpif_flow_dump', but all
other parameters must be different for each thread.


dpif_flow_dump_done() is conditionally thread-safe: All threads that
share the same 'struct dpif_flow_dump' must have finished using it.
This function must then be called exactly once for a particular
dpif_flow_dump to finish the corresponding flow dump operation.


Functions that operate on 'struct dpif_port_dump' are conditionally
thread-safe with respect to those objects.  That is, one may dump ports
from any number of threads at once, but each thread must use its own
struct dpif_port_dump.