You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Programmable Data Plane Reading list has been permanently moved to its new home programmabledataplane.review. This repo is no longer maintained, please direct your pull requests to the new github repo.
This is a reading list for students, practitioners, and researchers
interested in the general area of programmable data plane devices. Topics
include SmartNICs, programmable middleboxes and software/hardware switches,
that is, everything that may underlie a software-defined network, with only
a marginal attention to the major application areas like Network Function
Virtualization, service chaining, or 5G.
The reading list is organized into a rough hierarchy based on the major
topics of Abstractions, Architecture, Applications, and
Miscellanea; note that this hierarchy is more or less arbitrary and the
purpose is just to have some organization at all. The individual papers
are tagged as “mustread”, “important”, and “interesting” (available only in
the HTML version), with the approximate meaning “read at least these papers
to get a good understanding of the area”, “papers for getting more familiar
with some sub-areas”, and “interesting contributions to the field”,
respectively. Just like the hierarchy, the tags are also pretty much
arbitrary and follow the subjective view of the authors; as always, your
mileage may vary.
Note: Some of the linked papers are behind paywalls. We double-checked
that all listed papers can be accessed freely by a moderate amount of
googling; we still provide the paywall links as user-provided PDFs often do
not prove overly stable over time.
Abstractions
At the heart of programmable data planes lies the question of which
abstractions and programming interfaces to provide. We first review
literature on low-level APIs, including OpenFlow and P4, and then discuss
more high-level languages and compilers, including DevoFlow and the
Frenetic framework. Particular focus is put on stateful abstractions. We
then extend our review to literature on parser design as well as
scheduling, and in particular, the question whether there exist universal
packet scheduling algorithms.
Languages and Compilers
We start our survey with the seminal paper on OpenFlow, introducing a
standardized interface to manage flow table entries in data plane devices
via a standard control-plane–data-plane API. We then proceed by discussing
P4 and its alternatives and use cases. We also review, among other,
high-performance packet processing languages and make the case for
intermediate representations for programmable data planes.
We then proceed to more high-level languages and abstractions, discuss
programming languages such as Pyretic, systems such as Maple, or novel
switch designs like HARMLESS to seamlessly add SDN capability to legacy
network gear.
The OpenFlow whitepaper. The original idea in OpenFlow was to provide a way
for researchers to run experimental protocols in the networks they use
every day. OpenFlow is based on an Ethernet switch, with an internal
flow-table, and a standardized interface to add and remove flow entries.
This allowed, in addition to allowing researchers to evaluate their ideas
in real-world traffic settings, for OpenFlow to serve as a useful campus
component in large-scale testbeds.
This paper describes a model which combines a parallel model and
heterogeneous multiprocessor implementations. The parallel packet
processing model uses coarse-grained SPMD parallelism to free users from
thread management and it requires the host system to locate protocol
headers in the packet before a parallel copy of the program executes. The
packetC language abstracts and encapsulates familiar packet processing data
sets and operations into new aggregate data types and operators, e.g., for
packets, databases and searchsets.
As an alternative to P4, Protocol-Oblivious Forwarding (POF) is presented
as a key enabler for highly flexible and programmable SDN. The goal is to
remove any dependency on protocol-specific configurations on the forwarding
elements and, in addition to P4’s stateless design, enhance the data-path
with new stateful instructions to support genuine software defined
networking behavior. A generic flow instruction set (FIS) is defined to
fulfill this purpose and both hardware-based and open source software-based
prototypes are shown to demonstrate the feasibility and advantages of POF.
This seminal paper introduces P4, a high-level language for programming
protocol-independent packet processors. P4 has three goals: (1)
reconfigurability, in that programmers can change the way switches process
packets once they are deployed, (2) protocol independence, in that switches
are not tied to any specific network protocols, and (3) target
independence, in that programmers can describe packet-processing
functionality independently of the specifics of the underlying
hardware. The paper demonstrates P4 by showing configure a switch to add a
new hierarchical label.
The paper presents a case study that uses P4 to express the forwarding behavior
of a datacenter switch’s data plane. In the process, it identifies issues and
strengths of P4. Some of the lessons learned had, and are having, an impact on
the language evolution. For instance, and most notably, the
language-architecture separation that has been implemented in newer versions of
P4.
The paper introduces NetASM, an intermediate representation for
programmable data planes. NetASM is a device-independent language that is
expressive enough to act as the target language for compilers for
high-level languages, yet low-level enough to be efficiently assembled on
various device architectures. It enables conventional compiler optimization
techniques to significantly improve the performance and resource
utilization of custom packet-processing pipelines on a variety of targets.
The paper proposes an API for programming the generation of packets in
programmable switches, instead of forging network packets on the controller
side. The InSP API allows a programmer to define in-switch packet
generation operations, which include the specification of triggering
conditions, packet’s content and forwarding actions.
PVPP is a data-plane program compiler from P4, a data plane DSL based on
match-action tables, to the fd.io Vector Packet Processor (VPP) software
switch, based on the packet processing node graph model. PVPP compiles a
data plane program written in P4 to VPP’s internal graph
representation.
This paper is motivated by the observation that OpenFlow, in its original
design, imposes great overheads, involving the switch’s control-plane too
often. In order to meet the needs of high-performance networks, the
authors propose and evaluate DevoFlow, which provides less fine-grained
visibility, at significantly lower costs. In a case study, the authors show
that DevoFlow can load-balance data center traffic as well as fine-grained
solutions, but with much fewer flow table entries and using much fewer
control messages.
The paper introduces Pyretic, a novel programming language for writing
composable SDN applications using a set of high level topology and
packet-processing abstractions. Pyretic improves on Frenetic (an earlier
incarnation of a similar language) by adding support for sequential
composition, the use of topology abstractions to define what each module
can see and do with the network, and an abstract packet model that
introduces virtual fields into packets. Modular applications are written
using the static policy language NetCore, which provides primitive actions,
matching predicates, query policies, and policies.
The paper presents Maple, a system that simplifies SDN programming by (1)
allowing a programmer to use a standard programming language to design an
arbitrary, centralized algorithm, to decide the behavior of an entire
network, and (2) providing an abstraction that the programmer-defined,
centralized policy runs on every packet entering a network, and hence is
oblivious to the challenge of translating a high-level policy into sets of
rules on distributed individual switches. To implement algorithmic policies
efficiently, Maple includes not only a highly-efficient multicore
scheduler, but more importantly a novel tracing runtime optimizer that can
automatically record reusable policy decisions, offload work to switches
when possible, and keep switch flow tables up-to-date by dynamically
tracing the dependency of policy decisions on packet contents as well as
the environment.
The paper contributes just what is stated in the title, a new network
programming language called NetKAT that is based on a solid
mathematical foundation. Formerly, the design of network dataplane
programming languages was largely ad hoc, driven more by the needs of
applications and the capabilities of network hardware than by
foundational principles. NetKAT solves this problem by (1) proposing
primitives for filtering, modifying, and transmitting packets,
operators for combining programs in parallel and in sequence, and a
Kleene star operator for iteration, and (2) presenting a series of
proofs that the language is sound and complete.
The paper introduces PFQ-Lang, an extensible functional language to
process, analyze and forward packets, which allows easy development by
leveraging functional composition and allows to exploit multi-queue NICs
and multi-core architectures.
Schiff et al. show that standard OpenFlow can be exploited to implement
powerful functionality in the data plane, e.g., to reduce the number of
interactions with the control plane or to render the network more robust.
Example applications of such a SmartSouth include topology snapshot,
anycast, blackhole detection and critical node detection.
Seminal paper exploring the design of a compiler for programmable switching
chips, in particular how to map logical lookup tables to physical tables
while meeting data and control dependencies in the program. A Integer
Linear Programming (ILP) and greedy approach is presented to generate
solutions optimized for latency, pipeline occupancy, or power
consumption. The authors show benchmarks from real production networks to
two different programmable switch architectures: RMT and Intel’s FlexPipe.
The paper presents the Virtual Filtering Platform (VFP), a programmable
virtual switch that powers Microsoft Azure, a large public cloud. VFP
includes support for multiple independent network controllers, policy based
on connections rather than only on packets, efficient caching and
classification algorithms for performance, and efficient offload of flow
policy to programmable NICs. The paper presents the design of VFP and its
API, its flow language and compiler used for flow processing, performance
results, and experiences deploying and using VFP in Azure over several
years.
P4FPGA is a tool for developing and evaluating data plane applications. It
is both an open-source compiler and runtime; the compiler in turn extends
the P4.org reference compiler with a custom backend that generates FPGA
code. By combining high-level programming abstractions offered by P4 with a
flexible and powerful hardware target, P4FPGA may allow developers to
rapidly prototype and deploy new data plane applications.
The paper proposes HARMLESS, a new SDN switch design that seamlessly adds
SDN capability to legacy network gear, by emulating the OpenFlow switch OS
in a separate software switch component. This way, HARMLESS enables a quick
and easy leap into SDN, combining the rapid innovation and upgrade cycles
of software switches with the port density and cost-efficiency of
hardware-based appliances into a fully dataplane-transparent and
vendor-neutral solution. HARMLESS incurs an order of magnitude smaller
initial expenditure for an SDN deployment than existing turnkey vendor SDN
solutions while it yields matching, or even better data plane performance
for smaller enterprises.
The paper presents netdiff, an algorithm to check the equivalence of
two network dataplanes. Such an algorithm can be useful to verify and
compare the output of different dataplane compilers, to find new bugs
existing network dataplanes, or to check the equivalence of FIB
updates in a production network. The evaluation shows that equivalence
is an easy way to find bugs, scales well to relatively large programs
and uncovers subtle issues otherwise difficult to find.
Abstractions for Embedded State
While OpenFlow match/action table abstractions are stateless, there are
many efforts toward devising a stateful data plane programming abstraction,
e.g., based on finite state machines, for supporting more dynamic
applications. We discuss such approaches as well as first workload
characterizations of stateful networking applications. We also review
literature on the challenge of consistent state migration and elastic
scaling, and discuss security implications.
This paper presents the first workload characterization of stateful
networking applications. The analysis emphasizes the study of data cache
behavior, but discusses branch prediction, instruction distribution,
etc. Another important contribution is the study of the state categories of
different networking applications.
The paper tackles the challenge to devise a stateful data plane programming
abstraction (versus the stateless OpenFlow match/action table abstraction)
which still entails high performance and remains consistent with vendors’
preference for closed platforms. The authors posit that a promising answer
revolves around the usage of extended finite state machines, as an
extension (super-set) of the OpenFlow match/action abstraction, turn the
proposed abstraction into an actual table-based API, and show how it can
be supported by (mostly) reusing core primitives already implemented in
OpenFlow devices.
The paper proposes FAST (Flow-level State Transitions) as a new switch
primitive for software-defined networks. With FAST, the controller simply
preinstalls a state machine and switches can automatically record flow
state transitions by matching incoming packets to installed filters. FAST
can support a variety of dynamic applications, and can be readily
implemented with commodity switch components and software switches.
SNAP offers a simpler “centralized” stateful programming model on top of
the simple match-action paradigm offered by OpenFlow. SNAP programs are
developed on a one-big-switch abstraction and may contain reads and writes
to global, persistent arrays, allowing programmers to implement a broad
range of stateful applications. The SNAP compiler then distributes, places,
and optimizes access to these stateful arrays, discovering read/write
dependencies and translating one-big-switch programs into an efficient
internal representation based on a novel variant of binary decision
diagrams.
Kinetic provides a formal way to program the network control plane using
finite state machines. The use of a formal language allows the system to
verify the correctness of the control program according to user-specified
temporal properties. The paper also reports about a user survey among students
of the Coursera’s SDN course, which find the Finite State Machine abstraction
of Kinetic to be intuitive and easier to verify compared to other high-level
languages, such as Pyretic.
This paper shows how to program data-plane algorithms in a high-level
language and compile those programs into low-level microcode that can run
on programmable line-rate switching chips. The key challenge is that many
data-plane algorithms create and modify algorithmic state. To achieve
line-rate programmability for stateful algorithms, the paper introduces the
notion of a packet transaction: a sequential packet-processing code block
that is atomic and isolated from other such code blocks. The idea is
developed in Domino, a C-like imperative language to express data-plane
algorithms, and many examples are shown that can be run at line rate with
modest estimated chip-area overhead.
This paper aims at contributing to the debate on how to bring
programmability of stateful packet processing tasks inside the network
switches, while retaining platform independence. The proposed approach,
named “Open Packet Processor” (OPP), shows the viability of eXtended Finite
State Machines (XFSM) as low-level data plane programming
abstraction. Platform independence is accomplished by decoupling the
implementation of hardware primitives from their usage by an application
formally described via an abstract XFSM.
Migrating/cloning internal state in elastically scalable Network Functions
Virtualization (NFV) require modifications to middlebox code to identify
needed state. The paper presents a framework-independent system,
StateAlyzr, that embodies novel algorithms adapted from program analysis to
provably and automatically identify all state that must be migrated/cloned
to ensure consistent middlebox output in the face of
redistribution. StateAlyzr reduces man-hours required for code modification
by nearly 20x.
The paper presents Swing State, a general state-management framework and
runtime system supporting consistent state migration in stateful data
planes. The key insight is to perform state migration entirely within the
data plane by piggybacking state updates on live traffic. To minimize the
overhead, Swing State only migrates the states that cannot be safely
reconstructed at the destination switch. A prototype of Swing State for P4
is also described.
The paper provides the reader with a background on stateful SDN data plane
proposals, focusing on the security implications that data plane
programmability brings about, identifies potential attack scenarios, and
highlights possible vulnerabilities specific to stateful in-switch
processing, including denial of service and saturation attacks.
The paper presents Stateless Network Functions, a new architecture for
network functions virtualization, where the existing design of network
functions is decomposed into a stateless processing component along with a
data-store layer. The StatelessNF processing instances are architected
around efficient pipelines utilizing DPDK for high performance network I/O,
packaged as Docker containers for easy deployment, and a data store
interface optimized based on the expected request patterns to efficiently
access a RAMCloud-based data store. A network-wide orchestrator monitors
the instances for load and failure, manages instances to scale and provide
resilience, and leverages an OpenFlow-based network to direct traffic to
instances.
Elastic scaling is a central promise of NFV but has been hard to realize in
practice, because most Network Functions (NFs) are stateful and this state
need to be shared across NF instances. The paper presents S6, building on
the insight that a distributed shared state abstraction is well-suited to
the NFV context. State is organized as a distributed shared object (DSO)
space, extended with techniques designed to meet the need for elasticity
and high-performance in NFV workloads.
FlowBlaze is an open abstraction for building stateful packet
processing functions in hardware. The abstraction is based on Extended
Finite State Machines and introduces the explicit definition of flow
state, allowing FlowBlaze to leverage flow-level parallelism. The
paper presents an implementation of FlowBlaze on a NetFPGA SmartNIC
that achieves latency on the order of a few microseconds, consumes
relatively little power, can hold per-flow state for hundreds of
thousands of flows, and yields speeds of 40 Gbps.
Programmable Parsing and Scheduling
We start by reviewing design principles for packet parsers.
We then revisit the concept of a universal scheduler that
would handle all queuing strategies that may arise in a programmable
switch, and ask whether such a scheduling algorithm can really exist.
We conclude with a review of fair queuing on reconfigurable switches.
The paper presents an interesting view on parser design and the trade-offs
between different designs, asking whether it is better to design one fast
parser or several slow parsers, what are the costs of making the parser
reconfigurable in the field, and what design decisions most impact power
and area. The paper describes trade-offs in parser design, identifies
design principles for switch and router architects, and describes a parser
generator that outputs synthesizable Verilog that is available for
download.
The authors argue that, instead of going with a universal scheduler that
would handle all queuing strategies that may arise in a programmable
switch, Software-Defined Networking must be extended to control the
fast-path scheduling and queuing behavior of a switch. To this end, they
propose adding a small FPGA to switches, and synthesize, place, and route
hardware implementations for CoDel and RED.
The addresses a seemingly simple question: Is there a universal packet
scheduling algorithm? It turns out that in general the answer is “no”;
however, the authors manage to show that the classical Least Slack Time
First (LSTF) scheduling algorithm comes closest to being universal and it
can closely replay a wide range of scheduling algorithms. LSTF is evaluated
as to whether in practice it can meet various network-wide objectives; the
authors find that LSTF performs comparable to the state-of-the-art for each
of performance metric.
Similarly to the “Universal Packet Scheduling” paper, this paper presents
another design for a programmable packet scheduler, which allows scheduling
algorithms, potentially algorithms that are unknown today, to be programmed
into a switch without requiring hardware redesign. The design uses the
property that scheduling algorithms make two decisions, in what order to
schedule packets and when to schedule them, and exploits that in many
scheduling algorithms definitive decisions on these two questions can be
made when packets are enqueued. The resultant design uses a single
abstraction: the push-in first-out queue (PIFO), a priority queue that
maintains the scheduling order or time.
The paper discusses how to leverage configurable per-packet processing and
the ability to maintain mutable state inside switches to achieve fair
bandwidth allocation across all traversing flows. The problem is that
implementing fair queuing mechanisms in high-speed switches is expensive,
since they require complex flow classification, buffer allocation, and
scheduling on a per-packet basis. The proposed dequeuing scheduler, called
Rotating Strict Priority scheduler, simulates an ideal round-robin scheme
where each active flow transmits a single bit of data in every round, which
allows to transmit packets from multiple queues in approximately sorted
order.
The trend toward increasing link speeds and slowdown in the scaling of
CPU speeds, leads to a situation where packet scheduling in software
results in lower precision and higher CPU utilization. While by
offloading packet scheduling to the hardware (e.g., the NIC), this
drawback can be overcome, one still would like to retain the
flexibility benefits of software packet schedulers: packet scheduling
in hardware should hence be programmable. Shrivastav proposes a
generalization of the Push-In-First-Out (PIFO) primitive used by
state-of-the-art hardware packet schedulers: Push-In-Extract-Out
(PIEO) maintains an ordered list of elements, but allows dequeue from
arbitrary positions in the list by supporting a programmable
predicate-based filtering at dequeue. PIEO supports most scheduling
(work-conserving and non-work conserving) algorithms and can be
implemented scalably in hardware.
Architecture
We divide the discussion of switch architectures into software and hardware
switch architectures.
Software Switch Architectures
We first discuss the viability of software switching and then review
dataflow graph abstractions, also discussing, e.g., Click, ClickOS, and
software NICs. We proceed by revisiting literature on match-action
abstractions, discussing OVS and PISCES. We conclude with a review on
packet I/O libraries.
The paper is the first to study the performance limitations when building
both software routers and software virtual routers on commodity CPU
platforms. The authors observe that the fundamental performance bottleneck
is the memory system, and that through careful mapping of tasks to CPU
cores one can achieve very high forwarding rates. The authors also identify
principles for the construction of high-performance software router systems
on commodity hardware.
The paper introduces the FlowStream switch architecture, which enables flow
processing and forwarding at unprecedented flexibility and low cost by
consolidating middlebox functionality, such as load balancing, packet
inspection and intrusion detection, and commodity switch technologies,
offering the possibility to control the switching of flows in a
fine-grained manner, into a single integrated package deployed on commodity
hardware.
RouteBricks is concerned with enabling high-speed parallel processing in
software routers, using a software router architecture that parallelizes
router functionality both across multiple servers and across multiple cores
within a single server. RouteBricks adopts a fully programmable Click/Linux
environment and is built entirely from off-the-shelf, general-purpose
server hardware.
Introduces Click, a software architecture for building flexible and
configurable routers from packet processing modules implementing simple
router functions like packet classification, queuing, scheduling, organized
into a directed graph with packet processing modules at the vertices;
packets flow along the edges of the graph.
The paper introduces Snap, a framework for packet processing that exploits
the parallelism available on modern GPUs, while remaining flexible, with
packet processing tasks implemented as simple modular elements that are
composed to build fully functional routers and switches. Snap is based on
the Click modular router, which it extends by adding new architectural
features that support batched packet processing, memory structures
optimized for offloading to coprocessors, and asynchronous scheduling with
in-order completion.
The paper introduces ClickOS, a high-performance, virtualized software
middlebox platform. ClickOS virtual machines are small (5MB), boot quickly
(about 30 milliseconds), add little delay (45 microseconds), and over one
hundred of them can be concurrently run while saturating a 10Gb pipe on a
commodity server. A wide range of middleboxes is implemented, including a
firewall, a carrier-grade NAT and a load balancer, and the evaluations
suggest that ClickOS can handle packets in the millions per second.
BESS is the Berkeley Extensible Software Switch developed at the University
of California, Berkeley and at Nefeli Networks. BESS is heavily inspired by
the Click modular router, representing a packet processing pipeline as a
dataflow (multi)graph that consists of modules, each of which implements a
NIC feature, and ports that act as sources and sinks for this
pipeline. Packets received at a port flow through the pipeline to another
port, and each module in the pipeline performs module-specific operations
on packets.
The authors make the observation that it is difficult to simultaneously
provide high packet rates, high throughput, low CPU usage, high port
density and a flexible data plane in a same architecture. A new
architecture called mSwitch is proposed and four distinct modules are
implemented on top: a learning bridge, an accelerated Open vSwitch module,
a protocol demultiplexer for userspace protocol stacks, and a filtering
module that can direct packets to virtualized middleboxes.
NetBricks is an NFV framework adopting the “graph-based” pipeline
abstraction and embracing type checking and safe runtimes to provide
isolation efficiently in software, providing the same memory isolation as
containers and VMs without incurring the same performance penalties. The
new isolation technique is called zero-copy software isolation.
The paper describes the design and implementation of Open vSwitch, a
multi-layer, open source virtual switch. The design document details the
advanced flow classification and caching techniques that Open vSwitch uses
to optimize its operations and conserve hypervisor resources.
PISCES is a software switch derived from Open vSwitch (OVS), a hypervisor
switch whose behavior is customized using P4. PISCES is not hard-wired to
specific protocols; this independence makes it easy to add new
features. The paper also shows how the compiler can analyze the high-level
P4 specification to optimize forwarding performance; the evaluations show
that PISCES performs comparably to OVS but PISCES programs are about 40
times shorter than equivalent OVS source code.
The paper presents SoftFlow, an extension to Open vSwitch that seamlessly
integrates middlebox functionality while maintaining the familiar OpenFlow
forwarding model and performing significantly better than alternative
techniques for middlebox integration.
The authors argue that, instead of enforcing the same universal fast-path
semantics to all OpenFlow applications and optimizing for the common case,
as it is done in Open vSwitch, a programmable software switch should rather
automatically specialize its dataplane piecemeal with respect to the
configured workload. They introduce ESwitch, a switch architecture that
uses on-the-fly template-based code generation to compile any OpenFlow
pipeline into efficient machine code, which can then be readily used as the
switch fast-path, delivering superior packet processing speed, improved
latency and CPU scalability, and predictable performance.
The paper makes the observation that data-plane compilation is
fundamentally static, i.e., the input of the compiler is a fixed
description of the forwarding plane semantics and the output is code that
can accommodate any packet processing behavior set by the controller at
runtime. The authors advocate a dynamic approach to data plane compilation
instead, where not just the semantics but the intended behavior is
also input to the compiler, opening the door to a handful of runtime
optimization opportunities that can be leveraged to improve the performance
of custom-compiled datapaths beyond what is possible in a static setting.
This paper presents the design and experience with Andromeda, the network
virtualization stack underlying the Google Cloud Platform. Andromeda is
designed around the Hoverboard programming model, which uses gateways for
the long tail of low bandwidth flows enabling the control plane to program
network connectivity for tens of thousands of VMs in seconds, and applies
per-flow processing to elephant flows only. The paper cites statistics
indicating that above 80% of VM pairs never talk to each other in a
deployment and only 1–2% generate sufficient traffic to warrant per-flow
processing. The architecture also uses a high-performance OS bypass
software packet processing path for CPU-intensive per packet operations,
implemented on coprocessor threads.
Netmap is a framework that enables commodity operating systems to
handle the millions of packets per seconds, without requiring custom
hardware or changes to applications. The idea is to eliminate
inefficiencies in OSes’ standard packet processing datapaths: per-packet
dynamic memory allocations are removed by preallocating resources, system
call overheads are amortized over large I/O batches, and memory copies are
eliminated by sharing buffers and metadata between kernel and userspace,
while still protecting access to device registers and other kernel memory
areas.
DPDK is a set of libraries and drivers for fast packet processing,
including a multicore framework, huge page memory, ring buffers, poll-mode
drivers for networking I/O, crypto and eventdev, etc. DPDK can be used to
receive and send packets within the minimum number of CPU cycles (usually
less than 80 cycles), develop fast packet capture algorithms (like
tcpdump), and run third-party fast path stacks.
FD.io (Fast data – Input/Output) is a collection of several projects and
libraries to support flexible, programmable and composable services on a
generic hardware platform, using a high-throughput, low-latency and
resource-efficient IO services suitable to many architectures (x86, ARM,
and PowerPC) and deployment environments (bare metal, VM, container).
Hardware Switch Architectures
We start off by discussing a first incarnation of a programmable switch,
PLUG, then discuss the SwitchBlade platform and the seminal paper on RMT
(Reconfigurable Match Tables). We then review existing performance
evaluation studies and literature dealing with performance monitoring and
the issue of potential inconsistencies in reconfigurable networks. We
conclude with a paper on Azure SmartNICs based on FPGAs.
The first incarnation of the “programmable switch”. PLUG (Pipelined Lookup
Grid) is a flexible lookup module that can achieve generality without
loosing efficiency, because various custom lookup modules have the same
fundamental features that PLUG retains: area dominated by memories, simple
processing, and strict access patterns defined by the data structure. The
authors IPv4, Ethernet, Ethane, and SEATTLE in a dataflow-based programming
model for the PLUG and mapped them to the PLUG hardware, showing that
throughput, area, power, and latency of PLUGs are close to those of
specialized lookup modules.
SwitchBlade is a platform for rapidly deploying custom protocols on
programmable hardware. SwitchBlade uses a pipeline-based design that allows
individual hardware modules to be enabled or disabled on the fly,
integrates common packet-processing functions as hardware modules enabling
different protocols to use these functions without having to resynthesize
hardware, and uses a customizable forwarding engine that supports both
longest-prefix matching in the packet header and exact matching on a hash
value. SwitchBlade also allows multiple custom data planes to operate in
parallel on the same physical hardware, while providing complete isolation
for protocols running in parallel.
This seminal paper presents RMT to overcome two limitations in current
switching chips and OpenFlow: (1) conventional hardware switches are rigid,
allowing “Match-Action” processing on only a fixed set of fields, and (2)
the OpenFlow specification only defines a limited repertoire of packet
processing actions. The RMT (Reconfigurable Match Tables) model is a
RISC-inspired pipelined architecture for switching chips, including an
essential minimal set of action primitives to specify how headers are
processed in hardware. RMT allows the forwarding plane to be changed in the
field without modifying hardware.
The paper presents a tool chain that maps a domain-specific declarative
packet-processing language with object-oriented semantics, called PX, to
high-performance reconfigurable-computing architectures based on
field-programmable gate array (FPGA) technology, including components for
packet parsing, editing, and table lookups.
The definite source on OpenFlow switches and the differences between them.
The authors measure, report, and explain the performance characteristics of
the control- and data-planes in three hardware OpenFlow switches. The
results highlight differences between the OpenFlow specification and its
implementations that, if ignored, pose a serious threat to network security
and correctness.
The paper is motivated by the challenges involved in consistent updates of
distributed network configurations, given the complexity of modern switch
datapaths and the exposed opaque configuration mechanisms. The authors
demonstrate that even simple rule updates result in inconsistent packet
switching in multi-table datapaths. The main contribution of the paper is
a hardware design that supports a transactional configuration mechanism,
providing strong switch-level atomicity: all packets traversing the
datapath will encounter either the old configuration or the new one, and
never an inconsistent mix of the two. The approach is prototyped using the
NetFPGA hardware platform.
This paper focuses on accelerating NFs with FPGAs. However, FPGA is
predominately programmed using low-level hardware description languages
(HDLs), which are hard to code and difficult to debug. More importantly,
HDLs are almost inaccessible for most software programmers. This paper
presents ClickNP, a FPGA-accelerated platform, which is highly flexible as
it is completely programmable using high-level C-like languages and exposes
a modular programming abstraction that resembles Click Modular Router, and
also high performance.
A follow-up to the RMT paper. dRMT (disaggregated Reconfigurable
Match-Action Table) is a new architecture for programmable switches, which
overcomes two important restrictions of RMT: (1) table memory is local to
an RMT pipeline stage, implying that memory not used by one stage cannot be
reclaimed by another, and (2) RMT is hardwired to always sequentially
execute matches followed by actions as packets traverse pipeline
stages. dRMT resolves both issues by disaggregating the memory and compute
resources of a programmable switch, moving table memories out of pipeline
stages and into a centralized pool that is accessible through a
crossbar. In addition, dRMT replaces RMT’s pipeline stages with a cluster
of processors that can execute match and action operations in any order.
The authors ask what switch hardware primitives are required to support an
expressive language of network performance questions. They present a
performance query language, Marple, modeled on familiar functional
constructs, backed by a new programmable key-value store primitive on
switch hardware that performs flexible aggregations at line rate and scales
to millions of keys. Marple can express switch queries that could
previously run only on end hosts, while Marple queries only occupy a modest
fraction of a switch’s hardware resources.
Modern public cloud architectures rely on complex networking policies and
running the necessary network stacks on CPU cores takes away processing
power from VMs, increasing the cost of running cloud services, and adding
latency and variability to network performance. The paper presents the
design of AccelNet, the Azure Accelerated Networking scheme for offloading
host networking to hardware, using custom Azure SmartNICs based on FPGAs,
including the hardware/software co-design model, performance results on key
workloads, and experiences and lessons learned from developing and
deploying AccelNet.
Hybrid Hardware/Software Architectures
It is often believed that the performance of programmable network
processors is lower than hard‐coded chips. There exists interesting
literature questioning this assumption and exploring these overheads
empirically. We also discuss opportunities coming from Graphics Processing
Units (GPUs) acceleration, e.g., for packet processing, as well as from
hybrid hardware/software architectures in general.
PacketShader is a high-performance software router framework for general
packet processing with Graphics Processing Unit (GPU) acceleration,
exploiting the massively-parallel processing power of GPU to address the
CPU bottleneck in software routers, combined with a high-performance packet
I/O engine. The paper presents implementations for IPv4 and IPv6
forwarding, OpenFlow switching, and IPsec tunneling to demonstrate the
flexibility and performance advantage of PacketShader.
Industry insight holds that programmable network processors are of lower
performance than their hard-coded counterparts, such as Ethernet chips. The
paper argues that, contrast to the common view, the overhead of
programmability is relatively low, and that the apparent difference between
programmable and hard-coded chips is not primarily due to programmability
itself, but because the internal balance of programmable network processors
is tuned to more complex use cases.
The paper opens the debate as to whether Graphics Processing Units (GPUs)
are useful for accelerating software-based routing and packet handling
applications. The authors argue that for many such applications the
benefits arise less from the GPU hardware itself than from the expression
of the problem in a language such as CUDA or OpenCL that facilitates memory
latency hiding and vectorization through massive concurrency. They then
demonstrate that applying a similar style of optimization to algorithm
implementations, a CPU-only implementation is more resource efficient than
the version running on the GPU.
The paper presents an architecture to allow high-speed forwarding even with
large rule tables and fast updates, by combining the best of hardware and
software processing. The CacheFlow system caches the most popular rules in
the small TCAM and relies on software to handle the small amount of
cache-miss traffic. The authors observe that one cannot blindly apply
existing cache-replacement algorithms, because of dependencies between
rules with overlapping patterns. Rather long dependency chains must be
broken to cache smaller groups of rules while preserving the semantics of
the policy.
This is the answer to the question raised by the “Raising the Bar for Using
GPUs” paper. Kalia et al. argue that the key enabler for high
packet-processing performance is the inherent feature of GPU that
automatically hides memory access latency rather than its parallel
computation power and claim that CPU can outperform or achieve a similar
performance as GPU if its code is re-arranged to run concurrently with
memory access. This paper revists these claims and find, with eight popular
algorithms widely used in network applications, that (a) there are many
compute-bound algorithms that do benefit from the parallel computation
capacity of GPU while CPU-based optimizations fail to help, and (b) the
relative performance advantage of CPU over GPU in most applications is due
to data transfer bottleneck in PCIe communication of discrete GPU rather
than lack of capacity of GPU itself.
The paper’s aim is to reinvigorate the discussion on the design of
network interface cards (NICs) in the network and OS community. The
authors argue that currently operating systems fail to efficiently
exploit and manage the considerable hardware resources provided by
modern network interface controllers. They then describe Dragonet, a
network stack that represents both the physical capabilities of the
network hardware and the current protocol state of the machine as
dataflow graphs.
The paper presents a reference design and implementation for a
programmable NIC that is in wide-scale use today as an accessible
development environment that both reuses existing codebases and
enables new designs. NetFPGA SUME is an FPGA-based PCI Express board
with I/O capabilities for 100 Gbps operation as a network interface
card, multiport switch, firewall, or test and measurement environment.
SoftNIC is a hybrid software-hardware architecture to bridge the gap
between limited hardware capabilities and ever changing user
demands. SoftNIC provides a programmable platform that allows applications
to leverage NIC features implemented in software and hardware, without
sacrificing performance. This paper serves the foundation for the BESS
software switch.
The authors argue that the primary reason for high memory and
processing overheads inherent to packet processing applications is the
inefficient use of the memory and I/O resources by commodity
NICs. They propose FlexNIC, a flexible network DMA interface that can
be used to reduce packet processing overheads; FlexNIC allows services
to install packet processing rules into the NIC, which then executes
simple operations on packets while exchanging them with host
memory. This moves some of the packet processing traditionally done in
software to the NIC, where it can be done flexibly and at high speed.
The paper presents Floem, a set of programming abstractions for
NIC-accelerated applications to ease developing server applications
that offload computation and data to a NIC accelerator. Floem
simplifies typical offloading issues, like data placement and caching,
partitioning of code and its parallelism, and communication strategies
between program components across devices, by providing language-level
abstractions for logical and physical queues, global per-packet state,
remote caching; and interfacing with external application code. The
paper also presents evaluations to demonstrate how these abstractions
help explore a space of NIC-offloading designs for real-world
applications, including a key-value store and a distributed real-time
data analytics system.
The paper presents iPipe, a generic actor-based offloading framework
to run distributed applications on commodity SmartNICs. The paper
details the iPipe design, built around a hybrid scheduler that
combines first-come-first-served with deficit round-robin policies to
schedule offloading tasks at microsecond-scale precision on SmartNICs,
and presents 3 custom-built use cases, a real-time data analytics
engine, a distributed transaction system, and a replicated key-value
store, to show that SmartNIC offloading brings about significant
performance benefits in terms of traffic control, computing
capability, onboard memory, and host communication.
The paper addresses the noisy neighbor problem in the context of
NICs. Using data from a major public cloud provider, the paper
systematically characterizes how performance isolation can break in
virtualization stacks and finds a fundamental tradeoff between
isolation and efficiency. The paper then proposes PicNIC, the
Predictable Virtualized NIC, a new NIC design that shares resources
efficiently in the common case while rapidly reacting to ensure
isolation in order to provide predictable performance for isolated
workloads. PicNIC builds on three constructs to quickly detect
isolation breakdown and to enforce it when necessary: CPU-fair
weighted fair queues at receivers, receiver-driven congestion control
for backpressure, and sender-side admission control with
shaping. Evaluations show that this combination ensures isolation for
VMs at sub-millisecond timescales with negligible overhead.
Algorithms & HW Realizations
Another very relevant research area regards the design of algorithms and data planes
for this new technology.
This paper presents the packet classifier algorithm that underlies the Open
vSwitch fast-path. Packet classification requires matching each packet
against a database of flow rules and forwarding the packet according to the
highest priority rule. The paper introduces a generic packet classification
algorithm, called Tuple Space Search (TSS), based on the observation that
real databases typically use only a small number of distinct field lengths.
Thus, by mapping rules to tuples even a simple linear search of the tuple
space can provide significant speedup over naive linear search over the
filters. Each tuple is maintained as a hash table that can be searched in
one memory access.
Content-addressable memory (CAM) and Ternary CAM (TCAM) chips are the most
important component in programmable switch ASICs, performing packet
classification according to configurable header fields, matching rules, and
priority, in a single clock cycle using dedicated comparison circuitry. The
paper surveys recent developments in the design of large-capacity CAMs. The
main CAM-design challenge is to reduce power consumption associated with
the large amount of parallel active circuitry, without sacrificing speed or
memory density. The paper reviews CAM-design techniques at the circuit
level and at the architectural level.
Programmable routers often need to support a separate forwarding
information base (FIB) for each virtual router provisioned by the
controller, which leads to memory scaling challenges. In this paper, a
small, shared FIB data structure is presented and a fast lookup algorithm
that capitalize on the commonality of IP prefixes between each
FIB. Experiments with real packet traces and routing tables show that the
approach achieves much lower memory requirements and considerably faster
lookup times.
Ternary Content-Addressable Memories (TCAMs) have become the industrial
standard for high-throughput packet classification, and as such, for
programmable switch ASICs. However, one major drawback of TCAMs is their
high power consumption. In this paper, a practical and efficient solution
is proposed which introduces a smart pre-classifier to reduce power
consumption of TCAMs for multi-dimensional packet classification. The
classifier dimension is reduced by pre-classifying a packet on two header
fields, source and destination IP addresses. Then, the high dimensional
problem can use only a small portion of a TCAM for a given packet. The
pre-classifier is built such that a given packet matches at most one entry
in the pre-classifier, and each rule is stored only once in one of the TCAM
blocks, which avoids rule replication. The presented solution uses
commodity TCAMs.
Programmable switches usually need to implement some or more match-action
tables in the fast-path. This paper presents CuckooSwitch, a
software-based switch design built around a memory-efficient,
high-performance, and highly-concurrent hash table for compact and fast FIB
lookup. The authors show that CuckooSwitch can process the maximum packets
per second rate achievable across the underlying hardware’s PCI buses while
maintaining a forwarding table of one billion forwarding entries.
The main goal of this paper is to demonstrate how data compression can
benefit the networking community, by showing how to squeeze the
longest-prefix-matching lookup table, consulted by switches for IP lookup,
into information-theoretical entropy bounds with essentially zero cost on
lookup performance and FIB update. The state-of-the-art in compressed data
structures yields a static entropy-compressed FIB representation with
asymptotically optimal lookup. Since this data structure results too slow
for practical uses, the authors redesign the venerable prefix tree to also
admit entropy bounds and support lookup in optimal time and update in
nearly optimal time.
Efficient packet classification is a core concern for programmable network
devices, but it is also very difficult to implement
efficiently. Traditional multi-field classification approaches, in both
software and ternary content-addressable memory (TCAMs), entail trade-offs
between (memory) space and (lookup) time. In this work, a novel approach is
presented, which identifies properties of many classifiers that can be
implemented in linear space and with worst-case guaranteed logarithmic time
and allows the addition of more fields including range constraints without
impacting space and time complexities.
The authors argue that a hybrid software-hardware switch may help in
lowering the flow-table entries installation time, and present ShadowSwitch
(sSw), an OpenFlow switch prototype that implements such a design. sSw
builds on two key observations. First, software tables are very fast to be
updated, hence, forwarding tables updates always happen in software first
and, eventually, entries are moved to the TCAM to achieve higher overall
throughput and offload the software forwarder. Lookup in software is
performed only in case there are no entries matching a packet in
hardware. Second, since deleting entries from TCAM is much faster than
adding them, ShadowSwitch may translate an entry installation in a mix of
installation in software tables and deletion from hardware tables.
The paper presents the design and evaluation of Hermes, a practical and
immediately deployable framework that offers a novel method for
partitioning and optimizing switch TCAM to enable performance guarantees
for control plane actions, in particular, inserting, modifying, or deleting
rules. Hermes provides these guarantees by trading-off a nominal amount of
TCAM space for assured performance.
The paper presents PPS, an algorithm for locating occurrences of
string keywords in the payload of packets using programmable network
ASICs. PPS converts the string search problem into a Deterministic
Finite Automata (DFA) representation and then maps the DFA into a
sequence of forwarding tables implemented in the switch pipeline. The
evaluations show that PPS demonstrates significantly higher throughput
and lower latency than CPUs, GPUs, or FPGAs.
Applications
A main motivation for programmable data planes are the novel applications
they enable. We identify and, in the following, will discuss five main
categories: applications related to resilient and efficient forwarding,
in-network computation, consensus, telemetry, and load-balancing.
One may wonder, what aspects of SDN and programmable data plane make these
applications possible? There is probably no single perfect answer to this
question.
Applications related to in-network computation typically leverage new
hardware-assisted primitive operations, supported in the data plane, to
provide novel functionality and improve performance. Resilient and
efficient routing (and to some extent load-balancing) leverage the unique
and unprecedented programmatic control over the way traffic flows through
the network, e.g., to implement advanced functionality in the data plane
(whereas formerly it used to be handled, e.g., in the control plane).
Measurement applications benefit from the improved traffic visibility
and/or from the improved latency and throughput at which high-volume and
highly variable traffic can be handled, if offloaded to the data plane.
Reduced latency and improved reaction time is arguably also a key reason
for consensus applications. Furthermore, measurement applications benefit
from the fact that they can be expressed in terms of simple primitives
(e.g., sketches). We also note that such applications are not limited to
be “performed (only) in the network”: for example, telemetry can (and
today often does) occur outside the network. That said, telemetry
applications also benefit from the new visibility into the network, e.g.,
queues occupation levels of the switches along the path. Many interesting
applications also arise from offloading applications that were formerly
handled in a separate middlebox to programmable switches.
In general, any application designed for a non-programmable device may
benefit from the flexibilities introduced by a programmable counterpart
(e.g., allowing to evolve the application). Also, applications with a
strong networking component (e.g., request-response patterns) are more
likely to benefit from in-network services, as much communication traffic
naturally traverses the network anyway.
Resilient, Robust, and Efficient Forwarding
Data planes often operate much faster than the control plane, which
motivates to move functionality for maintaining connectivity and efficient
routing under failures to the switches. At the same time, implementing such
functionality is non-trivial, as discussed in the following research
papers.
This paper is motivated by the limitations of existing IP multipath
protocols relying on per-flow static hashing, which can result in
suboptimal throughput and bandwidth losses due to long-term
collisions. Hedera is a dynamic flow scheduling system for multi-stage
switch topologies as they often appear in data centers. Hedera uses flow
information from constituent switches and reroutes traffic to
non-conflicting routes accordingly. The authors show that the more global
view of routing and traffic demands allows Hedera to see bottlenecks that
switch-local schedulers cannot, and to adaptively schedule the switching
fabric in a way which significantly improves aggregate network utilization
with minimal overheads.
The authors propose to move the responsibility for maintaining basic
network connectivity (as opposed to the computation of optimal paths which
require global control plane knowledge) to the data plane, which operates
orders of magnitude faster than the control plane. Their Data-Driven
Connectivity (DDC) approach, which can handle arbitrary delays and losses,
relies on simple state changes which can done at packet rates. In
particular, DCC relies on link reversal routing, adapted to suit the data
plane, e.g., to handle message loss.
The paper argues that in order to provide a high availability,
connectivity, and robustness, dependable SDNs must implement functionality
for inband network traversals, e.g., to find failover paths in the presence
link failures. Three fundamentally different mechanisms are described:
simple stateless mechanisms, efficient mechanisms based on packet tagging,
and mechanisms based on dynamic state at the switches.
The paper explores new possibilities, created by programmable
switches, for fast dataplane-driven rerouting upon signals triggered
by traffic disruptions. The proposed method, Blink, uses exploits
TCP-induced signals to detect failures; when compounded over multiple
flows, TCP behavior creates a strong and characteristic failure
signal. Blink analyzes TCP flows, at line rate, to reliably and
quickly detect major traffic disruptions and recover data-plane
connectivity. Evaluation results on a P4 implementation of Blink
running on real Tofino switch indicate that it can achieve sub-second
rerouting for realistic Internet traffic and scales to protect large
fractions of realistic traffic.
In-network Computation
Offloading computation, on‐path aggregation functionalities, caching, or
even AI, to the network, has the potential to significantly improve the
efficiency of distributed applications. Accordingly, the study of such
mechanisms have recently received much attention.
This paper makes the case that many massive-scale information processing
and real-time applications may benefit from pushing data-aggregation load
from the network edge into the network. This is because in many of these
applications data is aggregated during the computation process and the
output size is a fraction of the input size. The authors explore a
different point in the design space, whereby instead of increasing the
network bandwidth they rather implement a MapReduce-like system on a
cluster design that uses a direct-connect network topology, with servers
directly linked to other servers, and letting servers to perform in-network
aggregation of data during the shuffle phase. Camdoop was shown to
significantly reduce network traffic and provide high performance increase.
This paper is motivated by the performance challenges faced by data-center
applications, such as Hadoop batch processing, during the data aggregation
phase: if the network struggles to support many-to-few, high-bandwidth
communication between servers then it can become a bottleneck. Mai et
al. propose to depart from performing data aggregation at edge servers, but
rather, do it more efficiently along network paths. The presented software
platform, NETAGG, supports on-path aggregation for network-bound
partition/aggregation applications. It is based on a middlebox-like design,
in which dedicated servers that can execute aggregation functions provided
by applications. The authors demonstrate that NETAGG can improve throughput
substantially.
SHArP is designed to offload computational load to the network, by relying
on intelligent network devices manipulating data traversing the
datacenter. SHArP is implemented in Mellanox’s SwitchIB-2 ASIC, using
in-network trees to reduce data from a group of sources, and to distribute
the result. Multiple parallel jobs with several partially overlapping
groups are supported, and pipelining is used for improving latency further.
The authors ask the question, given that programmable data plane hardware
creates new opportunities for infusing intelligence into the network, what
kinds of computation should be delegated to the data plane? The paper
discusses the opportunities and challenges for co-designing data center
distributed systems with their network layer, under the constraints imposed
by the limitations of the network machine architecture of programmable
devices. They find that, in particular, aggregation functions raise
opportunities to exploit the limited computation power of networking
hardware to lessen network congestion and improve the overall application
performance.
This paper presents IncBricks, an in-network caching fabric with basic
computing primitives. IncBricks is a hardware-software co-designed system
that supports caching in the network using a programmable network
middlebox. As a key-value store accelerator, our prototype lowers request
latency by over 30% and doubles throughput for 1024 byte values in a common
cluster configuration. The results demonstrate the effectiveness of
in-network computing and that efficient datacenter network request
processing is possible if we carefully split the computation across
programmable switches, network accelerators, and end hosts.
The main contribution of this work is providing a set of general building
blocks that mask the limitations of programmable switches (limited state,
support limited types of operations, limited per-packet computation) using
approximation techniques and thereby enabling the implementation of
realistic network protocols. These building blocks are then used to tackle
the network resource allocation problem within datacenters and realize
approximate variants of congestion control and load balancing protocols,
such as XCP, RCP, and CONGA, that require explicit support from the
network. The evaluations show that the proposed approximations are accurate
and that they do not exceed the hardware resource limits associated with
flexible switches.
This paper analyzes the feasibility and opportunities from using
programmable network devices (e.g., network cards and switches), as
accelerators for Artificial Neural Networks (NNs). In particular, the
authors investigate the properties of NN processing on CPUs, and find that
programmable network devices may indeed be a suitable engine, for
implementing a CPU’s NN co-processor.
The paper presents N2Net, a system that implements binary neural networks
using commodity switching chips deployed in network switches and
routers. N2Net shows that these devices can run simple neural network
models, whose input is encoded in the network packets’ header, at packet
processing speeds (billions of packets per second). Furthermore, the
authors’ experience highlights that switching chips could support even more
complex models, provided that some minor and cheap modifications to the
chip’s design are applied.
Distributed Consensus
Another interesting application for programmable data planes is related to
consensus algorithms: the coordination among controllers or switches may be
performed most efficiently directly on the network devices. Over the last
years, several interesting first approaches have been reported in the
literature, not only to compute consensus but also to provide different
notions of consistency more generally.
This paper explores the possibility of implementing the widely deployed
Paxos consensus protocol in network devices. Two different approaches are
presented: (1) a detailed design description for implementing the full
Paxos logic in SDN switches, which identifies a sufficient set of required
OpenFlow extensions, and (2) an alternative, optimistic protocol which can
be implemented without changes to the OpenFlow API, but relies on
assumptions about how the network orders messages. Although neither of
these protocols can be fully implemented without changes to the underlying
switch firmware, the authors argue that such changes are feasible in
existing hardware.
This paper posits that there are significant performance benefits to be
gained by implementing the Paxos protocol, the foundation for building many
fault-tolerant distributed systems and services, in network devices. The
paper describes an implementation of Paxos in P4.
SwitchKV implements a key-value store system leveraging SDN network switches to
balance the cache servers workload routing the traffic based on the content of
the network packets. To identify the content of a packet, the key of a
key-value entry is encoded in the packet header. A hybrid cache strategy keeps
the cache and switch forwarding rules updated, finally achieving significant
improvements in both system’s throughput and latency.
The paper considers the design of consistent distributed control planes in
which the actions performed on the data plane by different controllers need
to be synchronized. The authors propose a synchronization framework for
based on atomic transactions implemented in the data plane switches and
show that their approach allows to realize fundamental consensus primitives
in the presence of controller failures. They also discuss applications for
consistent policy composition. With a proof-of-concept implementation, it
is demonstrated that the framework can be implemented using the standard
OpenFlow protocol.
NetCache implements a small cache in for key-velue stores in a programmable
hardware switch data plane. The switch works as a cache at the datacenter’s
rack level, handling requests directed to the rack’s server. The
implementation deals with consistency problems and shows how to overcome
the constraints of hardware to provide throughput and latency improvements.
The paper presents an interesting alternative of NetCache: instead of
using in-network programmable switches to cache key-value pairs it
leverages programmable NIC to accelerate key-value stores in an
“end-to-end” fashion. In particular, KV-Direct extends RDMA using
programmable NICs to enable remote direct key-value access to the main
host memory, yielding more than 1.2 billion operations per second
using 10 parallel NICs.
This paper presents NetChain, a new approach that provides scale-free
sub-RTT coordination in data centers. NetChain exploits programmable
switches to store data and process queries entirely in the network data
plane. This eliminates the query processing at coordination servers and
cuts the end-to-end latency to as little as half of an RTT. New protocols
and algorithms are designed for NetChain guarantees strong consistency and
handles switch failures efficiently.
Monitoring, Telemetry, and Measurement
Perhaps the most interesting applications are related to network
measurement, monitoring and diagnosis. Indeed, programmable data planes can
be a game changer, providing deep insights into the network, even to
end-hosts, as we discuss in the following.
Jeyakumar et al. present an approach to give end-hosts visibility into
network behavior and to quickly introduce new data plane functionality, via
a new Tiny Packet Program (TTP) interface. TTPs are embedded into packets
by endhosts and can actively query and manipulate internal network
state. The idea is motivated by a clear work division: switches forward and
execute TTPs in-band at line rate, and endhosts perform arbitrary (and
easily updated) computation on network state. The paper presents a number
of use case descriptions motivating In‐band Network Telemetry (INT).
In-band Network Telemetry (INT) is a powerful new network-diagnostics and
debug mechanism, which allows, e.g., to diagnose performance problems
related to latency spikes. The INT abstraction allows data packets to query
switch-internal state (e.g., queue size, link utilization, and queuing
latency). The paper reports on a prototype implemented in the P4 language,
hence supporting various different programmable network devices.
The paper seeks for accurate, feasible and scalable traffic matrix
estimation approaches, by designing feasible traffic measurement rules that
can be installed in TCAM entries of SDN switches. The statistics of the
measurement rules are collected by the controller to estimate fine-grained
traffic matrix. Two strategies are proposes, called Maximum Load Rule First
(MLRF) and Large Flow First (LFF), both of which LFF satisfy the flow
aggregation constraints (determined by associated routing policies) and
have low-complexity.
The paper describes HashPipe, a heavy hitter detection algorithm using
programmable data planes. HashPipe implements a pipeline of hash tables
which retain counters for heavy flows while evicting lighter flows over
time. HashPipe is prototyped in P4 and evaluated with packet traces from an
ISP backbone link and a data center.
Dapper is a system which leverages emerging edge devices offering flexible
and high-speed packet processing on commodity hardware, to diagnose cloud
performance problems in a timely manner. In particular, Dapper analyzes TCP
performance in real time near the end-hosts, i.e., at the hypervisor, NIC,
or top-of-rack switch, by determining whether a connection is limited by
the sender, the network, or the receiver. Dapper was prototyped in P4.
The paper presents SketchVisor, a robust network measurement framework,
which augments sketch-based measurement in the data plane with a fast path
that is activated under high traffic load to provide high-performance local
measurement with slight accuracy degradation. It further recovers accurate
network-wide measurement results via compressive sensing. A SketchVisor
prototype is build on top of Open vSwitch; testbed experiments show that
SketchVisor achieves high throughput and high accuracy for a wide range of
network measurement tasks.
Load balancing
Last but not least, and similarly to the above discussion on resilient
routing, programmable data planes provide unprecedented flexibilities (and
performance) in how traffic can be dynamically load-balanced.
HULA is motivated by the shortcomings of ECMP as well as of existing
congestion-aware load-balancing techniques such as CONGA, which, due to
limited switch memory, can only maintain a limited amount of
congestion-tracking state at the edge switches, and hence do not
scale. HULA is a more flexible and scalable data-plane load-balancing
algorithm in which each switch tracks congestion only for the best path to
a destination through a neighboring switch. HULA is designed for
programmable switches and is programmed in P4.
The paper explores how to use programmable switching ASICs to build much
faster load balancers than have been built before. The proposed system,
called SilkRoad, is defined in a 400 lines of P4 and, when compiled to a
state-of-the-art switching ASIC, it can load-balance ten million
connections simultaneously at line rate.
Memcached is an in-memory key-value distributed caching solution, commonly
used by web servers for fast content delivery. In order to deal with skewed
distributions of key popularity in key-value stores, the authors propose
and implement MBalancer, a switch-based L7 load balancing scheme, which
offloads requests from bottleneck Memcached servers. MBalancer runs as an
SDN application, identifies the (typically small number of) hot keys,
duplicates these hot keys to many (or all) Memcached servers, and adjusts
the switches’ forwarding tables accordingly. Experiences with an
implementation of MBalancer on a hardware-based OpenFlow switch indicate
significant throughput boost and latency reduction.
This seminal paper challenges a fundamental design principle of modern
network architecture that the control plane software-based. With the
advent of programmable switch ASICs, which can run complex logic at
line rate, the paper revisits this principle, by accelerating the
control plane offloading some of its tasks directly to the network
hardware. Some simple control plane functionality can already be
successfully offloaded to P4 hardware, including failure detection and
notification, connectivity retrieval, and even policy-based routing
protocols, but complex cases involve several tradeoffs and
limitations; the paper outlines these and sketches interesting future
research directions towards hardware-software co-design of network
control planes.
Current implementations of time synchronization protocols, like PTP,
handle the protocol stack in the control-plane. The paper explores the
possibility of using programmable switching ASICs to design and
implement a time synchronization protocol, DPTP, with the core logic
running in the data-plane. Comprehensive measurement studies running a
a dataplane-accelerated DPTP implemented in P4 running on a Barefoot
Tofino switch shows that DPTP can achieve median and 99th percentile
synchronization error of 19 ns and 47 ns, even under heavy network
load.
Miscellaneous Topics
There exists highly recommendable literature on the history of SDN and
programmable data planes. We also report on two other important topics,
deployment and algorithms of programmable data planes.
History
There are several interesting papers putting the technological trends
around programmable networks into a historic perspective.
One intellectual precursor to programmable networks is the Active Networks
concept. This paper surveys the application of active networks technology
to network management and monitoring. The main idea of Smart Packets, which
contain programs written in a safe language, is to move management decision
points closer to the nodes being managed, as well as to target specific
aspects of the node for information.
The paper reviews the state of the art in reconfigurable network systems,
covering hardware reconfiguration, SDN, and the interplay between them. It
starts with a tutorial on software-defined networks, then continues to
discuss programming languages as the linking element between different
levels of software and hardware in the network, reviews electronic
switching systems, highlighting programmability and reconfiguration
aspects, and describes the trends in reconfigurable network
elements.
A keynote from Nick McKeown at NetPL’17 on the many great research ideas
and new languages that have emerged for programmable forwarding. The talk
considers how we got here, why programmable forwarding planes are
inevitable, why now is the right time, why they are a final frontier for
SDN, and why they are here to stay.
Deployments
A very relevant question, which is also a research challenge, regards the
deployment of SDN and programmable data planes.
A seminal paper for deploying SDN in enterprise networks, this paper
presents Ethane, a network architecture allowing managers to define a
single network-wide fine-grain policy and then enforcing it
directly. Ethane couples extremely simple flow-based Ethernet switches with
a centralized controller that manages the admittance and routing of
flows. While radical, this design is backwards-compatible with existing
hosts and switches. Ethane was implemented in both hardware and software,
supporting both wired and wireless hosts.
SoftCell aims to overcome today’s expensive, inflexible and complex
cellular core networks by supporting fine-grained policies for mobile
devices, using commodity switches and servers. In particular, SoftCell
allows to flexibly route traffic through sequences of middleboxes based on
subscriber attributes and applications. Scalability is achieved by
minimizing the size of the forwarding tables, using aggregation, and by
performing packet classification at the access switches, next to the base
stations.
The paper presents the design, implementation, and evaluation of B4, a
private WAN connecting Google’s data centers across the planet. B4 has a
number of unique characteristics: (1) massive bandwidth requirements
deployed to a modest number of sites, (2) elastic traffic demand that seeks
to maximize average bandwidth, and (3) full control over the edge servers
and network, which enables rate limiting and demand measurement at the
edge. These characteristics led to a Software Defined Networking architecture
using OpenFlow to control relatively simple switches built from merchant
silicon.
To support deploying SDNs into the Evolved Packet Core (EPC), the paper
presents the design and evaluation of a system architecture for a software
EPC that achieves high and scalable performance. The authors postulate that
the poor scaling of existing EPC systems stems from the manner in which the
system is decomposed, which leads to device state being duplicated across
multiple components, which in turn results in frequent interactions between
the different components. An alternate approach is proposed in which state
for a single device is consolidated in one location and EPC functions are
reorganized for efficient access to this consolidated state. A prototype
for PEPC is also presented, as a software EPC that implements the key
components of the design.
The paper presents a new simulator platform on top of ns-3
specifically tailored to programmable dataplanes and P4 in
particular. Evaluations show tht NS4 can effectively simulate
representative P4 programs and scales to large-scale P4-enabled
networks at a low cost.