Extended Berkeley Packet Filter (BPF) has emerged as a powerful method to extend packet-processing functionality in the Linux operating system. BPF allows users to write code in high-level languages like C or Rust and attach them to specific points in the kernel, such as the network device driver. To ensure safe execution of a user-developed BPF program in kernel context, Linux employs an in-kernel static checker, that only accepts the program if it can be shown to be crash-free and isolated from the rest of kernel memory.
However, BPF programming is not easy. One, even modest-size BPF programs can be rejected by the kernel checker as they are construed to be too large to analyze. Two, the in-kernel static checker may incorrectly determine that an eBPF program exhibits unsafe behaviors. Three, even small performance optimizations to BPF code (e.g., 5% gains) must be meticulously hand-crafted by expert developers. Optimizing compilers for BPF are severely hampered because the kernel checker's safety considerations are incompatible with rule-based optimizations used in traditional compilers.
We present K2, a program-synthesis-based compiler that automatically
optimizes BPF bytecode with formal correctness and safety guarantees.
K2 leverages stochastic search, a formalization of BPF in first-order
logic, and several domain-specific techniques to accelerate equivalence checking
of BPF programs. Relative to clang -O3
, K2 produces code with 6--26% reduced size,
13--85 microseconds lower latency, and -0.03--5% higher throughput
(packets per second per core), across programs drawn from
production systems at Cilium and the Linux kernel. BPF programs produced by
K2 can pass the kernel checker.
We provide instructions to help validate the following claims about the paper.
Artifact availability. (section 0). Our compiler and all its peripheral code in subsidiary repositories have been made publicly available. The source code of the K2 compiler and supporting software (including evaluation scripts) are publicly available under https://github.com/smartnic/ Further, for easy experimentation, we provide a Docker container. The install.sh
script in this repository provides instructions to download all the required components if one is building a container image from scratch. Many of the experiments in this repository use the Docker container.
Artifact functionality (sections 1 through 4). We show that the compiler can be exercised with different inputs, input formats, and compiler parameters.
-
Hello world. This task exercises all of K2's main software components by showing how to optimize a small, simple program. [total estimated machine time: 1 minute; human time: 5 minutes]
-
Changing the input program. We show how a user might change the input program to K2 to obtain different outputs. K2 accepts programs in three formats: BPF C macros used by the Linux kernel (file extension .bpf), a home-grown instruction format we developed that we call this the K2 language (files with extension .k2), and pre-compiled BPF object files. The compiler will output optimized versions of the programs in the same format that is consumed. [total estimated machine time: 4 minutes; human time: 15 minutes]
-
Changing compiler parameters. We show how a user might modify the parameters fed to K2 by modifying the K2 command line in our scripts. The full set of K2 parameters is available. [total estimated machine time: 2 minutes; human time: 10 minutes]
-
Specific ACM SIGCOMM criteria for artifact functionality. This section explicitly addresses the three criteria for artifact functionality described in the call for artifacts. [total estimated (human) time: 2 minutes]
Reproduction of results (sections 5 through 8). We show how to reproduce the main claims in the empirical evaluation of the paper.
-
Instruction count reduction: We provide scripts to reproduce a subset of results on the reductions in program instruction counts that K2 can obtain (table 1 in the submitted paper). The subset of benchmarks chosen corresponds to those programs that we believe can run fast enough on a user's laptop -- we chose benchmarks where the best programs were found within 30 minutes in our submission. Our paper submission claims program compaction anywhere between 6 -- 26%. [total estimated time: 90 minutes.]
-
Impact of optimizations on equivalence-checking time: We provide scripts to reproduce a subset of results on the impact of optimizations on reductions in equivalence-checking time (table 3 in the submitted paper). The subset of benchmarks chosen corresponds to those programs that we believe can run fast enough on a user's laptop -- we chose benchmarks where the experiment can finish running within 1 hour in our submission. Our paper submission claims an average 5 orders of magnitude reduction in equivalence-checking time. However, the subset of (faster) benchmarks in the container may only show about 4 orders of magnitude of benefit (10,000X or more). [total estimated time: 120 minutes.]
-
Latency/throughput benefits: We provide scripts to reproduce a subset of results from our empirical evaluation of throughput and latency of programs optimized by K2 (table 2 in the submitted paper). This requires setting up an experiment on the NSF CloudLab testbed using the credentials and disk images that we provide. This experiment takes a while to run -- however, it runs on a server-class machine that you log into, and once the longer experiments start, they require your attention only after a few hours. Our paper submission claims that the best programs produced by K2 produce -0.03--5% higher throughput and 13--85 microseconds of lower latency than the most optimized versions produced by Clang. Due to the slightly different nature of the experimental setup between the paper submission and the replication (i.e., this document), the exact numbers you will see will differ from those in Table 2 in the submission. See our caveat for more details. Instead of validating the exact improvements, the main claim to validate is that there is a reduction in tail latency (i.e., the latency at the MLFFR, see section 8 in the submitted paper) with similar or better throughput (i.e., the MLFFR). [total estimated machine time: 15 hours; human time: 40 minutes]
-
Efficacy of safety checks: We provide scripts to reproduce the results on checking the safety of K2-produced programs using the kernel checker (table 7 in the submitted paper appendix). We test whether K2 programs can pass the kernel checker by loading each BPF program into the kernel. If the load succeeds, it means that the program can pass the kernel checker. The result shows that 35 among 38 programs can pass the kernel checker. [total estimated machine time: 2 minutes 30 seconds; human time: 5 minutes.]
-
The results shown below can only be used for reference. Since K2 leverages stochastic search, we cannot guarantee that the results are exactly the same for every run. Further, experimentation on throughput and latency can show variations from run to run. We have observed that small programs often exhibit deterministic behavior.
-
K2 is a synthesizing compiler that optimizes BPF programs better than traditional compilers like clang. However, it does so at the expense of additional compile time. When you run the compiler on a program, you may often need to wait longer (e.g., than using
clang
) to see results. We have made our best efforts to provide sample programs that can be near-interactively compiled on a reasonably powerful laptop, and we provide estimations of how long you might have to wait for each compilation to finish. -
The estimated times for the evaluations running in the Docker container are based on a machine with 3.1 GHz Dual-Core Intel Core i7 processor, 16 GB 1867 MHz DDR3 memory. We have tested the container on a few different laptops, and expect the experiments to run without too much trouble on a relatively modern laptop.
-
Experiments 1--6 can be run in any order once you've set up the Docker container. Experiments 7 and 8 can be run in either order once you've set up the CloudLab experiment.
-
Install docker if it is not installed already by following the Docker install documentation.
-
Download the prebuilt docker image (shasum: 469dbb6cb935342bd7fbd5687c395ba9cb7ef5e5)
https://drive.google.com/file/d/1FT0YjxDlJYzAiu16V-pyNaDoBvcXC0aa/view?usp=sharing
- Run the docker image:
docker load < docker.tar.gz
docker run -it sigcomm21_artifact:v1 bash
cd sigcomm21_artifact
In this experiment, we get started with a Hello World program to show the basic functionality of K2.
The compiler takes an input program benchmark.k2
written in the K2 language, optimizes this program,
and produces an output program benchmark.k2.out
in the K2 language.
Estimated runtime: 20 seconds.
cd 1_hello_world
sh k2.sh
After running the above commands, you will see some prints on the screen. The key information is about the performance costs (i.e., instruction count) of the input and output programs. It shows that K2 reduces 3 instructions to 2 instructions for the input program.
...
original program's perf cost: 3
...
top 1 program's performance cost: 2
...
Run the following commands to have a look at the input and output programs.
cat benchmark.k2
cat benchmark.k2.out
you will see the program
MOV64XC 0 1
ADD64XY 0 0
EXIT
is optimized to
MOV64XC 0 2
EXIT
Here are some comments to help understand the meanings of the two programs. Each line of the benchmark files is one instruction. r0, r1, ...
are BPF registers.
benchmark.k2
MOV64XC 0 1 // r0 = 1
ADD64XY 0 0 // r0 += r0
EXIT // exit, return r0
benchmark.k2.out
MOV64XC 0 2 // r0 = 2
EXIT // exit, return r0
In this experiment, we introduce three input-output program formats supported by K2: BPF C macros, the K2 language, and pre-compiled BPF object files, and show how a user might modify a given program to see different outputs. Kernel developers often hard-code BPF assembly similar to the first format (BPF C macros); most real code that uses BPF is written in C and compiled by Clang
into the BPF object file format, which corresponds to the third format (pre-compiled BPF object files). We show all three program formats since the first two formats are easier to read and understand (especially to understand specific optimizations), while the last format is practically the most used and deployed.
The Linux kernel supports utilizing BPF C macros to write BPF assembly programs (an example). In this experiment, we show K2 takes a program written in BPF C macros and produces an optimized program also written in BPF C macros. We also use an example to show how a user might change the input program.
Estimated runtime: 40 seconds.
cd ../2_different_inputs/1_bpf_insn
sh k2.sh benchmark_before.bpf
The second command feeds program benchmark_before.bpf
to K2, and K2 will produce an output program in
benchmark_before.bpf.out
You will see K2 reduces 12 instructions to 10.
...
original program's perf cost: 12
...
top 1 program's performance cost: 10
...
Run the following command to see the difference between the original and optimized programs.
diff benchmark_before.bpf benchmark_before.bpf.out
We can see that 2 two-byte load-and-store operations are optimized to 1 four-byte load-and-store operation.
1,4c1,2
< BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1, 0),
< BPF_STX_MEM(BPF_H, BPF_REG_10, BPF_REG_0, -4),
< BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1, 2),
< BPF_STX_MEM(BPF_H, BPF_REG_10, BPF_REG_0, -2),
---
> BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, 0),
> BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4),
Here are some comments to help understand the programs.
benchmark_before.bpf
BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1, 0), // r0 = *(u16 *)(r1 + 0)
BPF_STX_MEM(BPF_H, BPF_REG_10, BPF_REG_0, -4), // *(u16 *)(r10 - 4) = r0
BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1, 2), // r0 = *(u16 *)(r1 + 2)
BPF_STX_MEM(BPF_H, BPF_REG_10, BPF_REG_0, -2), // *(u16 *)(r10 - 2) = r0
BPF_LD_MAP_FD(BPF_REG_1, 4), // r1 = map fd 4
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), // r2 = r10
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), // r2 += -4
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), // call map_lookup_elem
BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1), // if r0 == 0, jmp 1 (exit)
BPF_MOV64_IMM(BPF_REG_0, 1), // r0 = 1
BPF_EXIT_INSN(), // exit, return r0
benchmark_before.bpf.out (the first two instructions)
BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, 0), // r0 = *(u32 *)(r1 + 0)
BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), // *(u32 *)(r10 - 4) = r0
The comment call map_lookup_elem
refers to the instruction r0 = map_lookup_elem(map, &key)
.
The first function parameter map
is read from r1
, the second &key
is read from r2
.
This function reads a key from the stack memory (in this experiment, key size is set as 4 bytes,
so key = *(u32 *)r2
), looks up this key in the map pointed to by r1
,
and returns the value address (i.e., &map[key]) if the key exists in the map,
else returns NULL
(i.e., 0).
The instruction BPF_LD_MAP_FD
occupies two "slots", i.e., the space of two "regular" BPF instructions.
A 64-bit map descriptor is divided into two 32-bit immediate numbers stored in two separate slots.
Suppose we wanted to modify the input program from benchmark_before.bpf
to benchmark_after.bpf
.
We can recompile the program and observe the new output that K2 will produce.
Run the command to see the difference between benchmark_before.bpf
and benchmark_after.bpf
.
diff benchmark_before.bpf benchmark_after.bpf
You should see
1,4c1,4
< BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1, 0),
< BPF_STX_MEM(BPF_H, BPF_REG_10, BPF_REG_0, -4),
< BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1, 2),
< BPF_STX_MEM(BPF_H, BPF_REG_10, BPF_REG_0, -2),
---
> BPF_MOV64_IMM(BPF_REG_0, 0),
> BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4),
> BPF_MOV64_IMM(BPF_REG_0, 1),
> BPF_STX_MEM(BPF_B, BPF_REG_10, BPF_REG_0, -5),
Here are some notes to understand the changes in the benchmark_after.bpf
program relative to benchmark_before.bpf
.
You may observe that the third and fourth instructions are redundant, since the rest of the program does not read from
(r10 - 5)
.
BPF_MOV64_IMM(BPF_REG_0, 0), // r0 = 0
BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), // *(u32 *)(r10 - 4) = r0
BPF_MOV64_IMM(BPF_REG_0, 1), // r0 = 1
BPF_STX_MEM(BPF_B, BPF_REG_10, BPF_REG_0, -5), // *(u8 *)(r10 - 5) = r0
Let's run the command that invokes K2 to optimize benchmark_after.bpf
. The output is stored in benchmark_after.bpf.out
This usually takes about 40 seconds.
sh k2.sh benchmark_after.bpf
Run the following command to see the difference between the input and output programs
diff benchmark_after.bpf benchmark_after.bpf.out
You may get (this is the result from one of our runs)
1,4c1
< BPF_MOV64_IMM(BPF_REG_0, 0),
< BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4),
< BPF_MOV64_IMM(BPF_REG_0, 1),
< BPF_STX_MEM(BPF_B, BPF_REG_10, BPF_REG_0, -5),
---
> BPF_ST_MEM(BPF_W, BPF_REG_10, -4, 0),
K2 has discovered an optimization using the instruction
BPF_ST_MEM(BPF_W, BPF_REG_10, -4, 0), // *(u32 *)(r10 - 4) = 0
K2 reduces 4 instructions to 1 instruction by directly storing an immediate number on the stack and removing the redundant instructions.
This subsection discusses the same process as 2.1 (making a change to the input program and observing different outputs), with the main difference that the program is encoded in the K2 language. The language is mainly used inside our compiler for ease of development and is not something that regular BPF developers use as a surface language. K2 opcodes have a one-to-one correspondence with the set of BPF assembly opcodes. You could look at the source code of our BPF interpreter (in /sigcomm21_artifact/dependencies/superopt/src/isa/ebpf/inst.cc in the container) for more details.
Estimated runtime: 25 seconds.
cd ../2_k2_inst
sh k2.sh benchmark_before.k2
The second command feeds program benchmark_before.k2
to K2, and K2 will produce an output program in
benchmark_before.k2.out
You will see K2 reduces 12 instructions to 10.
...
original program's perf cost: 12
...
top 1 program's performance cost: 10
...
Run the following command to see the difference between the input and output programs.
diff benchmark_before.k2 benchmark_before.k2.out
We can see that two one-byte load-and-store operations are optimized to one two-byte load-and-store operation.
1,4c1,2
< LDXB 0 1 0
< STXB 10 -2 0
< LDXB 0 1 1
< STXB 10 -1 0
---
> LDXH 0 1 0
> STXH 10 -2 0
Here are some comments to help understand programs.
Note: call map_lookup_elem
calls funtion r0 = map_lookup_elem(map, &key)
.
The first parameter map
is read from r1
, the second &key
is read from r2
.
This function reads key from the stack (in this experiment, key size is set as 2 bytes,
so key = *(u16 *)r2
), then looks up this key in the map,
returns the value address (i.e., &map[key]) if the key is in the map,
else returns NULL
(i.e., 0).
benchmark_before.k2:
LDXB 0 1 0 // r0 = *(u8 *)(r1 + 0)
STXB 10 -2 0 // *(u8 *)(r10 - 2) = r0
LDXB 0 1 1 // r0 = *(u8 *)(r1 + 1)
STXB 10 -1 0 // *(u8 *)(r10 - 1) = r0
LDMAPID 1 0 // r1 = map fd 0
MOV64XY 2 10 // r2 = r10
ADD64XC 2 -2 // r2 += 2
CALL 1 // call map_lookup_elem
JEQXC 0 0 1 // if r0 == 0, jmp 1 (exit)
MOV64XC 0 1 // r0 = 1
EXIT // exit, return r0
benchmark_before.k2.out: (the first two instructions)
LDXH 0 1 0 // r0 = *(u16 *)(r1 + 0)
STXH 10 -2 0 // *(u16 *)(r10 - 2) = r0
Here, we modify the input program from benchmark_before.k2
to benchmark_after.k2
.
We can observe that K2 will produce a different output.
Run the command to see the difference between benchmark_before.k2
and benchmark_after.k2
.
diff benchmark_before.k2 benchmark_after.k2
You will get
1,4c1,4
< LDXB 0 1 0
< STXB 10 -2 0
< LDXB 0 1 1
< STXB 10 -1 0
---
> MOV64XC 0 0
> STXH 10 -2 0
> MOV64XC 0 1
> STXB 10 -5 0
Here are some comments to help understand benchmark_after.k2
.
Note: the third and fourth instructions are redundant, since the remaining program does not read from
(r10 - 5)
.
MOV64XC 0 0 // r0 = 0
STXH 10 -2 0 // *(u16 *)(r10 - 2) = r0
MOV64XC 0 1 // r0 = 1
STXB 10 -5 0 // *(u8 *)(r10 - 5) = r0
Run the following command to invoke K2 to optimize benchmark_after.k2
and store the output in benchmark_after.k2.out
(estimated runtime: 20 seconds.)
sh k2.sh benchmark_after.k2
Run the following command to see the difference between the input and output programs
diff benchmark_after.k2 benchmark_after.k2.out
You may see (this is the result from our run)
1,4c1
< MOV64XC 0 0
< STXH 10 -2 0
< MOV64XC 0 1
< STXB 10 -5 0
---
> STDW 10 -8 1
Comment to understand the modification:
STDW 10 -8 1 // *(u64 *)(r10 - 8) = 1
// It means that Mem[r10 - 7: r10 - 1] is set as 0, while Mem[r10 - 8] is set as 1
// In your run, the immediate number could be others because of the stochastic search
K2 reduces 4 instructions to 1 instruction by directly storing an immediate number on the stack and removing redundant instructions.
K2 can optimize BPF object files (.o
files) produced as a result of taking BPF code written in C and compiling with clang
. The result is another (optimized) object file (.o
), which can be used as a drop-in replacement for the original file.
Estimated runtime: 1 minute.
cd ../3_object_file
sh k2.sh xdp1_kern.o xdp1
The second command feeds the program xdp1
in the object file xdp1_kern.o
to K2, and K2 will produce an output object file xdp1_kern.o.out
.
You will see K2 reduces 61 instructions to 59.
...
original program's perf cost: 61
...
top 1 program's performance cost: 59
...
Run the following command to see the difference between the input and output programs. The first
and second commands use llvm-objdump
to get the disassembly programs from xdp1_kern.o
and
xdp1_kern.o.out
, then store the programs in the corresponding files with the extension of .objdump
.
llvm-objdump -d xdp1_kern.o > xdp1_kern.o.objdump
llvm-objdump -d xdp1_kern.o.out > xdp1_kern.o.out.objdump
diff xdp1_kern.o.objdump xdp1_kern.o.out.objdump
We can see that there are two goto +0
instructions in the output program. For convenience, we use goto +0
instead of NOP for marking the deletion of this instruction.
...
< 5: 71 13 0c 00 00 00 00 00 r3 = *(u8 *)(r1 + 12)
< 6: 71 14 0d 00 00 00 00 00 r4 = *(u8 *)(r1 + 13)
< 7: 67 04 00 00 08 00 00 00 r4 <<= 8
< 8: 4f 34 00 00 00 00 00 00 r4 |= r3
---
> 5: 69 13 0c 00 00 00 00 00 r3 = *(u16 *)(r1 + 12)
> 6: 05 00 00 00 00 00 00 00 goto +0 <xdp_prog1+0x38>
> 7: bc 34 00 00 00 00 00 00 w4 = w3
> 8: 05 00 00 00 00 00 00 00 goto +0 <xdp_prog1+0x48>
K2 uses stochastic search to find higher-performance programs that are semantically equivalent to the input. The search uses several parameters. (If you wish to take a look, the full set of available K2 parameters is available.)
Optimizing smaller regions of the program at a time allows K2 to scale
to larger program sizes by performing synthesis and verification over smaller programs,
and composing the smaller results to form larger results (see more details in section 5, Modular verification
,
in our accepted PDF paper.) We call these smaller (contiguous) program regions windows.
There are constraints on the instructions that these windows can contain, documented
in the Appendix of our paper.
In this experiment, we will show how to set the optimization window parameters for K2. The manual window setting is mainly for illustration: in practice, we auto-detect and set these program windows. Specifically, we show explicit settings of two different window parameters for the same input program.
Note
: In this section, We use window [s,e]
to represent the program segment from instruction s
to e
for convenience.
Estimated runtime: 40 seconds
cd ../../3_change_parameters/
sh k2.sh benchmark.bpf 0 3 benchmark_win1.bpf.out
This command invokes K2 to optimize window [0,3] for program benchmark.bpf
, and store the output
program in benchmark_win1.bpf.out
.
For window [0,3], in our run, K2 reduces 16 instructions to 14.
...
original program's perf cost: 16
...
top 1 program's performance cost: 14
...
Run the following command to see the difference between the input and output programs (window [0,3]).
diff benchmark.bpf benchmark_win1.bpf.out
We can see that if window is set as [0,3], the first 4 instructions are optimized.
1,4c1,2
< BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1, 0),
< BPF_STX_MEM(BPF_H, BPF_REG_10, BPF_REG_0, -4),
< BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1, 2),
< BPF_STX_MEM(BPF_H, BPF_REG_10, BPF_REG_0, -2),
---
> BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, 0),
> BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4),
Here are some comments to help understand the program.
1,4c1,2
< BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1, 0), // r0 = *(u16 *)(r1 + 0)
< BPF_STX_MEM(BPF_H, BPF_REG_10, BPF_REG_0, -4), // *(u16 *)(r10 - 4) = r0
< BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1, 2), // r0 = *(u16 *)(r1 + 2)
< BPF_STX_MEM(BPF_H, BPF_REG_10, BPF_REG_0, -2), // *(u16 *)(r10 - 2) = r0
---
> BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, 0), // r0 = *(u32 *)(r1 + 0)
> BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), // *(u32 *)(r10 - 4) = r0
Estimated runtime: 40 seconds
sh k2.sh benchmark.bpf 4 5 benchmark_win2.bpf.out
For this command, K2 takes the same program benchmark.bpf
as an input to optimize window [4,5]
and stores the output program in benchmark_win2.bpf.out
For window [4,5], the number of instructions is reduced to 15.
...
original program's perf cost: 16
...
top 1 program's performance cost: 15
...
Run the following command to see the difference between the input and output programs (window [4,5]).
diff benchmark.bpf benchmark_win2.bpf.out
We can see that instructions 4 to 5 are optimized if window is set as [4,5].
5,6c5
< BPF_MOV64_IMM(BPF_REG_0, 1),
< BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_0, -16),
---
> BPF_ST_MEM(BPF_DW, BPF_REG_10, -16, 1),
Here are some comments to help understand the programs.
< BPF_MOV64_IMM(BPF_REG_0, 1), // r0 = 1
< BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_0, -16), // *(u64 *)(r0 - 16) = r0
---
> BPF_ST_MEM(BPF_DW, BPF_REG_10, -16, 1), // *(u64 *)(r0 - 16) = 1
If you take a look at the full command line that is used to invoke the compiler in this exercise, you may see that the window settings are incorporated using the flags --win_s_list
and --win_e_list
.
[Estimated reading time: 2 minutes]
Documentation: Our artifact includes the Docker container, which includes the primary source code of the compiler as well as some subsidiary repositories containing experimental evaluation scripts. This README serves as the documentation of the compiler showing how to exercise the artifact.
Completeness: The source code of the compiler attached in the container (see the dependencies
folder in the container) contains all the main components of the compiler described in the paper. Inside the folder /sigcomm21_artifact/dependencies/superopt
in the container:
src/search
contains the code related to the stochastic search (section 3 in the paper)
src/verify
contains the generic framework for program equivalence checking within K2 (high-level section 4 logic)
src/isa/ebpf
contains the formalization of the BPF instruction set in first-order logic, for both program equivalence checking and safety (sections 4, 5, 6)
Exercisability: The code needed to run the experiments in the paper is in the folder 4_reproduce_results
in the container image. Detailed instructions to reproduce the results are provided below.
Estimated runtime: 1 hour 30 minutes
cd ../4_reproduce_results/1_insn_count
sh insn_count.sh
Note: the result reproduced on your machine may be different to the result from our run
because of the stochastic search. The key point is that Number of instructions
of K2
is the same as/similar to the corresponding results in the table showing here.
+----------------------------------------------------+
| Number of instructions |
+--------------------+--------+----+-----------------+
| Benchmark | -O2/O3 | K2 | Compression (%) |
+--------------------+--------+----+-----------------+
| socket-0 | 29 | 27 | 6.9 |
| socket-1 | 32 | 30 | 6.25 |
| sys_enter_open | 24 | 20 | 16.67 |
| xdp1_kern-xdp1 | 61 | 56 | 8.2 |
| xdp_cpumap_enqueue | 26 | 22 | 15.38 |
| xdp_devmap_xmit | 36 | 30 | 16.67 |
| xdp_exception | 18 | 16 | 11.11 |
| xdp_fw | 72 | 67 | 6.94 |
| xdp_map_access | 30 | 26 | 13.33 |
| xdp_pktcntr | 22 | 19 | 13.64 |
| xdp_redirect_err | 18 | 16 | 11.11 |
+--------------------+--------+----+-----------------+
+------------------------------------------------------------------+
| Elapsed time for finding programs (sec) |
+--------------------+------------------------------+--------------+
| Benchmark | Time to the smallest program | Overall time |
+--------------------+------------------------------+--------------+
| socket-0 | 1.9 | 30 |
| socket-1 | 1.2 | 38 |
| sys_enter_open | 1266.5 | 1564 |
| xdp1_kern-xdp1 | 585.5 | 671 |
| xdp_cpumap_enqueue | 252.6 | 291 |
| xdp_devmap_xmit | 992.5 | 1178 |
| xdp_exception | 2.7 | 21 |
| xdp_fw | 248.5 | 304 |
| xdp_map_access | 23.0 | 49 |
| xdp_pktcntr | 164.3 | 270 |
| xdp_redirect_err | 78.8 | 178 |
+--------------------+------------------------------+--------------+
Estimated runtime: 2 hours
cd ../2_eq_chk_time
sh eq_chk.sh
Note: the result reproduced on your machine may be different from the result from our run. The key point is that for each benchmark, equivalence-checking time with optimizations I,II,III,IV is smaller than that without at least one optimizations. Here is a sample result that we obtained.
+-----------------------------------------------------------------------------------+
| Time consumption (us) |
+--------------------+-------------+-----------+-----------+-----------+------------+
| Benchmark | I,II,III,IV | I,II,III | I,II | I | None |
+--------------------+-------------+-----------+-----------+-----------+------------+
| xdp_exception | 48887 | 1280730 | 3239100 | 1954550 | 50623600 |
| xdp_redirect_err | 35126 | 884464 | 1594880 | 1599790 | 143212000 |
| xdp_devmap_xmit | 69242 | 297381000 | 419262000 | 634626000 | 2430860000 |
| xdp_cpumap_kthread | 24184 | 111865000 | 103730000 | 125594000 | 605124000 |
| xdp_cpumap_enqueue | 301844 | 125391000 | 147459000 | 192472000 | 716662000 |
| xdp_pktcntr | 4056 | 3549610 | 15479500 | 17264000 | 89449200 |
+--------------------+-------------+-----------+-----------+-----------+------------+
+--------------------------------------------------------+
| Slow down (how many times slower than I,II,III,IV) |
+--------------------+----------+-------+-------+--------+
| Benchmark | I,II,III | I,II | I | None |
+--------------------+----------+-------+-------+--------+
| xdp_exception | 26x | 66x | 40x | 1036x |
| xdp_redirect_err | 25x | 45x | 46x | 4077x |
| xdp_devmap_xmit | 4295x | 6055x | 9165x | 35107x |
| xdp_cpumap_kthread | 4626x | 4289x | 5193x | 25022x |
| xdp_cpumap_enqueue | 415x | 489x | 638x | 2374x |
| xdp_pktcntr | 875x | 3816x | 4256x | 22054x |
+--------------------+----------+-------+-------+--------+
This section describes how to validate the experimental claim about latency/throughput improvements in code produced by K2. This experiment takes the longest time to run among all our experiments. (total estimated machine time: 15 hours, human time: 40 minutes). The high-level steps are as follows.
- Set up an experiment on the NSF CloudLab testbed using credentials that we provided on hotCRP.
- Measure the latency-throughput profile of one benchmark by varying the offered load and measuring the benchmark program's packet-processing rate, packet drop rate, and the round-trip latency of the packets that were successfully processed without dropping. The experiment graphs these quantities as a function of the offered load. We will describe how to obtain the throughput (MLFFR) and the tail latency (latency at the MLFFR) of the benchmark from the corresponding curve.
- Run a long-lived experiment covering multiple benchmarks to obtain similar curves for all variants of benchmark programs.
We are using a different setup from the one used in the paper submission for replicating this experiment. We are using a machine with an Intel architecture (cloudlab instance xl170
) rather than an AMD architecture (cloudlab instance d6515
). This change occurred due to two reasons.
(1) We consistently found it hard to obtain d6515
machines on CloudLab, due to conflicting reservations from other users. There are just more instances of xl170
available.
(2) We found results from the Intel architecture more repeatable and reliable, with the benefits of additional performance-enhancing features that we plan to use for the final numbers in the camera-ready version of our paper.
For these reasons, the throughput and latency numbers won't exactly match those in Table 2 in the paper submission. Instead, the main claim we wish to validate is that: relative to the best version produced by clang
, code produced by K2 has (1) similar or better throughput (MLFFR), and (2) lower tail latency (i.e., the latency at the MLFFR).
[Estimated Human Time: 30 minutes]
Visit https://cloudlab.us/ and click the "Log in" button. Please use the CloudLab account username and password provided in our SIGCOMM21 artifact hotCRP submission to log into the CloudLab console.
You can choose to start the experiment right away (just click "Finish" in the screen below) or schedule it to start at some time in the future.
The default/initial duration for which an experiment may run (i.e., the time that the machines will be available to you) is 16 hours. You can extend an experiment later after the experiment boots up.
You may encounter the following failures/slow cases:
Sometimes, starting an experiment can fail when CloudLab has insufficient resources available. If your experiment fails due to insufficient resources, you can check for future resource availability at https://www.cloudlab.us/resinfo.php -- look for the future availability of machine instances of type "xl170" in the Utah cluster. You need at least 2 available machines for our experiment. You can also make reservations for machines at a future time by following instructions from http://docs.cloudlab.us/reservations.html. Please contact us if you have any difficulty.
If your experiment is successfully scheduled, it might still keep you waiting with the message Please wait while we get your experiments ready
. This can happen sometimes since we use a custom disk image. Please be patient for a few minutes.
Contact us or the CloudLab mailing list (https://groups.google.com/g/cloudlab-users) if you have any difficulties.
The experiment consists of two nodes, labeled node-0 and node-1. Node-0 serves as the device-under-test (DUT), which runs the packet processing programs we're evaluating. Node-1 runs a traffic generator. By sending and receiving from the same machine, the traffic generator measures both the throughput and the round-trip latencies of packets that are processed by the device under test. The detailed setup is described in section 8 of the submitted paper.
- SSH into node-1. You can determine the name of the node-1 and node-0 machines from the CloudLab console (go to "list view" once the experiment is ready)
You must use the private key provided as part of our hotCRP submission to login into the machine labeled node-1 (hp125.utah.cloudlab.us
in the example above). Suppose you've named the file containing the private key (from hotCRP) my.key
. Type
ssh -p 22 -i my.key reviewer@hp125.utah.cloudlab.us
Sometimes, this step might fail with the error Permission denied (publickey)
. If that happens to you, consider adding the private key to your SSH agent, using the command ssh-add my.key
and re-try SSHing.
-
Once logged into node-1, use your favorite text editor to add the following line to ~/.bash_profile.
export PYTHONPATH=/usr/local/v2.87/automation/trex_control_plane/interactive
-
cd /usr/local/trex-configuration/
-
Run
./update-scripts.sh
. When prompted, enter the following details for your experiment. The user@DUT (device under test) string isreviewer@<insert node-0 name you found above>
and the device type should bexl170
. For the example shown above, the full exchange looks like the following:
reviewer@node-1 trex-configuration]$ ./update-scripts.sh
Setting up some information
Enter DUT as (username@machine):reviewer@hp124.utah.cloudlab.us
Enter Device type (d6515 or xl170):xl170
- Exit the session and log into node-1 again.
You will upload the SSH private key provided to you on hotCRP into node-1, so that it can remotely control node-0 (the DUT). Suppose your private key is stored in a file my.key
on your local machine.
- On your own machine, run the command
scp -i my.key my.key reviewer@hp125.utah.cloudlab.us:~/.ssh/id_ed25519
- On node-1, test that you can ssh into node-0. Do not skip this check! On the node-1 machine you're currently logged into, type
ssh -p 22 reviewer@hp124.utah.cloudlab.us
where you will replace hp124.utah.cloudlab.us by the name of the node-0 machine from the CloudLab console. You should be able to connect to node-0. Then, exit out of node-0 session.
[Total Estimated Machine Time: 1 hour; Human Time: 10 minutes]
Now that the experiment is fully set up, we can stress test the DUT with traffic from the traffic generator at various offered loads (Tx rates) and look at how throughput (Rx rate) and latencies (average latency of packets returned to the traffic generator) vary with the offered load. Some of our benchmarks just drop packets at high speeds (a useful capability in itself, e.g., for DDoS mitigation). For those benchmarks, we can only measure throughput as reported by a program running on the DUT itself. We show both kinds of programs (benchmarks that will forward packets back as well as those that just drop) below. We consider the experiments in this section as "warm-ups" since they tell you how to measure and interpret one trial of one benchmark program; the full set of experiments for all benchmarks are in the next subsection.
Note: All data and logs, graphs are saved in your home directory on Cloudlab. The CloudLab server images don't come with GUIs, so in order to view graphs, you will need to copy the file to local computer. Instructions are provided below.
[Estimated Machine Run Time: 30 minutes; human time: 1 minute]
- SSH into node-1: e.g.
ssh -p 22 -i my.key reviewer@hp125.utah.cloudlab.us
wheremy.key
is the private SSH key on your local computer andhp125.utah.cloudlab.us
is replaced with the name of thenode-1
machine in your experiment. - Change to directory:
cd /usr/local/v2.87
- Start run:
nohup python3 -u run_mlffr.py -b xdp_fwd -v o1 -d xdp_fwd/ -n 1 -c 6 &
. This proccess will run in the background. Just press Enter. - Check progress of logs
tail -f $HOME/nohup.out
- Once it has completed running (it will say Completed Full Script in the logs, takes about 30 minutes), you will now generate the graphs.
cd /usr/local/trex-configuration/visualize-data-scripts/
- Generate throughput graph:
python3 rx_plot.py -d ~/xdp_fwd -v o1 -b xdp_fwd -r 0
The graph will be located in the$HOME/xdp_fwd/rx
directory and is called0.png
. - Generate latency graph:
python3 latency.py -d ~/xdp_fwd -type avg -v o1 -b xdp_fwd
The graph will be located in$HOME/xdp_fwd/
directory and is calledo1_avgL.png
. - Copy the graphs and view them on your computer. On your machine, execute the following commands.
scp -i my.key reviewer@hp025.utah.cloudlab.us:/users/reviewer/xdp_fwd/rx/0.png .
scp -i my.key reviewer@hp025.utah.cloudlab.us:/users/reviewer/xdp_fwd/o1_avgL.png .
where you will replacehp025.utah.cloudlab.us
with the name of your correspondingnode-1
machine.
Sample graphs that you obtain may look like the following:
The graph on the left-hand side shows the rate at which packets were forwarded (labeled Rx rate
) by the BPF program xdp_fwd
in millions of packets per second per core, plotted against the offered load from the traffic generator (labeled Tx rate
). Here, the variant of xdp_fwd
being evaluated is the BPF bytecode generated by compiling the source with clang -O1
. The graph on the right-hand side shows the average round-trip latency (reported in microseconds) for packets that were processed successfuly by xdp_fwd
and returned to the traffic generator. (The average latency does not include dropped packets, which have "infinite" latency).
We define the throughput as the maximum loss-free forwarding rate (MLFFR) (RFC 1242/2544), i.e., the Rx rate after the Rx rate tilts away from its linear growth with respect to the Tx rate and flattens out. The latency at the MLFFR (read off at the corresponding x-value from the graph on the right-hand side) is the tail latency of the program. In the graph above, we might report the throughput of xdp_fwd -O1
to be slightly above 4.9 Mpps with a tail latency of ~75 microseconds, which is the latency at 4.9 Mpps. (These numbers won't exactly match those in Table 2 in the paper, see our caveat). The latency curve tapers off at 200 microseconds, which corresponds to the buffering available (e.g., in the NIC) before packet drops occur.
[Estimated Machine Run Time: 30 minutes; human time: 1 minute]
- SSH into node-1: e.g.
ssh -p 22 -i my.key reviewer@hp125.utah.cloudlab.us
wheremy.key
is your private ssh key on your local computer and hp125 will be replaced with node-1 in your experiment. - Change to directory:
cd /usr/local/v2.87
- Start run:
nohup python3 -u run_mlffr_user.py -b xdp_map_access -v o1 -d xdp_map -n 1 -c 6 > $HOME/map.txt &
. This proccess will run in the background; therefore, press enter. - Check progress of logs
tail -f $HOME/map.txt
- Once it has completed running (it will say Completed Full Script in the logs), you will now generate the graphs. The logs are located in node0.
- SSH to node0. e.g.
ssh -p 22 -i my.key reviewer@hp124.utah.cloudlab.us
cd /usr/local/trex-configuration/visualize-data-scripts/
- Generate throughput graph:
python3 generate_user_graphs.py -d ~/xdp_map -v o1 -b xdp_map_access -r 0
The graph will be located in$HOME/xdp_map/rx/
and is called0.png
. - Copy Graphs and View Graphs on your computer. Execute the following on your LOCAL computer.
scp -i my.key reviewer@hp024.utah.cloudlab.us:/users/reviewer/xdp_map/rx/0.png .
where hp024 is node-0
You may obtain a graph that looks like this. This throughput measurement is reported by a program running directly on the DUT, and not on the traffic generator. Since this only consists of one run, the actual throughput may vary up and down around a certain value, as offered load increases. The throughputs stabilize with more trials. With this single run, we would record the MLFFR throughput at ~15.0 Mpps.
[Total estimated machine time: 14 hours]
[Estimated machine run time: 8 hours; human time: 10 minutes]
- SSH into node-1: e.g.
ssh -p 22 -i my.key reviewer@hp125.utah.cloudlab.us
wheremy.key
is your private ssh key on your local computer and hp125 will be replaced with node-1 in your experiment. - Change to directory:
cd /usr/local/v2.87
- Start run:
nohup python3 -u run_mlffr.py -b xdp_fwd -d xdp_fwd_all -n 3 -c 6 > $HOME/xdp_fwd_log.txt &
. This proccess will run in the background; therefore, press enter. - Check progress of logs
tail -f $HOME/xdp_fwd_log.txt
- Once it has completed running (it will say Completed Full Script in the logs), you will now generate the graphs.
cd /usr/local/trex-configuration/visualize-data-scripts/
- Generate throughput and latency graphs:
python3 generate_graphs.py -d ~/xdp_fwd_all -b xdp_fwd -r 3
- The latency graph is located in
~/xdp_fwd_all/avgL/avg.png
and the throughput graph is located in~/xdp_fwd_all/rx/avg.png
.
The results are the average of three trials. The raw data is located in ~/xdp_fwd_all/avgL-data.csv
and ~/xdp_fwd_all/rx-data.csv
. You may see graphs that look like this:
(If needed, see how to interpret throughput-latency graphs for a refresher on understanding one curve from this graph.) From the graphs above, we would write down the MLFFR throughput of the -O1
variant of xdp_fwd
as 4.9 Mpps. With the resolution available in the Tx rates on the x-axis, the MLFFR of the other variants (-O2
and all the variants produced by K2, labeled kX
) is also roughly 4.9 Mpps. The latency at the MLLFR of the best clang variant is ~57 microseconds (for -O2
) and that of the best K2 variant (for k2
) is ~50 microseconds.
[Estimated machine run Time: 6 hours, human time: 10 minutes]
- SSH into node-1: e.g.
ssh -p 22 -i my.key reviewer@hp125.utah.cloudlab.us
wheremy.key
is your private ssh key on your local computer and hp125 will be replaced with node-1 in your experiment. - Change to directory:
cd /usr/local/v2.87
- Start run:
nohup python3 -u run_mlffr_user.py -b xdp_map_access -d xdp_map_all -n 3 -c 6 > $HOME/map_all.txt &
. This proccess will run in the background; therefore, press enter. - Check progress of logs
tail -f $HOME/map_all.txt
- Once it has completed running (it will say Completed Full Script in the logs), you will now generate the graphs.
cd /usr/local/trex-configuration/visualize-data-scripts/
- Generate throughput graphs:
python3 generate_user_graphs.py -d ~/xdp_map_all -b xdp_map_access -r 3 -average
- The throughput graph is located in
~/xdp_map_all/rx/avg.png
You might see a result that looks like this. These are averages over 3 trials. The raw data is in ~/xdp_map_all/avg_parsed_data.csv
.
The peak throughput of the best K2 variant is ~15.5 Mpps (k1
) and that of the best clang
variant (O1
) is ~15 Mpps.
Note: this experiment runs on the NSF cloudlab testbed. If you haven't finished setting up the CloudLab experiment, please do that first.
This section evaluates whether the programs optimized by K2 can pass the kernel checker. We test this by
loading
the BPF programs into the kernel. If the program can be successfully loaded, it means this program passes
the kernel checker. Otherwise, we will receive an error from the kernel checker. Since it takes a long time
(more than 12 hours) for K2 to optimize all benchmarks in table 7, we directly use the optimized programs
from the compiler logs
produced by K2 when we submitted the manuscript.
Note.
There is a typo in Table 7 in the submitted paper. K2 produced 3 unique programs for benchmark
xdp_pktcntr
instead of 5. We will fix this error in the camera-ready version.
Estimated runtime: 2 minutes 30 seconds
SSH into node-0: suppose currently you are on the node-1 machine
ssh -p 22 reviewer@hp124.utah.cloudlab.us
where you will replace hp124.utah.cloudlab.us by the name of the node-0 machine from the CloudLab console.
cd /usr/local/trex-configuration/safety_chk
sudo python3 safety_chk.py
+-----------------+---------------------+------------------------------+
| Benchmark | # variants produced | # accepted by kernel checker |
+-----------------+---------------------+------------------------------+
| xdp2 | 5 | 4 |
| xdp_fw | 5 | 5 |
| xdp_router_ipv4 | 5 | 4 |
| xdp_fwd | 5 | 5 |
| xdp_pktcntr | 3 | 3 |
| xdp1 | 5 | 5 |
| xdp_map_access | 5 | 5 |
| xdp_redirect | 5 | 4 |
+-----------------+---------------------+------------------------------+
The End.