Skip to content

Commit

Permalink
Implement configurable Network Instance MTU
Browse files Browse the repository at this point in the history
User is now able to set MTU for network instance bridge and all
application interfaces connected to it.

MTU determines the largest IP packet that the network
instance is allowed to carry. This does not include the L2 header size
(e.g. Ethernet header or a VLAN tag size). The value is a 16-byte
unsigned integer, representing the MTU size in bytes. The minimum
accepted value for the MTU is 1280 (RFC 8200, "IPv6 minimum link MTU").
If not defined (zero value), EVE will set the MTU to the default value
of 1500 bytes.

On the host side, MTU is set to interfaces by EVE. On the guest (app)
side, the responsibility to set the MTU lies either with EVE or with
the user/app, depending on the network instance, app type and the type of
interfaces used (local or switch, VM or container, virtio or something
else).

For container applications running inside an EVE-created shim-VM, EVE
initializes the MTU of interfaces during shim-VM boot. MTUs of all
interfaces are passed to the VM via kernel boot arguments (/proc/cmdline).
The init script parses out these values and applies them to application
interfaces (excluding direct assignments).
Furthermore, interfaces connected to local network instances will have
their MTUs automatically updated using DHCP if there is a change in MTU
configuration. To update the MTU of interfaces connected to switch
network instances, user may run an external DHCP server in the network
and publish MTU changes via DHCP option 26 (the DHCP client run by EVE
inside shim-VM will pick it up and apply it).

In the case of VM applications, it is mostly the responsibility of the
app/user to set and keep the MTUs up-to-date.
When device provides HW-assisted virtualization capabilities, EVE (with
kvm hypervisor) connects VM with network instances using para-virtualized
virtio interfaces, which allow to propagate MTU value from the host to
the guest. If the virtio driver used by the app supports the MTU
propagation (VIRTIO_NET_F_MTU feature flag is set), the initial MTU
values will be set using virtio (regardless of the network instance type).

To support MTU update for interfaces connected to local network instances,
the app can run a DHCP client and receive the latest MTU via DHCP option 26.
For switch network instances, the user can run his own external DHCP server
in the network with the MTU option configured.

For other hypervisors, DHCP-based MTU propagation is also available but
other options are limited:
- xen's VIF driver does not support MTU propagation from host to guest
- with kubernetes, the MTU value (initially) set on the VETH connecting
  pod with a network instance is propagated further into the VM by the
  kubevirt. However, kubevirt lacks the capability to detect MTU changes
  and propagate them to the VM.

Please note that application traffic leaving or entering the device
via a network adapter associated with the network instance is additionally
limited by the MTU value of the adapter, configured within the NetworkConfig
object. If the configured network instance MTU differs from the network
adapter MTU, EVE will flag the network instance with an error and use the
adapter's MTU for the network instance instead (to prevent traffic from
being dropped or fragmented inside EVE).

Significant part of this commit is also refactoring of Network instance
error management. There are different kinds of errors that NI can be
flagged with. Some of those errors are critical and prevent NI from being
created, while others can be ignored to some extent or might be transient.
It is difficult to manage all these possible error scenarious with only
one error attribute in NetworkInstanceStatus. Therefore, I have split the
error field into multiple attributes, one for each kind of error. This
significantly simplifies the error management while adding only few new
fields into the structure.

Signed-off-by: Milan Lenco <milan@zededa.com>
  • Loading branch information
milan-zededa authored and eriknordmark committed Jun 25, 2024
1 parent 865058c commit 1b13ae7
Show file tree
Hide file tree
Showing 22 changed files with 640 additions and 181 deletions.
83 changes: 83 additions & 0 deletions docs/APP-CONNECTIVITY.md
Original file line number Diff line number Diff line change
Expand Up @@ -322,3 +322,86 @@ propagated by DHCP to connected applications, unless network instance is air-gap
(without uplink) or the uplink is app-shared (not management) and does not have a default
route of its own. In both cases, it is possible to enforce default route propagation
by configuring a static default route for the network instance.

### Network Instance MTU

The user can adjust the Maximum Transmission Unit (MTU) size of the network instance
bridge and all application interfaces connected to it.
MTU determines the largest IP packet that the network instance is allowed to carry.
A smaller MTU value is often used to avoid packet fragmentation when some form of packet
encapsulation is being applied, while a larger MTU reduces the overhead associated with
packet headers, improves network efficiency, and increases throughput by allowing more
data to be transmitted in each packet (known as a jumbo frame).

EVE uses the L3 MTU, meaning the value does not include the L2 header size (e.g., Ethernet
header or VLAN tag size). The value is a 16-bit unsigned integer, representing the MTU size
in bytes. The minimum accepted value for the MTU is 1280, which is the minimum link MTU
needed to carry an IPv6 packet (see RFC 8200, "IPv6 minimum link MTU"). If the MTU for
a network instance is not defined (zero value), EVE will set the default MTU size of 1500
bytes.

On the host side, MTU is set to bridge and app VIFs by EVE. On the guest (application)
side, the responsibility to set the MTU lies either with EVE or with the user/app,
depending on the network instance type (local or switch), app type (VM or container)
and the type of interfaces used (virtio or something else).

#### Container App VIF MTU

For container applications running inside an EVE-created shim-VM, EVE initializes the MTU
of interfaces during the shim-VM boot. MTUs of all interfaces are passed to the VM via kernel
boot arguments (/proc/cmdline). The init script parses out these values and applies them
to application interfaces (excluding direct assignments).
Furthermore, interfaces connected to local network instances will have their MTUs
automatically updated using DHCP if there is a change in the MTU configuration. To update
the MTU of interfaces connected to switch network instances, user may run an external
DHCP server in the network and publish MTU changes via DHCP option 26 (the DHCP client
run by EVE inside shim-VM will pick them up and apply them).

#### VM App VIF MTU

In the case of VM applications, it is mostly the responsibility of the app/user to set
and keep the MTUs up-to-date. When device provides HW-assisted virtualization capabilities,
EVE (with kvm or kubevirt hypervisor) connects VM with network instances using para-virtualized
virtio interfaces, which allow to propagate MTU value from the host to the guest.
If the virtio driver used by the app supports the MTU propagation, the initial MTU values
will be set using virtio (regardless of the network instance type).

To determine if virtio driver used by an app supports MTU propagation, user must check
if `VIRTIO_NET_F_MTU` feature flag is reported as `1`.
Given that:

```c
#define VIRTIO_NET_F_MTU 3
```
Check the feature flag with (replace `enp1s0` with your interface name):
```sh
# the position argument of "cat" starts with 1, hence we have to do +1
cat /sys/class/net/enp1s0/device/features | cut -c 4
1 # if not supported, prints 0 instead
```

Please note that with the Xen hypervisor, the Xen's VIF driver does not support MTU
propagation from host to guest.
To support MTU change in run-time for interfaces connected to local network instances,
VM app can run a DHCP client and receive the latest MTU via DHCP option 26.
For switch network instances, the user can run his own external DHCP server in the network
with the MTU option configured.
With Kubevirt, MTU change after VMI is deployed is not possible. This is because the bridge
and the (virtio) TAP created by Kubevirt to connect pod interface (VETH) with the VMI interface
are fully managed by Kubevirt, which lacks the ability to detect and apply MTU changes.
This means that even if the app updates MTU on its side (using e.g. DHCP), the path MTU may
differ because the connection between the VMI and the underlying Pod will continue using
the old MTU value.
#### Network Instance MTU vs. Network Adapter MTU
Please note that application traffic leaving or entering the device via a network
adapter associated with the network instance is additionally limited by the MTU value
of the adapter, configured within the NetworkConfig object. If the configured network
instance MTU differs from the network adapter MTU, EVE will flag the network instance
with an error and use the adapter's MTU for the network instance instead (to prevent
traffic from being dropped or fragmented inside EVE).
8 changes: 7 additions & 1 deletion pkg/pillar/cmd/zedmanager/handledomainmgr.go
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ import (
"errors"
"fmt"
"runtime"
"strconv"
"strings"

"github.com/lf-edge/eve/pkg/pillar/types"
)
Expand Down Expand Up @@ -124,10 +126,14 @@ func MaybeAddDomainConfig(ctx *zedmanagerContext,
}
if ns != nil {
adapterCount := len(ns.AppNetAdapterList)

dc.VifList = make([]types.VifConfig, adapterCount)
mtuStrList := make([]string, adapterCount)
for i, adapter := range ns.AppNetAdapterList {
dc.VifList[i] = adapter.VifInfo.VifConfig
mtuStrList[i] = strconv.Itoa(int(adapter.MTU))
}
if dc.IsOCIContainer() && adapterCount > 0 {
dc.ExtraArgs += " mtu=" + strings.Join(mtuStrList, ",")
}
}
log.Functionf("MaybeAddDomainConfig done for %s", key)
Expand Down
1 change: 1 addition & 0 deletions pkg/pillar/cmd/zedrouter/appnetwork.go
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,7 @@ func (z *zedrouter) prepareConfigForVIFs(config types.AppNetworkConfig,
adapterStatus.Mac = z.generateAppMac(adapterNum, status, netInstStatus)
}
adapterStatus.HostName = config.Key()
adapterStatus.MTU = netInstStatus.MTU
guestIP, err := z.lookupOrAllocateIPv4ForVIF(
netInstStatus, *adapterStatus, status.UUIDandVersion.UUID)
if err != nil {
Expand Down
92 changes: 60 additions & 32 deletions pkg/pillar/cmd/zedrouter/networkinstance.go
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,8 @@ func (z *zedrouter) getNIBridgeConfig(
MACAddress: status.BridgeMac,
IPAddress: ipAddr,
Uplink: z.getNIUplinkConfig(status),
IPConflict: status.IPConflict,
IPConflict: status.IPConflictErr.HasError(),
MTU: status.MTU,
}
}

Expand All @@ -88,6 +89,7 @@ func (z *zedrouter) getNIUplinkConfig(
LogicalLabel: port.Logicallabel,
IfName: ifName,
IsMgmt: port.IsMgmt,
MTU: port.MTU,
DNSServers: types.GetDNSServers(*z.deviceNetworkStatus, ifName),
NTPServers: types.GetNTPServers(*z.deviceNetworkStatus, ifName),
}
Expand All @@ -96,74 +98,102 @@ func (z *zedrouter) getNIUplinkConfig(
// Update NI status and set interface name of the selected uplink
// referenced by a logical label.
func (z *zedrouter) setSelectedUplink(uplinkLogicalLabel string,
status *types.NetworkInstanceStatus) (waitForUplink bool, err error) {
status *types.NetworkInstanceStatus) error {
if status.PortLogicalLabel == "" {
// Air-gapped
status.SelectedUplinkLogicalLabel = ""
status.SelectedUplinkIntfName = ""
return false, nil
return nil
}
status.SelectedUplinkLogicalLabel = uplinkLogicalLabel
if uplinkLogicalLabel == "" {
status.SelectedUplinkIntfName = ""
// This is potentially a transient state, wait for DPC update
// and uplink probing eventually finding a suitable uplink port.
return true, fmt.Errorf("no selected uplink port")
return fmt.Errorf("no selected uplink port")
}
ports := z.deviceNetworkStatus.GetPortsByLogicallabel(uplinkLogicalLabel)
switch len(ports) {
case 0:
err = fmt.Errorf("label of selected uplink (%s) does not match any port (%v)",
err := fmt.Errorf("label of selected uplink (%s) does not match any port (%v)",
uplinkLogicalLabel, ports)
// Wait for DPC update
return true, err
return err
case 1:
if ports[0].InvalidConfig {
return false, fmt.Errorf("port %s has invalid config: %s", ports[0].Logicallabel,
return fmt.Errorf("port %s has invalid config: %s", ports[0].Logicallabel,
ports[0].LastError)
}
// Selected port is OK
break
default:
err = fmt.Errorf("label of selected uplink matches multiple ports (%v)", ports)
return false, err
// Note: soon we will support NI with multiple ports.
err := fmt.Errorf("label of selected uplink matches multiple ports (%v)", ports)
return err
}
ifName := ports[0].IfName
status.SelectedUplinkIntfName = ifName
ifIndex, exists, _ := z.networkMonitor.GetInterfaceIndex(ifName)
if !exists {
// Wait for uplink interface to appear in the network stack.
return true, fmt.Errorf("missing uplink interface '%s'", ifName)
return fmt.Errorf("missing uplink interface '%s'", ifName)
}
if status.IsUsingUplinkBridge() {
_, ifMAC, _ := z.networkMonitor.GetInterfaceAddrs(ifIndex)
status.BridgeMac = ifMAC
}
return false, nil
return nil
}

// This function is called on DPC update or when UplinkProber changes uplink port
// selected for network instance.
func (z *zedrouter) doUpdateNIUplink(uplinkLogicalLabel string,
status *types.NetworkInstanceStatus, config types.NetworkInstanceConfig) {
waitForUplink, err := z.setSelectedUplink(uplinkLogicalLabel, status)
if err != nil {

// Update association between the NI and the selected device port.
uplinkErr := z.setSelectedUplink(uplinkLogicalLabel, status)
if uplinkErr == nil && status.UplinkErr.HasError() {
// Uplink issue was resolved.
status.UplinkErr.ClearError()
z.publishNetworkInstanceStatus(status)
}
if uplinkErr != nil &&
uplinkErr.Error() != status.UplinkErr.Error {
// New uplink issue arose or the error has changed.
z.log.Errorf("doUpdateNIUplink(%s) for %s failed: %v", uplinkLogicalLabel,
status.UUID, err)
status.SetErrorNow(err.Error())
status.UUID, uplinkErr)
status.UplinkErr.SetErrorNow(uplinkErr.Error())
z.publishNetworkInstanceStatus(status)
}

// Re-check MTUs between the NI and the port.
fallbackMTU, mtuErr := z.checkNetworkInstanceMTUConflicts(config, status)
if mtuErr == nil && status.MTUConflictErr.HasError() {
// MTU conflict was resolved.
status.MTUConflictErr.ClearError()
if config.MTU == 0 {
status.MTU = types.DefaultMTU
} else {
status.MTU = config.MTU
}
z.publishNetworkInstanceStatus(status)
return
}
if mtuErr != nil &&
mtuErr.Error() != status.MTUConflictErr.Error {
// New MTU conflict arose or the error has changed.
z.log.Error(mtuErr)
status.MTUConflictErr.SetErrorNow(mtuErr.Error())
status.MTU = fallbackMTU
z.publishNetworkInstanceStatus(status)
}

// Apply uplink/MTU changes in the network stack.
if status.Activated {
z.doUpdateActivatedNetworkInstance(config, status)
}
if status.WaitingForUplink && !waitForUplink {
status.WaitingForUplink = false
status.ClearError()
if config.Activate && !status.Activated {
z.doActivateNetworkInstance(config, status)
z.checkAndRecreateAppNetworks(status.UUID)
}
if config.Activate && !status.Activated && status.EligibleForActivate() {
z.doActivateNetworkInstance(config, status)
z.checkAndRecreateAppNetworks(status.UUID)
}
z.publishNetworkInstanceStatus(status)
}
Expand All @@ -175,7 +205,7 @@ func (z *zedrouter) doActivateNetworkInstance(config types.NetworkInstanceConfig
z.runCtx, config, z.getNIBridgeConfig(status))
if err != nil {
z.log.Errorf("Failed to activate network instance %s: %v", status.UUID, err)
status.SetErrorNow(err.Error())
status.ReconcileErr.SetErrorNow(err.Error())
z.publishNetworkInstanceStatus(status)
return
}
Expand Down Expand Up @@ -203,7 +233,7 @@ func (z *zedrouter) doInactivateNetworkInstance(status *types.NetworkInstanceSta
niRecStatus, err := z.niReconciler.DelNI(z.runCtx, status.UUID)
if err != nil {
z.log.Errorf("Failed to deactivate network instance %s: %v", status.UUID, err)
status.SetErrorNow(err.Error())
status.ReconcileErr.SetErrorNow(err.Error())
z.publishNetworkInstanceStatus(status)
return
}
Expand All @@ -221,7 +251,7 @@ func (z *zedrouter) doUpdateActivatedNetworkInstance(config types.NetworkInstanc
if err != nil {
z.log.Errorf("Failed to update activated network instance %s: %v",
status.UUID, err)
status.SetErrorNow(err.Error())
status.ReconcileErr.SetErrorNow(err.Error())
z.publishNetworkInstanceStatus(status)
return
}
Expand Down Expand Up @@ -314,17 +344,16 @@ func (z *zedrouter) checkAllNetworkInstanceIPConflicts() {
continue
}
conflictErr := z.checkNetworkInstanceIPConflicts(niConfig)
if conflictErr == nil && niStatus.IPConflict {
if conflictErr == nil && niStatus.IPConflictErr.HasError() {
// IP conflict was resolved.
niStatus.IPConflictErr.ClearError()
if niStatus.Activated {
// Local NI was initially activated prior to the IP conflict.
// Subsequently, when the IP conflict arose, it was almost completely
// un-configured (only preserving app VIFs) to keep device connectivity
// unaffected. Now, it can be restored to full functionality.
z.log.Noticef("Updating NI %s (%s) now that IP conflict "+
"is not present anymore", niConfig.UUID, niConfig.DisplayName)
niStatus.IPConflict = false
niStatus.ClearError()
// This also publishes the new status.
z.doUpdateActivatedNetworkInstance(*niConfig, &niStatus)
} else {
Expand All @@ -338,11 +367,10 @@ func (z *zedrouter) checkAllNetworkInstanceIPConflicts() {
z.handleNetworkInstanceCreate(nil, niConfig.Key(), *niConfig)
}
}
if conflictErr != nil && !niStatus.IPConflict {
if conflictErr != nil && !niStatus.IPConflictErr.HasError() {
// New IP conflict arose.
z.log.Error(conflictErr)
niStatus.IPConflict = true
niStatus.SetErrorNow(conflictErr.Error())
niStatus.IPConflictErr.SetErrorNow(conflictErr.Error())
z.publishNetworkInstanceStatus(&niStatus)
if niStatus.Activated {
// Local NI is already activated. Instead of removing it and halting
Expand Down
Loading

0 comments on commit 1b13ae7

Please sign in to comment.