Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16636 cart: force port range for tcp provider #15209

Draft
wants to merge 11 commits into
base: master
Choose a base branch
from
8 changes: 4 additions & 4 deletions docs/QSG/build_from_scratch.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,10 +81,10 @@ the user outside of a virtual environment, in which case `~/.local/bin` will nee
PATH.

```bash
$ python3 -m venv venv
$ source venv/bin/activate
$ python3 -m pip --no-cache-dir install --upgrade pip
$ python3 -m pip install -r requirements-build.txt
$ python3 -m venv venv
$ source venv/bin/activate
$ python3 -m pip --no-cache-dir install --upgrade pip
$ python3 -m pip install -r requirements-build.txt
```

## Build DAOS
Expand Down
30 changes: 27 additions & 3 deletions docs/admin/predeployment_check.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,9 +87,9 @@ The DAOS Agent (running on the client nodes) is responsible for resolving a user
UID/GID to user/group names, which are then added to a signed credential and sent to
the DAOS storage nodes.

## HPC Fabric setup
## Network Setup

DAOS depends on the HPC fabric software stack and drivers. Depending on the type of HPC fabric
DAOS depends on the network fabric software stack and drivers. Depending on the type of fabric
that is used, a supported version of the fabric stack needs to be installed.

Note that for InfiniBand fabrics, DAOS is only supported with the MLNX\_OFED stack that is
Expand Down Expand Up @@ -162,9 +162,33 @@ Some distributions install a firewall as part of the base OS installation. DAOS
for its management service. If this port is blocked by firewall rules, neither `dmg` nor the
`daos_agent` on a remote node will be able to contact the DAOS server(s).

Either configure the firewall to allow traffic for this port, or disable the firewall
If telemetry is enabled in the server configuration file, the telemetry port (9191 by default)
must also be accessible on the DAOS server nodes.

Depending on the provider used, each engine might also listen on a range of ports. This is
the case for the tcp provider. This range will start at the fabric_iface_port specified in the
server YAML file and use two ports for management, one port per target and helper xstream. For instance,
with fabric_iface_port set to 20000, 16 targets and 4 helper streams, the engine will listen on ports
in the range from 20000 to 20021 for a total of 22 ports.

Moreover, there are cases where an engine might have to initiate a connection to a running application.
In this case, inbound connections from the storage nodes to the compute nodes must be allowed.
The default port range used by applications is 20100-21100 with the tcp provider. This can be modified
by setting the FI_TCP_PORT_LOW_RANGE and FI_TCP_PORT_HIGH_RANGE environment variables before running
the application.

Either configure the firewall to allow traffic for these ports, or disable the firewall
(for example, by running `systemctl stop firewalld; systemctl disable firewalld`).

The table below summarizes all ports that should be opened on the firewall:

| Node Type | Component | Process | Settings | Default |
| --------- | --------------|-------------|-------------------------------------------------------|-------------|
| Server | Control plane | daos_server | port: | 10001 |
| Server | Telemetry | daos_server | telemetry_port: | 9191 |
| Server | Data plane | daos_engine | fabric_iface_port: + 2 + targets: + nr_xs_helpers: | 20000-20019 |
| Client | libdaos | application | FI_TCP_PORT_LOW_RANGE/FI_TCP_PORT_HIGH_RANGE env vars | 20100-21100 |

## Install from Source

When DAOS is installed from source (and not from pre-built packages), extra manual
Expand Down
17 changes: 17 additions & 0 deletions src/cart/crt_init.c
Original file line number Diff line number Diff line change
Expand Up @@ -523,6 +523,23 @@ prov_settings_apply(bool primary, crt_provider_t prov, crt_init_options_t *opt)
if (prov != CRT_PROV_OFI_CXI && prov != CRT_PROV_OFI_TCP)
d_setenv("NA_OFI_UNEXPECTED_TAG_MSG", "1", 0);

/**
* Force specific port range for application when using tcp provider to know what
* ports to open when firewall is used.
*/
if (!crt_is_service() && (prov == CRT_PROV_OFI_TCP || prov == CRT_PROV_OFI_TCP_RXM)) {
uint32_t port_low_range = UINT32_MAX;
uint32_t port_high_range = UINT32_MAX;

crt_env_get(FI_TCP_PORT_LOW_RANGE, &port_low_range);
crt_env_get(FI_TCP_PORT_HIGH_RANGE, &port_high_range);

if (port_low_range == UINT32_MAX && port_high_range == UINT32_MAX) {
d_setenv("FI_TCP_PORT_LOW_RANGE", "20100", 0);
d_setenv("FI_TCP_PORT_HIGH_RANGE", "21100", 0);
frostedcmos marked this conversation as resolved.
Show resolved Hide resolved
}
}

g_prov_settings_applied[prov] = true;
}

Expand Down
2 changes: 2 additions & 0 deletions src/cart/crt_internal_types.h
Original file line number Diff line number Diff line change
Expand Up @@ -220,6 +220,8 @@ struct crt_event_cb_priv {
ENV(SWIM_PING_TIMEOUT) \
ENV(SWIM_PROTOCOL_PERIOD_LEN) \
ENV(SWIM_SUSPECT_TIMEOUT) \
ENV(FI_TCP_PORT_LOW_RANGE) \
ENV(FI_TCP_PORT_HIGH_RANGE) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why putting this in other order would help (saw your last patch).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea, but i had to do it this way for the build to pass. @frostedcmos any idea?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not sure. I submitted a PR in the order that caused you issues in run #3 and it passed for me:
#15228

Could there be some whitespace somewhere, resulting in bad macro expansion in previous tries? I can't easily tell in github views.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe

ENV_STR(UCX_IB_FORK_INIT)

/* uint env */
Expand Down
Loading