Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve data locality by considering Kubernetes topology #595

Open
lfrancke opened this issue Oct 24, 2024 · 3 comments
Open

Improve data locality by considering Kubernetes topology #595

lfrancke opened this issue Oct 24, 2024 · 3 comments

Comments

@lfrancke
Copy link
Member

Description

As users of the HDFS operator and a Stackable deployed HDFS we want it to ensure data locality by talking to a DataNode on the same Kubernetes node as the client first if one exists.

Value

HDFS tries to store the first copy of a block on a "local" machine before shipping data to remote machines over the network. This relies on a simple IP address comparison in the HDFS code which breaks due to the nature of Kubernetes where pods don't share the same IP even if they are on the same Kubernetes node.

I believe we can improve this situation by changing the HDFS code to consider the Kubernetes node while looking for a "local" machine.

We already have precedent with the hdfs-topology-provider which does something similar. I believe we can plug this logic into the chooseLocalOrFavoredStorage method of BlockPlacementPolicyDefault.

We want this because it will probably benefit all workloads that are using HDFS and locally attached storage and that are using things like Spark or HBase where processing can happen on the same Kubernetes node as the storage. The benefit is going to be less network traffic and a boost in performance.

Dependencies

It probably makes sense to reuse code from the hdfs-topology-provider project.

Tasks

  • Understand exactly what the topology provider is doing: We need a way - from within Kubernetes - to get the node a client is "calling" from (via its IP) as well as the node the DataNodes are running on
  • Decide whether this behavior is going to be enabled by default or not
  • Discuss the best point where to plug this behavior in and create a patch or pluggable class (etc. a BlockPlacementPolicy)
  • Document the behavior
  • Test it on a multi-node cluster and also test it with external clients (listener etc.)
  • Marketing: Discuss with marketing whether we should create a blog post about it

Acceptance Criteria

  • We have a way to compare data locality based on Kubernetes nodes that falls back to the default in case there is an error
  • The behavior is documented even if it is not changeable/pluggable

Release Notes

The HDFS NameNodes will now look at the Kubernetes topology when considering whether a client request is made locally or not. This means it will consider all clients "local" that are hosted on the same Kubernetes node as a DataNode.

@adwk67
Copy link
Member

adwk67 commented Nov 4, 2024

Summary of topology provider logic

The StackableTopologyProvider implements DNSToSwitchMapping class uses the entry point public List<String> resolve(List<String> names) to return a topology for a given pod name/IP.

The topology is in the form: /{resolved-label-1}/{resolved-label-2}/.... For example, in the integration test the labels:

rackAwareness:
  - nodeLabel: kubernetes.io/hostname
  - podLabel: app.kubernetes.io/role-group

are used to check that a topology has been created for a specific dataNode role group. Internally, StackableTopologyProvider first of all resolves the name (which could be an IP or a pod name) to an IP and maps that to labels locating the same kubernetes node where that pod is running.

This steps are as follows (ignoring caching logic):

  • lookup the IPs for the HDFS data nodes
  • collect the kubernetes node labels for each (data node) IP
  • collect the pod labels for each (data node) IP
    • for each HDFS data node IP we now have a collection of its pod and node labels)
  • the entry point input names are resolved to a collection of either the pod names (for non-listener) or the dataNode IP to which this listener resolves (for listeners)
  • the names are resolved to their IPs (either their own - if data nodes - or that of the data node where this pod is running)
    • we now have the data node IP for each input pod 🟢
  • labels are looked up from the pod and the pod's node to build the topology

We can use the information here 🟢 for this ticket.

@timrobertson100
Copy link

timrobertson100 commented Nov 26, 2024

I'm going to leave a comment here which may be helpful or you may wish to remove.

We have a CDH5 cluster that does a daily large Spark batch job to read Parquet and generate HFiles for bulk loading to HBase.
The job runs at 5.5hrs on CDH5.

On our new ST installation configured to run the job as equivalent as possible (600 cores, 36G executor memory, same code), but newer libraries (e.g. spark 2.3 -> 3) and hardware we see the job run at 7.5hrs. I'm still learning our setup but Grafana is logging at peaks of 640MiB/s on the Data Node transmit and receive graphs during this job.

We might have a good environment for evaluating any fix (perhaps running patched code?) and we'd be happy to assist. We run 100s of Spark jobs like this daily.

@adwk67
Copy link
Member

adwk67 commented Nov 26, 2024

We might have a good environment for evaluating any fix (perhaps running patched code?) and we'd be happy to assist. We run 100s of Spark jobs like this daily.

That's great to know - thank you! We'll definitely bear this in mind when we pick up this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants