Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve path management #22

Open
iannesbitt opened this issue May 17, 2023 · 1 comment
Open

Improve path management #22

iannesbitt opened this issue May 17, 2023 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@iannesbitt
Copy link
Contributor

When handling file paths, keep in mind that when this code is run across different nodes, the local file system might not be shared. On Delta, each node has access to it's own local system, and some access to a shared network. Reading and writing speeds tend to be faster when files are on the node that is doing the work, rather than on the shared filesystem. So in many cases, we copy smaller files on to each node before starting a job.

Let's take home = os.path.expanduser('~').replace('\\', '/') in __init__ as an example.

  • The os.path.expanduser('~') function returns the path of the current user's home directory. On a cluster, different nodes might have different filesystems. The user's home directory might not be the same on all nodes, or might not even exist on some nodes.
  • Even if all nodes have the same filesystem, the user's home directory might not be shared across all nodes. If the home directory is not shared, a file that is written to the home directory on one node will not be available on other nodes.
  • Since we need to access and save files to a specific path, it would be safer to use an absolute path to a directory that we know exists and is shared across all nodes in the cluster.
  • replacing '' with '/' might not always give the correct results, especially if we want the code to be cross-platform. Better to use pathlib.

In parts of viz-points, the input path is used to construct the output path. However, we will need the ability to write files to a different location than the location from where we read them in. In other words, the base path for input files could differ from the base path for output files. For example, sometimes we read input files from the shared network, but write files to the local node that is processing the output (to speed up writing).

Originally posted by @robyngit in #12 (review)

@iannesbitt iannesbitt added the enhancement New feature or request label May 17, 2023
@iannesbitt iannesbitt self-assigned this May 17, 2023
@iannesbitt iannesbitt added this to the 0.0.2 milestone Jul 5, 2023
@iannesbitt iannesbitt removed this from the 0.0.2 milestone Jul 6, 2023
@iannesbitt
Copy link
Contributor Author

Reopening as not all points have been fully addressed. I need to know where we usually store parallel processing artifacts in order to make paths more cluster-friendly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: No status
Status: No status
Development

No branches or pull requests

1 participant