Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use local scr for initial downloads / heavy writes #148

Open
sgoodm opened this issue Jan 19, 2023 · 3 comments
Open

Use local scr for initial downloads / heavy writes #148

sgoodm opened this issue Jan 19, 2023 · 3 comments

Comments

@sgoodm
Copy link
Member

sgoodm commented Jan 19, 2023

As a best practice given the potential scale of many of our pipelines (i.e., potentially running dozens+ tasks across nodes), I'd like to minimize the continuous IO on our main file system. This should be relatively simple by just using the node's local disk for downloads / processing before copying to the main file system. Even in scenarios where operations are quick and not heavy IO, this shouldn't slow jobs down much and worth the extra piece of mind.

(Currently, heavy IO is seemingly causing extra issues on our aging file system, but we are in the process of moving everything to a brand new file system.)

@sgoodm
Copy link
Member Author

sgoodm commented Jan 19, 2023

@jacobwhall at some point we can update completed pipelines to adhere to this practice, but most of the recently complete ones should be pretty light weight (and have already been fully run so near-future IO isn't an issue).

@jacobwhall
Copy link
Member

Many of the dataset scripts currently follow a model like this:

  graph LR;
      id1(data source)-- download tasks -->raw_dir;
      raw_dir-- process_tasks -->output_dir;
      output_dir-- ingest system -->GeoQuery;
Loading

Setting the raw_dir to be in a local scratch folder (e.g. ~/lscr/TMPDIR on W&M HPC) would greatly reduce the I/O on shared filesystems. We can make this a default location for raw_dir moving forward. Doing so will require changes to how the Dataset class currently assigns tasks to workers. Since download tasks and process tasks (as in above graph) are currently assigned separately, it's likely that the worker that processes a file will be different than the one that downloaded it, and won't have that file available in its local scratch directory. We'll need to ensure that the same worker does both of these tasks.

@jacobwhall
Copy link
Member

From our conversation today, it sounds like raw_dir and output_dir both need to be permanent archives, so we can't have either of them point to a local scratch directory. We should specify a tmp_dir to write files into from tasks, and then at the end of each task write the file to its final destination. Perhaps a Dataset class function could manage the tmp_dir for us

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants