Use local scr for initial downloads / heavy writes #148

sgoodm · 2023-01-19T15:11:16Z

As a best practice given the potential scale of many of our pipelines (i.e., potentially running dozens+ tasks across nodes), I'd like to minimize the continuous IO on our main file system. This should be relatively simple by just using the node's local disk for downloads / processing before copying to the main file system. Even in scenarios where operations are quick and not heavy IO, this shouldn't slow jobs down much and worth the extra piece of mind.

(Currently, heavy IO is seemingly causing extra issues on our aging file system, but we are in the process of moving everything to a brand new file system.)

sgoodm · 2023-01-19T15:21:02Z

@jacobwhall at some point we can update completed pipelines to adhere to this practice, but most of the recently complete ones should be pretty light weight (and have already been fully run so near-future IO isn't an issue).

jacobwhall · 2023-01-19T16:49:02Z

Many of the dataset scripts currently follow a model like this:

  graph LR;
      id1(data source)-- download tasks -->raw_dir;
      raw_dir-- process_tasks -->output_dir;
      output_dir-- ingest system -->GeoQuery;

Setting the raw_dir to be in a local scratch folder (e.g. ~/lscr/TMPDIR on W&M HPC) would greatly reduce the I/O on shared filesystems. We can make this a default location for raw_dir moving forward. Doing so will require changes to how the Dataset class currently assigns tasks to workers. Since download tasks and process tasks (as in above graph) are currently assigned separately, it's likely that the worker that processes a file will be different than the one that downloaded it, and won't have that file available in its local scratch directory. We'll need to ensure that the same worker does both of these tasks.

jacobwhall · 2023-01-19T19:54:53Z

From our conversation today, it sounds like raw_dir and output_dir both need to be permanent archives, so we can't have either of them point to a local scratch directory. We should specify a tmp_dir to write files into from tasks, and then at the end of each task write the file to its final destination. Perhaps a Dataset class function could manage the tmp_dir for us

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use local scr for initial downloads / heavy writes #148

Use local scr for initial downloads / heavy writes #148

sgoodm commented Jan 19, 2023 •

edited

Loading

sgoodm commented Jan 19, 2023

jacobwhall commented Jan 19, 2023

jacobwhall commented Jan 19, 2023

Use local scr for initial downloads / heavy writes #148

Use local scr for initial downloads / heavy writes #148

Comments

sgoodm commented Jan 19, 2023 • edited Loading

sgoodm commented Jan 19, 2023

jacobwhall commented Jan 19, 2023

jacobwhall commented Jan 19, 2023

sgoodm commented Jan 19, 2023 •

edited

Loading