added num_workers param in config

infocusp · Nov 7, 2024 · ac30dce · ac30dce
1 parent 59e5623
commit ac30dce
Show file tree

Hide file tree

Showing 2 changed files with 6 additions and 0 deletions.
diff --git a/config/README.md b/config/README.md
@@ -36,6 +36,11 @@ default: `50000`
 Useful for low resource utilization. This will ensure all data is stored in multiple chunks of almost `sample_chunksize` samples. This does not hamper any logic in algorithms but simply ensures that the entire dataset is never loaded all at once on the RAM.  
 `null` value will disregard this optimization.
 
+**num_workers** {int}: `int | null`  
+default: `1`  
+This param uses multiple workers in parallel to speed up the data writing to disk. Please use this
+with careful consideration of the number of cores available in the device. *Note that this doesn't increase memory usage of pipeline*. Ideal increment found at `num_workers = 3`.
+
 **train_val_test** {dict}:  
 This section splits the data using the mentioned splitting technique mentioned in `splitter_config` & required params like `split_ratio` and `stratify` options. Example below.
 

diff --git a/config/config.yaml b/config/config.yaml
@@ -13,6 +13,7 @@ experiment:
 # DATA CONFIG.
 data:
     sample_chunksize: 20000
+    num_workers: 1
 
     train_val_test:
         full_datapath: '/path/to/anndata.h5ad'