A JSON-based DSL for describing paths in your project.
My etl/data analysis projects are littered with code like,
import os
DATA_DIR = "data"
CLEAN_DIR = os.path.join(DATA_DIR, "clean")
RAW_DIR = os.path.join(DATA_DIR, "raw")
TARGET_HTML = os.path.join(RAW_DIR, "something.html")
OUTPUT_FILE = os.path.join(CLEAN_DIR, "something.csv")
with open(TARGET_HTML) as fp:
csv = process(fp)
with open(OUTPUT_FILE) as fp:
write_csv(fp)
It's fine for one file, but when you have a whole ETL pipeline tucked
into a Makefile
, the duplication leads to fragility and violates
DRY. It's a REALLY common pattern in file-based processing. This
package and format lets you do create a .paths.json
file like,
{
"__ENV": {"VERSION": "0.0.1"},
"DATA_DIR": ["data", "$$VERSION"],
"CLEAN_DIR": ["$DATA_DIR", "clean"],
"RAW_DIR": ["$DATA_DIR", "raw"],
"SOMETHING_HTML": ["$RAW_DIR", "something.html"],
"SOMETHING_CSV": ["$RAW_DIR", "something.csv"]
}
Then, from your python scripts,
from pathsjson.automagic import PATHS
print("Processing:", PATHS['SOMETHING_HTML'])
with PATHS.resolve('SOMETHING_HTML').open() as fp:
csv = process(fp)
with PATHS.resolve('SOMETHING_CSV').open("w") as fp:
write_csv(fp)
pip install pathsjson
There is a .paths.json
schema.
It's validated with JSON-Schema.
Read the docs: here.