Luigi for reproducible data analysis workflow

luigi is a very powerful DAG workflow manager with strong extensibility. It's useful to build data analysis pipelines, but some part of its default operation is unfavorable from the view of reproducibility and consistency. The point is that luigi.Task by default checks only the existence of output object, therefore it's considered as completed if inputs changed but output object exists.

Here I present an extension of luigi.Task more suitable for reproducable data analysis workflows. It override complete method of luigi.Task as to compare the hash values of inputs to those of previous run.

Thanks to the luigi team.

How to use?

Make your tasks inherit hash_checking_tasks.TaskWithCheckingInputHash
Make the task's output and all the input inherit hash_checking_tasks.HashableTarget.
Run.

How does it work?

TaskWithCheckingInputHash is an extension of luigi.Task with below operation:

check the dependent tasks' completeness in complete() method.
check if the input of previous run is equal to that of the current run.
if the run is successful, store the information about the task.

TaskWithCheckingInputHash rely on HashableTarget that:

we can check the equality of the content of targets by comparing the values of hash_content().
we can retrieve the information about the Task which made the current output (if exists) by get_current_input_hash()
we can store the information about the Task which made the output by store_input_hash()

TODO:

Docstrings for the whole public methods.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
__pycache__		__pycache__
test		test
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
hash_checking_tasks.py		hash_checking_tasks.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Luigi for reproducible data analysis workflow

How to use?

How does it work?

About

Releases

Packages

Languages

License

ngr-t/luigi_for_data_science

Folders and files

Latest commit

History

Repository files navigation

Luigi for reproducible data analysis workflow

How to use?

How does it work?

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages