Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of store RDF export of the workflow in CWL Prov RO-Bundle #1709

Draft
wants to merge 55 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
d4851f6
Implementation of store RDF export of the workflow in CWL Prov RO-Bun…
jjkoehorst Aug 16, 2022
00ce190
formatting changes according to make format
jjkoehorst Aug 16, 2022
562d13c
formatting corrections
jjkoehorst Aug 16, 2022
e063735
remove need for type ignore
mr-c Aug 16, 2022
e0bb0e0
hard change to checksum_only
jjkoehorst Aug 16, 2022
16335b7
Added sha checksum to file_entity, need to look into what predicate s…
jjkoehorst Aug 16, 2022
59703ac
Merge branch 'cwlprov-cwl-rdf' of github.com:jjkoehorst/cwltool into …
jjkoehorst Aug 16, 2022
40c6705
formatting cleanup
jjkoehorst Aug 17, 2022
87c304b
--no-data argument added
jjkoehorst Aug 17, 2022
049dcd7
added no_data variable to some functions as i was unable to access th…
jjkoehorst Aug 17, 2022
21ecba9
test provenance --no-data added and a TODO check for check_bagit if w…
jjkoehorst Aug 17, 2022
879d5ce
Global no-data option for now to test the same environment with or wi…
jjkoehorst Aug 17, 2022
723c643
NO_DATA global variable added to know if there should be no data for …
jjkoehorst Aug 18, 2022
540a5a8
formatting
jjkoehorst Aug 18, 2022
211348a
cleaning logger and no_data access implementation
jjkoehorst Aug 18, 2022
bd61e43
Merge branch 'cwlprov-cwl-rdf'
jjkoehorst Aug 24, 2022
ad90be6
cleaning up imports
jjkoehorst Aug 25, 2022
76abff0
make remove_unused_imports, cleaning up all kinds of imports
jjkoehorst Sep 5, 2022
33d1551
some empty line formatting
jjkoehorst Sep 5, 2022
81b48de
Merge branch 'main' into cwlprov-cwl-rdf
jjkoehorst Sep 5, 2022
bc56733
if not none instead of !=
jjkoehorst Sep 5, 2022
e5b498d
make cleanup sync
jjkoehorst Sep 6, 2022
3666b65
docstrings added
jjkoehorst Sep 6, 2022
f58e90e
Default NO_DATA set to false
jjkoehorst Sep 6, 2022
4a6906b
move NO_DATA to utils
jjkoehorst Sep 6, 2022
08e18b0
remove global NO_DATA
mr-c Sep 6, 2022
33f706b
missed two NO_DATA's
jjkoehorst Sep 6, 2022
406ae69
Merge branch 'cwlprov-cwl-rdf' of github.com:jjkoehorst/cwltool into …
mr-c Sep 6, 2022
b288fb4
added return type str: to the checksum content processor
jjkoehorst Sep 6, 2022
cb28e1a
Merge branch 'cwlprov-cwl-rdf' of github.com:jjkoehorst/cwltool into …
mr-c Sep 6, 2022
1dbcdad
fix type
mr-c Sep 6, 2022
ab71278
restore regular prov tests
mr-c Sep 6, 2022
112f4f0
Duplicated a test case and the cwltool function to allow for --no-dat…
jjkoehorst Sep 6, 2022
cd0a4af
formatting
jjkoehorst Sep 6, 2022
50dac83
nolisting workflow and test added
jjkoehorst Sep 7, 2022
d3048af
with copy files but excluding a specific folder test
jjkoehorst Sep 7, 2022
ac532d4
working on load listing recognition for files and provenance
jjkoehorst Sep 8, 2022
fb5a65a
expanded the test case, server testing showed a loadListing option no…
jjkoehorst Sep 8, 2022
373b600
issue with load listing field
jjkoehorst Sep 8, 2022
eb93204
unused import removal
jjkoehorst Sep 8, 2022
a4b26af
show file name with debugger
jjkoehorst Sep 9, 2022
95c2c63
from_fp does not always carry name
jjkoehorst Sep 9, 2022
401918e
testing to print stacktrace to identify path to print file
jjkoehorst Sep 20, 2022
6fe74f3
check listing value
jjkoehorst Sep 20, 2022
7f370bb
change default to invalid_listing
jjkoehorst Sep 20, 2022
26fec21
debugging in progress
jjkoehorst Sep 20, 2022
d01a0df
trace in debug
jjkoehorst Oct 12, 2022
c15156b
stack trace only at debug level
jjkoehorst Oct 12, 2022
315e78f
stacktrace disabled
jjkoehorst Oct 12, 2022
8158340
Merge branch 'main' into cwlprov-cwl-rdf
jjkoehorst Aug 16, 2023
cad4896
formatting
jjkoehorst Aug 16, 2023
b930842
sort imports
jjkoehorst Aug 16, 2023
aa0054e
No warnings test
jjkoehorst Aug 17, 2023
87946a3
missed one attribute
jjkoehorst Aug 17, 2023
420dd1c
work in progress to fix the main merge
jjkoehorst Aug 17, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 31 additions & 5 deletions cwltool/provenance_profile.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
from typing_extensions import TYPE_CHECKING

import cwltool.workflow
from . import process

from .errors import WorkflowException
from .job import CommandLineJob, JobBase
Expand Down Expand Up @@ -57,7 +58,7 @@


def copy_job_order(
job: Union[Process, JobsType], job_order_object: CWLObjectType
job: Union[Process, JobsType], job_order_object: CWLObjectType, process
) -> CWLObjectType:
"""Create copy of job object for provenance."""
if not isinstance(job, WorkflowJob):
Expand All @@ -66,12 +67,34 @@ def copy_job_order(
customised_job: CWLObjectType = {}
# new job object for RO
debug = _logger.isEnabledFor(logging.DEBUG)
# Process the process object first
load_listing = {}

# Implementation to capture the loadlisting from cwl to skip the inclusion of for example files of big database
# folders
for index, entry in enumerate(process.inputs_record_schema["fields"]):
if (
entry["type"] == "org.w3id.cwl.cwl.Directory"
and "loadListing" in entry
and entry["loadListing"]
):
load_listing[entry["name"]] = entry["loadListing"]

# print("LOAD LISTING: ", load_listing)
# PROCESS:Workflow: file:///Users/jasperk/gitlab/cwltool/tests/wf/directory_no_listing.cwl
# print("PROCESS:" + str(process))

for each, i in enumerate(job.tool["inputs"]):
with SourceLine(job.tool["inputs"], each, WorkflowException, debug):
iid = shortname(i["id"])
# if iid in the load listing object and no_listing then....
if iid in job_order_object:
customised_job[iid] = copy.deepcopy(job_order_object[iid])
# add the input element in dictionary for provenance
if iid in load_listing and load_listing[iid] != "no_listing":
customised_job[iid] = copy.deepcopy(job_order_object[iid])
# TODO Other listing options here?
else:
# add the input element in dictionary for provenance
customised_job[iid] = copy.deepcopy(job_order_object[iid])
elif "default" in i:
customised_job[iid] = copy.deepcopy(i["default"])
# add the default elements in the dictionary for provenance
Expand Down Expand Up @@ -246,13 +269,13 @@ def evaluate(
if not hasattr(process, "steps"):
# record provenance of independent commandline tool executions
self.prospective_prov(job)
customised_job = copy_job_order(job, job_order_object)
customised_job = copy_job_order(job, job_order_object, process)
self.used_artefacts(customised_job, self.workflow_run_uri)
research_obj.create_job(customised_job)
elif hasattr(job, "workflow"):
# record provenance of workflow executions
self.prospective_prov(job)
customised_job = copy_job_order(job, job_order_object)
customised_job = copy_job_order(job, job_order_object, process)
self.used_artefacts(customised_job, self.workflow_run_uri)

def record_process_start(
Expand Down Expand Up @@ -472,8 +495,11 @@ def declare_directory(self, value: CWLObjectType) -> ProvEntity:
# a later call to this method will sort that
is_empty = True

# if value['basename'] == "dirIgnore":
# pass
if "listing" not in value:
get_listing(self.fsaccess, value)

for entry in cast(MutableSequence[CWLObjectType], value.get("listing", [])):
is_empty = False
# Declare child-artifacts
Expand Down
35 changes: 28 additions & 7 deletions tests/test_provenance.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ def cwltool(tmp_path: Path, *args: Any) -> Path:
def cwltool_no_data(tmp_path: Path, *args: Any) -> Path:
prov_folder = tmp_path / "provenance"
prov_folder.mkdir()
new_args = ["--no-data", "--provenance", str(prov_folder)]
new_args = ["--enable-ext", "--no-data", "--provenance", str(prov_folder)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--enable-ext shouldn't be required when using loadListing with CWL v1.1+

new_args.extend(args)
# Run within a temporary directory to not pollute git checkout
tmp_dir = tmp_path / "cwltool-run"
Expand Down Expand Up @@ -207,7 +207,7 @@ def test_directory_workflow(tmp_path: Path) -> None:


@needs_docker
def test_directory_workflow_no_data(tmp_path: Path) -> None:
def test_directory_workflow_no_listing(tmp_path: Path) -> None:
dir2 = tmp_path / "dir2"
dir2.mkdir()
sha1 = {
Expand All @@ -223,8 +223,28 @@ def test_directory_workflow_no_data(tmp_path: Path) -> None:
with open(dir2 / x, "w", encoding="ascii") as f:
f.write(x)

folder = cwltool_no_data(
tmp_path, get_data("tests/wf/directory.cwl"), "--dir", str(dir2)
dir3 = tmp_path / "dirIgnore"
dir3.mkdir()
sha1 = {
# Expected hashes of ASCII letters (no linefeed)
# as returned from:
# for x in a b c ; do echo -n $x | sha1sum ; done
"d": "3c363836cf4e16666669a25da280a1865c2d2874",
"e": "58e6b3a414a1e090dfc6029add0f3555ccba127f",
"f": "4a0a19218e082a343a1b17e5333409af9d98f0f5",
}
for x in "def":
# Make test files with predictable hashes
with open(dir3 / x, "w", encoding="ascii") as f:
f.write(x)

folder = cwltool(
tmp_path,
get_data("tests/wf/directory_no_listing.cwl"),
"--dir",
str(dir2),
"--ignore",
str(dir3),
)
# check invert? as there should be no data in there
# check_provenance(folder, directory=True)
Expand All @@ -234,13 +254,14 @@ def test_directory_workflow_no_data(tmp_path: Path) -> None:
folder
/ "data"
/ "3c"
/ "3ca69e8d6c234a469d16ac28a4a658c92267c423"
/ "3c363836cf4e16666669a25da280a1865c2d2874"
# checksum as returned from:
# echo -e "a\nb\nc" | sha1sum
# 3ca69e8d6c234a469d16ac28a4a658c92267c423 -
)
# File should be empty and in the future not existing...
assert os.path.getsize(file_list.absolute()) == 0
# print("FILE LIST: ", file_list.absolute())
# assert os.path.getsize(file_list.absolute()) == 0
# To be discared when file really does not exist anymore
assert file_list.is_file()

Expand All @@ -250,7 +271,7 @@ def test_directory_workflow_no_data(tmp_path: Path) -> None:
prefix = l_hash[:2] # first 2 letters
p = folder / "data" / prefix / l_hash
# File should be empty and in the future not existing...
assert os.path.getsize(p.absolute()) == 0
# assert os.path.getsize(p.absolute()) == 0
# To be discared when file really does not exist anymore
assert p.is_file(), f"Could not find {l} as {p}"

Expand Down
73 changes: 73 additions & 0 deletions tests/wf/directory_no_listing.cwl
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
#!/usr/bin/env cwl-runner
cwlVersion: v1.2
class: Workflow

doc: >
Inspect provided directory and return filenames.
Generate a new directory and return it (including content).

hints:
- class: DockerRequirement
dockerPull: docker.io/debian:stable-slim

inputs:
dir:
type: Directory
ignore:
type: Directory
loadListing: no_listing

steps:
ls:
in:
dir: dir
ignore: ignore
out:
[listing]
run:
class: CommandLineTool
baseCommand: ls
inputs:
dir:
type: Directory
inputBinding:
position: 1
ignore:
type: Directory
inputBinding:
position: 2
outputs:
listing:
type: stdout

generate:
in: []
out:
[dir1]
run:
class: CommandLineTool
requirements:
- class: ShellCommandRequirement
arguments:
- shellQuote: false
valueFrom: >
pwd;
mkdir -p dir1/a/b;
echo -n a > dir1/a.txt;
echo -n b > dir1/a/b.txt;
echo -n c > dir1/a/b/c.txt;
inputs: []
outputs:
dir1:
type: Directory
outputBinding:
glob: "dir1"

outputs:
listing:
type: File
outputSource: ls/listing
dir1:
type: Directory
outputSource: generate/dir1