Skip to content

Commit

Permalink
Merge pull request #15 from VForWaTer/data_section
Browse files Browse the repository at this point in the history
Data section in tool.yml and parameters.json
  • Loading branch information
mmaelicke authored Dec 11, 2023
2 parents c37bd02 + 03afe00 commit a7dc709
Show file tree
Hide file tree
Showing 4 changed files with 175 additions and 90 deletions.
59 changes: 30 additions & 29 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,14 @@
This document describes specifications for generic [`Tool`](./tool.md) entities. A `Tool` is:
* any executable software
* contained in a docker (compatible) container
* transforms optional [`Parameters`](./parameter.md) and optional data into output
* transforms [`Input`](./input.md) consisting of optional [`Parameters`](./input.md#parameters-file-specification) and optional [`Data`](./input.md#data-file-specification) into output

A very simplified workflow of a tool execution looks like the flowchart below:

```mermaid
flowchart LR
input --> container --> output
data --> container
input -- parameters --> container --> output
input -- data --> container
```

The main objective is to create a communitiy-driven tool interface specification,
Expand All @@ -39,8 +38,8 @@ of this specification.

This section lists the implementations, which we are aware of. By *implementation*,
we are referring to software packages for different programming languages used
in either of the tools, that help to parse the *parametrization* of a tool into
a language specific data structure. You can read more about [`Parameters` here](./parameter.md).
in either of the tools, that help to parse the *parametrization* and the *input data* of a tool into
a language specific data structure. Here, you can read more about [parameter and data Input](./input.md).

The available implementations as of now are:

Expand All @@ -56,30 +55,33 @@ The table below lists which implementation exists and what parts of the
tool specification are already covered:


| specification | json2args (Python 3.X) | json2aRgs (R) | getParameters.m (Octave / MATLAB) | js2args (Node.js). |
|:-------------------|:------------------------:|:------------------:|:-----------------------------------:|:-------------------:|
| specification | json2args (Python 3.X) | json2aRgs (R) | getParameters.m (Octave / MATLAB) | js2args (Node.js). |
|:----------------------|:------------------------:|:------------------:|:-----------------------------------:|:-------------------:|
| **Parameter Types** ||
| string | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| integer | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| float | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| enum | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| enum -check values | :heavy_check_mark: | :heavy_check_mark: | :x: | :x: |
| boolean | :grey_question: | :grey_question: | :grey_question: | :grey_question: |
| datetime | :heavy_check_mark: | :heavy_check_mark: | :x: | :heavy_check_mark: |
| file - `.dat` | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| file - `.csv` | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| file - `.nc` | :x: | :x: | :x: | :x: |
| file - `.sqlite` | :x: | :x: | :x: | :x: |
| asset | :x: | :x: | :x: | :x: |
| string | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| integer | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| float | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| enum | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| enum -check values | :heavy_check_mark: | :heavy_check_mark: | :x: | :x: |
| boolean | :grey_question: | :grey_question: | :grey_question: | :grey_question: |
| datetime | :heavy_check_mark: | :heavy_check_mark: | :x: | :heavy_check_mark: |
| asset | :x: | :x: | :x: | :x: |
| **Parameter fields** ||
| array | :heavy_check_mark: | :grey_question: | :grey_question: | :grey_question: |
| default | :heavy_check_mark: | :heavy_check_mark: | :x: | :x: |
| min & max | :heavy_check_mark: | :heavy_check_mark: | :x: | :x: |
| array | :heavy_check_mark: | :grey_question: | :grey_question: | :grey_question: |
| default | :heavy_check_mark: | :heavy_check_mark: | :x: | :x: |
| min & max | :heavy_check_mark: | :heavy_check_mark: | :x: | :x: |
| **Data fields** ||
| extension - `.dat` | :x: | :x: | :x: | :x: |
| extension - `.csv` | :x: | :x: | :x: | :x: |
| extension - `.nc` | :x: | :x: | :x: | :x: |
| extension - `.sqlite` | :x: | :x: | :x: | :x: |
| load | :x: | :x: | :x: | :x: |
| wildcard search | :x: | :x: | :x: | :x: |
=======
| empty parameters* | :x: | :x: | :x: | :x: |
| empty input * | :x: | :x: | :x: | :x: |


\* `empty parameters` refers to the parameter specification requiring implementations to be able to handle empty or missing `/in/parameter.json` by returing an appropriate empty data structure
\* `empty input` refers to the input specification requiring implementations to be able to handle empty or missing `/in/input.json` by returing an appropriate empty data structure

## Frameworks

Expand All @@ -88,12 +90,11 @@ directly by operating the docker/podman CLI is the most low-level option and alw
The listed solutions will take some of the management boilerplate from you and
might turn out useful.

### Python

* [`toolbox-runner`](https://github.com/hydrocode-de/tool-runner)
* [`toolbox-runner`](https://github.com/hydrocode-de/tool-runner) (Python)
* [`tool-runner-js`](https://github.com/hydrocode-de/tool-runner-js) (NodeJS)


## Contents

* [`Tool`](./tool.md) specification
* [`Parameter`](./parameter.md) specification
* [`Input (Parameters and Data)`](./input.md) specification
174 changes: 125 additions & 49 deletions docs/parameter.md → docs/input.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,23 @@
# Parameter definitions
# Input: Parameter and data definitions

Input of a tool consists of optional `Parameters` and optional `Data`.

We define a `Parameter` to be a structured argument, which is passed to a
[Tool](./tool.md) on runtime. The sum of all passed `Parameters` make up the
*parameterization* or *parametrization* of a tool execution.

All tools define their `Parameters` in the `tool.yml`. This is the blueprint about
the parameter values that are acceptable or required along with specification about
value ranges. The *parameterization* is file based and defaults to
`/in/parameters.json`. The JSON format is mandatory.
All tools define their `Parameters` and `Data` in the `tool.yml`. This is the
blueprint about the parameter values and input data that are acceptable or
required along with specifications, e.g. about value ranges, default values and
data types. The actual definition of input (*parameterization* and *input data*)
when running a tool is file based and defaults to `/in/input.json`. The JSON
format is mandatory.

## Missing parameterization
## Missing parameterization and input data

In case a [Tool](./tool.md) accepts only optional parameters, or no parameters
are defined at all, the `/in/parameter.json` can be an empty file:
In case a [Tool](./tool.md) accepts only optional parameters and no input data,
or no parameters and no data are defined at all, the `/in/input.json` can be
an empty file:

```json
{}
Expand All @@ -26,25 +31,25 @@ In Python this would be an empty `dict`, in R an empty `list`.

## Parameterization vs. Data

In the semantics of [Tool](tool.md)s, there is a difference between *data*, which
In the semantics of [Tools](tool.md), there is a difference between *data*, which
is processed by a tool and *parameters*, which configure a tool.
On the one hand this differentiation is important to reflect the meaning of
arguments passed to generic tools, on the other hand there are implications for
reproducible workflows.

Changing the parameters of the tool results in a different analysis workflow, as
a change in parameter might in principle change the logic. Hence, a different
a change in parameters might in principle change the logic. Hence, a different
parameterization describes a different analysis.
Changing data does not change the tool logic. By definition, a tool is reproducible,
if the parameterization of a tool can be applied to other data. That means, the
**same** analysis is run on **different** data.

From a practical perspective, if you build a tool around these tool specifications,
the tool name and content of the `/in/parameter.json` can be used to create
checksums.
the tool name and content of the sections `parameters` and `data` of `/in/input.json`
can be used to create checksums and therefor help to establish reproducible workflows.


## File specification
## Parameters: File specification

Each `Parameter` is described in a parameter block in the `/src/tool.yml` file.
All parameters are collected as the mandatory `tools.<tool_name>.parameters` block:
Expand All @@ -53,17 +58,18 @@ All parameters are collected as the mandatory `tools.<tool_name>.parameters` blo
tools:
foobar:
parameters:
parameter_name: Parameter
foo_parameter:
[...]
```
Refer to the section below to learn about mandatory and optional fields for `Parameter`.
Refer to the section below to learn about mandatory and optional fields for a `Parameter`.


## Fields
### Fields

The following section will define all mandatory and optional fields of a `Parameter` entity.
The following section defines all mandatory and optional fields of a `Parameter` entity.

### `type`
#### `type`

The `type` field is the only mandatory field. Each parameter needs a data-type.
Allowed data-types include:
Expand All @@ -73,15 +79,14 @@ Allowed data-types include:
* float
* boolean
* enum
* file
* asset

#### enum
##### `enum`

The `type=enum` field has an additional mandatory `values` field, which lists all
allowed enum values. Note that enums should be validated by a parsing library
or a library calling the tools. For the tools, enums parameters are treated like
strings as soon as read from a `parameter.json` file.
strings as soon as read from a `input.json` file.

Example

Expand All @@ -97,28 +102,14 @@ tool:
- option 3
```
#### file
The `type=file` type indicates, that the passed parameter is a path to a file. These files contain data in most cases, but may also contain more complex configurations.
The library used for parsing parameterizations should load the file into memory and pass the respective datastructure to the tool. This way, the tool is independent of specific data files.
If this is not a possible workflow, file paths can be passed as ordinary string parameters and the parsing library will not attempt to load the file.
There are a number of file types, which are loaded by default:
| file extension | Python | R | Matlab | NodeJS |
| ---------------|--------|-----|---------|----------|
| .dat | `numpy.array` | `vector` | `matrix` | `number[][]` |
| .csv | `pandas.DataFrame` | `data.frame` | `matrix` | `number[][]` |
#### `asset`
#### asset
The `type=asset` can be used to specify paths to files or entire folders that are copied unchanged to the `/in/` path of the tool container and thus made available to the tool for further processing. The parsing library never attempts to load and process these files, therefore assets are available as-is in the container. Assets are neither Data nor parameters, but their dynamic nature might influence the tool execution. Hence, they are added as input to the tool.
The `type=asset` can be used to specify paths to files or entire folders that are copied unchanged to the `/in/` path of the tool container and thus made available to the tool for further processing. The parsing library never attempts to load and process these files, therefore the files are available in the tool exactly as they are.
Assets can be tool configurations, folders containing data, geometry files or all kinds of other files.
### `description`
#### `description`
The `description` is a multiline comment to describe the purpose of the parameter.
For the `description` Markdown is allowed, although tool-frameworks are not required to parse it.
For the `description`, Markdown is allowed, although tool-frameworks are not required to parse it.
Descriptions are optional and can be omitted.
A mutltiline comment in YAML can be specified like:
Expand All @@ -129,29 +120,106 @@ description: |
This is the second line
```

### `array`
#### `array`

The `array` field takes a single boolean value and defaults to `array=false`. If set to `array=true` the `Parameter` is an array of the specified `type`. The array field **cannot** be combined with the `type=file` and `type=enum` fields.
The `array` field takes a single boolean value and defaults to `array=false`. If set to `array=true` the `Parameter` is an array of the specified `type`. The array field **cannot** be combined with the `type=enum` field.

### `min`
#### `min`

Minimum value for constraining the value range. The `min` field is only valid for `type=integer` and `type=float`. Setting a minimum value is optional and can be omitted.
Note that if a `max` value is additionally specified for the parameter, `min` must be lower than `max`.

### `max`
#### `max`

Maximum value for constraining the value range. The `max` field is only valid for `type=integer` and `type=float`. Setting a maximum value is optional and can be omitted.
Note that if a `min` value is additionally specified for the parameter, `max` must be higher than `min`.

### `optional`
#### `optional`

Boolean field which defaults to `false`. If set to `optional=true`, the parameter is not required by the tool. This implies, that the tool implementation can handle a `parameters.json` in which the `Parameter` is entirely missing.
Boolean field which defaults to `false`. If set to `optional=true`, the parameter is not required by the tool. This implies, that the tool implementation can handle a `input.json` in which the `Parameter` is entirely missing.

### `default`
#### `default`

The `default` field is of same data type as the `Parameter` itself. If a default value is set, the tool-framework is required to inject this parameter into the `parameters.json`, as the tool will treat the default like any other non-optional parameter.
The `default` field is of the same data type as the `Parameter` itself. If a default value is set, the tool-framework is required to inject this parameter into the `input.json`, as the tool will treat the default like any other non-optional parameter.
Note, that default parameters are only parsed if they are not set as `optional=true`.


## Data: File specification

All input `Data` is described in a data block in the `/src/tool.yml` file.
All sets of input data are collected as the **optional** `tools.<tool_name>.data` block:

```yaml
tools:
foobar:
parameters:
[...]
data:
foo_data:
[...]
```
Refer to the section below to learn about mandatory and optional fields for `Data`.


### Fields

The following section defines all mandatory and optional fields of a `Data` entity.

#### `load`

This is the only **mandatory** field for an entity of `Data`.
Boolean field which defaults to `true`. If set to `load=false`, the file is not parsed by the
library used for parsing input. In this case, file paths are passed as ordinary strings and
the parsing library will not attempt to load the file.

There are a number of file formats, which are loaded by default:


| file extension | Python | R | Matlab | NodeJS |
| ---------------|--------|-----|---------|----------|
| .dat | `numpy.array` | `vector` | `matrix` | `number[][]` |
| .csv | `pandas.DataFrame` | `data.frame` | `matrix` | `number[][]` |


Note that setting `load=false` can be helpful when developing tools that require to load the
data in a different way than it is provided by the parsing libraries.

#### `extension`

By default, the file format is derived from the file extension given in the path to the data
in `input.json`. Via the `extension` field, it is possible to override the file format of input
data. This way, it can be ensured that the library used for parsing the input always loads the
file in the respective datastructure to the tool. If the file format / extension is not
supported by the parsing library, file paths are passed just as strings, the parsing library
will not attempt to load the file (see above for supported formats).

```yaml
tools:
foobar:
parameters:
...
data:
foo_data:
load: true
extension: .csv
```

#### `description`

The `description` is a multiline comment to describe the input data.
For the `description` Markdown is allowed, although tool-frameworks are not required to parse it.
Descriptions are optional and can be omitted.

A multiline comment in YAML can be specified like:

```yaml
description: |
This is the first line
This is the second line
```


## Example

```yaml
Expand All @@ -164,8 +232,6 @@ tools:
min: 0
max: 10
description: An integer between 0 and 10
foo_data:
type: file
foo_str:
type: string
default: My default string
Expand All @@ -180,4 +246,14 @@ tools:
array: true
optional: true
description: An optional array of floats
data:
foo_csv_data:
load: true
extension: .csv
description: |
The parsing library will try to load the data like .csv files,
regardless of the file extension.
foo_nc_data:
load: false
description: netCDF data that is not loaded by the parsing library.
```
Loading

0 comments on commit a7dc709

Please sign in to comment.