Merge pull request #15 from VForWaTer/data_section

Data section in tool.yml and parameters.json
VForWaTer · Dec 11, 2023 · a7dc709 · a7dc709
2 parents c37bd02 + 03afe00
commit a7dc709
Show file tree

Hide file tree

Showing 4 changed files with 175 additions and 90 deletions.
diff --git a/docs/README.md b/docs/README.md
@@ -5,15 +5,14 @@
 This document describes specifications for generic [`Tool`](./tool.md) entities. A `Tool` is:
 *  any executable software
 *  contained in a docker (compatible) container 
-*  transforms optional [`Parameters`](./parameter.md) and optional data into output
+*  transforms [`Input`](./input.md) consisting of optional [`Parameters`](./input.md#parameters-file-specification) and optional [`Data`](./input.md#data-file-specification) into output
 
 A very simplified workflow of a tool execution looks like the flowchart below:
 
 ```mermaid
 flowchart LR
-    input --> container --> output
-    data --> container
-
+    input -- parameters --> container --> output
+    input -- data --> container
 ```
 
 The main objective is to create a communitiy-driven tool interface specification, 
@@ -39,8 +38,8 @@ of this specification.
 
 This section lists the implementations, which we are aware of. By *implementation*, 
 we are referring to software packages for different programming languages used
-in either of the tools, that help to parse the *parametrization* of a tool into
-a language specific data structure. You can read more about [`Parameters` here](./parameter.md).
+in either of the tools, that help to parse the *parametrization* and the *input data* of a tool into
+a language specific data structure. Here, you can read more about [parameter and data Input](./input.md).
 
 The available implementations as of now are:
 
@@ -56,30 +55,33 @@ The table below lists which implementation exists and what parts of the
 tool specification are already covered:
 
 
-|  specification     |  json2args (Python 3.X)  | json2aRgs (R)      |  getParameters.m (Octave / MATLAB)  |  js2args (Node.js). |
-|:-------------------|:------------------------:|:------------------:|:-----------------------------------:|:-------------------:|
+|  specification        |  json2args (Python 3.X)  | json2aRgs (R)      |  getParameters.m (Octave / MATLAB)  |  js2args (Node.js). |
+|:----------------------|:------------------------:|:------------------:|:-----------------------------------:|:-------------------:|
 |    **Parameter Types**                                                                                                        ||
-| string             | :heavy_check_mark:       | :heavy_check_mark: | :heavy_check_mark:                  | :heavy_check_mark:  |
-| integer            | :heavy_check_mark:       | :heavy_check_mark: | :heavy_check_mark:                  | :heavy_check_mark:  |
-| float              | :heavy_check_mark:       | :heavy_check_mark: | :heavy_check_mark:                  | :heavy_check_mark:  |
-| enum               | :heavy_check_mark:       | :heavy_check_mark: | :heavy_check_mark:                  | :heavy_check_mark:  |
-| enum -check values | :heavy_check_mark:       | :heavy_check_mark: | :x:                                 | :x:                 |
-| boolean            | :grey_question:          | :grey_question:    | :grey_question:                     | :grey_question:     |
-| datetime           | :heavy_check_mark:       | :heavy_check_mark: | :x:                                 | :heavy_check_mark:  |
-| file - `.dat`      | :heavy_check_mark:       | :heavy_check_mark: | :heavy_check_mark:                  | :heavy_check_mark:  |
-| file - `.csv`      | :heavy_check_mark:       | :heavy_check_mark: | :heavy_check_mark:                  | :heavy_check_mark:  |
-| file - `.nc`       | :x:                      | :x:                | :x:                                 | :x:                 |
-| file - `.sqlite`   | :x:                      | :x:                | :x:                                 | :x:                 |
-| asset              | :x:                      | :x:                | :x:                                 | :x:                 |
+| string                | :heavy_check_mark:       | :heavy_check_mark: | :heavy_check_mark:                  | :heavy_check_mark:  |
+| integer               | :heavy_check_mark:       | :heavy_check_mark: | :heavy_check_mark:                  | :heavy_check_mark:  |
+| float                 | :heavy_check_mark:       | :heavy_check_mark: | :heavy_check_mark:                  | :heavy_check_mark:  |
+| enum                  | :heavy_check_mark:       | :heavy_check_mark: | :heavy_check_mark:                  | :heavy_check_mark:  |
+| enum -check values    | :heavy_check_mark:       | :heavy_check_mark: | :x:                                 | :x:                 |
+| boolean               | :grey_question:          | :grey_question:    | :grey_question:                     | :grey_question:     |
+| datetime              | :heavy_check_mark:       | :heavy_check_mark: | :x:                                 | :heavy_check_mark:  |
+| asset                 | :x:                      | :x:                | :x:                                 | :x:                 |
 |    **Parameter fields**                                                                                                       ||
-| array              | :heavy_check_mark:       | :grey_question:    | :grey_question:                     | :grey_question:     |
-| default            | :heavy_check_mark:       | :heavy_check_mark: | :x:                                 | :x:                 |
-| min & max          | :heavy_check_mark:       | :heavy_check_mark: | :x:                                 | :x:                 |
+| array                 | :heavy_check_mark:       | :grey_question:    | :grey_question:                     | :grey_question:     |
+| default               | :heavy_check_mark:       | :heavy_check_mark: | :x:                                 | :x:                 |
+| min & max             | :heavy_check_mark:       | :heavy_check_mark: | :x:                                 | :x:                 |
+|    **Data fields**                                                                                                            ||
+| extension - `.dat`    | :x:                      | :x:                | :x:                                 | :x:                 |
+| extension - `.csv`    | :x:                      | :x:                | :x:                                 | :x:                 |
+| extension - `.nc`     | :x:                      | :x:                | :x:                                 | :x:                 |
+| extension - `.sqlite` | :x:                      | :x:                | :x:                                 | :x:                 |
+| load                  | :x:                      | :x:                | :x:                                 | :x:                 |
+| wildcard search       | :x:                      | :x:                | :x:                                 | :x:                 |
 =======
-| empty parameters*  | :x:                      | :x:                | :x:                                 | :x:                 |
+| empty input     *  | :x:                      | :x:                | :x:                                 | :x:                 |
 
 
-\* `empty parameters` refers to the parameter specification requiring implementations to be able to handle empty or missing `/in/parameter.json` by returing an appropriate empty data structure
+\* `empty input` refers to the input specification requiring implementations to be able to handle empty or missing `/in/input.json` by returing an appropriate empty data structure
 
 ## Frameworks
 
@@ -88,12 +90,11 @@ directly by operating the docker/podman CLI is the most low-level option and alw
 The listed solutions will take some of the management boilerplate from you and
 might turn out useful.
 
-### Python
-
-* [`toolbox-runner`](https://github.com/hydrocode-de/tool-runner)
+* [`toolbox-runner`](https://github.com/hydrocode-de/tool-runner) (Python)
+* [`tool-runner-js`](https://github.com/hydrocode-de/tool-runner-js) (NodeJS)
 
 
 ## Contents
 
 * [`Tool`](./tool.md) specification
-* [`Parameter`](./parameter.md) specification
+* [`Input (Parameters and Data)`](./input.md) specification
diff --git a/docs/parameter.md → docs/input.md b/docs/parameter.md → docs/input.md
@@ -1,18 +1,23 @@
-# Parameter definitions
+# Input: Parameter and data definitions
+
+Input of a tool consists of optional `Parameters` and optional `Data`.
 
 We define a `Parameter` to be a structured argument, which is passed to a 
 [Tool](./tool.md) on runtime. The sum of all passed `Parameters` make up the 
 *parameterization* or *parametrization* of a tool execution.
 
-All tools define their `Parameters` in the `tool.yml`. This is the blueprint about
-the parameter values that are acceptable or required along with specification about
-value ranges. The *parameterization* is file based and defaults to 
-`/in/parameters.json`. The JSON format is mandatory. 
+All tools define their `Parameters` and `Data` in the `tool.yml`. This is the 
+blueprint about the parameter values and input data that are acceptable or 
+required along with specifications, e.g. about value ranges, default values and
+data types. The actual definition of input (*parameterization* and *input data*) 
+when running a tool is file based and defaults to `/in/input.json`. The JSON 
+format is mandatory. 
 
-## Missing parameterization
+## Missing parameterization and input data
 
-In case a [Tool](./tool.md) accepts only optional parameters, or no parameters 
-are defined at all, the `/in/parameter.json` can be an empty file:
+In case a [Tool](./tool.md) accepts only optional parameters and no input data,
+or no parameters and no data are defined at all, the `/in/input.json` can be 
+an empty file:
 
 ```json
 {}
@@ -26,25 +31,25 @@ In Python this would be an empty `dict`, in R an empty `list`.
 
 ## Parameterization vs. Data
 
-In the semantics of [Tool](tool.md)s, there is a difference between *data*, which 
+In the semantics of [Tools](tool.md), there is a difference between *data*, which 
 is processed by a tool and *parameters*, which configure a tool.
 On the one hand this differentiation is important to reflect the meaning of 
 arguments passed to generic tools, on the other hand there are implications for
 reproducible workflows.
 
 Changing the parameters of the tool results in a different analysis workflow, as
-a change in parameter might in principle change the logic. Hence, a different 
+a change in parameters might in principle change the logic. Hence, a different 
 parameterization describes a different analysis. 
 Changing data does not change the tool logic. By definition, a tool is reproducible,
 if the parameterization of a tool can be applied to other data. That means, the 
 **same** analysis is run on **different** data.
 
 From a practical perspective, if you build a tool around these tool specifications,
-the tool name and content of the `/in/parameter.json` can be used to create 
-checksums.
+the tool name and content of the sections `parameters` and `data` of `/in/input.json` 
+can be used to create checksums and therefor help to establish reproducible workflows.
 
 
-## File specification
+## Parameters: File specification
 
 Each `Parameter` is described in a parameter block in the `/src/tool.yml` file.
 All parameters are collected as the mandatory `tools.<tool_name>.parameters` block:
@@ -53,17 +58,18 @@ All parameters are collected as the mandatory `tools.<tool_name>.parameters` blo
 tools:
   foobar:
     parameters:
-      parameter_name: Parameter
+      foo_parameter:
+        [...]
 ```
 
-Refer to the section below to learn about mandatory and optional fields for `Parameter`.
+Refer to the section below to learn about mandatory and optional fields for a `Parameter`.
 
 
-## Fields
+### Fields
 
-The following section will define all mandatory and optional fields of a `Parameter` entity.
+The following section defines all mandatory and optional fields of a `Parameter` entity.
 
-### `type`
+#### `type`
 
 The `type` field is the only mandatory field. Each parameter needs a data-type.
 Allowed data-types include:
@@ -73,15 +79,14 @@ Allowed data-types include:
 * float
 * boolean
 * enum
-* file
 * asset
 
-#### enum
+##### `enum`
 
 The `type=enum` field has an additional mandatory `values` field, which lists all
 allowed enum values. Note that enums should be validated by a parsing library
 or a library calling the tools. For the tools, enums parameters are treated like 
-strings as soon as read from a `parameter.json` file.
+strings as soon as read from a `input.json` file.
 
 Example
 
@@ -97,28 +102,14 @@ tool:
           - option 3
 ```
 
-#### file
-
-The `type=file` type indicates, that the passed parameter is a path to a file. These files contain data in most cases, but may also contain more complex configurations.
-The library used for parsing parameterizations should load the file into memory and pass the respective datastructure to the tool. This way, the tool is independent of specific data files. 
-If this is not a possible workflow, file paths can be passed as ordinary string parameters and the parsing library will not attempt to load the file. 
-
-There are a number of file types, which are loaded by default:
-
-| file extension | Python |  R  |  Matlab |  NodeJS  |
-| ---------------|--------|-----|---------|----------| 
-| .dat  |  `numpy.array` | `vector` | `matrix`  | `number[][]` | 
-| .csv  |  `pandas.DataFrame` | `data.frame` |  `matrix` |  `number[][]` |
+#### `asset`
 
-#### asset
+The `type=asset` can be used to specify paths to files or entire folders that are copied unchanged to the `/in/` path of the tool container and thus made available to the tool for further processing. The parsing library never attempts to load and process these files, therefore assets are available as-is in the container. Assets are neither Data nor parameters, but their dynamic nature might influence the tool execution. Hence, they are added as input to the tool.
 
-The `type=asset` can be used to specify paths to files or entire folders that are copied unchanged to the `/in/` path of the tool container and thus made available to the tool for further processing. The parsing library never attempts to load and process these files, therefore the files are available in the tool exactly as they are.  
-Assets can be tool configurations, folders containing data, geometry files or all kinds of other files.
-
-### `description`
+#### `description`
 
 The `description` is a multiline comment to describe the purpose of the parameter.
-For the `description` Markdown is allowed, although tool-frameworks are not required to parse it.
+For the `description`, Markdown is allowed, although tool-frameworks are not required to parse it.
 Descriptions are optional and can be omitted.
 
 A mutltiline comment in YAML can be specified like:
@@ -129,29 +120,106 @@ description: |
     This is the second line
 ```
 
-### `array`
+#### `array`
 
-The `array` field takes a single boolean value and defaults to `array=false`. If set to `array=true` the `Parameter` is an array of the specified `type`. The array field **cannot** be combined with the `type=file` and `type=enum` fields.
+The `array` field takes a single boolean value and defaults to `array=false`. If set to `array=true` the `Parameter` is an array of the specified `type`. The array field **cannot** be combined with the `type=enum` field.
 
-### `min`
+#### `min`
 
 Minimum value for constraining the value range. The `min` field is only valid for `type=integer` and `type=float`. Setting a minimum value is optional and can be omitted.  
 Note that if a `max` value is additionally specified for the parameter, `min` must be lower than `max`.
 
-### `max`
+#### `max`
 
 Maximum value for constraining the value range. The `max` field is only valid for `type=integer` and `type=float`. Setting a maximum value is optional and can be omitted.  
 Note that if a `min` value is additionally specified for the parameter, `max` must be higher than `min`.
 
-### `optional`
+#### `optional`
 
-Boolean field which defaults to `false`. If set to `optional=true`, the parameter is not required by the tool. This implies, that the tool implementation can handle a `parameters.json` in which the `Parameter` is entirely missing.
+Boolean field which defaults to `false`. If set to `optional=true`, the parameter is not required by the tool. This implies, that the tool implementation can handle a `input.json` in which the `Parameter` is entirely missing.
 
-### `default`
+#### `default`
 
-The `default` field is of same data type as the `Parameter` itself. If a default value is set, the tool-framework is required to inject this parameter into the `parameters.json`, as the tool will treat the default like any other non-optional parameter.  
+The `default` field is of the same data type as the `Parameter` itself. If a default value is set, the tool-framework is required to inject this parameter into the `input.json`, as the tool will treat the default like any other non-optional parameter.  
 Note, that default parameters are only parsed if they are not set as `optional=true`.
 
+
+## Data: File specification
+
+All input `Data` is described in a data block in the `/src/tool.yml` file.
+All sets of input data are collected as the **optional** `tools.<tool_name>.data` block:
+
+```yaml
+tools:
+  foobar:
+    parameters:
+      [...]
+    data:
+      foo_data:
+        [...]
+```
+
+Refer to the section below to learn about mandatory and optional fields for `Data`.
+
+
+### Fields
+
+The following section defines all mandatory and optional fields of a `Data` entity.
+
+#### `load`
+
+This is the only **mandatory** field for an entity of `Data`.  
+Boolean field which defaults to `true`. If set to `load=false`, the file is not parsed by the 
+library used for parsing input. In this case, file paths are passed as ordinary strings and 
+the parsing library will not attempt to load the file.
+
+There are a number of file formats, which are loaded by default:
+
+
+| file extension | Python |  R  |  Matlab |  NodeJS  |
+| ---------------|--------|-----|---------|----------| 
+| .dat  |  `numpy.array` | `vector` | `matrix`  | `number[][]` | 
+| .csv  |  `pandas.DataFrame` | `data.frame` |  `matrix` |  `number[][]` |
+
+
+Note that setting `load=false` can be helpful when developing tools that require to load the
+data in a different way than it is provided by the parsing libraries.
+
+#### `extension`
+
+By default, the file format is derived from the file extension given in the path to the data
+in `input.json`. Via the `extension` field, it is possible to override the file format of input 
+data. This way, it can be ensured that the library used for parsing the input always loads the
+file in the respective datastructure to the tool.  If the file format / extension is not 
+supported by the parsing library, file paths are passed just as strings, the parsing library 
+will not attempt to load the file (see above for supported formats).
+
+```yaml
+tools:
+  foobar:
+    parameters:
+      ...
+    data:
+      foo_data:
+        load: true
+        extension: .csv
+```
+
+#### `description`
+
+The `description` is a multiline comment to describe the input data.
+For the `description` Markdown is allowed, although tool-frameworks are not required to parse it.
+Descriptions are optional and can be omitted.
+
+A multiline comment in YAML can be specified like:
+
+```yaml
+description: | 
+    This is the first line
+    This is the second line
+```
+
+
 ## Example
 
 ```yaml
@@ -164,8 +232,6 @@ tools:
         min: 0
         max: 10
         description: An integer between 0 and 10
-      foo_data:
-        type: file
       foo_str:
         type: string
         default: My default string
@@ -180,4 +246,14 @@ tools:
         array: true
         optional: true
         description: An optional array of floats
+    data:
+      foo_csv_data:
+        load: true
+        extension: .csv
+        description: |
+          The parsing library will try to load the data like .csv files,
+          regardless of the file extension.
+      foo_nc_data:
+        load: false
+        description: netCDF data that is not loaded by the parsing library.    
 ```