Merge pull request #19 from VForWaTer/remove-autoparse

Remove auto-load and specify data fields
VForWaTer · Dec 12, 2023 · ddb8afb · ddb8afb
2 parents 9405d31 + 0620708
commit ddb8afb
Showing 1 changed file with 61 additions and 48 deletions.
diff --git a/docs/input.md b/docs/input.md
@@ -46,7 +46,7 @@ if the parameterization of a tool can be applied to other data. That means, the
 
 From a practical perspective, if you build a tool around these tool specifications,
 the tool name and content of the sections `parameters` and `data` of `/in/input.json` 
-can be used to create checksums and therefor help to establish reproducible workflows.
+can be used to create checksums and therefore help to establish reproducible workflows.
 
 
 ## Parameters: File specification
@@ -147,76 +147,92 @@ Note, that default parameters are only parsed if they are not set as `optional=t
 ## Data: File specification
 
 All input `Data` is described in a data block in the `/src/tool.yml` file.
-All sets of input data are collected as the **optional** `tools.<tool_name>.data` block:
+All sets of input data are collected as the **optional** `tools.<tool_name>.data` block.
+The simplest declaration of input data is to list all available data files in a
+single, top-level list:
 
 ```yaml
 tools:
   foobar:
     parameters:
       [...]
     data:
-      foo_data:
-        [...]
+      - foo_data
+      - foo_data2
 ```
 
-Refer to the section below to learn about mandatory and optional fields for `Data`.
+If any of the dataset sources requires a more detailed configuration, objects 
+can be specifies as well:
 
+```yaml
+tools:
+  foobar:
+    parameters:
+      [...]
+    data:
+      foo_data:
+        description: Our first dataset with foo properties
+      foo_data2:
+        description: Our second dataset with foo2 properties
+```
 
-### Fields
+Refer to the section below to learn about the fields for `Data`.
 
-The following section defines all mandatory and optional fields of a `Data` entity.
 
-#### `load`
+### Fields
 
-This is the only **mandatory** field for an entity of `Data`.  
-Boolean field which defaults to `true`. If set to `load=false`, the file is not parsed by the 
-library used for parsing input. In this case, file paths are passed as ordinary strings and 
-the parsing library will not attempt to load the file.
+The following section defines all fields of a `Data` entity.
 
-There are a number of file formats, which are loaded by default:
 
+#### `description`
 
-| file extension | Python |  R  |  Matlab |  NodeJS  |
-| ---------------|--------|-----|---------|----------| 
-| .dat  |  `numpy.array` | `vector` | `matrix`  | `number[][]` | 
-| .csv  |  `pandas.DataFrame` | `data.frame` |  `matrix` |  `number[][]` |
+The `description` is a single- or multiline comment to describe the input data.
+For the `description` Markdown is allowed, although tool-frameworks are not required to parse it.
+Descriptions are optional and can be omitted, but it is highly recommended to 
+add descriptions to all required data inputs.
 
+A multiline comment in YAML can be specified like:
 
-Note that setting `load=false` can be helpful when developing tools that require to load the
-data in a different way than it is provided by the parsing libraries.
+```yaml
+description: | 
+    This is the first line
+    This is the second line
+```
 
-#### `extension`
+#### `example`
 
-By default, the file format is derived from the file extension given in the path to the data
-in `input.json`. Via the `extension` field, it is possible to override the file format of input 
-data. This way, it can be ensured that the library used for parsing the input always loads the
-file in the respective datastructure to the tool.  If the file format / extension is not 
-supported by the parsing library, file paths are passed just as strings, the parsing library 
-will not attempt to load the file (see above for supported formats).
+The `example` field is optional and can be used to reference a sample dataset
+for the given input, **within** the container. Data examples are a prime source 
+for your users to understand how inputs should look like and be formatted.
 
 ```yaml
-tools:
-  foobar:
-    parameters:
-      ...
-    data:
-      foo_data:
-        load: true
-        extension: .csv
+example: /in/input_name.csv
 ```
 
-#### `description`
+It is considered good practice to add example data and example parameterizaitons
+to the `/in/` folder. At inspection time, when a client application reads the 
+`tool.yml`, this client can also access the examples in the `/in/` folder.
+At runtime, as the client application mounts data and parameterizations into the
+container at `/in/`, the examples are non-existent in the container and cannot 
+accidentally pollute the runtime container.
 
-The `description` is a multiline comment to describe the input data.
-For the `description` Markdown is allowed, although tool-frameworks are not required to parse it.
-Descriptions are optional and can be omitted.
 
-A multiline comment in YAML can be specified like:
+#### `extension`
+
+The `extension` field is optional and can be used to limit the permitted file 
+extensions for a data input. Allowed is a single string input or a list of strings.
+By convention, the point `.` should be included into the `extension` as well.
 
 ```yaml
-description: | 
-    This is the first line
-    This is the second line
+extension: .csv
+```
+
+```yaml
+extension:
+  - .dat
+  - .txt
+  - .DAT
+  - .TXT
 ```
 
 
@@ -248,12 +264,9 @@ tools:
         description: An optional array of floats
     data:
       foo_csv_data:
-        load: true
-        extension: .csv
         description: |
-          The parsing library will try to load the data like .csv files,
-          regardless of the file extension.
+          This is a CSV file that should contain valid input. We do currently
+          not specify, what that exactly means.
       foo_nc_data:
-        load: false
-        description: netCDF data that is not loaded by the parsing library.    
+        description: CF-netCDF 1.8 conform climate model output.    
 ```