Skip to content

Commit

Permalink
Merge pull request #19 from VForWaTer/remove-autoparse
Browse files Browse the repository at this point in the history
Remove auto-load and specify data fields
  • Loading branch information
mmaelicke authored Dec 12, 2023
2 parents 9405d31 + 0620708 commit ddb8afb
Showing 1 changed file with 61 additions and 48 deletions.
109 changes: 61 additions & 48 deletions docs/input.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ if the parameterization of a tool can be applied to other data. That means, the

From a practical perspective, if you build a tool around these tool specifications,
the tool name and content of the sections `parameters` and `data` of `/in/input.json`
can be used to create checksums and therefor help to establish reproducible workflows.
can be used to create checksums and therefore help to establish reproducible workflows.


## Parameters: File specification
Expand Down Expand Up @@ -147,76 +147,92 @@ Note, that default parameters are only parsed if they are not set as `optional=t
## Data: File specification

All input `Data` is described in a data block in the `/src/tool.yml` file.
All sets of input data are collected as the **optional** `tools.<tool_name>.data` block:
All sets of input data are collected as the **optional** `tools.<tool_name>.data` block.
The simplest declaration of input data is to list all available data files in a
single, top-level list:

```yaml
tools:
foobar:
parameters:
[...]
data:
foo_data:
[...]
- foo_data
- foo_data2
```
Refer to the section below to learn about mandatory and optional fields for `Data`.
If any of the dataset sources requires a more detailed configuration, objects
can be specifies as well:
```yaml
tools:
foobar:
parameters:
[...]
data:
foo_data:
description: Our first dataset with foo properties
foo_data2:
description: Our second dataset with foo2 properties
```
### Fields
Refer to the section below to learn about the fields for `Data`.

The following section defines all mandatory and optional fields of a `Data` entity.

#### `load`
### Fields

This is the only **mandatory** field for an entity of `Data`.
Boolean field which defaults to `true`. If set to `load=false`, the file is not parsed by the
library used for parsing input. In this case, file paths are passed as ordinary strings and
the parsing library will not attempt to load the file.
The following section defines all fields of a `Data` entity.

There are a number of file formats, which are loaded by default:

#### `description`

| file extension | Python | R | Matlab | NodeJS |
| ---------------|--------|-----|---------|----------|
| .dat | `numpy.array` | `vector` | `matrix` | `number[][]` |
| .csv | `pandas.DataFrame` | `data.frame` | `matrix` | `number[][]` |
The `description` is a single- or multiline comment to describe the input data.
For the `description` Markdown is allowed, although tool-frameworks are not required to parse it.
Descriptions are optional and can be omitted, but it is highly recommended to
add descriptions to all required data inputs.

A multiline comment in YAML can be specified like:

Note that setting `load=false` can be helpful when developing tools that require to load the
data in a different way than it is provided by the parsing libraries.
```yaml
description: |
This is the first line
This is the second line
```

#### `extension`
#### `example`

By default, the file format is derived from the file extension given in the path to the data
in `input.json`. Via the `extension` field, it is possible to override the file format of input
data. This way, it can be ensured that the library used for parsing the input always loads the
file in the respective datastructure to the tool. If the file format / extension is not
supported by the parsing library, file paths are passed just as strings, the parsing library
will not attempt to load the file (see above for supported formats).
The `example` field is optional and can be used to reference a sample dataset
for the given input, **within** the container. Data examples are a prime source
for your users to understand how inputs should look like and be formatted.

```yaml
tools:
foobar:
parameters:
...
data:
foo_data:
load: true
extension: .csv
example: /in/input_name.csv
```

#### `description`
It is considered good practice to add example data and example parameterizaitons
to the `/in/` folder. At inspection time, when a client application reads the
`tool.yml`, this client can also access the examples in the `/in/` folder.
At runtime, as the client application mounts data and parameterizations into the
container at `/in/`, the examples are non-existent in the container and cannot
accidentally pollute the runtime container.

The `description` is a multiline comment to describe the input data.
For the `description` Markdown is allowed, although tool-frameworks are not required to parse it.
Descriptions are optional and can be omitted.

A multiline comment in YAML can be specified like:
#### `extension`

The `extension` field is optional and can be used to limit the permitted file
extensions for a data input. Allowed is a single string input or a list of strings.
By convention, the point `.` should be included into the `extension` as well.

```yaml
description: |
This is the first line
This is the second line
extension: .csv
```

```yaml
extension:
- .dat
- .txt
- .DAT
- .TXT
```


Expand Down Expand Up @@ -248,12 +264,9 @@ tools:
description: An optional array of floats
data:
foo_csv_data:
load: true
extension: .csv
description: |
The parsing library will try to load the data like .csv files,
regardless of the file extension.
This is a CSV file that should contain valid input. We do currently
not specify, what that exactly means.
foo_nc_data:
load: false
description: netCDF data that is not loaded by the parsing library.
description: CF-netCDF 1.8 conform climate model output.
```

0 comments on commit ddb8afb

Please sign in to comment.