Skip to content

Commit

Permalink
Merge pull request #62 from ARTIST-Association/features/torch_dataset
Browse files Browse the repository at this point in the history
Torch Dataset
  • Loading branch information
Markus-Goetz authored Dec 11, 2024
2 parents e7d6d81 + 50cae29 commit f4ac086
Show file tree
Hide file tree
Showing 70 changed files with 1,440 additions and 41 deletions.
197 changes: 184 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
<p align="center">
<img src="logo.svg" alt="logo" width="450"/>
<a href="https://paint-database.org" target="_blank">
<img src="logo.svg" alt="logo" width="450"/>
</a>
</p>

# PAINT
Expand All @@ -11,15 +13,33 @@
[![fair-software.eu](https://img.shields.io/badge/fair--software.eu-%E2%97%8F%20%20%E2%97%8F%20%20%E2%97%8B%20%20%E2%97%8F%20%20%E2%97%8B-orange)](https://fair-software.eu)
[![codecov](https://codecov.io/gh/ARTIST-Association/PAINT/graph/badge.svg?token=B2pjCVgOhc)](https://codecov.io/gh/ARTIST-Association/PAINT)

## What is ``PAINT``
## Welcome to ``PAINT``

The PAINT database makes operational data of concentrating solar power plants available in accordance with the FAIR data
principles, i.e., making them findable, accessible, interoperable, and reusable. Currently, the data encompasses
calibration images, deflectometry measurements, kinematic settings, and weather information of the concentrating solar
power plant in Jülich, Germany, with the global power plant id (GPPD) WRI1030197. Metadata for all database entries
follows the spatio-temporal asset catalog (STAC) standard.
This repository contains code associated with the [PAINT database](https://paint-database.org). The ``PAINT`` database
makes operational data of concentrating solar power plants available in accordance with the FAIR data principles, i.e.,
making them findable, accessible, interoperable, and reusable. Currently, the data encompasses calibration images,
deflectometry measurements, kinematic settings, and weather information of the concentrating solar power plant in
Jülich, Germany, with the global power plant id (GPPD) WRI1030197. Metadata for all database entries follows the
spatio-temporal asset catalog (STAC) standard.

## Installation
## What can this repository do for you?

This repository contains two main types of code:
1. **Preprocessing:** This code was used to preprocess the data and extract all metadata into the STAC format. This
preprocessing included moving and renaming files to be in the correct structure, converting coordinates to the WGS84
format, and generating all STAC files (items, collections, and catalogs). This code is found in the subpackage
``paint.preprocessing`` and executed in the scripts located in ``preprocessing-scripts``. This code could be useful if
you have similar data that you would also like to process and include in the ``PAINT`` database!
2. **Data Access and Usage:** This code enables data from the ``PAINT`` database to be easily accessed from a code-base
and applied for a specific use case. Specifically, we provide a ``StacClient`` for browsing the STAC metadata files in
the ``PAINT`` database and downloading specific files. Furthermore, we provide utilities to generate custom benchmarks
for evaluating various calibration algorithms and also a ``torch.Dataset`` for efficiently loading and using calibration
data. This code is found in the subpackage ``paint.data`` and examples of possible execution are found in the
``scripts`` folder.

In the following, we will highlight how to use the code in more detail!

### Installation
We heavily recommend installing the `PAINT` package in a dedicated `Python3.9+` virtual environment. You can
install ``PAINT`` directly from the GitHub repository via:
```bash
Expand All @@ -31,26 +51,177 @@ Alternatively, you can install ``PAINT`` locally. To achieve this, there are two
git clone https://github.com/ARTIST-Association/PAINT.git
```
2. Install the package from the main branch:
- Install basic dependencies: ``pip install -e .``
- Install basic dependencies: ``pip install .``
- If you want to develop paint, install an editable version with developer dependencies: ``pip install -e ".[dev]"``

## Structure
### Structure
The ``PAINT`` repository is structured as shown below:
```
.
├── data-preparation-scripts # Scripts used to generate STAC files and structure data
├── html # Code for the paint-database.org website
├── markers # Saved markers for the WRI1030197 power plant in Jülich
├── paint # Python package
│ ├── data
│ ├── preprocessing
│ └── util
├── plots # Scripts to generate plots
└── tests # Tests for the python package
├── plots # Scripts used to generate plots found in our paper
├── preprocessing-scripts # Scripts used for preprocessing and STAC generation
├── scripts # Scripts highlighting example usage of the data
└── test # Tests for the python package
├── data
├── preprocessing
└── util
```

### Example usage: ``StacClient``
The ``StacClient`` provides a simple way of browsing and downloading data from the
[PAINT database](https://paint-database.org) directly within a python program. The `StacClient` we provide is based on
`pystac`, however, we have included a few wrapper functions tailored for the [PAINT database](https://paint-database.org).

An example of how to use the `StacClient` is found in [this example script](scripts/example_stac_client.py). In this
script, we first initialize the ``StacClient``:
```python
client = StacClient(output_dir=args.output_dir)
```
The `StacClient` includes built in functions to automatically download the tower measurements for the power plant with
the global id WRI1030197 in Jülich, or weather data from the DWD or Jülich for a desired period of time:
```python
# Download tower measurements.
client.get_tower_measurements()

# Download weather data between a certain time period.
client.get_weather_data(
data_sources=args.weather_data_sources,
start_date=datetime.strptime(args.start_date, mappings.TIME_FORMAT),
end_date=datetime.strptime(args.end_date, mappings.TIME_FORMAT),
)
```
The most useful function enables data from one or more heliostats to be downloaded:
```python
# Download heliostat data.
client.get_heliostat_data(
heliostats=args.heliostats,
collections=args.collections,
filtered_calibration_keys=args.filtered_calibration,
)
```
Here the arguments are important:
- ``heliostats`` - this is a list of heliostats or `None`. If `None` is provided, then data for all heliostats will be
downloaded.
- ``collections`` - this indicates from which STAC collections data should be downloaded. These collections include
calibration data, deflectometry measurements, and heliostat properties. If ``None`` is provided, then data for all
collections will be downloaded.
- ``filtered_calibration_keys`` - the calibration collection includes multiple items, i.e. raw images, cropped images,
flux images, flux centered images, and calibration properties. With this argument it is possible to decide which items
will be downloaded. If ``None`` is provided, then data for all items will be downloaded.

Finally, if you are only interested in the metadata to do some more in depth data exploration or generate plots then you
can download the heliostat metadata with the following function:
```python
# Download metadata for all heliostats.
client.get_heliostat_metadata(heliostats=None)
```
Of course this `StacClient` doesn't cover all possible use cases - but with the code provided we hope to give you enough
information to write your own extensions if required!

### Example usage: ``DatasetSplitter``
The ``DatasetSplitter`` class is used to create benchmark dataset splits. When working with calibration data and
developing alignment algorithms to optimize performance, it is important that the train, test, and validation data are
diverse. Currently, there is no standard to benchmark different algorithms and part of the ``PAINT`` project is to
provide this standard. Therefore, we include methods for generating benchmark splits, that can then be used for a
standardized evaluation process. We currently provide support for the following splits:
- **Azimuth Split:** This splits the data based on the azimuth of the sun for each considered calibration sample. For a
single heliostat, the ``training_size`` indices with the smallest azimuth values are selected for the training split,
while the ``validation_size`` indices with the largest values are selected for the validation split. The remaining
indices are assigned to the test split. This ensures that indices with very different azimuth values are considered in
the train and validation samples, i.e., the train and validation splits should contain very different samples. This
difference leads to a high level of difficulty and should guarantee that the trained calibration method is robust.
- **Solstice Split:** This splits the data based on the time of the year, more specifically, how close the measurement
date was to the winter or summer solstice. Specifically, for a single heliostat, the ``training_size`` indices closest
to the winter solstice are selected for the training split, while the ``validation_size`` indices closest to the summer
solstice are selected for the validation split. The remaining indices are assigned to the test split. This ensures that
indices from very different seasons, i.e. different conditions, are considered in training and validation, i.e., the
train and validation splits should contain very different samples. This difference leads to a high level of difficulty
and should guarantee that the trained calibration method is robust.

The [example dataset splits script](scripts/example_dataset_splits.py) provides an example of how to use the ``DatasetSplitter``.
To generate splits we first initialize the class with an ``input_file`` that contains the path to the metadata required
to generate the splits and an ``output_dir`` where the split information will be saved:
```python
splitter = DatasetSplitter(
input_file=args.input_file, output_dir=args.output_dir, remove_unused_data=False
)
```
Additionally, the `removed_unused_data` boolean indicates whether extra metadata not required for the split calculation
should be removed from the ``pandas.Dataframe`` that is returned or not. This extra metadata may be useful to generate
plots or analyse the splits in more detail.

To generate the splits we simply call the ``get_dataset_splits()`` function:
```python
# Example for azimuth splits
azimuth_splits = splitter.get_dataset_splits(
split_type="azimuth", training_size=10, validation_size=30
)
```

### Example usage: ``PaintCalibrationDataset``
Since multiple calibration items may be required for training an alignment optimization or similar, we have created a
custom ``torch.Dataset`` that loads calibration items from the ``PAINT`` database. An example of how to use this dataset
is provided in [this script](scripts/example_dataset.py).

There are three ways of creating a ``PaintCalibrationDataset``:
1. Directly creating the dataset, based on calibration data that has already been downloaded and saved in a ``root_dir``:
```python
dataset = PaintCalibrationDataset(
root_dir=direct_root_dir,
item_ids=None,
item_type=args.item_type,
)
```
Here, the ``item_ids`` can be a list indicating which of the items contained in the ``root_dir`` should be used or if
``None`` all items will be used. The ``item_type`` is used to determine what type of calibration item should be loaded,
i.e. the raw image, cropped image, flux image, flux centered image, or calibration properties file.
2. Creating the dataset from a benchmark file (see above). In this case the ``benchmark_file`` must also be provided:
```python
train, test, val = PaintCalibrationDataset.from_benchmark(
benchmark_file=benchmark_file,
root_dir=benchmark_root_dir,
item_type=args.item_type,
download=True,
)
```
This class method will generate three ``torch.Datasets``, one for each of the considered splits.
3. Creating the dataset from a single heliostat or list of heliostats. In this case, all calibration items for the
provided heliostats will be used to create a dataset. In this case a list of ``heliostats`` must be provided:
```python
heliostat_dataset = PaintCalibrationDataset.from_heliostats(
heliostats=heliostats,
root_dir=heliostat_root_dir,
item_type=args.item_type,
download=True,
)
```

### Example usage: Full dataset workflow
The [full dataset workflow](scripts/example_benchmark_dataset_full_workflow.py) provides an executable example of how
``PAINT`` calibration data could be directly used in further applications. When executed, the script will download the
necessary metadata to generate benchmark splits (if this data is not already downloaded), generate the dataset benchmark
splits, and initialize a ``torch.Dataset`` based on these splits (downloading the data if necessary). The script takes
the following arguments:
- ``metadata_input`` - Path to the file containing the metadata required to generate the dataset splits.
- ``output_dir`` - Root directory to save all outputs and data.
- ``split_type`` - The benchmark dataset split type to apply.
- ``train_size`` - The number of training samples required per heliostat - the total training size depends on the number
of heliostats.
- ``val_size`` - The number of validation samples required per heliostat - the total training size depends on the number
of heliostats.
- ``remove_unused_data`` - Whether to remove metadata that is not required to load benchmark splits, but may be useful
for plots or data inspection.
- ``item_type`` - The type of calibration item to be loaded -- i.e., raw image, cropped image, flux image, flux
centered image, or calibration properties.

Feel free to execute the script and have some fun :rocket:!

## How to contribute
Check out our [contribution guidelines](CONTRIBUTING.md) if you are interested in contributing to the `PAINT` project :fire:.
Please also carefully check our [code of conduct](CODE_OF_CONDUCT.md) :blue_heart:.
Expand Down
Loading

0 comments on commit f4ac086

Please sign in to comment.