Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Torch Dataset #62

Merged
merged 45 commits into from
Dec 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
45b92ef
extend mappings for dataset
kalebphipps Dec 5, 2024
a7f1ed8
include initial dataset, only basic functions
kalebphipps Dec 5, 2024
6247b25
include logger
kalebphipps Dec 5, 2024
55cdc8b
include class method to download from benchmark file
kalebphipps Dec 5, 2024
e09b7dc
fix logging
kalebphipps Dec 6, 2024
16ce746
fix docstrings
kalebphipps Dec 6, 2024
15263a7
add new function for dataset - not yet finished
kalebphipps Dec 6, 2024
5b4a0a7
include boolean to save in one folder if the data is required for a d…
kalebphipps Dec 6, 2024
b8cc4ce
include from heliostat class method
kalebphipps Dec 6, 2024
98d3b30
update example
kalebphipps Dec 6, 2024
bcd4714
include docstrings
kalebphipps Dec 9, 2024
6ac0b87
example script for the entire workflow including downloading data
kalebphipps Dec 9, 2024
013da2b
fix error with mappings
kalebphipps Dec 9, 2024
32d3389
include torchvision as dependency
kalebphipps Dec 9, 2024
d562be5
improve loading from benchmark method to only consider data available…
kalebphipps Dec 9, 2024
720aa3f
include tests for the dataset
kalebphipps Dec 9, 2024
2133da6
ruff formatting
kalebphipps Dec 9, 2024
140f0d3
Merge branch 'main' into features/torch_dataset
kalebphipps Dec 9, 2024
0f60a77
add test data for dataset
kalebphipps Dec 9, 2024
768901c
rename unused variables
kalebphipps Dec 9, 2024
adb1db8
fix type hints
kalebphipps Dec 9, 2024
9503ecd
include test for custom string representation
kalebphipps Dec 9, 2024
ceb0d04
fix initial orientation values
kalebphipps Dec 9, 2024
515d847
fix formatting
kalebphipps Dec 9, 2024
007edc0
fix formatting
kalebphipps Dec 9, 2024
b215e6d
fix capitalization of index column
kalebphipps Dec 9, 2024
f46bfb9
rename test files
kalebphipps Dec 10, 2024
692549d
remove unused files
kalebphipps Dec 10, 2024
586bfd2
adjust test coverage
kalebphipps Dec 10, 2024
25b86e2
adjust test for new names
kalebphipps Dec 10, 2024
409031e
adjust test for new names and include test of string representation
kalebphipps Dec 10, 2024
0c6c594
adjust names of calibration item identifiers
kalebphipps Dec 10, 2024
0c3aa35
exclude download function from test coverage
kalebphipps Dec 10, 2024
1a7a736
adjust STAC for new file names
kalebphipps Dec 10, 2024
0be6d8a
improve memory efficiency and adjust file names
kalebphipps Dec 10, 2024
b2c41d1
improve argument description and include choices
kalebphipps Dec 10, 2024
7aa2b31
include choices
kalebphipps Dec 10, 2024
b892da3
include choices
kalebphipps Dec 10, 2024
e0a663e
include choices
kalebphipps Dec 10, 2024
b1e6f66
include choices
kalebphipps Dec 10, 2024
4f53c63
make executable
kalebphipps Dec 10, 2024
1b1a7c2
make executable
kalebphipps Dec 10, 2024
fa0781a
fix typos
kalebphipps Dec 10, 2024
afaab7d
update structure and include examples
kalebphipps Dec 10, 2024
50cae29
fix typo
kalebphipps Dec 10, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
197 changes: 184 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
<p align="center">
<img src="logo.svg" alt="logo" width="450"/>
<a href="https://paint-database.org" target="_blank">
<img src="logo.svg" alt="logo" width="450"/>
</a>
</p>

# PAINT
Expand All @@ -11,15 +13,33 @@
[![fair-software.eu](https://img.shields.io/badge/fair--software.eu-%E2%97%8F%20%20%E2%97%8F%20%20%E2%97%8B%20%20%E2%97%8F%20%20%E2%97%8B-orange)](https://fair-software.eu)
[![codecov](https://codecov.io/gh/ARTIST-Association/PAINT/graph/badge.svg?token=B2pjCVgOhc)](https://codecov.io/gh/ARTIST-Association/PAINT)

## What is ``PAINT``
## Welcome to ``PAINT``

The PAINT database makes operational data of concentrating solar power plants available in accordance with the FAIR data
principles, i.e., making them findable, accessible, interoperable, and reusable. Currently, the data encompasses
calibration images, deflectometry measurements, kinematic settings, and weather information of the concentrating solar
power plant in Jülich, Germany, with the global power plant id (GPPD) WRI1030197. Metadata for all database entries
follows the spatio-temporal asset catalog (STAC) standard.
This repository contains code associated with the [PAINT database](https://paint-database.org). The ``PAINT`` database
makes operational data of concentrating solar power plants available in accordance with the FAIR data principles, i.e.,
making them findable, accessible, interoperable, and reusable. Currently, the data encompasses calibration images,
deflectometry measurements, kinematic settings, and weather information of the concentrating solar power plant in
Jülich, Germany, with the global power plant id (GPPD) WRI1030197. Metadata for all database entries follows the
spatio-temporal asset catalog (STAC) standard.

## Installation
## What can this repository do for you?

This repository contains two main types of code:
1. **Preprocessing:** This code was used to preprocess the data and extract all metadata into the STAC format. This
preprocessing included moving and renaming files to be in the correct structure, converting coordinates to the WGS84
format, and generating all STAC files (items, collections, and catalogs). This code is found in the subpackage
``paint.preprocessing`` and executed in the scripts located in ``preprocessing-scripts``. This code could be useful if
you have similar data that you would also like to process and include in the ``PAINT`` database!
2. **Data Access and Usage:** This code enables data from the ``PAINT`` database to be easily accessed from a code-base
and applied for a specific use case. Specifically, we provide a ``StacClient`` for browsing the STAC metadata files in
the ``PAINT`` database and downloading specific files. Furthermore, we provide utilities to generate custom benchmarks
for evaluating various calibration algorithms and also a ``torch.Dataset`` for efficiently loading and using calibration
data. This code is found in the subpackage ``paint.data`` and examples of possible execution are found in the
``scripts`` folder.

In the following, we will highlight how to use the code in more detail!

### Installation
We heavily recommend installing the `PAINT` package in a dedicated `Python3.9+` virtual environment. You can
install ``PAINT`` directly from the GitHub repository via:
```bash
Expand All @@ -31,26 +51,177 @@ Alternatively, you can install ``PAINT`` locally. To achieve this, there are two
git clone https://github.com/ARTIST-Association/PAINT.git
```
2. Install the package from the main branch:
- Install basic dependencies: ``pip install -e .``
- Install basic dependencies: ``pip install .``
- If you want to develop paint, install an editable version with developer dependencies: ``pip install -e ".[dev]"``

## Structure
### Structure
The ``PAINT`` repository is structured as shown below:
```
.
├── data-preparation-scripts # Scripts used to generate STAC files and structure data
├── html # Code for the paint-database.org website
├── markers # Saved markers for the WRI1030197 power plant in Jülich
├── paint # Python package
│ ├── data
│ ├── preprocessing
│ └── util
├── plots # Scripts to generate plots
└── tests # Tests for the python package
├── plots # Scripts used to generate plots found in our paper
├── preprocessing-scripts # Scripts used for preprocessing and STAC generation
├── scripts # Scripts highlighting example usage of the data
└── test # Tests for the python package
├── data
├── preprocessing
└── util
```

### Example usage: ``StacClient``
The ``StacClient`` provides a simple way of browsing and downloading data from the
[PAINT database](https://paint-database.org) directly within a python program. The `StacClient` we provide is based on
`pystac`, however, we have included a few wrapper functions tailored for the [PAINT database](https://paint-database.org).

An example of how to use the `StacClient` is found in [this example script](scripts/example_stac_client.py). In this
script, we first initialize the ``StacClient``:
```python
client = StacClient(output_dir=args.output_dir)
```
The `StacClient` includes built in functions to automatically download the tower measurements for the power plant with
the global id WRI1030197 in Jülich, or weather data from the DWD or Jülich for a desired period of time:
```python
# Download tower measurements.
client.get_tower_measurements()

# Download weather data between a certain time period.
client.get_weather_data(
data_sources=args.weather_data_sources,
start_date=datetime.strptime(args.start_date, mappings.TIME_FORMAT),
end_date=datetime.strptime(args.end_date, mappings.TIME_FORMAT),
)
```
The most useful function enables data from one or more heliostats to be downloaded:
```python
# Download heliostat data.
client.get_heliostat_data(
heliostats=args.heliostats,
collections=args.collections,
filtered_calibration_keys=args.filtered_calibration,
)
```
Here the arguments are important:
- ``heliostats`` - this is a list of heliostats or `None`. If `None` is provided, then data for all heliostats will be
downloaded.
- ``collections`` - this indicates from which STAC collections data should be downloaded. These collections include
calibration data, deflectometry measurements, and heliostat properties. If ``None`` is provided, then data for all
collections will be downloaded.
- ``filtered_calibration_keys`` - the calibration collection includes multiple items, i.e. raw images, cropped images,
flux images, flux centered images, and calibration properties. With this argument it is possible to decide which items
will be downloaded. If ``None`` is provided, then data for all items will be downloaded.

Finally, if you are only interested in the metadata to do some more in depth data exploration or generate plots then you
can download the heliostat metadata with the following function:
```python
# Download metadata for all heliostats.
client.get_heliostat_metadata(heliostats=None)
```
Of course this `StacClient` doesn't cover all possible use cases - but with the code provided we hope to give you enough
information to write your own extensions if required!

### Example usage: ``DatasetSplitter``
The ``DatasetSplitter`` class is used to create benchmark dataset splits. When working with calibration data and
developing alignment algorithms to optimize performance, it is important that the train, test, and validation data are
diverse. Currently, there is no standard to benchmark different algorithms and part of the ``PAINT`` project is to
provide this standard. Therefore, we include methods for generating benchmark splits, that can then be used for a
standardized evaluation process. We currently provide support for the following splits:
- **Azimuth Split:** This splits the data based on the azimuth of the sun for each considered calibration sample. For a
single heliostat, the ``training_size`` indices with the smallest azimuth values are selected for the training split,
while the ``validation_size`` indices with the largest values are selected for the validation split. The remaining
indices are assigned to the test split. This ensures that indices with very different azimuth values are considered in
the train and validation samples, i.e., the train and validation splits should contain very different samples. This
difference leads to a high level of difficulty and should guarantee that the trained calibration method is robust.
- **Solstice Split:** This splits the data based on the time of the year, more specifically, how close the measurement
date was to the winter or summer solstice. Specifically, for a single heliostat, the ``training_size`` indices closest
to the winter solstice are selected for the training split, while the ``validation_size`` indices closest to the summer
solstice are selected for the validation split. The remaining indices are assigned to the test split. This ensures that
indices from very different seasons, i.e. different conditions, are considered in training and validation, i.e., the
train and validation splits should contain very different samples. This difference leads to a high level of difficulty
and should guarantee that the trained calibration method is robust.

The [example dataset splits script](scripts/example_dataset_splits.py) provides an example of how to use the ``DatasetSplitter``.
To generate splits we first initialize the class with an ``input_file`` that contains the path to the metadata required
to generate the splits and an ``output_dir`` where the split information will be saved:
```python
splitter = DatasetSplitter(
input_file=args.input_file, output_dir=args.output_dir, remove_unused_data=False
)
```
Additionally, the `removed_unused_data` boolean indicates whether extra metadata not required for the split calculation
should be removed from the ``pandas.Dataframe`` that is returned or not. This extra metadata may be useful to generate
plots or analyse the splits in more detail.

To generate the splits we simply call the ``get_dataset_splits()`` function:
```python
# Example for azimuth splits
azimuth_splits = splitter.get_dataset_splits(
split_type="azimuth", training_size=10, validation_size=30
)
```

### Example usage: ``PaintCalibrationDataset``
Since multiple calibration items may be required for training an alignment optimization or similar, we have created a
custom ``torch.Dataset`` that loads calibration items from the ``PAINT`` database. An example of how to use this dataset
is provided in [this script](scripts/example_dataset.py).

There are three ways of creating a ``PaintCalibrationDataset``:
1. Directly creating the dataset, based on calibration data that has already been downloaded and saved in a ``root_dir``:
```python
dataset = PaintCalibrationDataset(
root_dir=direct_root_dir,
item_ids=None,
item_type=args.item_type,
)
```
Here, the ``item_ids`` can be a list indicating which of the items contained in the ``root_dir`` should be used or if
``None`` all items will be used. The ``item_type`` is used to determine what type of calibration item should be loaded,
i.e. the raw image, cropped image, flux image, flux centered image, or calibration properties file.
2. Creating the dataset from a benchmark file (see above). In this case the ``benchmark_file`` must also be provided:
```python
train, test, val = PaintCalibrationDataset.from_benchmark(
benchmark_file=benchmark_file,
root_dir=benchmark_root_dir,
item_type=args.item_type,
download=True,
)
```
This class method will generate three ``torch.Datasets``, one for each of the considered splits.
3. Creating the dataset from a single heliostat or list of heliostats. In this case, all calibration items for the
provided heliostats will be used to create a dataset. In this case a list of ``heliostats`` must be provided:
```python
heliostat_dataset = PaintCalibrationDataset.from_heliostats(
heliostats=heliostats,
root_dir=heliostat_root_dir,
item_type=args.item_type,
download=True,
)
```

### Example usage: Full dataset workflow
The [full dataset workflow](scripts/example_benchmark_dataset_full_workflow.py) provides an executable example of how
``PAINT`` calibration data could be directly used in further applications. When executed, the script will download the
necessary metadata to generate benchmark splits (if this data is not already downloaded), generate the dataset benchmark
splits, and initialize a ``torch.Dataset`` based on these splits (downloading the data if necessary). The script takes
the following arguments:
- ``metadata_input`` - Path to the file containing the metadata required to generate the dataset splits.
- ``output_dir`` - Root directory to save all outputs and data.
- ``split_type`` - The benchmark dataset split type to apply.
- ``train_size`` - The number of training samples required per heliostat - the total training size depends on the number
of heliostats.
- ``val_size`` - The number of validation samples required per heliostat - the total training size depends on the number
of heliostats.
- ``remove_unused_data`` - Whether to remove metadata that is not required to load benchmark splits, but may be useful
for plots or data inspection.
- ``item_type`` - The type of calibration item to be loaded -- i.e., raw image, cropped image, flux image, flux
centered image, or calibration properties.

Feel free to execute the script and have some fun :rocket:!

## How to contribute
Check out our [contribution guidelines](CONTRIBUTING.md) if you are interested in contributing to the `PAINT` project :fire:.
Please also carefully check our [code of conduct](CODE_OF_CONDUCT.md) :blue_heart:.
Expand Down
Loading
Loading