Merge branch 'main' into specs-2024.8.0

Open-EO · Aug 12, 2024 · 09871f0 · 09871f0
2 parents 3e9f6aa + f3cd4dd
commit 09871f0
Show file tree

Hide file tree

Showing 41 changed files with 1,432 additions and 139 deletions.
diff --git a/.github/CODE_OF_CONDUCT.md b/.github/CODE_OF_CONDUCT.md
@@ -0,0 +1,46 @@
+# Contributor Covenant Code of Conduct
+
+## Our Pledge
+
+In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, gender identity and expression, level of experience, nationality, personal appearance, race, religion, or sexual identity and orientation.
+
+## Our Standards
+
+Examples of behavior that contributes to creating a positive environment include:
+
+* Using welcoming and inclusive language
+* Being respectful of differing viewpoints and experiences
+* Gracefully accepting constructive criticism
+* Focusing on what is best for the community
+* Showing empathy towards other community members
+
+Examples of unacceptable behavior by participants include:
+
+* The use of sexualized language or imagery and unwelcome sexual attention or advances
+* Trolling, insulting/derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or electronic address, without explicit permission
+* Other conduct which could reasonably be considered inappropriate in a professional setting
+
+## Our Responsibilities
+
+Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior.
+
+Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful.
+
+## Scope
+
+This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers.
+
+## Enforcement
+
+Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team at support@eodc.eu. The project team will review and investigate all complaints, and will respond in a way that it deems appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately.
+
+Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project's leadership.
+
+## Attribution
+
+This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, available at [http://contributor-covenant.org/version/1/4][version]
+
+[homepage]: http://contributor-covenant.org
+[version]: http://contributor-covenant.org/version/1/4/
diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md
@@ -0,0 +1,137 @@
+# Welcome to openeo-processes-dask
+
+Thank you for your efforts in contributing to our project! Any contribution you make, e.g. bug reports, bug fixes, additional documentation, enhancement suggestions, and other ideas are welcome and will be reflected on [openeo-processes-dask](https://github.com/Open-EO/openeo-processes-dask).
+
+If there are any questions, do not hesitate to ask, by opening an issue or directly contacting us. We also have a biweekly online meeting, where PRs and issues are discussed. Feel free to join! Contact: support@eodc.eu
+
+Our aim is to work on this project together - everyone is welcome to contribute. Please follow our [Code of Conduct](./CODE_OF_CONDUCT.md).
+
+On this page, we will guide you through the contribution workflow from opening an issue and creating a PR to reviewing and merging the PR.
+
+
+### Getting started
+
+To get an overview of the project, read the [README](../README.md) file. The process implementations are based on the openEO [specification](https://processes.openeo.org/). The aim of this project is, to offer implementations for all listed processes based on the [xarray](https://github.com/pydata/xarray)/[dask](https://github.com/dask/dask) ecosystem.
+
+To get a general introduction to openEO, see:
+- [openEO docs](https://docs.openeo.cloud/)
+- [openEO API](https://api.openeo.org/)
+- [openEO registration](https://docs.openeo.cloud/join/free_trial.html#connect-with-egi-check-in)
+- [openEO official website](https://openeo.cloud/)
+
+
+### Issues and bugs
+
+#### Create a new issue
+
+Reporting bugs is an important part of improving the project. If you find any unclear documentation, unexpected behaviour in the implementation, missing features, etc. first [check if an issue already exists](https://github.com/Open-EO/openeo-processes-dask/issues). If a related issue doesn't exist, you can open a new one.
+
+#### Create a bug report
+
+A bug report should always contain python code, to recreate the behaviour. This can be formatted nicely using ` ```python ... ``` `. Add an explaination on which parts are unexpected. If the issue is related to a certain process, also have a look at the process specification, to check, what kind of results should be produced, which parameters are required, which error messages should be raised, etc.
+
+#### Solve an issue
+
+Scan through our [existing issues](https://github.com/Open-EO/openeo-processes-dask/issues) to find one that interests you. We normally do not assign issues to anyone. If you find an interesting issue to work on, you are welcome to open a PR with a fix.
+
+### Make Changes
+
+#### Version control, git and github
+
+To make changes to the code, you will need a free [github](https://github.com/) account. The code is available on github, where we use [git](https://git-scm.com/) for version control.
+
+#### Make changes locally
+
+1. Create an account and log in to [github](https://github.com/).
+2. Fork the repository. Go to the [project](https://github.com/Open-EO/openeo-processes-dask) and click the `Fork` button on the top of the page.
+3. Clone the fork of the repository to your local machine.
+```
+git clone https://github.com/<YOUR USER NAME>/openeo-processes-dask.git
+cd openeo-processes-dask
+```
+4. Set up the development environment using the instructions in the [README](../README.md).
+```
+poetry install --all-extras
+```
+5. Create a new branch
+```
+git checkout -b new-branch-name
+```
+
+Once this is set up, you can start making code changes.
+
+### Commit your update
+
+You can check, which files contain changes using `git status`.
+
+Before you commit your changes, make sure your tests run through and update the tests if required. It is recommended to cover all your changes in the tests. Once you submit your changes, github will automatically check, if the new lines of code are covered in the tests.
+
+If you made complex changes, it is helpful to also include comments next to your code, in order to document the changes for reviewers and other contributors.
+
+We are using pre-commit hooks, to stick to a nice structure and formatting. See [pre-commit](https://pre-commit.com/) and [git hooks](https://git-scm.com/book/en/v2/Customizing-Git-Git-Hooks).
+
+Once your changes are ready, you can commit and push them to github.
+1. To commit all modified files into the local copy of your repo, do `git commit -am 'A commit message'`.
+2. To push the changes up to your forked repo on GitHub, do a `git push`.
+
+### Pull Request
+
+Once you’re ready or need feedback on your code, open a Pull Request, also known as a PR, on the github project page.
+- Don't forget to [link PR to issue](https://github.com/Open-EO/openeo-processes-dask/issues) if you are solving one.
+- Once you submit your PR, an openeo-processes-dask team member will review your changes. We may ask questions or request additional information.
+- We may ask for changes to be made before a PR can be merged.
+- As you update your PR and apply changes, mark each conversation as [resolved](https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/commenting-on-a-pull-request#resolving-conversations).
+- If you run into any merge issues, checkout this [git tutorial](https://github.com/skills/resolve-merge-conflicts) to help you resolve merge conflicts and other issues.
+
+You can directly comment in the PR if you want the reviewer to pay particular attention to something - If your request is not ready to be merged, just say so in your pull request message and create a “Draft PR". This way, you can get preliminary code review and others can see what is currently being worked on.
+
+## New releases
+
+Once the reviewer(s) approve the PR and there are no more changes requested, the PR will be merged into the main branch. A new release will then be created based on the main branch.
+
+There will be a new release at least every two weeks. (In case, there were no new changes at all, the release might be skipped.)
+
+If important changes are added - such as bug fixes and additional processes - new releases might be made in between.
+
+Small changes - such as new comments, updated documentation - will be included in the bi-weekly releases.
+
+## Adding new processes
+
+If you only want to update implementation details in a process, the specification should remain as it is and you do not need to update the submodule.
+
+If you want to add a new process or update the parameters of the process, you will also need to interact with the submodule.
+
+The specifications come from a fork of the official openeo-processes: https://github.com/eodcgmbh/openeo-processes
+
+To add a new process:
+- add the specification to https://github.com/eodcgmbh/openeo-processes
+    - create a github fork
+    - check if the process you want to add is in the missing-processes folder and if so, move it to the root folder
+    - if not, create a new process definition
+    - create a PR and merge it
+- update the submodule in openeo-processes-dask
+    - create a github fork of this repository
+    - Use `git submodule init` and `git submodule update` in your forked repository to update the specifications
+    - To specify the submodule explicitely, you can use
+     `git submodule update --remote openeo_processes_dask/specs/openeo-processes/`
+    - find more details on submodules [here](https://git-scm.com/book/en/v2/Git-Tools-Submodules)
+- add the implementation in openeo-processes-dask
+- cover the new implementation in the tests
+- update the dependencies, if you need to introduce a new package. `poetry add ...`.
+- create a PR to merge your fork into the openeo-processes-dask
+
+New implementations can be tested using the local [client-side-processing](https://open-eo.github.io/openeo-python-client/cookbook/localprocessing.html). This allows testing process without a connection to an openEO back-end on a user's local netCDFs, geoTIFFs, ZARR files, or remote STAC Collections/ Items.
+
+For backend development, the specifications and implementations can be used to create a process registry, e.g. https://github.com/Open-EO/openeo-pg-parser-networkx/blob/main/examples/01_minibackend_demo.ipynb
+```
+from openeo_processes_dask.specs import load_collection as load_collection_spec
+process_registry["load_collection"] = Process(spec=load_collection_spec, implementation=load_collection)
+```
+
+## Prior to submitting a PR - a checklist
+
+- Add comments and documentation for your code
+- Make sure your tests still run through and add additional tests.
+- Format your code nicely - run `poetry run pre-commit install` and `pre-commit run --all-files`.
+- Add a descriptive comment to your commit and push your code to [github](https://github.com/Open-EO/openeo-processes-dask).
+- Create a PR with a descriptive title for your changes.
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -23,7 +23,7 @@ jobs:
     strategy:
       matrix:
         os: [Ubuntu]
-        python-version: ["3.9", "3.10", "3.11"]
+        python-version: ["3.10", "3.11"]
         include:
           - os: Ubuntu
             image: ubuntu-22.04

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -1,5 +1,7 @@
 # See https://pre-commit.com for more information
 # See https://pre-commit.com/hooks.html for more hooks
+default_language_version:
+    python: python3
 repos:
 -   repo: https://github.com/asottile/pyupgrade
     rev: v3.10.1

diff --git a/README.md b/README.md
@@ -28,6 +28,11 @@ A subset of process implementations with heavy or unstable dependencies are hidd
 ## Development environment
 openeo-processes-dask requires poetry >1.2, see their [docs](https://python-poetry.org/docs/#installation) for installation instructions.
 
+Clone the repository with `--recurse-submodules` to also fetch the process specs:
+```
+git clone --recurse-submodules git@github.com:Open-EO/openeo-processes-dask.git
+```
+
 To setup the python venv and install this project into it run:
 ```
 poetry install --all-extras

diff --git a/docs/scalability/README.md b/docs/scalability/README.md
@@ -0,0 +1,3 @@
+# Document issues with scalability
+
+Edge-cases, which cannot be handled within `openeo-processes-dask`, but want to be documented.
diff --git a/docs/scalability/aggregate-large-spatial-extents.md b/docs/scalability/aggregate-large-spatial-extents.md
@@ -0,0 +1,70 @@
+# Aggregate data over large spatial extents
+
+Date: 2024-02-26
+
+## Context
+
+https://github.com/Open-EO/openeo-processes-dask/issues/124
+
+In a recent use case, the `aggregate_spatial` process was applied to both Sentinel 1 and Sentinel 2 data to generate `vector-cubes` with polygons from [here](https://github.com/openEOPlatform/SRR3_notebooks/blob/main/notebooks/resources/UC8/vector_data/target_canopy_cover_60m_WGS84/target_canopy_cover_WGS84_60m.geojson). As `process graphs` are executed node after node, we would first use `load_collection` to load the data over the total bounds of the polygons and hand the data to `aggregate_spatial` afterwards.
+With the total bounds of all polygons being set to `{'west': 3, 'east': 18, 'south': 43, 'north': 51}`, loading the data led to the following situations:
+
+- We were able to load the data, for short temporal extents:
+<figure>
+    <img src="./data.png" alt="Datacube">
+    <figcaption>Figure 1: Lazy dask array for Sentinel 1 datacube. Note the size of data: 3.73 TiB</figcaption>
+</figure>
+
+- When increasing the temporal interval, the amount of data became too much for dask and the corresponding error was raised.
+<figure>
+    <img src="./data_crash.png" alt="Errormessage">
+</figure>
+
+It is not trivial to solve this in dask, as
+- this means that one chunk cannot handle the amount of data, that is supposed to be in the chunk.
+- setting a smaller chunk size, means more and more chunks are generated and the dask task graph can easily become too large itself.
+- increasing the dask memory might solve the problem for one specific dataset, but it might occur again, if a different dataset with a higher spatial or temporal resolution is used afterwards.
+
+This means, that running through the `process graph` raises an error, before `aggregate_spatial` can be executed.
+
+## Possible solutions
+
+There might be more than one solution to this.
+
+Here is a first attempt on how we were trying to solve this.
+
+In `openeo-processes-dask`, there is a general implementation of `aggregate_spatial` available, that can be run, as long as the input data can be read by dask and the error described above does not occur. If the error is raised, the execution of the processes or the process implementations might need adaptions.
+
+### Handle data loading within aggregate_spatial
+
+Here is a pseudo-code for another `aggregate_spatial` implementation:
+
+<CodeSwitcher>
+<template v-slot:py>
+
+```python
+def aggregate_spatial(
+    data,
+    geometries,
+    reducer):
+
+    vector_cube = []
+    groups = group_geometries(geometries)
+
+    for group in groups:
+        small_data = load_collection(group.bounds)
+        small_data = apply_processes(small_data)
+
+        for geometry in group:
+            polygon_data = aggregate(small_data, geometry, reducer)
+            vector_cube.append(polygon_data)
+
+    return vector_cube
+```
+
+Remarks:
+- you might want to group your geometries based on how close they are to each other
+- handling data loading inside of the aggregate_spatial process can be very backend specific, depending on how you define the load_collection process and on how you execute process graphs - you might need to delay the data loading in load_collection, if the process is executed before aggregate_spatial.
+- when you load the data in aggregate spatial, you might also need to apply processes to it that are in the process_graph and should be executed prior to aggregate_spatial.
+- if you group your geometries in the beginning, the order of the resulting vector_cube might be adapted, you might want to sort the data in a final step.
+- the code seems rather basic, as it makes use of a simple for-loop. We were trying to avoid this, but ran into several issues for the other attempts we had. (Trying to handle this in dask, might make the task graph too large. Trying to use sparse arrays can be very tricky, when geometries are distributed all over Europe, and do not have a native order. )
diff --git a/docs/scalability/data.png b/docs/scalability/data.png
diff --git a/docs/scalability/data_crash.png b/docs/scalability/data_crash.png
diff --git a/openeo_processes_dask/process_implementations/__init__.py b/openeo_processes_dask/process_implementations/__init__.py
@@ -5,6 +5,7 @@
 from .arrays import *
 from .comparison import *
 from .cubes import *
+from .inspect import *
 from .logic import *
 from .math import *
 
@@ -15,12 +16,12 @@
         "Did not load machine learning processes due to missing dependencies: Install them like this: `pip install openeo-processes-dask[implementations, ml]`"
     )
 
-try:
-    from .experimental import *
-except ImportError as e:
-    logger.warning(
-        "Did not experimental processes due to missing dependencies: Install them like this: `pip install openeo-processes-dask[implementations, experimental]`"
-    )
+# try:
+#     from .experimental import *
+# except ImportError as e:
+#     logger.warning(
+#         "Did not experimental processes due to missing dependencies: Install them like this: `pip install openeo-processes-dask[implementations, experimental]`"
+#     )
 
 import rioxarray as rio  # Required for the .rio accessor on xarrays.