Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add stac-geoparquet and stac-arrow #287

Merged
merged 1 commit into from
Aug 8, 2024
Merged

Add stac-geoparquet and stac-arrow #287

merged 1 commit into from
Aug 8, 2024

Conversation

gadomski
Copy link
Member

@gadomski gadomski commented Aug 7, 2024

Closes

Description

A re-start of the arrow/parquet work. geoarrow-rs seems to be in a bit of flux (especially w.r.t. 2 vs. 3 dimensions) so we're on a sorta-random SHA right now there.

Checklist

  • Unit tests
  • Documentation, including doctests
  • Git history is linear
  • Commit messages are descriptive
  • (optional) Git commit messages follow conventional commits
  • Code is formatted (cargo fmt)
  • cargo test
  • Changes are added to the CHANGELOG

@gadomski gadomski self-assigned this Aug 7, 2024
@gadomski gadomski force-pushed the stac-geoparquet branch 9 times, most recently from aab7827 to 49bf5ba Compare August 8, 2024 18:40
@gadomski gadomski marked this pull request as ready for review August 8, 2024 18:40
@gadomski
Copy link
Member Author

gadomski commented Aug 8, 2024

@kylebarron redirecting your attention here (if you want). Pending a couple of fixups, I've got a working to/from stac-geoparquet converter:

cargo install stac-cli
stac translate in.json out.parquet
stac translate out.parquet back.json

And in Rust:

let file = std::fs::File::open("data.parquet").unwrap();
let item_collection = stac_geoparquet::from_reader(file).unwrap();
let file = std::fs::File::create("out.parquet").unwrap();
stac_geoparquet::to_writer(file, item_collection.into()).unwrap();

Rough benchmarking (same procedure as in #256 (comment)) indicates that we're ~50% faster than the simplest Python converter:

$ hyperfine --warmup 3 'target/release/stac translate ~/Desktop/longmont.json ~/Desktop/longmont-rs.parquet' 'python benchmark.py'
Benchmark 1: target/release/stac translate ~/Desktop/longmont.json ~/Desktop/longmont-rs.parquet
  Time (mean ± σ):      1.114 s ±  0.020 s    [User: 0.918 s, System: 0.177 s]
  Range (min … max):    1.095 s …  1.157 s    10 runs
 
Benchmark 2: python benchmark.py
  Time (mean ± σ):      1.736 s ±  0.086 s    [User: 1.475 s, System: 0.212 s]
  Range (min … max):    1.657 s …  1.925 s    10 runs
 
Summary
  target/release/stac translate ~/Desktop/longmont.json ~/Desktop/longmont-rs.parquet ran
    1.56 ± 0.08 times faster than python benchmark.py

Caveats:

  • Things are pretty naive right now, so I'm sure there's more juice to squeeze out of the performance
  • No Python bindings yet but I'd like to create some of those too, to potentially provide an alternative interface there
  • I'm keeping my crate versions to v0.0.x for now, until geoarrow-rs has its next release (the current main was breaking for me due to new dimension stuff).

@kylebarron
Copy link
Collaborator

Did you hit specific issues with latest main of geoarrow-rs, or just that it requires some changes to your code?

@gadomski
Copy link
Member Author

gadomski commented Aug 8, 2024

Did you hit specific issues with latest main of geoarrow-rs, or just that it requires some changes to your code?

Yeah, I hadn't opened an issue yet, but it looks like the Dimension hash set isn't being updated in https://github.com/geoarrow/geoarrow-rs/blob/c927f1281bf9de028687733c5013e5246faef4e5/src/datatypes.rs#L556-L668

stac-arrow/README.md Outdated Show resolved Hide resolved
@kylebarron
Copy link
Collaborator

Yeah, I hadn't opened an issue yet, but it looks like the Dimension hash set isn't being updated in geoarrow/geoarrow-rs@c927f12/src/datatypes.rs#L556-L668

Ah yes good catch. If you can create an issue to track that would be helpful

@gadomski
Copy link
Member Author

gadomski commented Aug 8, 2024

Done, geoarrow/geoarrow-rs#682

Co-authored-by: Kyle Barron <kylebarron2@gmail.com>
@gadomski gadomski enabled auto-merge (rebase) August 8, 2024 22:57
@gadomski gadomski merged commit 444c89c into main Aug 8, 2024
25 checks passed
@gadomski gadomski deleted the stac-geoparquet branch August 8, 2024 23:00
@gadomski gadomski mentioned this pull request Aug 8, 2024
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Add stac-geoparquet reading and writing
2 participants