-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add stac-arrow #256
Add stac-arrow #256
Conversation
ec95a95
to
0fda505
Compare
Indeed. See apache/arrow-rs#5318. I believe serde-json support will be removed in a future arrow-rs release. |
eb6b6ff
to
ccaa98a
Compare
636ac2a
to
189436c
Compare
I did a quick-and-dirty benchmark (stac-arrow/benches/read.rs) and it looks like it's worth it to use/port the deprecated
|
If there's anything in geoarrow-rs that you'd like to use that I can help with, let me know. |
0710102
to
31ca4a2
Compare
@kylebarron I think I've got a working start. There's obviously some/many edge cases in the geoarrow/geoparquet writing that I'll need to handle later, but for now stac-geoparquet can read what I write so it's a start. Some simple benchmarking against 10k sentinel-2 items indicates we might get some performance gains (though I'm not thinking about compression yet so could be a bad comparison): $ hyperfine --warmup 3 'target/release/stac convert ~/Desktop/longmont.json ~/Desktop/longmont-rs.parquet' 'python ~/Desktop/to_stac_geoparquet.py'
Benchmark 1: target/release/stac convert ~/Desktop/longmont.json ~/Desktop/longmont-rs.parquet
Time (mean ± σ): 1.322 s ± 0.051 s [User: 0.971 s, System: 0.246 s]
Range (min … max): 1.265 s … 1.410 s 10 runs
Benchmark 2: python ~/Desktop/to_stac_geoparquet.py
Time (mean ± σ): 2.374 s ± 0.061 s [User: 2.111 s, System: 0.542 s]
Range (min … max): 2.317 s … 2.490 s 10 runs
Summary
target/release/stac convert ~/Desktop/longmont.json ~/Desktop/longmont-rs.parquet ran
1.80 ± 0.08 times faster than python ~/Desktop/to_stac_geoparquet.py Query: target/release/stac search https://earth-search.aws.element84.com/v1 \
-c sentinel-2-c1-l2a \
--max-items 10000 \
--sortby='-properties.datetime' \
--intersects '{"type":"Point","coordinates":[-105.1019,40.1672]}' > ~/Desktop/longmont.json Benchmark script: import json
import stac_geoparquet.arrow
from pyarrow import Table
with open("/Users/gadomski/Desktop/longmont.json") as f:
items = json.load(f)["features"]
table = Table.from_batches(stac_geoparquet.arrow.parse_stac_items_to_arrow(items))
stac_geoparquet.arrow.to_parquet(table, "/Users/gadomski/Desktop/longmont-py.parquet") |
5b32adb
to
703d0aa
Compare
I'm a little surprised the Rust version isn't more than 1.8x faster. I suppose on the Python side once the JSON is converted to Arrow, it's fully compiled code anyways. |
Ah, also this is the most efficient way to convert JSON to Arrow (though it uses the most memory) because it can pass those dicts directly to pyarrow, which then presumably is fully compiled in its conversion to Arrow
|
A couple notes on this if you want to make Python bindings. I've been (slowly) making progress on https://github.com/kylebarron/arro3, a core library that aims to make it easier to make Arrow-based Python packages from Rust. You can see a high-level example in https://github.com/kylebarron/arro3/blob/fc47f7c624d753947d1921086a7714512c1d8bbe/arro3-compute/src/concat.rs, where you can just take You can also export a stream of record batches as a Python |
The goal for geoarrow-rs' python bindings for 0.3 is for it to exclusively use |
Marking this as draft until geoarrow-rs goes v0.3. |
sounds like you've seen some breaking changes 🫣 |
I was finding that I needed to be on the v0.3-alpha release, or a specific SHA, to get the stuff I needed. No biggie :-) |
I've also been making a decent number of changes on the Rust side, like 3D support |
Superseded by #287 |
Description
Read and write stac-geoparquet.
Checklist
cargo fmt
)cargo test