Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add stac-arrow #256

Closed
wants to merge 1 commit into from
Closed

Add stac-arrow #256

wants to merge 1 commit into from

Conversation

gadomski
Copy link
Member

@gadomski gadomski commented May 31, 2024

Description

Read and write stac-geoparquet.

Checklist

  • Unit tests
  • Documentation, including doctests
  • Git history is linear
  • Commit messages are descriptive
  • (optional) Git commit messages follow conventional commits
  • Code is formatted (cargo fmt)
  • cargo test
  • Changes are added to the CHANGELOG

@gadomski gadomski self-assigned this May 31, 2024
@gadomski gadomski force-pushed the stac-arrow branch 2 times, most recently from ec95a95 to 0fda505 Compare May 31, 2024 18:39
@gadomski gadomski marked this pull request as draft May 31, 2024 18:40
@gadomski gadomski changed the title Add stac-arrow Add stac-arrow and stac-duckdb May 31, 2024
@gadomski gadomski added the [crate] duckdb stac-duckdb label May 31, 2024
@kylebarron
Copy link
Collaborator

it looks like arrow-json is moving away from providing direct serde_json support (e.g. this deprecated function: docs.rs/arrow-json/51.0.0/arrow_json/writer/fn.record_batches_to_json_rows.html).

Indeed. See apache/arrow-rs#5318. I believe serde-json support will be removed in a future arrow-rs release.

@gadomski gadomski force-pushed the stac-arrow branch 4 times, most recently from eb6b6ff to ccaa98a Compare June 5, 2024 11:58
@gadomski gadomski changed the title Add stac-arrow and stac-duckdb Add stac-arrow Jun 5, 2024
@gadomski gadomski removed the [crate] duckdb stac-duckdb label Jun 5, 2024
@gadomski gadomski force-pushed the stac-arrow branch 2 times, most recently from 636ac2a to 189436c Compare June 5, 2024 12:38
@gadomski
Copy link
Member Author

gadomski commented Jun 5, 2024

I did a quick-and-dirty benchmark (stac-arrow/benches/read.rs) and it looks like it's worth it to use/port the deprecated record_batches_to_json_rows for the "need to go through serde_json" case:

read/record_batches_to_json_rows    time:   [170.61 µs 174.72 µs 178.93 µs]
read/writer                         time:   [224.61 µs 231.71 µs 241.36 µs]

@gadomski gadomski mentioned this pull request Jun 5, 2024
8 tasks
@kylebarron
Copy link
Collaborator

If there's anything in geoarrow-rs that you'd like to use that I can help with, let me know.

@gadomski gadomski force-pushed the stac-arrow branch 7 times, most recently from 0710102 to 31ca4a2 Compare June 6, 2024 22:24
@gadomski
Copy link
Member Author

gadomski commented Jun 6, 2024

@kylebarron I think I've got a working start. There's obviously some/many edge cases in the geoarrow/geoparquet writing that I'll need to handle later, but for now stac-geoparquet can read what I write so it's a start.

Some simple benchmarking against 10k sentinel-2 items indicates we might get some performance gains (though I'm not thinking about compression yet so could be a bad comparison):

$ hyperfine --warmup 3 'target/release/stac convert ~/Desktop/longmont.json ~/Desktop/longmont-rs.parquet' 'python ~/Desktop/to_stac_geoparquet.py'
Benchmark 1: target/release/stac convert ~/Desktop/longmont.json ~/Desktop/longmont-rs.parquet
  Time (mean ± σ):      1.322 s ±  0.051 s    [User: 0.971 s, System: 0.246 s]
  Range (min … max):    1.265 s …  1.410 s    10 runs
 
Benchmark 2: python ~/Desktop/to_stac_geoparquet.py
  Time (mean ± σ):      2.374 s ±  0.061 s    [User: 2.111 s, System: 0.542 s]
  Range (min … max):    2.317 s …  2.490 s    10 runs
 
Summary
  target/release/stac convert ~/Desktop/longmont.json ~/Desktop/longmont-rs.parquet ran
    1.80 ± 0.08 times faster than python ~/Desktop/to_stac_geoparquet.py

Query:

target/release/stac search https://earth-search.aws.element84.com/v1 \                                                                           
    -c sentinel-2-c1-l2a \
    --max-items 10000 \
    --sortby='-properties.datetime' \
    --intersects '{"type":"Point","coordinates":[-105.1019,40.1672]}' > ~/Desktop/longmont.json

Benchmark script:

import json

import stac_geoparquet.arrow
from pyarrow import Table

with open("/Users/gadomski/Desktop/longmont.json") as f:
    items = json.load(f)["features"]
table = Table.from_batches(stac_geoparquet.arrow.parse_stac_items_to_arrow(items))
stac_geoparquet.arrow.to_parquet(table, "/Users/gadomski/Desktop/longmont-py.parquet")

@gadomski gadomski force-pushed the stac-arrow branch 7 times, most recently from 5b32adb to 703d0aa Compare June 7, 2024 16:03
scripts/requirements.in Outdated Show resolved Hide resolved
@kylebarron
Copy link
Collaborator

Some simple benchmarking against 10k sentinel-2 items indicates we might get some performance gains (though I'm not thinking about compression yet so could be a bad comparison):

I'm a little surprised the Rust version isn't more than 1.8x faster. I suppose on the Python side once the JSON is converted to Arrow, it's fully compiled code anyways.

@kylebarron
Copy link
Collaborator

kylebarron commented Jun 14, 2024

Ah, also this is the most efficient way to convert JSON to Arrow (though it uses the most memory) because it can pass those dicts directly to pyarrow, which then presumably is fully compiled in its conversion to Arrow

table = Table.from_batches(stac_geoparquet.arrow.parse_stac_items_to_arrow(items))

@kylebarron
Copy link
Collaborator

kylebarron commented Jun 25, 2024

A couple notes on this if you want to make Python bindings. I've been (slowly) making progress on https://github.com/kylebarron/arro3, a core library that aims to make it easier to make Arrow-based Python packages from Rust. You can see a high-level example in https://github.com/kylebarron/arro3/blob/fc47f7c624d753947d1921086a7714512c1d8bbe/arro3-compute/src/concat.rs, where you can just take input: PyRecordBatchReader and access the same Arrow stream that parse_stac_items_to_arrow is generating. arro3 aims to handle all the Python-Rust FFI for you. So you can do input.into_reader() to access the Rust Box<dyn RecordBatchReader + Send>.

You can also export a stream of record batches as a Python RecordBatchReader using the to_python method. Note that this exports an arro3.core.RecordBatchReader not a pyarrow.RecordBatchReader, but you can convert to a pyarrow RecordBatchReader with pyarrow.RecordBatchReader.from_stream(arro3.core.RecordBatchReader) at zero cost. This only requires a python-side runtime dependency on arro3.core (which tries to be as small as possible, i.e. ~1MB instead of pyarrow's 120MB)

@kylebarron
Copy link
Collaborator

The goal for geoarrow-rs' python bindings for 0.3 is for it to exclusively use arro3.core objects, and not define any new classes itself. So GeoTable will be replaced by returning an arro3.core.table

@gadomski
Copy link
Member Author

Marking this as draft until geoarrow-rs goes v0.3.

@kylebarron
Copy link
Collaborator

Marking this as draft until geoarrow-rs goes v0.3.

sounds like you've seen some breaking changes 🫣

@gadomski
Copy link
Member Author

sounds like you've seen some breaking changes 🫣

I was finding that I needed to be on the v0.3-alpha release, or a specific SHA, to get the stuff I needed. No biggie :-)

@kylebarron
Copy link
Collaborator

I've also been making a decent number of changes on the Rust side, like 3D support

@gadomski gadomski mentioned this pull request Aug 7, 2024
8 tasks
@gadomski
Copy link
Member Author

gadomski commented Aug 8, 2024

Superseded by #287

@gadomski gadomski closed this Aug 8, 2024
@gadomski gadomski deleted the stac-arrow branch August 8, 2024 23:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add stac-geoparquet reading and writing
2 participants