Add stac-arrow #256

gadomski · 2024-05-31T17:38:59Z

Description

Read and write stac-geoparquet.

Checklist

Unit tests
Documentation, including doctests
Git history is linear
Commit messages are descriptive
(optional) Git commit messages follow conventional commits
Code is formatted (cargo fmt)
cargo test
Changes are added to the CHANGELOG

kylebarron · 2024-06-01T09:00:02Z

it looks like arrow-json is moving away from providing direct serde_json support (e.g. this deprecated function: docs.rs/arrow-json/51.0.0/arrow_json/writer/fn.record_batches_to_json_rows.html).

Indeed. See apache/arrow-rs#5318. I believe serde-json support will be removed in a future arrow-rs release.

gadomski · 2024-06-05T12:40:30Z

I did a quick-and-dirty benchmark (stac-arrow/benches/read.rs) and it looks like it's worth it to use/port the deprecated record_batches_to_json_rows for the "need to go through serde_json" case:

read/record_batches_to_json_rows    time:   [170.61 µs 174.72 µs 178.93 µs]
read/writer                         time:   [224.61 µs 231.71 µs 241.36 µs]

kylebarron · 2024-06-05T18:18:37Z

If there's anything in geoarrow-rs that you'd like to use that I can help with, let me know.

gadomski · 2024-06-06T22:33:55Z

@kylebarron I think I've got a working start. There's obviously some/many edge cases in the geoarrow/geoparquet writing that I'll need to handle later, but for now stac-geoparquet can read what I write so it's a start.

Some simple benchmarking against 10k sentinel-2 items indicates we might get some performance gains (though I'm not thinking about compression yet so could be a bad comparison):

$ hyperfine --warmup 3 'target/release/stac convert ~/Desktop/longmont.json ~/Desktop/longmont-rs.parquet' 'python ~/Desktop/to_stac_geoparquet.py'
Benchmark 1: target/release/stac convert ~/Desktop/longmont.json ~/Desktop/longmont-rs.parquet
  Time (mean ± σ):      1.322 s ±  0.051 s    [User: 0.971 s, System: 0.246 s]
  Range (min … max):    1.265 s …  1.410 s    10 runs
 
Benchmark 2: python ~/Desktop/to_stac_geoparquet.py
  Time (mean ± σ):      2.374 s ±  0.061 s    [User: 2.111 s, System: 0.542 s]
  Range (min … max):    2.317 s …  2.490 s    10 runs
 
Summary
  target/release/stac convert ~/Desktop/longmont.json ~/Desktop/longmont-rs.parquet ran
    1.80 ± 0.08 times faster than python ~/Desktop/to_stac_geoparquet.py

Query:

target/release/stac search https://earth-search.aws.element84.com/v1 \                                                                           
    -c sentinel-2-c1-l2a \
    --max-items 10000 \
    --sortby='-properties.datetime' \
    --intersects '{"type":"Point","coordinates":[-105.1019,40.1672]}' > ~/Desktop/longmont.json

Benchmark script:

import json

import stac_geoparquet.arrow
from pyarrow import Table

with open("/Users/gadomski/Desktop/longmont.json") as f:
    items = json.load(f)["features"]
table = Table.from_batches(stac_geoparquet.arrow.parse_stac_items_to_arrow(items))
stac_geoparquet.arrow.to_parquet(table, "/Users/gadomski/Desktop/longmont-py.parquet")

scripts/requirements.in

kylebarron · 2024-06-14T18:45:50Z

Some simple benchmarking against 10k sentinel-2 items indicates we might get some performance gains (though I'm not thinking about compression yet so could be a bad comparison):

I'm a little surprised the Rust version isn't more than 1.8x faster. I suppose on the Python side once the JSON is converted to Arrow, it's fully compiled code anyways.

kylebarron · 2024-06-14T18:47:48Z

Ah, also this is the most efficient way to convert JSON to Arrow (though it uses the most memory) because it can pass those dicts directly to pyarrow, which then presumably is fully compiled in its conversion to Arrow

table = Table.from_batches(stac_geoparquet.arrow.parse_stac_items_to_arrow(items))

kylebarron · 2024-06-25T18:07:43Z

A couple notes on this if you want to make Python bindings. I've been (slowly) making progress on https://github.com/kylebarron/arro3, a core library that aims to make it easier to make Arrow-based Python packages from Rust. You can see a high-level example in https://github.com/kylebarron/arro3/blob/fc47f7c624d753947d1921086a7714512c1d8bbe/arro3-compute/src/concat.rs, where you can just take input: PyRecordBatchReader and access the same Arrow stream that parse_stac_items_to_arrow is generating. arro3 aims to handle all the Python-Rust FFI for you. So you can do input.into_reader() to access the Rust Box<dyn RecordBatchReader + Send>.

You can also export a stream of record batches as a Python RecordBatchReader using the to_python method. Note that this exports an arro3.core.RecordBatchReader not a pyarrow.RecordBatchReader, but you can convert to a pyarrow RecordBatchReader with pyarrow.RecordBatchReader.from_stream(arro3.core.RecordBatchReader) at zero cost. This only requires a python-side runtime dependency on arro3.core (which tries to be as small as possible, i.e. ~1MB instead of pyarrow's 120MB)

kylebarron · 2024-06-25T18:12:08Z

The goal for geoarrow-rs' python bindings for 0.3 is for it to exclusively use arro3.core objects, and not define any new classes itself. So GeoTable will be replaced by returning an arro3.core.table

gadomski · 2024-07-22T16:09:53Z

Marking this as draft until geoarrow-rs goes v0.3.

kylebarron · 2024-07-22T16:18:46Z

Marking this as draft until geoarrow-rs goes v0.3.

sounds like you've seen some breaking changes 🫣

gadomski · 2024-07-22T16:20:05Z

sounds like you've seen some breaking changes 🫣

I was finding that I needed to be on the v0.3-alpha release, or a specific SHA, to get the stuff I needed. No biggie :-)

kylebarron · 2024-07-22T16:21:27Z

I've also been making a decent number of changes on the Rust side, like 3D support

gadomski · 2024-08-08T23:08:06Z

Superseded by #287

gadomski self-assigned this May 31, 2024

gadomski force-pushed the stac-arrow branch from ded75c7 to 5c99522 Compare May 31, 2024 17:39

gadomski added the [crate] stac-arrow label May 31, 2024

gadomski force-pushed the stac-arrow branch 2 times, most recently from ec95a95 to 0fda505 Compare May 31, 2024 18:39

gadomski marked this pull request as draft May 31, 2024 18:40

gadomski changed the title ~~Add stac-arrow~~ Add stac-arrow and stac-duckdb May 31, 2024

gadomski mentioned this pull request May 31, 2024

Rust implementation? stac-utils/stac-geoparquet#54

Closed

gadomski force-pushed the stac-arrow branch from 8e69b1a to 0f05d04 Compare May 31, 2024 20:32

gadomski added the [crate] duckdb stac-duckdb label May 31, 2024

gadomski force-pushed the stac-arrow branch 4 times, most recently from eb6b6ff to ccaa98a Compare June 5, 2024 11:58

gadomski changed the title ~~Add stac-arrow and stac-duckdb~~ Add stac-arrow Jun 5, 2024

gadomski removed the [crate] duckdb stac-duckdb label Jun 5, 2024

gadomski force-pushed the stac-arrow branch 2 times, most recently from 636ac2a to 189436c Compare June 5, 2024 12:38

gadomski mentioned this pull request Jun 5, 2024

Add wkb feature and GeoparquetItem #260

Merged

8 tasks

gadomski force-pushed the stac-arrow branch 7 times, most recently from 0710102 to 31ca4a2 Compare June 6, 2024 22:24

gadomski force-pushed the stac-arrow branch 7 times, most recently from 5b32adb to 703d0aa Compare June 7, 2024 16:03

kylebarron reviewed Jun 14, 2024

View reviewed changes

scripts/requirements.in Outdated Show resolved Hide resolved

kylebarron mentioned this pull request Jun 14, 2024

Option to write GeoParquet 1.0 metadata stac-utils/stac-geoparquet#60

Closed

gadomski mentioned this pull request Jun 18, 2024

Allow writing either GeoParquet 1.0 or GeoParquet 1.1 schema metadata stac-utils/stac-geoparquet#61

Merged

gadomski linked an issue Jun 20, 2024 that may be closed by this pull request

Add stac-geoparquet reading and writing #269

Closed

gadomski force-pushed the stac-arrow branch from a3ac8db to 54d063c Compare June 24, 2024 13:22

gadomski mentioned this pull request Jun 25, 2024

Rename GeoparquetItem to FlatItem #275

Merged

8 tasks

kylebarron mentioned this pull request Jun 25, 2024

EPIC: non-spatial Python-Rust Arrow library geoarrow/geoarrow-rs#625

Closed

3 tasks

gadomski marked this pull request as draft July 22, 2024 16:09

gadomski force-pushed the stac-arrow branch from 54d063c to 74c74ff Compare July 22, 2024 16:14

feat: read stac-geoparquet

b493a4e

gadomski force-pushed the stac-arrow branch from 74c74ff to b493a4e Compare July 22, 2024 16:15

gadomski mentioned this pull request Aug 7, 2024

Add stac-geoparquet and stac-arrow #287

Merged

8 tasks

gadomski closed this Aug 8, 2024

gadomski deleted the stac-arrow branch August 8, 2024 23:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add stac-arrow #256

Add stac-arrow #256

gadomski commented May 31, 2024 •

edited

Loading

kylebarron commented Jun 1, 2024

gadomski commented Jun 5, 2024

kylebarron commented Jun 5, 2024

gadomski commented Jun 6, 2024 •

edited

Loading

kylebarron commented Jun 14, 2024

kylebarron commented Jun 14, 2024 •

edited

Loading

kylebarron commented Jun 25, 2024 •

edited

Loading

kylebarron commented Jun 25, 2024

gadomski commented Jul 22, 2024

kylebarron commented Jul 22, 2024

gadomski commented Jul 22, 2024

kylebarron commented Jul 22, 2024

gadomski commented Aug 8, 2024

Add stac-arrow #256

Add stac-arrow #256

Conversation

gadomski commented May 31, 2024 • edited Loading

Description

Checklist

kylebarron commented Jun 1, 2024

gadomski commented Jun 5, 2024

kylebarron commented Jun 5, 2024

gadomski commented Jun 6, 2024 • edited Loading

kylebarron commented Jun 14, 2024

kylebarron commented Jun 14, 2024 • edited Loading

kylebarron commented Jun 25, 2024 • edited Loading

kylebarron commented Jun 25, 2024

gadomski commented Jul 22, 2024

kylebarron commented Jul 22, 2024

gadomski commented Jul 22, 2024

kylebarron commented Jul 22, 2024

gadomski commented Aug 8, 2024

gadomski commented May 31, 2024 •

edited

Loading

gadomski commented Jun 6, 2024 •

edited

Loading

kylebarron commented Jun 14, 2024 •

edited

Loading

kylebarron commented Jun 25, 2024 •

edited

Loading