Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero-copy compatibility with the JS implementation? #3

Open
kylebarron opened this issue Oct 9, 2023 · 4 comments
Open

Zero-copy compatibility with the JS implementation? #3

kylebarron opened this issue Oct 9, 2023 · 4 comments

Comments

@kylebarron
Copy link

kylebarron commented Oct 9, 2023

I assume the answer will be no, in which case feel free to close this issue.

I'm interested in use cases where I can share the index data with other implementations, such as flatbush in the browser or a python flatbush implementation. In the original JS implementation, the entire index is self-contained in a single buffer, which allows it to be shared both across web workers in the browser, but also with e.g. a rust-wasm implementation.

I see in your implementation you use a sensible rust approach that uses rust-native types. There would probably need to be significant changes to the implementation to use a single backing buffer, which you understandably might not be interested in, for only the prospect of future FFI potentials. Any thoughts?

@jbuckmccready
Copy link
Owner

Hmm, I don't have a use case for this but I understand the utility if trying to interop with other languages using the same library in the same memory space. I think it could be done in Rust using the bytemuck crate (https://docs.rs/bytemuck/latest/bytemuck/), create a buffer of bytes and then cast subslices of the buffer to the different types (number type and index type) as needed. It would require additional constraints on the generic traits and add a dependency on bytemuck.

I am open to pull requests that add this functionality as an additional module behind a feature flag (to avoid bytemuck dependency if not needed). The module would contain new spatial index types which utilize a single byte buffer, have different generic number constraints to work with bytemuck (or possibly not generic at all?), and share the same core algorithm (spatial index/math functions could be shared between the the modules where possible).

If you are wanting to share the data using the same format but don't need to have it zero copy in the same memory space then writing a serializer/deserializer to/from a byte buffer in the same format would be simpler.

@kylebarron
Copy link
Author

Thanks for the reply! If you don't have an FFI-related use case, it might be simpler for me to prototype the library from scratch instead of trying to shoehorn it into how you've written the library so far. And then maybe at some point in the future once my version is implemented, we can see whether it makes sense to keep them as two libraries or combine them.

If you are wanting to share the data using the same format but don't need to have it zero copy in the same memory space then writing a serializer/deserializer to/from a byte buffer in the same format would be simpler.

Yeah definitely, but I'm really focused on applications that share the same memory space (e.g. for rust-python), so even though zero-copy will be more work, I'm inclined towards that.

@jbuckmccready
Copy link
Owner

jbuckmccready commented Oct 10, 2023

EDIT: Removed most of this reply because it was already answered!

I am curious what your use case is that causes the non-zero copy solution to be noticeable/the slow step, and if you do pursue the zero copy byte reinterpretation approach I'd be curious to see how it works out, cool project.

@kylebarron
Copy link
Author

I'm trying to build out an open ecosystem for extensible, modular geospatial analytical data processing. I'm excited about Rust to speed up Python and JS via compiled extension modules. It's true that you can create Python bindings to a Rust library, have Rust manage the memory, and never need to worry about zero-copy. But when someone else writes a C library that would like to interface with your data, if you don't have an ABI-stable way to share the data, you need to serialize it and they need to deserialize it.

For example, In Python, Shapely (and by extension the C library GEOS) is used for most geospatial data storage. But separate Python libraries with C extensions can't use the same GEOS memory because the underlying storage isn't ABI-stable. So there has to be a serde step in between.

Apache Arrow solves this problem for generic tabular data, because it defines a language-independent, ABI-stable memory layout. So you can move memory between Python/Rust/C just by changing ownership of the pointer (e.g. Polars does this from Rust). GeoArrow, a WIP spec I'm a part of, builds on top of Arrow and solves that problem for geometries, so you can share an array of geometries between Python/Rust/C for free. (I've been working on a rust implementation of this).

But it's very useful to be able to share large spatial data, declare that the data is already spatially ordered, and share a spatial index for free. Of the RTree implementations I'm aware of, the only one that's fast, compact, and has an ABI-stable memory layout is the Flatbush algorithm.

My time horizon is pretty long... I'm planning on 2-5 years to maturity of GeoArrow. But watching how much innovation in the non-spatial realm is happening from Arrow, I think there's massive potential in a geospatial zero-copy ecosystem. And given that spatial indexes are absolutely core to geospatial algorithms, it's worth investing in a zero-copy rtree implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants