Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add a module to read all Substrait plan formats #45

Draft
wants to merge 35 commits into
base: main
Choose a base branch
from

Conversation

EpsilonPrime
Copy link
Member

@EpsilonPrime EpsilonPrime commented Feb 6, 2024

With the generated Substrait plan protobufs already in this repository it is
possible to read and write binary protobufs, text protobufs, and json-encoded
protobufs. This PR adds the capability to additionally read and write the
Substrait text format as well as auto-detect formats when reading. It does this
by wrapping the substrait-cpp implementation with ctypes.

Copy link

github-actions bot commented Feb 6, 2024

ACTION NEEDED

Substrait follows the Conventional Commits
specification
for
release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@EpsilonPrime EpsilonPrime changed the title feature: Add a module to read all Substrait plan formats feat: add a module to read all Substrait plan formats Feb 6, 2024
@EpsilonPrime
Copy link
Member Author

EpsilonPrime commented Feb 6, 2024

This PR depends on substrait-io/substrait-cpp#91 (the substrait-cpp submodule dependency) which is already submitted. A concern will be to keep the version of Substrait that both substrait-python and substrait-cpp depend on in sync.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid the potential memory leak and then this is good from my perspective. Can't comment too much on the build.

tests/test_planloader.py Outdated Show resolved Hide resolved
tests/test_planloader.py Show resolved Hide resolved
Comment on lines 63 to 67
if result.contents.errorMessage:
raise PlanFileException(result.contents.errorMessage)
data = ctypes.string_at(result.contents.buffer, result.contents.size)
plan = plan_pb2.Plan()
plan.ParseFromString(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any chance something goes wrong here that doesn't free the plan? For example, maybe the plan fails to deserialize into protobuf? Can you use a try / finally?

@gforsyth
Copy link
Member

gforsyth commented Feb 6, 2024

Thanks for putting this together, @EpsilonPrime ! Is the text format documented somewhere?

I have a few questions:

  1. Can we use something like pybind11 or nanobind instead of bare ctypes?
  2. How complicated is the text format? It might make more sense to reimplement the parser in Python for the Python library, just for packaging and testing reasons, because...

we currently ship a single no-arch wheel. Once we're wrapping c++ we'll have to build versions for every architecture and also for every version of Python we aim to support.

@EpsilonPrime
Copy link
Member Author

Thanks for putting this together, @EpsilonPrime ! Is the text format documented somewhere?

I have a few questions:

  1. Can we use something like pybind11 or nanobind instead of bare ctypes?
  2. How complicated is the text format? It might make more sense to reimplement the parser in Python for the Python library, just for packaging and testing reasons, because...

we currently ship a single no-arch wheel. Once we're wrapping c++ we'll have to build versions for every architecture and also for every version of Python we aim to support.

I'll look into the other ways of including the C++ code -- performance really isn't a concern though because we have to serialize and deserialize the plan anyway to cross the language boundary (unless we want to enforce C++ protobuffers in Python which seems even more restrictive since we'd need to match versions more closely).

While implementing in native Python would be nice there's a full parser behind the text format which required 6 months of implementation in C++ (take a look at the substrait-io/substrait-cpp change log). There is an Antlr4 grammar file which specifies the language for the text format so part of the work is available and perhaps it'd be easier to power through it given that groundwork. My availability is limited (probably no time for a second language version this year) so if it's possible in the short term, that'd be ideal.

One thing we may want to consider is temporarily making the new library hidden behind a flag so that systems not ready to take on the feature don't need to be burdened by an OS-specific build.

@danepitkin
Copy link
Contributor

danepitkin commented Feb 9, 2024

Very cool! It's great to see new features added to the python impl.

I'm a big +1 on using nanobind (1st choice) or pybind11.

Also, it might be nice to make these additional features optional to downstream dependencies. For example, would Ibis want this included by default? We could organize the python package such that substrait only includes the generated protobuf classes and substrait[serializers](?) includes the binary/text/json [de]serializers. And then substrait[all] will always include everything. WDYT?

Edit: these suggestions can be follow up PRs, not necessarily required for merging

@jorisvandenbossche
Copy link

We could organize the python package such that substrait only includes the generated protobuf classes and substrait[serializers](?) includes the binary/text/json [de]serializers. And then substrait[all] will always include everything. WDYT?

AFAIK that will only work if we package the python bindings for substrait-cpp separately (a separate python package), such that substrait-python can depend on that additional package (I think extras can only be used for dependencies, not for enabling features within the library itself, except by relying on an external dependency)

Copy link
Member

@gforsyth gforsyth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the pyproject.toml file we might be able to get this working if we add cmake to the requires list under [build-system] -- I believe that the setuptools.build_meta build-backend has at least some support for that.
If not, we can look at scikit-build to help with getting things building as part of the pip installation, and that will help us feed something like cibuildwheel for handling packaging for all the python and architecture versions.

Comment on lines +28 to +35
- name: Build Substrait planloader library
run: |
cd ${{ github.workspace }}/third_party/substrait-cpp
make release
- name: Install Substrait planloader library
run: |
cd ${{ github.workspace }}/third_party/substrait-cpp/build-Release/export/planloader
make install
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll want to handle compilation via the pyproject.toml file, I'll leave some notes there.

Comment on lines +29 to +36
- name: Build Substrait planloader library
run: |
cd ${{ github.workspace }}/third_party/substrait-cpp
make release
- name: Install Substrait planloader library
run: |
cd ${{ github.workspace }}/third_party/substrait-cpp/build-Release/export/planloader
make install
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll want to handle compilation via the pyproject.toml file, I'll leave some notes there.

@gforsyth
Copy link
Member

In the event that the nested submodules make this extra painful (which seems... likely?) we could always use git subtree, which has much more predictable behavior in my estimation.

@amol-
Copy link
Contributor

amol- commented Feb 15, 2024

@EpsilonPrime can you please add an example to the README to show how to parse/produce a text plan?

Also I think it would be helpful to be able to load/save the plans to a string instead of a file. Is that a feature the C++ library can expose? I guess we could implement that by using a NamedTemporaryFile and reading it, but that sounds more like an hack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants