-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Patches carried by conda-forge for packaging sentencepiece #876
Comments
@taku910, did you have a moment to have a look at this issue already? Would there be a way to collaborate on this, e.g. starting with the abseil situation? |
Gentle ping @taku910 - is there a way I could contribute e.g. some of the abseil changes (to truly build against an external abseil) in a way that it would be acceptable to you? |
I've turned the patches against 0.1.99 into a tag, and rebased the branch of our patches against current master. With all the recent changes, the patch set has shrunken a lot, which is great! Thanks for your efforts to solve some of these tricky problems! 🙏 Of the remaining patches, the main thing that would be interesting to us to upstream in some form would be:
Is this something that you would be interested in @taku910? Long-term it would be great to also support builds against shared abseil & protobuf (resp. shared sentencepiece builds on windows). |
Hi there - I help package many a package for conda-forge, which is a pretty big ecosystem of packages especially in data science and ML/AI. It started out mainly focussed on Python, but has come to include many other languages as well.
One strong point (especially in comparison to pip) is that we are able to sanely redistribute non-Python artefacts. We can track the relevant ABI of a package and rebuild its descendant packages where necessary. This allows us to (almost) exclusively distribute shared libraries, which lowers package footprints, and has several other benefits.
We've been shipping sentencepiece for about 3 years through the so-called feedstock, and packages can be installed like:
We have builds for linux-{64, aarch64, ppc64le}, osx-{64, arm64}, win-64, multiplied by CPython 3.8-3.11 as well as PyPy 3.8 & 3.9.
In the context of wanting to package torchtext (which depends on sentencepiece -- but really only the C++ bits) I wanted to split off a separate "output" / "package" called
libsentencepiece
that torchtext can build against, without having to spuriously rebuild it per python version (it would literally be the same in each case).Also, crucially, we absolutely need to rely on our own abseil & protobuf builds, which became a pretty big sticking point. I finally managed to pull off this split last year (there's references there to my very long-winded previous tries; eventually I gave up on trying to build a shared library on windows because it was getting too insane1).
However, at the time, I ran out of steam to raise an issue here, and discuss how - if at all - it would be possible to upstream some of these changes. That's all the more the case because some of the patches I came up with are definitely not ready for primetime here. I got quite exasperated with how the code in
thirdparty/absl
was neither third-party nor abseil proper2, so I ended up replacing that wholesale. Of course that doesn't mean I'm suggesting to do this here verbatim.However, the abseil situation is perhaps one of the biggest sticking points. There's a variable called
SPM_USE_EXTERNAL_ABSL
, but it really doesn't do what it says on the tin, because all the include paths still point tothird_party/absl
, rather than abseil proper. This issue has also been encountered by other people: #869.Since I came up with my initial patches, sentencepiece is not forcing static linkage (
/MT
) of the MSVC runtime anymore (yay!). Other list of changes that were necessary for us:PROTOBUF_USE_DLLS
3)CMAKE_INSTALL_PREFIX
libsentencepiece
(in our case, this will always match the underlying version 1:1, though the ABI-guarantees - or lack thereof - of sentencepiece would be good to know)I've rebased my patches (branch) against 0.1.99 and am currently re-building our redistribution recipe against both protobuf 3.21 & 4.23 (this also runs the test suite afterwards to catch regressions).
Would be happy if we could find a way that lets some of these improvements flow back upstream. 🙃
(I fully understand that some changes are specific to our needs, and it's fine if we keep carrying some patches; but any reduction helps, and some of the changes would also benefit other people I believe)
I'm happy to prepare dedicated PRs for individual changes, but wanted to discuss first which bits are interesting to the maintainers here, and how we could make them palatable for merging here (e.g. with switches based on symbols or CMake resp. environment variables).
Footnotes
by that point I was already patching the output generated by protobuf for some silly constexpr errors ↩
as force-replacing it with actual abseil lead to build errors because -- to the best of my understanding -- it contains sentencepiece-specific glue-code, but in the
absl
namespace. ↩this at least should be getting easier as of protobuf 23, because autotools is gone, and we have better metadata for CMake / pkgconfig. ↩
The text was updated successfully, but these errors were encountered: