Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Patches for clean Fedora RPM builds that pass unit tests #123

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

xiphmont
Copy link

Fedora has had a bad run past few years of shipping nonfunctional or only partially functional netgen builds despite carrying a number of downstream patches. This is an effort to get Fedora (my native development platform) support up to parity with Ubuntu/Win/Mac.

The issues addressed here are varied and wide-ranging; I'd not be surprised if my approach at various points isn't considered the best option. However, it's a complete minimal cut to get working packages that build and pass internal tests, and a sort of base-minimum I needed to continue on other pieces (like SALOME reconciliation).

One thing that's especially bugging me: Netgen has been at 6.2 for a comparatively long time now, and I'm not sure any of the 'tweak' releases have actually been cross-abi compatible. Is it perhaps time to beak to 6.3, or maybe even 7? The internal and external changes seem to warrant a step of that magnitude.

Having version.hpp throw an exception when built against a malformed
version number (due to, eg, not having the git tag available in an
automated package build from source bundle) results in _init_ failing
during an otherwise apparently successful dlopen(), as well as the
netgen-mesher executable crashing before hitting main().

This patch both forces an abort and prints a helpful message alerting
the builder/user that something went wrong instead of handing over
either a coredump or a weirdly malfunctioning library with no apparent
cause.  Hopefully this will avoid a repeat of automated Fedora builds
shipping nonfunctional Netgen RPMs for several years!
Add explicit defaults where missing for clarity.

Support passing in NETGEN_GIT_VERSION for use when building, eg, RPM
packages using rpkg, which requires building from an untagged source
bundle, that is, not directly from a git checkout.

Add option to choose build against internal or external Pybind11; add
cmake module to find and configure against system Pybind.

re-add versioning of shared library files.

Add section for GCC-specific options setting (right now, to turn off
harmless warnings).

Warn if pybind11_stubgen is not found.

Adjust stubgen paths to work when build is not in-place in the source tree.
The name of the 'netgen' executable collides with an older UNIX pcb
trace routing application also named 'netgen'.  Fedora, for this
reason, renames this mesher package to 'netgen-mesher' and the
executable, likewise, to 'netgen-mesher'.  I propose the same change
here.

Ironically though, the current python module is already named
'netgen-mesher', which complicates loading of the module as 'netgen',
often resulting in it being unfindable.  For this reason, I similarly
reverse the usage, renaming the python module from 'netgen-mesher' to
'netgen'.
Adjust path setup of the python modules in __init__ such that it's
possible to run pytests from a mock root during package build/test.

Also add checks to pytest/CMakelists.txt to make sure pytest abd check
are present instead of failing with an inscrutible traceback if
they're not installed or usable.
Hardened toolchains (eg, mainline Fedora) no longer tolerate null
derefs that were silently trapped/ignored in earlier versions.  This
eliminates a nullptr deref in archive.hpp that was failing several
unit tests (and causing crashes) when trying to manipulate
default-constructed archive objects.
A default-constructed (or just empty) ngarray will have 'ownmem' set
despite not having an allocated data array.  Destructor would then
trigger an abort.
Don't delete memory that wasn't allocated.

Similarly, we can't deref the first element of an array that was never
allocated.
Several occurrences of int -> cgsize_t
Add DLL_HEADER to Surface, Primitive, and STLTopology in order to make
typeinfo available to link stage (otherwise, link fails with GNU ld).
I'm uncertain about earlier/alternate versions, but the current
version of mainline pybind11_stubgen requires a number of changes in
the python binding cpp files to succeed:

__repr__ implementation added for some basic types

Added default descriptions via py::arg_v variant syntax for those
cases where a __repr__ implementation made no sense (eg, for a default
sonstructor of a complex type)

Moved class declarations to precede first use in arguments/returns

Added trailing default arguments, so that mandatory args did not
follow optional args (without resorting to kwargs trickery).
Build all Netgen RPMs from cli via:
rpkg local --spec package/fedora/netgen-mesher-bleed.spec
from the toplevel source directory
Comment on lines -298 to +300
if (ownmem)
delete [] data;
if (data)
if (ownmem)
delete [] data;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change does not make any sense. When data is a nullptr, then the delete [] is a noop.
Dito below.

Maybe the actual problem was fixed with 5423242

Copy link
Author

@xiphmont xiphmont Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ has somehow continued to make behavior of delete nullptr ambiguous, and gcc/glibc crash on this if they're built with object instance nullptr checking disabled. Unfortunately, that's now the default config on RedHat-style distros since F35.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not an object instance nullptr (*this), but a regular delete [].
If this breaks on Fedora, they are doing something wrong.

https://isocpp.org/wiki/faq/freestore-mgmt#delete-handles-null

Do I need to check for null before delete p?
No!

vstring = vstring.substr(1,vstring.size()-1);
auto dot = vstring.find('.');
mayor_ = std::stoi(vstring.substr(0,dot));
if(dot == size_t(-1)) vstring = "";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compare with std::string::npos, instead of type-casting.

auto dot = vstring.find('.');
mayor_ = std::stoi(vstring.substr(0,dot));
if(dot == size_t(-1)) vstring = "";
else vstring = vstring.substr(dot+1, vstring.size()-dot-1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

substr() returns up to the end when the second argument is omitted (i.e. uses the std::string::npos default argument).

@@ -0,0 +1,202 @@
# - Find python libraries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, no vendored and totally outdated cmake stuff.
Pybind11 supports cmake config mode for ages, and does not require any external helpers.

Copy link
Author

@xiphmont xiphmont Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a patch that has been making the rounds for years as part of distro-packaged netgen-- I agree, it's pretty heavy/ugly. I used it only because I knew it worked. I'll do something better.

The general problem with CMake is that it is mostly unaware that RedHat distributions exist at all, and there does not seem to be any release testing against RH distros. A lot of the trouble is RedHat's own fault, of course, and CMake authors have been very responsive to suggested fixes. But that does mean that a bleeding-edge CMake is needed to build things, so I have generally opted for workarounds if they're not too awful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #127

Comment on lines -228 to +229
Do(&v[0], size);
if(size > 0) // can't deref v[0] if it doesn't exist
Do(&v[0], size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just v.data() instead? Thats always valid.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was just keeping the pattern. I'll change it.

@xiphmont
Copy link
Author

xiphmont commented Jul 6, 2022

This needs a rebase and general cleanup taking comments into account. I'll have an updated patchset soon after a little more discussion (eg, I name my experimental packages with 'bleed' and I had meant to undo that before submitting but forgot).

memcpy (p, line.col, line.maxsize * elsize);
delete [] (char*)line.col;

if (line.size > 0 && line.col != NULL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is incorrect.
Via SetSize2(row, 5); SetSize2(row, 0);, line.size becomes 0, but that does not mean the backing memory should not be freed.

Though, the allocation/deallocation is considerably flawed. The memory pointed to by line.col must only be freed when it does not point into oneblock. Unfortunately, that is not easy to check, as there is no end pointer.

@StefanBruens
Copy link
Contributor

This needs a rebase and general cleanup taking comments into account.

With github, this is typically noted by prefixing the PR title with WIP: .

@xiphmont xiphmont changed the title Patches for clean Fedora RPM builds that pass unit tests WIP: Patches for clean Fedora RPM builds that pass unit tests Jul 13, 2022
@xiphmont
Copy link
Author

xiphmont commented Jul 13, 2022

WIP: added to both PRs, as both will need more revision/discussion.
Apologies for the momentary lack of attention, work has been busy. I appreciate the deep and detailed review.

The biggest cause of latency on my side is looking into the nature of the nullptr checking changes in GCC and Glibc recently. In fact there has been some very deep churn there and I believe you're correct that something was very much not right in F35 (and may have been corrected already in F36/37).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants