Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vl-convert-python crashes with ARM64 PointerAuthentication error #67

Closed
sacundim opened this issue Jun 23, 2023 · 12 comments · Fixed by #68
Closed

vl-convert-python crashes with ARM64 PointerAuthentication error #67

sacundim opened this issue Jun 23, 2023 · 12 comments · Fixed by #68

Comments

@sacundim
Copy link

Environment:

  • Vega-Altair 5.0.1
  • vl-convert-python 0.10.3
  • Docker container in a Linux VM on an Apple Silicon Mac

I am porting a Python app of mine from Vega-Altair 4.2.x to 5.0.1, and altair_saver to vl-convert-python, and I get errors like this after I do a certain number (that I haven't measured) of SVG exports in the same process.

#
# Fatal error in , line 0
# Check failed: Deoptimizer::IsValidReturnAddress(PointerAuthentication::StripPAC(pc), isolate_).
#
#
#
#FailureMessage Object: 0xffff98f6e880
==== C stack trace ===============================

    /usr/local/lib/python3.9/site-packages/vl_convert/vl_convert.cpython-39-aarch64-linux-gnu.so(+0x11f9624) [0xffff8876d624]
    /usr/local/lib/python3.9/site-packages/vl_convert/vl_convert.cpython-39-aarch64-linux-gnu.so(+0x910b24) [0xffff87e84b24]
    /usr/local/lib/python3.9/site-packages/vl_convert/vl_convert.cpython-39-aarch64-linux-gnu.so(+0x11f2820) [0xffff88766820]
    /usr/local/lib/python3.9/site-packages/vl_convert/vl_convert.cpython-39-aarch64-linux-gnu.so(+0xe43d5c) [0xffff883b7d5c]
    /usr/local/lib/python3.9/site-packages/vl_convert/vl_convert.cpython-39-aarch64-linux-gnu.so(+0x97ac54) [0xffff87eeec54]
    /usr/local/lib/python3.9/site-packages/vl_convert/vl_convert.cpython-39-aarch64-linux-gnu.so(+0x977b2c) [0xffff87eebb2c]
    /usr/local/lib/python3.9/site-packages/vl_convert/vl_convert.cpython-39-aarch64-linux-gnu.so(+0x1064234) [0xffff885d8234]

I don't have much of a clue how to troubleshoot this—any suggestions welcome. Searching the web a bit, this PointerAuthentication and StripPAC stuff is related to ARM64 security features to detect if a pointer has been modified, but I have no clue if this would be a stack overflow issue vs. e.g. the Javascript interpreter doing some business with JIT code generation or garbage collection.

Two things that I have managed to establish are:

  1. Running the application multiple times in a row seems to repeatably fail at the same point in its execution (well, at least so far in about 3-4 tries).
  2. However, the input JSON spec that the app attempts to export at that point is not the trigger for the failure. I can tell this because:
    • The app is a single-threaded loop through a list of independent tasks that generate a Vega-Lite JSON spec and export it as SVG;
    • I can trivially change the order in which these tasks are executed;
    • Doing so causes the subtask where I originally observed the failure to succeed, and one that I originally saw succeed to fail.

I'm still doing some investigation to see if e.g. there's a fixed number of invocations of SVG export that triggers the failure or there is one subtask that succeeds but somehow leaves the process in a bad state that causes the subsequent ones to fail, but I don't know if there is a more direct way of figuring out what the cause is so I thought I'd file this ticket right away.

@sacundim
Copy link
Author

Update: I just ran the same codebase in an x86 Mac and it worked fine there.

@sacundim
Copy link
Author

Searching and reading a bit more, the error message ("Check failed: Deoptimizer::IsValidReturnAddress(PointerAuthentication::StripPAC(pc), isolate_)") seems to come from the V8 engine. Some deep nuance about the way vl-convert is interfacing to V8?

sacundim pushed a commit to sacundim/vl-convert_arm64_test that referenced this issue Jun 23, 2023
@sacundim
Copy link
Author

I wrote a very small standalone test program to try and reproduce this issue, I just bundle all the Vega-Lite specs that are generated by the application where I see the problem and pass them straight into vl-convert-python:

But no dice, the test program runs just fine (takes about 45 seconds on my Mac). The issue must lie in something about the interaction either of the Rust interface to the V8 engine, or how the Rust library that does it interplays with the Python interpreter where it gets embedded.

@sacundim
Copy link
Author

Success! I changed the test program a bit to run multiple rounds of the whole set of files, and now it does reproduce the issue.

@jonmmease
Copy link
Collaborator

Thanks for the report, investigation, and repro @sacundim. This is really helpful.

I don't have any ideas off the top of my head, but one thing that comes to mind is that we've previously run into an issue with the arm64 linux wheels that cropped up only when cross-compiling from linux on x86. This issue went away when I compiled locally inside an arm64 docker container (running on an M1 mac). In the end we worked around the issue by downgrading deno/v8 and continuing to cross-compile with GitHub actions.

So the first thing I want to try is to see if there's a difference in behavior for wheels that are compiled inside an arm64 docker container (rather than cross compiled).

@jonmmease
Copy link
Collaborator

Ok, I was able to reproduce the crash (thanks again for the easy to use repro @sacundim).

I rebuilt the vl_convert_python-0.10.3-cp39-cp39-manylinux_2_31_aarch64.whl wheel locally using a rust base image on my M1 mac, and then reran the tests with this, and it's now completing successfully for me.

Could you give this wheel a try and see if it fixes the issue for you as well?

https://filedn.com/lo5VE4SmtWKXIvNsinHVy7F/packages/vl_convert_python-0.10.3-cp39-cp39-manylinux_2_31_aarch64.whl

The take away is that it looks like we can't trust cross-compilation for building aarch64 images. I'll start looking into options for building on aarch64 directly on GitHub actions.

@sacundim
Copy link
Author

I'm trying the hand-compiled one now. A 6-round run of the test case just succeeded, now I'm trying 12 rounds to really stress it. Will later try the original program.

I wonder if you get the compilation process to dump out all the exact compiler options it uses, and compare between the cross-compilation and native environments, if you will see that they're different. Because obviously I could be wrong, but my inclination is to suspect that cross-compilation should produce the same object code given the same source files and compiler options. (Modulo nondeterministic factors like random seeds affecting hash table orders and such)

@sacundim
Copy link
Author

The 12-round run with the recompiled package succeeds, and two runs of the original app do so as well.

@jonmmease
Copy link
Collaborator

Ok, sounds good! I've had some luck cross compiling working packages on GitHub actions using QEMU. I'll ping you again when I have another wheel to try out. Thanks!

@jonmmease
Copy link
Collaborator

Here's a new wheel that I built on GitHub Actions using QEMU (in #68). It's also built with manylinux2014 support, so should be just as compatible as the current cross-compiled wheels.

vl_convert_python-0.10.3-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl

This is running the repro successfully for me, so hopefully this takes care of the issue!

@jonmmease
Copy link
Collaborator

Should be fixed in the just released 0.11.0. Thanks again for the repro, this wouldn't have gotten fixed otherwise!

sacundim pushed a commit to sacundim/covid-19-puerto-rico that referenced this issue Jun 26, 2023
@sacundim
Copy link
Author

I tried 0.11.0 and just confirmed that my app runs successfully. Thanks for the quick turnaround!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants