Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failures on big endian architecture. #446

Open
xnox opened this issue Aug 19, 2016 · 31 comments
Open

Test failures on big endian architecture. #446

xnox opened this issue Aug 19, 2016 · 31 comments

Comments

@xnox
Copy link
Contributor

xnox commented Aug 19, 2016

s390x is 64bit big endian.
test.txt

@pmienk
Copy link
Member

pmienk commented Aug 19, 2016

Assume fixed by #445, thanks!

@pmienk pmienk closed this as completed Aug 19, 2016
@xnox
Copy link
Contributor Author

xnox commented Aug 19, 2016

Nope, it is not =) #445 is compile failure. This is the test-suite failure that one observes after applying #445 fix.

@pmienk pmienk reopened this Aug 19, 2016
@narodnik
Copy link
Contributor

Is it possible to get safe access to a VPS with this arch to fix this issue? This looks like an endianness issue. See the errors like [7f8000 != ff8000] and [817f != 017f] Thanks.

@pmienk
Copy link
Member

pmienk commented Aug 21, 2016

I assume it's in large part due to a lack of conditionality in the endian
functions. I didn't see any simple way to get travis-ci to cover big
endian systems and don't have one myself. Was hoping to make a checkin in
the next day or two and see if it addresses the issue.

On Sun, Aug 21, 2016, 1:19 AM RojavaCrypto notifications@github.com wrote:

Is it possible to get safe access to a VPS with this arch to fix this
issue? This looks like an endianness issue. See the errors like [7f8000
!= ff8000] and [817f != 017f] Thanks.


You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
#446 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGL5OG1Ca46M74ox5N9QXMBDrpbmeJYwks5qiAn9gaJpZM4Jof3Y
.

@xnox
Copy link
Contributor Author

xnox commented Aug 22, 2016

@RojavaCrypto i am not aware of any public access to any big endian systems. E.g. powerpc (32bit, big endian) or the s390x (64bit, big endian). Whilst arm64 is wide spread, and hardware has support for arm64be - nobody has a linux port for arm64be, everyone does little endian only.

@pmienk
Copy link
Member

pmienk commented Aug 23, 2016

It appears I was wrong as to the involvement of the to_ and from_ endian functions as for the most part they rely on bit shift operators which should provide platform independence. I've corrected what looks to be an oversight I made there w.r.t. endianness, but doubt it has much to do with the failures seen.

@evoskuil
Copy link
Member

evoskuil commented Sep 6, 2016

Looking at the test failures, these are the most interesting:

test/utility/endian.cpp(98): fatal error: in "endian_tests/from_little_endian_stream_unsafe_eof_stream_partial_read": critical check expected == value has failed
test/utility/endian.cpp(128): fatal error: in "endian_tests/from_little_endian_stream_unsafe_valid": critical check expected == value has failed

Notice that all of the BE tests succeed, as do some of the LE tests. Unfortunately Boolean boost macros are used in these two tests, instead of integer value macros, so we can't see detail of the failures. I've created a PR to address this issue and better factor and standardize the endianness tests.

One of the three LE stream tests succeeds but it is a negative test with no value comparison - the output of its from_little_endian_stream_unsafe call is unused. Therefore the common denominator of the endian test failures is from_little_endian_stream_unsafe on BE architecture. Given that LE streaming is used extensively it could be the cause of some (or all) of the other failures.

@evoskuil
Copy link
Member

evoskuil commented Sep 6, 2016

Looking at recent history, it looks like the immediately-above issue has been addressed by 8c89522. This should resolve the two endian test failures and certainly a number of others. @xnox could you please re-execute the test pass with latest code?

@evoskuil evoskuil changed the title Test suite failure on s390x Endianness-related test failures on s390x (big endian). Sep 6, 2016
@evoskuil evoskuil added the bug label Sep 6, 2016
@evoskuil
Copy link
Member

evoskuil commented Sep 7, 2016

Based on the objectives of simplicity and full test coverage, the SHA1 and RIPEMD160 implementations have been replaced with architecture-independent implementations. Given that tests in these implementations were failing in this platform, this should also provide resolution for that part of the issue.

#462

@evoskuil
Copy link
Member

evoskuil commented Sep 7, 2016

A problem of invalid (overflowing) test data in the script_number implementation has been resolved by #459. The implementation has been modified to handle the full int64_t domain and to throw when that domain is over/under-flowed. The previous behavior was assertion-based and is documented to vary by architecture. It is likely that a large number of test failures resulted from this issue on this platform.

@evoskuil
Copy link
Member

evoskuil commented Sep 7, 2016

There are no known outstanding test failures or identified issues pertaining to endianness. Given that we do not have a BE test platform we await feedback.

@xnox
Copy link
Contributor Author

xnox commented Sep 8, 2016

Testing with 4beb7cc i still see thousands of failures, and there are still failures in e.g. script tests... the en/decoding still doesn't appear to take into account endiannes, either the testsuite expected output (false negatives) or the code itself (true failures)
Eg.
test/math/script_number.cpp(82): error: in "script_number_tests/check_operators": check encode_base16((add).bytes) == encode_base16((scriptnum1 + num2).data()) has failed [7e01 != fe01

Will attach test.log in a moment.

@xnox
Copy link
Contributor Author

xnox commented Sep 8, 2016

test.txt

@evoskuil
Copy link
Member

evoskuil commented Sep 8, 2016

@xnox Thanks for re-running the BE tests.

The number of failures is not indicative of the number of code errors. The script_number failures all results from one test, whereby the test iterates through a large amount of reference data. We try to avoid these types of non-isolating tests but many remain.

The code does (and previously did) take into account endiness. The problem is that the BE implementations are not regularly tested, and untested code is broken code. There are two ways to account for endinaness:

(1) Make architecture-aware code for each supported architecture, detect the architecture at compile time and switch code paths.
(2) Make architecture independent code.

The previous implementation was mostly architecture independent, but had two hash functions that were not. It was clear that, at least in this case, the architecture detection was not working. Given that this technique implies the code will not be tested on our current CI system, I changed out the two hash functions for architecture independent implementations. This was successful, there are no longer failures in test/math/hash.cpp , and all of the dependent test/wallet/ failures from the previous run are fixed.

There was also unintended architecture dependence in one of our endian functions, which was changed to architecture independent. These test/utility/endian.cpp test failures have been fixed in the latest run.

Consequently all of the previous test/chain/script.cpp failures have also been resolved. But we have somehow introduced a failure in sha256:

test/math/hash.cpp(66): fatal error: in "hash_tests/sha256_hash_test": ...

This is causing failures in all serializable types (including different failures than before in script.cpp).

I'm fairly certain that all of the remaining failures result from the (new) sha256 error and an error in script_number. We were not confident that the change to script_number in the last attempt would resolve the issue. We may have to alter its test cases in order to get more information.

@evoskuil
Copy link
Member

evoskuil commented Sep 8, 2016

It turns out that the sha256 test is dependent upon header.hash() and is therefore failing for the same reason as the serializable types. I've replaced the test with a clean test of functionality, which will resolve this. I believe that @pmienk has isolated the cause of the failures in the serializable types. This should just leave us with the issue in script_number.

@xnox
Copy link
Contributor Author

xnox commented Sep 9, 2016

Over at https://launchpad.net/ubuntu/+source/libbitcoin/2.11.0-1build1 the test-suite fails on arm64, armhf, ppc64el, s390x - which is a mix of big/little endian and 32bit/64bit platforms that are all non-x86. Seems like there is some x86-ish assumption somewhere. I'll figure out how to test/integrate these. I thought there were CI things that can hook into github and build stuff on non-x86.

@evoskuil
Copy link
Member

evoskuil commented Sep 9, 2016

@xnox The results show a test failure (not a compile failure) on three of the five little-endian (ppc64el) and bi-endian (arm64, armhf) platforms . This is a distinct issue from this compile issue, and probably not endianess-related. Without the test log we cannot resolve the issue, but I am aware of people building successfully for ARM. Again, untested means broken, and we don't currently have ARM or PPC in our test matrix. Please create a new issue for these.

The (remaining) powerpc and s390x failures are big-endian-related compile failures, but given these are in version 2.x that is unsurprising. We have been applying fixes to v3.x (master) only. In fact the error that you patched is the one exposed:

 #if SHA1_BIG_ENDIAN

Once we have the endianess issues resolved we may decide to backport to v2.x but presently all work is in 3.x.

It would be most helpful if you could provided s390x build output from the latest master. I am expecting to see only script_number failures at this point.

@evoskuil
Copy link
Member

evoskuil commented Nov 12, 2016

I just resolved an endian bug in master (v3). Conversion of hash values to uint256_t used a byte copy from the hash buffer to/from an array<8> of uint32_t. On big-endian architecture the assumed byte ordering is incorrect. This has been resolved by performing the conversions using our endian convertors.

Having reviewed the entire code base at this point I do not believe there to be any remaining endian bugs, but I have no platform to test on, so will leave this open.

@evoskuil evoskuil changed the title Endianness-related test failures on s390x (big endian). Test failures on big endian architecture. Nov 12, 2016
@evoskuil
Copy link
Member

evoskuil commented Feb 9, 2017

@xnox Any change you could give us another test run on big endian. We made a last fix but haven't been able to validate it and are tagging the v3 release this week. Thanks!

@evoskuil evoskuil added this to the 3.0 milestone Feb 9, 2017
@xnox
Copy link
Contributor Author

xnox commented Feb 13, 2017

@evoskuil a lot of tests still fail test.txt

@evoskuil
Copy link
Member

@xnox Thanks for this feedback. We'll give it another look. This would be simple if we had a machine of our own to test on :/

@pmienk
Copy link
Member

pmienk commented Feb 14, 2017

@xnox I augmented the logging. If you wouldn't mind, the output could be helpful and much appreciated.

@xnox
Copy link
Contributor Author

xnox commented Feb 15, 2017

@evoskuil i have none to give out =/ i'm not sure if there are any big endian emulators / virtual machines that one can run.

@evoskuil
Copy link
Member

No worries, once we get this I think we'll be able to maintain it.

@xnox
Copy link
Contributor Author

xnox commented Feb 16, 2017

ce91d4f

test.txt

@pmienk
Copy link
Member

pmienk commented Feb 17, 2017

Recent check-ins should reduce failures to those exposed by elliptic_curve.cpp. They appear to be an issue within secp256k1. I've opened the following issue: bitcoin-core/secp256k1#442

@evoskuil
Copy link
Member

Given the comments on the referenced issue it would be a good idea to execute the full libsecp256k1 test suite on the problem platform.

@evoskuil evoskuil modified the milestones: 4.0, 3.0 Mar 8, 2017
@evoskuil evoskuil reopened this Mar 30, 2017
@evoskuil evoskuil removed this from the 4.0 milestone Jan 3, 2019
@evoskuil
Copy link
Member

evoskuil commented Mar 6, 2021

TODO: add a compiler warning due to lack of CI testing on BE architecture.

/* Test for a little-endian machine */
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__

http://gcc.gnu.org/onlinedocs/cpp/Common-Predefined-Macros.html

@evoskuil
Copy link
Member

evoskuil commented Mar 6, 2021

Add CI support via Travis: syscoin/syscoin@8a4438f

@evoskuil
Copy link
Member

evoskuil commented Mar 6, 2021

Note that the bug opened against libsecp256k1 was closed as 'spurious' given that the values passed in the independent tests are opaque. In other words, it's a test issue, which makes sense, and should be easily resolvable.

@evoskuil
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants