Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory bug patching/ASAN #1166

Closed
7 tasks done
EinarElen opened this issue May 31, 2023 · 27 comments
Closed
7 tasks done

Memory bug patching/ASAN #1166

EinarElen opened this issue May 31, 2023 · 27 comments
Assignees

Comments

@EinarElen
Copy link
Contributor

EinarElen commented May 31, 2023

List of memory issues caught by ASAN/UBSAN and their fix-status

  • PN DQM missing &
  • Ecal veto buffer overflow missing .size()
  • Uninitialized entries in HcalVeto Hcal#58
  • Ecal veto mapsx/mapsy
  • Undefined noise entry in calorimeter hits
  • Missing case in hcal scintillator orientation
  • SimSpecialID UB? Probably not worth dealing with

Describe the bug
A missing & in one of the helper functions in the PN DQM code meant that the particle map was copied rather than referenced which made any pointers to simparticles therein invalid after the scope of the function. This bug hasn't had any impact on the DQM output, if it did we would have seen some difference. Regardless, we should of course patch it.

The culprit is here

/**
*
* Find all daughter particles of a parent. Particles are included if they>
*
* - Are in the particle map,
* - Are not photons or nuclear fragment, and
* - Are not a light ion (Z < 4) if the \ref count_light_ions_ parameter is
set to false
*
* The products are sorted by kinetic energy, in descending order.
*
**/
std::vector<const ldmx::SimParticle *>
findDaughters(const std::map<int, ldmx::SimParticle> particleMap,
const ldmx::SimParticle *parent) const;

Where the map is supposed to be passed in by const ref.

I caught this while trying to diagnose a completely different issue by running with address sanitizer enabled. I'm wondering if it would be useful to include these kinds of tools directly in our DQM jobs. At least ASAN has a minimal performance overhead. UBSAN could be interesting, but it doesn't by default kill a job if it catches something (it has some false positives) so you would only benefit from it if you read the logs.

@EinarElen EinarElen changed the title Memory bug in PN DQM Memory bug in PN DQM and Ecal Geometry May 31, 2023
@EinarElen
Copy link
Contributor Author

Caught another one in the Ecal veto/geometry code. When we are trying to fill the isolated hit map, we assume the list of neighbouring cells that we get from the Ecal geometry object to always have 6 entries. However, I'm seeing some events where you get a smaller number of neighbours (e.g. 4). This makes
cellNbrIds[k] = ldmx::EcalID(id.layer(), cellNbrIds[k].module(), cellNbrIds[k].cell()); do a buffer overflow.

I'm not sure if the assumption that there are always 6 neighbours is correct, if so there's another bug in some other place (e.g. the ecal geometry code). If not, the solution is to just loop over the size of the vector. @tomeichlersmith maybe you know?

https://github.com/LDMX-Software/Ecal/blob/bb668db60852b86d85eab1bce910c64be02b16e2/src/Ecal/EcalVetoProcessor.cxx#L907-L922

    // Skip hits that have a readout neighbor
    // Get neighboring cell id's and try to look them up in the full cell map
    // (constant speed algo.)
    //  these ideas are only cell/module (must ignore layer)
    std::vector<ldmx::EcalID> cellNbrIds = geometry_->getNN(id);

    for (int k = 0; k < 6; k++) {
      // update neighbor ID to the current layer
      cellNbrIds[k] = ldmx::EcalID(id.layer(), cellNbrIds[k].module(),
                                   cellNbrIds[k].cell());
      // look in cell hit map to see if it is there
      if (cellMap_.find(cellNbrIds[k]) != cellMap_.end()) {
        isolatedHit = std::make_pair(false, cellNbrIds[k]);
        break;
      }
    }

@EinarElen
Copy link
Contributor Author

EinarElen commented May 31, 2023

I am also seeing an overflow in

      dis[0] = faceXY[0] - mapsx[index + step];

Which I'm guessing has to be related to the mapsx access.
Index and step are defined the following way

  int step = 0;
  std::vector<float>::iterator it;
  it = std::lower_bound(mapsx.begin(), mapsx.end(), faceXY[0]);

  index = std::distance(mapsx.begin(), it);
  if (index == mapsx.size()) {
    index += -1;
  }

So we can never have an issue because index is out of bounds. However, if index is max and step > 0 then this will also be an overflow.

Step is updated below

      if (up == 0) {
        step += 1;
      } else {
        step += -1;
      }

So, I'm guessing that we are entering the first branch here when index is max which then overflows (and the same will happen for the y-map). This could actually be a bit more of an issue than the previous ones, if we have bad luck then writing to index+step for mapsx could be writing into mapsy's memory

The relevant code is here
https://github.com/LDMX-Software/Ecal/blob/bb668db60852b86d85eab1bce910c64be02b16e2/src/Ecal/EcalVetoProcessor.cxx#L492-L536

@EinarElen
Copy link
Contributor Author

LDMX-Software/Hcal#58 was also caught by this

@tomeichlersmith
Copy link
Member

The assumption that there are always 6 neighbors is incorrect, there will be cells on the edge of the flowers that have less than 6 neighbors so ASAN has indeed found a bug for us. (hopefully only in memory and not in values).

@EinarElen
Copy link
Contributor Author

Ok, the patch for that is straight-forward. I'm less sure about the mapsx/mapsy code. I'm not entirely sure what it is doing, so I can only patch it in a brute-force manner by checking the index + step values. Do you think this issue arises from the same kind of thing?

Following the discussion at the software developers meeting, I'll convert this to an issue about memory issues in general.

@tomeichlersmith
Copy link
Member

Yea, I'm struggling to understand the mapsx/mapsy code as well - I think flagging it as an issue in the Ecal repo will allow it to be on my To-Do list (probably done at the same time I break up the enormouse EcalVetoProcessor).

@EinarElen EinarElen changed the title Memory bug in PN DQM and Ecal Geometry Memory bug patching/ASAN Jun 27, 2023
@EinarElen
Copy link
Contributor Author

Updating this to be about all the issues that need to be dealt with if we want to be able to require ASAN clean builds for running validation. Keeping a list of all issues I've identified at the top

@EinarElen
Copy link
Contributor Author

EinarElen commented Jun 27, 2023

I'm currently working on some testing for the Hcal and happened to run into a couple of new issues

  • getScintillatorOrientation in the HcalGeometry doesn't handle the back properly for v12/v13 (missing switch case)
  • The EcalRecProducer and HcalRecProducer doesn't set the isNoise member for the EcalHit/HcalHits. Since these aren't given a default value in the CalorimeterHit, that means the isNoise member ends up containing junk which are virtually guaranteed to correspond to true. Notably, this actually shows up in the output from the HCal test suite which has a CHECK on isNoise that fails but this doesn't show up in the ctest output :(
  • Extremely minor and probably not worth fixing, but the SimSpecialID ssid2(SimSpecialID::SimSpecialType(9), 0xFEDCB); code in the DetectorID tests is undefined behavior.

@EinarElen
Copy link
Contributor Author

Lastly, ASAN reports a bunch of use-after-free in the Framework tests. These seem to be related to conditions that have already been destructed. Here's a very condensed version of one of those reports (most of the catch internals removed)

==17==ERROR: AddressSanitizer: heap-use-after-free on address 0x60c000000dc0 at pc 0x55655e7eb1a2 bp 0x7fff47402580 sp 0x7fff47402570
READ of size 8 at 0x60c000000dc0 thread T0
    #0 0x55655e7eb1a1 in void conditions::test::matchesMeta<conditions::DoubleTableCondition>(conditions::DoubleTableCondition const&, conditions::DoubleTableCondition const&) /home/einarelen/ldmx/ldmx-sw/Conditions/test/TestConditions.cxx:27
    #1 0x55655e7c237a in conditions::test::matchesAll(conditions::DoubleTableCondition const&, conditions::DoubleTableCondition const&) /home/einarelen/ldmx/ldmx-sw/Conditions/test/TestConditions.cxx:32
    
0x60c000000dc0 is located 0 bytes inside of 128-byte region [0x60c000000dc0,0x60c000000e40)
freed by thread T0 here:
    #0 0x7f60155c322f in operator delete(void*, unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:172
    #1 0x55655e7e646c in conditions::DoubleTableCondition::~DoubleTableCondition() /home/einarelen/ldmx/ldmx-sw/Conditions/include/Conditions/SimpleTableCondition.h:193
    #2 0x7f601421c270 in framework::ConditionsObjectProvider::releaseConditionsObject(framework::ConditionsObject const*) /home/einarelen/ldmx/ldmx-sw/Framework/include/Framework/ConditionsObjectProvider.h:90
    #3 0x7f6010050ebe in framework::Conditions::getConditionPtr(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /home/einarelen/ldmx/ldmx-sw/Framework/src/Framework/Conditions.cxx:91
    #4 0x55655e7f6739 in conditions::DoubleTableCondition const& framework::Conditions::getCondition<conditions::DoubleTableCondition>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /home/einarelen/ldmx/ldmx-sw/Framework/include/Framework/Conditions.h:84
   
previously allocated by thread T0 here:
    #0 0x7f60155c21c7 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:99
    #1 0x7f6010d6c967 in conditions::SimpleCSVTableProvider::getCondition(ldmx::EventHeader const&) /home/einarelen/ldmx/ldmx-sw/Conditions/src/Conditions/SimpleCSVTableProvider.cxx:190
    #2 0x7f60100502c2 in framework::Conditions::getConditionPtr(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /home/einarelen/ldmx/ldmx-sw/Framework/src/Framework/Conditions.cxx:69
    #3 0x55655e7f6739 in conditions::DoubleTableCondition const& framework::Conditions::getCondition<conditions::DoubleTableCondition>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /home/einarelen/ldmx/ldmx-sw/Framework/include/Framework/Conditions.h:84
    #4 0x55655e7cc8b1 in CATCH2_INTERNAL_TEST_0 /home/einarelen/ldmx/ldmx-sw/Conditions/test/TestConditions.cxx:225

@tomeichlersmith I know you dealt with a bunch of issues with creating multiple instances of the Process object when working on the docker upgrade. Do you think this could be a similar issue but for conditions?

@EinarElen
Copy link
Contributor Author

Hmm, might not be the same kind of issue. Tried running the tests one by one and you only get the problem for the Conditions test

@EinarElen
Copy link
Contributor Author

EinarElen commented Jun 28, 2023

One last thing, I keep seeing TypeError: 'float' object cannot be interpreted as an integer at the start of the process. I'm not sure where this comes from or what it is about.

@EinarElen
Copy link
Contributor Author

In good news, after patching all of the above (hacking around the mapsx/mapsy-issue), we are ASAN-safe and mostly UBSAN-safe. Tried all of the validation configs

@tomeichlersmith
Copy link
Member

The TypeError: 'float' object cannot be interpreted as an integer originates in ConfigurePython. I am unsure on where it is coming from however since it is not crashing the program either.

For the use-after-free issues, I'm guessing that it has something to do with the tests where we load conditions via the ConfigurePython->Process pipeline and then test them.

https://github.com/LDMX-Software/Conditions/blob/4056cf8684bd6eea0e13e4ab3e24958ccf8124e3/test/TestConditions.cxx#L182-L195

@tvami
Copy link
Member

tvami commented May 30, 2024

Ecal veto buffer overflow missing .size()

I put the tick in for this as this was fixed already in 5c6bee2

@tvami
Copy link
Member

tvami commented May 31, 2024

@EinarElen, when you have some time can you send instruction on how you ran ASAN?

@EinarElen
Copy link
Contributor Author

You need to build ldmx-sw with some additional cmake commands, in particular

-DENABLE_SANITIZER_ADDRESS=ON

There is some additional documentation in the cmake module, I think you'll want the stuff about recovering on error

@EinarElen
Copy link
Contributor Author

@tvami
Copy link
Member

tvami commented Aug 16, 2024

Hey @EinarElen I'm back to this a bit again,
I ran

ldmx cmake -B build -S . -DENABLE_SANITIZER_ADDRESS=ON
ldmx cmake --build build --target install -- -j$(nproc)

and it didnt show anything new. Are you doing something else too?

@EinarElen
Copy link
Contributor Author

It is a runtime thing, running with the sanitizer settings when you build enables it. So just run your simulation and it should crash at some point :)

@tvami
Copy link
Member

tvami commented Aug 19, 2024

OK so good news it that I can ran this. Bad news is that the first thing I see, I have very little idea on what to do with. The only thing I understand is that there is potentially a bug in the EcalVeto. This is what I see

=7==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6290002761fc at pc 0x7f0df47a8a55 bp 0x7ffd2ab849b0 sp 0x7ffd2ab849a0
READ of size 4 at 0x6290002761fc thread T0
    #0 0x7f0df47a8a54 in ecal::EcalVetoProcessor::produce(framework::Event&) (/Users/tav/Documents/1Research/LDMX/BugFixes/ldmx-sw/install/lib/libEcal.so+0x1a8a54)
    #1 0x7f0e01b8444a in framework::Process::process(int, framework::Event&) const (/Users/tav/Documents/1Research/LDMX/BugFixes/ldmx-sw/install/lib/libFramework.so+0x18444a)
    #2 0x7f0e01b8b6a6 in framework::Process::run() (/Users/tav/Documents/1Research/LDMX/BugFixes/ldmx-sw/install/lib/libFramework.so+0x18b6a6)
    #3 0x5573a5fe23a6 in main (/Users/tav/Documents/1Research/LDMX/BugFixes/ldmx-sw/install/bin/fire+0x93a6)
    #4 0x7f0dff229d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
    #5 0x7f0dff229e3f in __libc_start_main_impl ../csu/libc-start.c:392
    #6 0x5573a5fe2b54 in _start (/Users/tav/Documents/1Research/LDMX/BugFixes/ldmx-sw/install/bin/fire+0x9b54)

0x6290002761fc is located 4 bytes to the left of 16384-byte region [0x629000276200,0x62900027a200)
allocated by thread T0 here:
    #0 0x7f0e01eb61e7 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:99
    #1 0x7f0e00e3a27d in void std::vector<float, std::allocator<float> >::_M_realloc_insert<float const&>(__gnu_cxx::__normal_iterator<float*, std::vector<float, std::allocator<float> > >, float const&) (/Users/tav/Documents/1Research/LDMX/BugFixes/ldmx-sw/install/lib/libSimCore_Event.so+0x427d)

SUMMARY: AddressSanitizer: heap-buffer-overflow (/Users/tav/Documents/1Research/LDMX/BugFixes/ldmx-sw/install/lib/libEcal.so+0x1a8a54) in ecal::EcalVetoProcessor::produce(framework::Event&)
Shadow bytes around the buggy address:
  0x0c5280046be0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c5280046bf0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c5280046c00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c5280046c10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c5280046c20: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x0c5280046c30: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa[fa]
  0x0c5280046c40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c5280046c50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c5280046c60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c5280046c70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c5280046c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc

@EinarElen any ideas on how to extract more info on what's going on?
Alternatively I'll do some couts here and there

@EinarElen
Copy link
Contributor Author

Usually it is worth to run a debug build to get a bit more info out of it, but regardless it still tells you that there is a buffer overflow in EcalVetoProcessor::produce.

AFAIK it is this

#1166 (comment)

@tvami
Copy link
Member

tvami commented Aug 19, 2024

Ok I did a few couts, indeed it's the one with maps indices.

A good event

 index 1304  step 0
dis[0]: -3.40018
dis[1]: 74.6287

 index 1304  step 1
dis[0]: -3.40018
dis[1]: 59.6287

 index 1304  step 2
dis[0]: -3.40018
dis[1]: 44.6287

 index 1304  step 3
dis[0]: -3.40018
dis[1]: 29.6287

 index 1304  step 4
dis[0]: -3.40018
dis[1]: 14.6287

 index 1304  step 5
dis[0]: -3.40018
dis[1]: -0.371328

the problematic event

 index 0  step 0
dis[0]: -19.848
dis[1]: 96.0028

 index 0  step -1
>> breaks here

Then I find that the faceXY[0]: -262.335 which is outside of the range and thus results to the 0 index.

Another example, is where faceXY = 290.302 and

 recoilPos[0] -276.52 recoilP[0] -60.2409 recoilP[2] 16.6094

so already he recoilPos is outside the ECAL

@tvami
Copy link
Member

tvami commented Aug 26, 2024

OK resolved the

Ecal veto mapsx/mapsy

in #1410

I have student Ananda who'll look into the noise generation, so in that PR I think we can take the

Undefined noise entry in calorimeter hits

part.

I'll prob have the DQM one on my next todo list.

As for the other two points, I dont know the HCAL enough to fix that one, while for the test, do you have another suggestion to test instead @EinarElen ?

@tvami
Copy link
Member

tvami commented Sep 2, 2024

I'll prob have the DQM one on my next todo list.

Ah this was easy, it's already fixed:
https://github.com/LDMX-Software/ldmx-sw/blob/trunk/DQM/include/DQM/PhotoNuclearDQM.h#L104
(It was fixed in #1219)
I ticked the entry in this issue.

@tvami
Copy link
Member

tvami commented Sep 4, 2024

isNoise is fixed in #1434
for the other two I need @EinarElen

@EinarElen
Copy link
Contributor Author

https://github.com/LDMX-Software/ldmx-sw/blob/30e0b80a43a527f30ae08219eaec5529ea8de105/DetDescr/src/DetDescr/HcalGeometry.cxx#L91C1-L146C2

The first one should already be done :)

The other one, I think you just need to create a SimSpecialID in code built with undefined behaviour santizier

@tvami
Copy link
Member

tvami commented Sep 5, 2024

Yaay, all that list is now gone. I guess we can close this issue and have the rest of the problems in #1176 ?

@tvami tvami closed this as completed Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants