-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory bug patching/ASAN #1166
Comments
Caught another one in the Ecal veto/geometry code. When we are trying to fill the isolated hit map, we assume the list of neighbouring cells that we get from the Ecal geometry object to always have 6 entries. However, I'm seeing some events where you get a smaller number of neighbours (e.g. 4). This makes I'm not sure if the assumption that there are always 6 neighbours is correct, if so there's another bug in some other place (e.g. the ecal geometry code). If not, the solution is to just loop over the size of the vector. @tomeichlersmith maybe you know? // Skip hits that have a readout neighbor
// Get neighboring cell id's and try to look them up in the full cell map
// (constant speed algo.)
// these ideas are only cell/module (must ignore layer)
std::vector<ldmx::EcalID> cellNbrIds = geometry_->getNN(id);
for (int k = 0; k < 6; k++) {
// update neighbor ID to the current layer
cellNbrIds[k] = ldmx::EcalID(id.layer(), cellNbrIds[k].module(),
cellNbrIds[k].cell());
// look in cell hit map to see if it is there
if (cellMap_.find(cellNbrIds[k]) != cellMap_.end()) {
isolatedHit = std::make_pair(false, cellNbrIds[k]);
break;
}
} |
I am also seeing an overflow in dis[0] = faceXY[0] - mapsx[index + step]; Which I'm guessing has to be related to the int step = 0;
std::vector<float>::iterator it;
it = std::lower_bound(mapsx.begin(), mapsx.end(), faceXY[0]);
index = std::distance(mapsx.begin(), it);
if (index == mapsx.size()) {
index += -1;
} So we can never have an issue because Step is updated below if (up == 0) {
step += 1;
} else {
step += -1;
} So, I'm guessing that we are entering the first branch here when The relevant code is here |
LDMX-Software/Hcal#58 was also caught by this |
The assumption that there are always 6 neighbors is incorrect, there will be cells on the edge of the flowers that have less than 6 neighbors so ASAN has indeed found a bug for us. (hopefully only in memory and not in values). |
Ok, the patch for that is straight-forward. I'm less sure about the Following the discussion at the software developers meeting, I'll convert this to an issue about memory issues in general. |
Yea, I'm struggling to understand the |
Updating this to be about all the issues that need to be dealt with if we want to be able to require ASAN clean builds for running validation. Keeping a list of all issues I've identified at the top |
I'm currently working on some testing for the Hcal and happened to run into a couple of new issues
|
Lastly, ASAN reports a bunch of
@tomeichlersmith I know you dealt with a bunch of issues with creating multiple instances of the |
Hmm, might not be the same kind of issue. Tried running the tests one by one and you only get the problem for the |
One last thing, I keep seeing |
In good news, after patching all of the above (hacking around the mapsx/mapsy-issue), we are ASAN-safe and mostly UBSAN-safe. Tried all of the validation configs |
The For the use-after-free issues, I'm guessing that it has something to do with the tests where we load conditions via the ConfigurePython->Process pipeline and then test them. |
I put the tick in for this as this was fixed already in 5c6bee2 |
@EinarElen, when you have some time can you send instruction on how you ran ASAN? |
You need to build ldmx-sw with some additional cmake commands, in particular -DENABLE_SANITIZER_ADDRESS=ON There is some additional documentation in the cmake module, I think you'll want the stuff about recovering on error |
Hey @EinarElen I'm back to this a bit again,
and it didnt show anything new. Are you doing something else too? |
It is a runtime thing, running with the sanitizer settings when you build enables it. So just run your simulation and it should crash at some point :) |
OK so good news it that I can ran this. Bad news is that the first thing I see, I have very little idea on what to do with. The only thing I understand is that there is potentially a bug in the EcalVeto. This is what I see
@EinarElen any ideas on how to extract more info on what's going on? |
Usually it is worth to run a debug build to get a bit more info out of it, but regardless it still tells you that there is a buffer overflow in EcalVetoProcessor::produce. AFAIK it is this |
Ok I did a few A good event
the problematic event
Then I find that the Another example, is where faceXY = 290.302 and
so already he recoilPos is outside the ECAL |
OK resolved the
in #1410 I have student Ananda who'll look into the noise generation, so in that PR I think we can take the
part. I'll prob have the DQM one on my next todo list. As for the other two points, I dont know the HCAL enough to fix that one, while for the test, do you have another suggestion to test instead @EinarElen ? |
Ah this was easy, it's already fixed: |
isNoise is fixed in #1434 |
The first one should already be done :) The other one, I think you just need to create a SimSpecialID in code built with undefined behaviour santizier |
Yaay, all that list is now gone. I guess we can close this issue and have the rest of the problems in #1176 ? |
List of memory issues caught by ASAN/UBSAN and their fix-status
.size()
noise
entry in calorimeter hitsDescribe the bug
A missing
&
in one of the helper functions in the PN DQM code meant that the particle map was copied rather than referenced which made any pointers to simparticles therein invalid after the scope of the function. This bug hasn't had any impact on the DQM output, if it did we would have seen some difference. Regardless, we should of course patch it.The culprit is here
ldmx-sw/DQM/include/DQM/PhotoNuclearDQM.h
Lines 96 to 110 in 0079ce0
Where the map is supposed to be passed in by const ref.
I caught this while trying to diagnose a completely different issue by running with address sanitizer enabled. I'm wondering if it would be useful to include these kinds of tools directly in our DQM jobs. At least ASAN has a minimal performance overhead. UBSAN could be interesting, but it doesn't by default kill a job if it catches something (it has some false positives) so you would only benefit from it if you read the logs.
The text was updated successfully, but these errors were encountered: