Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove duplicate detection #79

Merged
merged 10 commits into from
Mar 19, 2024

Conversation

Pennycook
Copy link
Contributor

This is quite embarrassing, but I think a full explanation here is necessary to understand the changes here.

The deduplication functionality originally introduced in #7 was clearly motivated by something, but it's not clear what. I've tried generating some compilation databases with CMake and different build directories, and the default behavior appears to be to generate absolute paths specifying the original source file (i.e., outside of the build directory). Looking back at JSON files we had generated for case studies in the past, this was true even when the functionality from #7 was introduced. The justification in the commit that deduplication is required to handle out-of-tree builds is clearly wrong.

The only situation I can find where this deduplication functionality is helpful is the original test_duplicates test. This makes me think that the functionality may have been introduced due to a misunderstanding of how compilation databases work, or to address issues observed in handling synthetic/hypothetical codebases that don't exist in real life.

In short:

  • There are real cases where deduplication leads to an incorrect result.
  • There are no known cases where deduplication is required for a correct result.

My proposal is to disable deduplication for now, and make sure that we document (and warn about) potential shortcomings. If we later encounter build directory structures that cause issues, we can introduce some new tests and try to identify the true root-cause before embracing deduplication as the fix.

@laserkelvin: I'd appreciate it if in addition to review here, you could try to actually go through the workflow of generating some compilation databases and running codebasin, just to make sure my reasoning is sound.

Related issues

Fixes #72. Should be merged after #77.

Proposed changes

  • Disable deduplication/merging duplicates. Note this is a one-line change -- removing the functionality is deferred to 2.0.0.
  • Add a test to cover the case reported in Deduplication leads to incorrect/unexpected results #72.
  • Add a test for handling of CMake-style compilation databases.
  • Add a warning for empty platforms that may arise from misconfigured compilation databases.
  • Add a test to ensure that the warning is being generated.

After internal testing and review, we reached the conclusion that using -p to
both define and filter platforms was too confusing and difficult to teach.
Additionally, there are concerns that use of flags to define platforms may not
scale, since real-life use-cases may require more than a list of commands.

This commit is a partial revert of intel#49: the ability to filter platforms
remains, but the ability to define platforms directly via JSON is removed.

Signed-off-by: John Pennycook <john.pennycook@intel.com>
Signed-off-by: John Pennycook <john.pennycook@intel.com>
Signed-off-by: John Pennycook <john.pennycook@intel.com>
Separates legacy behavior (YAML files) and modern behavior (TOML files).
When using TOML files, the output name is updated to reflect the subset
of platforms that were actually included in the analysis.

Signed-off-by: John Pennycook <john.pennycook@intel.com>
This replaces the old "duplicates" test, which was wrong.

Duplicates SHOULD count towards divergence by default. Assuming that duplicates
only arise due to out-of-tree builds is incorrect.

Signed-off-by: John Pennycook <john.pennycook@intel.com>
The JSON compilation databases used in this test are designed to mimic those
generated by CMake. Notably, the "file" field of such databases is an absolute
path to a file in the source directory.

Signed-off-by: John Pennycook <john.pennycook@intel.com>
Signed-off-by: John Pennycook <john.pennycook@intel.com>
If a compilation database specifies files relative to a build directory
(or otherwise points to paths that do not exist) then the resulting platform
definition will be empty. We should ensure that we issue a warning in this
case.

Signed-off-by: John Pennycook <john.pennycook@intel.com>
Signed-off-by: John Pennycook <john.pennycook@intel.com>
@Pennycook Pennycook added bug Something isn't working enhancement New feature or request labels Mar 7, 2024
@Pennycook Pennycook added this to the 1.2.0 milestone Mar 7, 2024
@laserkelvin
Copy link
Contributor

I'll keep this on my to-do and try get to it ASAP.

Do you have any code bases in mind that you could suggest?

@Pennycook
Copy link
Contributor Author

Do you have any code bases in mind that you could suggest?

Anything you're familiar with that does an out-of-tree build should work. I wrote up a simple (synthetic) test with CMake. I'd expect something like BabelStream or CloverLeaf would also work.

I'm reluctant to give you a code base, because I really want to see if you find something I didn't.

@laserkelvin
Copy link
Contributor

Makes sense!

@laserkelvin
Copy link
Contributor

laserkelvin commented Mar 14, 2024

Running this on main branch, the outputs of nbody look like this:

---------------------------------
          Platform Set LOC % LOC
---------------------------------
                    {} 611 30.64
                {cuda}   4  0.20
      {serial, openmp} 790 39.62
{serial, openmp, cuda} 589 29.54
---------------------------------
Code Divergence: 0.38
Unused Code (%): 30.64
Total SLOC: 1994

Distance Matrix
--------------------------
       cuda openmp serial
--------------------------
  cuda 0.00   0.57   0.57
openmp 0.57   0.00   0.00
serial 0.57   0.00   0.00
--------------------------

Checking out this PR, the result looks like this:

---------------------------------
          Platform Set LOC % LOC
---------------------------------
                    {} 751 25.68
              {serial} 790 27.02
                {cuda}   4  0.14
              {openmp} 790 27.02
{cuda, serial, openmp} 589 20.14
---------------------------------
Code Divergence: 0.63
Unused Code (%%): 25.68
Total SLOC: 2924

Distance Matrix
--------------------------
       cuda openmp serial
--------------------------
  cuda 0.00   0.57   0.57
openmp 0.57   0.00   0.73
serial 0.57   0.73   0.00
--------------------------

So it seems that the de-duplication is doing something...

tests/duplicates/test_duplicates.py Outdated Show resolved Hide resolved
Signed-off-by: John Pennycook <john.pennycook@intel.com>
@Pennycook Pennycook mentioned this pull request Mar 19, 2024
3 tasks
@Pennycook Pennycook merged commit 5d0a6a0 into intel:main Mar 19, 2024
2 checks passed
@Pennycook Pennycook deleted the remove-duplicate-detection branch March 19, 2024 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deduplication leads to incorrect/unexpected results
2 participants