-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove duplicate detection #79
Conversation
After internal testing and review, we reached the conclusion that using -p to both define and filter platforms was too confusing and difficult to teach. Additionally, there are concerns that use of flags to define platforms may not scale, since real-life use-cases may require more than a list of commands. This commit is a partial revert of intel#49: the ability to filter platforms remains, but the ability to define platforms directly via JSON is removed. Signed-off-by: John Pennycook <john.pennycook@intel.com>
Signed-off-by: John Pennycook <john.pennycook@intel.com>
Signed-off-by: John Pennycook <john.pennycook@intel.com>
Separates legacy behavior (YAML files) and modern behavior (TOML files). When using TOML files, the output name is updated to reflect the subset of platforms that were actually included in the analysis. Signed-off-by: John Pennycook <john.pennycook@intel.com>
This replaces the old "duplicates" test, which was wrong. Duplicates SHOULD count towards divergence by default. Assuming that duplicates only arise due to out-of-tree builds is incorrect. Signed-off-by: John Pennycook <john.pennycook@intel.com>
The JSON compilation databases used in this test are designed to mimic those generated by CMake. Notably, the "file" field of such databases is an absolute path to a file in the source directory. Signed-off-by: John Pennycook <john.pennycook@intel.com>
Signed-off-by: John Pennycook <john.pennycook@intel.com>
If a compilation database specifies files relative to a build directory (or otherwise points to paths that do not exist) then the resulting platform definition will be empty. We should ensure that we issue a warning in this case. Signed-off-by: John Pennycook <john.pennycook@intel.com>
Signed-off-by: John Pennycook <john.pennycook@intel.com>
I'll keep this on my to-do and try get to it ASAP. Do you have any code bases in mind that you could suggest? |
Anything you're familiar with that does an out-of-tree build should work. I wrote up a simple (synthetic) test with I'm reluctant to give you a code base, because I really want to see if you find something I didn't. |
Makes sense! |
Running this on ---------------------------------
Platform Set LOC % LOC
---------------------------------
{} 611 30.64
{cuda} 4 0.20
{serial, openmp} 790 39.62
{serial, openmp, cuda} 589 29.54
---------------------------------
Code Divergence: 0.38
Unused Code (%): 30.64
Total SLOC: 1994
Distance Matrix
--------------------------
cuda openmp serial
--------------------------
cuda 0.00 0.57 0.57
openmp 0.57 0.00 0.00
serial 0.57 0.00 0.00
-------------------------- Checking out this PR, the result looks like this: ---------------------------------
Platform Set LOC % LOC
---------------------------------
{} 751 25.68
{serial} 790 27.02
{cuda} 4 0.14
{openmp} 790 27.02
{cuda, serial, openmp} 589 20.14
---------------------------------
Code Divergence: 0.63
Unused Code (%%): 25.68
Total SLOC: 2924
Distance Matrix
--------------------------
cuda openmp serial
--------------------------
cuda 0.00 0.57 0.57
openmp 0.57 0.00 0.73
serial 0.57 0.73 0.00
-------------------------- So it seems that the de-duplication is doing something... |
Signed-off-by: John Pennycook <john.pennycook@intel.com>
This is quite embarrassing, but I think a full explanation here is necessary to understand the changes here.
The deduplication functionality originally introduced in #7 was clearly motivated by something, but it's not clear what. I've tried generating some compilation databases with CMake and different build directories, and the default behavior appears to be to generate absolute paths specifying the original source file (i.e., outside of the build directory). Looking back at JSON files we had generated for case studies in the past, this was true even when the functionality from #7 was introduced. The justification in the commit that deduplication is required to handle out-of-tree builds is clearly wrong.
The only situation I can find where this deduplication functionality is helpful is the original
test_duplicates
test. This makes me think that the functionality may have been introduced due to a misunderstanding of how compilation databases work, or to address issues observed in handling synthetic/hypothetical codebases that don't exist in real life.In short:
My proposal is to disable deduplication for now, and make sure that we document (and warn about) potential shortcomings. If we later encounter build directory structures that cause issues, we can introduce some new tests and try to identify the true root-cause before embracing deduplication as the fix.
@laserkelvin: I'd appreciate it if in addition to review here, you could try to actually go through the workflow of generating some compilation databases and running
codebasin
, just to make sure my reasoning is sound.Related issues
Fixes #72. Should be merged after #77.
Proposed changes