Copyright detection would be amazing #389

ben-spiller · 2023-08-23T16:51:21Z

Since many opensource licenses (e.g. MIT) require publishing a list of copyright attributions from the dependencies you use, it'd be awesome to have support for detecting copyrights in this tool to populate the CycloneDX "copyright" field and comply with this common requirement.

This could be implemented by using a regex (user-configurable would be great) to detect copyright messages from various standard locations inside the jar (a configurable set of globs) e.g. NOTICES, META-INF/MANIFEST.MF, README.* etc.

Even more amazing would be to do download the associated source jar from mavencentral in case the binary doesn't contain copyrights (but even just binary scanning would be a big win).

sithmein · 2023-09-11T14:37:41Z

I'm also very much interested in this. I found https://github.com/JD-CSTx/license-maven-plugin which does exactly what is needed here for a different Maven plug-in. If @JD-CSTx agrees I would volunteer to take his code and try to add it to the cyclonedx-maven-plugin.

Master-Code-Programmer · 2023-09-16T11:49:00Z

I'm also very much interested in this. I found https://github.com/JD-CSTx/license-maven-plugin which does exactly what is needed here for a different Maven plug-in. If @JD-CSTx agrees I would volunteer to take his code and try to add it to the cyclonedx-maven-plugin.

Of course I agree, also: I couldn't disagree, even if I wanted to. It's a fork of the MojoHaus License Maven Plugin (which was abandon for a long time period), and is under the LPGL 3.0 License: https://www.mojohaus.org/license-maven-plugin/licenses.html.

sithmein · 2023-10-09T13:42:02Z

I started working on this at https://github.com/sithmein/cyclonedx-maven-plugin/tree/issue-389-copyright-detection . The Maven plug-in has a new configuration parameter extractCopyrights which is false by default. If set to true the plug-in will look into all artifacts' Jar files (binaries and sources) and extract copyright information.
I tested it with a project of ours that has ~300 components and the plug-in is able to extract almost all copyright information that I was able to find manually.

This is only a first iteration but you can already give it a try by installing it locally (I bumped the version) and then running the new version on a project.

One open question is about the format when there are multiple copyrights fond. CycloneDX only has a text field for copyright. The plug-in currently joins all found copyrights with semicolon.

pombredanne · 2023-10-09T19:50:18Z

You may to check out ScanCode toolkit (that I co-maintain) for this. This is considered as one of the best-in-class tools for copyright detection. This is in Python, not Java though. https://github.com/nexB/scancode-toolkit/tree/develop/src/cluecode

sithmein · 2023-10-09T19:56:02Z

I already tried it but the result were not really satisfactory. It reported quite a lot of nonsense in our case. And it took waaaay to much time, likely because it looked at each and every file. I don't believe this is necessary, though. If the publisher of an artifact doesn't bother providing copyright information in some usable way you cannot expect from users of that artifact to dig it up themselves by looking at every single file. My - totally non-legal - opinion.

pombredanne · 2023-10-10T17:25:13Z

@sithmein re:

I already tried it but the result were not really satisfactory. It reported quite a lot of nonsense in our case

That's a bug to me then. Do you have you the input you used?

prabhu · 2023-10-13T08:45:34Z

This can't be implemented using regex. Been there and bought the T-shirt. Use a project like Javaparser and parse comment nodes from AST for java. For other files, find a suitable treesitter implementation.

sithmein · 2023-10-13T08:51:32Z

What do you mean with "it can't be implemented"? Obviously it works. It may not detect any kind of weird copyright notices but I doubt that any other approach will.
The question is, what we want to achieve in the end? My goal right now is to extract copyright information that is provided in an obvious and clean way. The goal is not to reimplement Scancode in Java (as an example). Also because we are not working on the sources but on the official (binary) artifacts.

hboutemy added the enhancement label Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Copyright detection would be amazing #389

Copyright detection would be amazing #389

ben-spiller commented Aug 23, 2023

sithmein commented Sep 11, 2023

Master-Code-Programmer commented Sep 16, 2023

sithmein commented Oct 9, 2023

pombredanne commented Oct 9, 2023

sithmein commented Oct 9, 2023

pombredanne commented Oct 10, 2023

prabhu commented Oct 13, 2023 •

edited

Loading

sithmein commented Oct 13, 2023

Copyright detection would be amazing #389

Copyright detection would be amazing #389

Comments

ben-spiller commented Aug 23, 2023

sithmein commented Sep 11, 2023

Master-Code-Programmer commented Sep 16, 2023

sithmein commented Oct 9, 2023

pombredanne commented Oct 9, 2023

sithmein commented Oct 9, 2023

pombredanne commented Oct 10, 2023

prabhu commented Oct 13, 2023 • edited Loading

sithmein commented Oct 13, 2023

prabhu commented Oct 13, 2023 •

edited

Loading