Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copyright detection would be amazing #389

Open
ben-spiller opened this issue Aug 23, 2023 · 8 comments
Open

Copyright detection would be amazing #389

ben-spiller opened this issue Aug 23, 2023 · 8 comments

Comments

@ben-spiller
Copy link

Since many opensource licenses (e.g. MIT) require publishing a list of copyright attributions from the dependencies you use, it'd be awesome to have support for detecting copyrights in this tool to populate the CycloneDX "copyright" field and comply with this common requirement.

This could be implemented by using a regex (user-configurable would be great) to detect copyright messages from various standard locations inside the jar (a configurable set of globs) e.g. NOTICES, META-INF/MANIFEST.MF, README.* etc.

Even more amazing would be to do download the associated source jar from mavencentral in case the binary doesn't contain copyrights (but even just binary scanning would be a big win).

@sithmein
Copy link

I'm also very much interested in this. I found https://github.com/JD-CSTx/license-maven-plugin which does exactly what is needed here for a different Maven plug-in. If @JD-CSTx agrees I would volunteer to take his code and try to add it to the cyclonedx-maven-plugin.

@Master-Code-Programmer
Copy link

I'm also very much interested in this. I found https://github.com/JD-CSTx/license-maven-plugin which does exactly what is needed here for a different Maven plug-in. If @JD-CSTx agrees I would volunteer to take his code and try to add it to the cyclonedx-maven-plugin.

Of course I agree, also: I couldn't disagree, even if I wanted to. It's a fork of the MojoHaus License Maven Plugin (which was abandon for a long time period), and is under the LPGL 3.0 License: https://www.mojohaus.org/license-maven-plugin/licenses.html.

@sithmein
Copy link

sithmein commented Oct 9, 2023

I started working on this at https://github.com/sithmein/cyclonedx-maven-plugin/tree/issue-389-copyright-detection . The Maven plug-in has a new configuration parameter extractCopyrights which is false by default. If set to true the plug-in will look into all artifacts' Jar files (binaries and sources) and extract copyright information.
I tested it with a project of ours that has ~300 components and the plug-in is able to extract almost all copyright information that I was able to find manually.

This is only a first iteration but you can already give it a try by installing it locally (I bumped the version) and then running the new version on a project.

One open question is about the format when there are multiple copyrights fond. CycloneDX only has a text field for copyright. The plug-in currently joins all found copyrights with semicolon.

@pombredanne
Copy link

You may to check out ScanCode toolkit (that I co-maintain) for this. This is considered as one of the best-in-class tools for copyright detection. This is in Python, not Java though. https://github.com/nexB/scancode-toolkit/tree/develop/src/cluecode

@sithmein
Copy link

sithmein commented Oct 9, 2023

I already tried it but the result were not really satisfactory. It reported quite a lot of nonsense in our case. And it took waaaay to much time, likely because it looked at each and every file. I don't believe this is necessary, though. If the publisher of an artifact doesn't bother providing copyright information in some usable way you cannot expect from users of that artifact to dig it up themselves by looking at every single file. My - totally non-legal - opinion.

@pombredanne
Copy link

@sithmein re:

I already tried it but the result were not really satisfactory. It reported quite a lot of nonsense in our case

That's a bug to me then. Do you have you the input you used?

@prabhu
Copy link

prabhu commented Oct 13, 2023

This can't be implemented using regex. Been there and bought the T-shirt. Use a project like Javaparser and parse comment nodes from AST for java. For other files, find a suitable treesitter implementation.

@sithmein
Copy link

What do you mean with "it can't be implemented"? Obviously it works. It may not detect any kind of weird copyright notices but I doubt that any other approach will.
The question is, what we want to achieve in the end? My goal right now is to extract copyright information that is provided in an obvious and clean way. The goal is not to reimplement Scancode in Java (as an example). Also because we are not working on the sources but on the official (binary) artifacts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants