Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The snowflake package identifier problem #594

Closed
lumjjb opened this issue Mar 16, 2023 · 5 comments
Closed

The snowflake package identifier problem #594

lumjjb opened this issue Mar 16, 2023 · 5 comments
Assignees
Labels
data-quality Things related to data quality and document ingestion data-sources long-term Things for the future

Comments

@lumjjb
Copy link
Contributor

lumjjb commented Mar 16, 2023

The snowflake package identifier problem

This issue is being created from a problem raised by @knrc and @dejanb around the SBOMs for the Java ecosystem. This however, a similar issue has also been seen with the debian ecosystem as to how SBOMs express package identifiers. Thus, we will give a more generic name to this problem: the “snowflake package identifier” problem (name is up for change, but I thought it is kind of representative of the issue, since we are talking about the same package used in different context which makes them slightly different in a subtle way).

The problem

Java example

In Java, you are able to express that you want to use a package, but not include some of its transitive dependencies, and use a different library instead. This is not dissimilar to scenarios where for compliance reasons, sometimes certain FIPS approved cryptographic libraries need to be used, and thus during compilation we choose to not include the original library but a compatible one with all required symbols instead. The way this can be expressed in Java is talked about in more detail in this issue created by @knrc.

The reason this gets tricky is that between two builds of a java application that use the same package A, each of them can be using different transitive packages (via package overriding). However, both applications refer to package A by the same name and version. This is a problem in SBOMs and GUAC because when represented in the data model, there is no way to differentiate between the instances of the two usages of package A.

To illustrate, we have two build java applications:
image

  • App 1 uses Package X that uses a transitive package A.
  • App 2 uses Package X but overrides package A with package B.
  • When GUAC ingests the two pieces of information about both software packages

image

The main issue around this is the identifiers that are used for the packages are PURLs, and the amount of semantic meaning for a PURL varies. Some PURLs provide a way to verify the content of the package and its descendants, and some do not. This results in two separate package use that has different descendants end up being aggregated and not able to allow one to properly reason about the use of the package in different context.

For example, in this graph, App1 and App2 after ingestion point to X, but because the identifier of package X does not distinguish between both uses, they are no longer distinguishable. This is not only an issue in GUAC but in any system that wants to consume SBOMs (CycloneDX/cyclonedx-maven-plugin#306).

As a tangent, one may argue that perhaps the SBOM should be expressed as a flattened list of dependencies and should be represented as a 1 layer tree (or a star node). However, this also misses certain contexts that can be used for vulnerability remediation. So even if it is, it is helpful to express the dependency use relationships.

Debian example

Another example that we ran into was with Debian, in this case, there were some container images where the information from the package manager was used to describe a package, but in certain containers, minimization was done and so certain files were not present in certain package use. However, because there was no way to express the difference in the inventory of the package, we end up with the same issue where a container now depends on a package which says it contains a certain file, but it does not.

Proposed Solution(s)

In the CycloneDX PR (CycloneDX/cyclonedx-maven-plugin#306), the proposal is to add a hash to the reference which acts as a merkle tree of PURLs which a pkg depends on.

In GUAC, we can take a similar approach where we can perform a hash on descendants of a package when parsing the SBOMs. And express them in our pkg data model as a qualifier (which are used to express specific instances of a library). This can be done via taking the serialization of GUAC pkg predicates for descendants and use that hash as a qualifier via a merkle tree hash by pkg serialization lexical order.

The ideal situation is that the Java ecosystem would encode a way to differentiate between such instances or provide the identifiers to do this analysis. Possibly as a qualifier on a PURL.

@lumjjb lumjjb added long-term Things for the future data-quality Things related to data quality and document ingestion data-sources labels Mar 16, 2023
@lumjjb
Copy link
Contributor Author

lumjjb commented Mar 16, 2023

@loosebazooka fyi!

@loosebazooka
Copy link

Just some notes from chatting with Brandon earlier.

  1. X is probably versioned (1.0.0, 2.1.3, etc), so should X:1.0.0 have it's own subtrees for each resolved (A vs B resolution strategy)
graph TD;
    App1-->X;
    App1-.->X:1.0.0:hash1
    App2-->X;
    X-->X:1.0.0;
    X:1.0.0-->X:1.0.0:hash1;
    X:1.0.0-->X:1.0.0:hash2;
    App2-.->X:1.0.0:hash2
    X:1.0.0:hash1-->A
    X:1.0.0:hash2-->B
Loading
  1. This problem is not limited to java, this strategy has to work for all ecosystems (go.mod has replace, etc)

@lumjjb
Copy link
Contributor Author

lumjjb commented May 17, 2023

@knrc mentioned that he will look into this!

@knrc
Copy link
Contributor

knrc commented May 17, 2023

@lumjjb Please assign this issue to me and I'll start working on it later today

@lumjjb
Copy link
Contributor Author

lumjjb commented Apr 2, 2024

This issue has been resolved by #1367

@lumjjb lumjjb closed this as completed Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-quality Things related to data quality and document ingestion data-sources long-term Things for the future
Projects
None yet
Development

No branches or pull requests

3 participants