The snowflake package identifier problem #594

lumjjb · 2023-03-16T15:50:21Z

The snowflake package identifier problem

This issue is being created from a problem raised by @knrc and @dejanb around the SBOMs for the Java ecosystem. This however, a similar issue has also been seen with the debian ecosystem as to how SBOMs express package identifiers. Thus, we will give a more generic name to this problem: the “snowflake package identifier” problem (name is up for change, but I thought it is kind of representative of the issue, since we are talking about the same package used in different context which makes them slightly different in a subtle way).

The problem

Java example

In Java, you are able to express that you want to use a package, but not include some of its transitive dependencies, and use a different library instead. This is not dissimilar to scenarios where for compliance reasons, sometimes certain FIPS approved cryptographic libraries need to be used, and thus during compilation we choose to not include the original library but a compatible one with all required symbols instead. The way this can be expressed in Java is talked about in more detail in this issue created by @knrc.

The reason this gets tricky is that between two builds of a java application that use the same package A, each of them can be using different transitive packages (via package overriding). However, both applications refer to package A by the same name and version. This is a problem in SBOMs and GUAC because when represented in the data model, there is no way to differentiate between the instances of the two usages of package A.

To illustrate, we have two build java applications:

App 1 uses Package X that uses a transitive package A.
App 2 uses Package X but overrides package A with package B.
When GUAC ingests the two pieces of information about both software packages

The main issue around this is the identifiers that are used for the packages are PURLs, and the amount of semantic meaning for a PURL varies. Some PURLs provide a way to verify the content of the package and its descendants, and some do not. This results in two separate package use that has different descendants end up being aggregated and not able to allow one to properly reason about the use of the package in different context.

For example, in this graph, App1 and App2 after ingestion point to X, but because the identifier of package X does not distinguish between both uses, they are no longer distinguishable. This is not only an issue in GUAC but in any system that wants to consume SBOMs (CycloneDX/cyclonedx-maven-plugin#306).

As a tangent, one may argue that perhaps the SBOM should be expressed as a flattened list of dependencies and should be represented as a 1 layer tree (or a star node). However, this also misses certain contexts that can be used for vulnerability remediation. So even if it is, it is helpful to express the dependency use relationships.

Debian example

Another example that we ran into was with Debian, in this case, there were some container images where the information from the package manager was used to describe a package, but in certain containers, minimization was done and so certain files were not present in certain package use. However, because there was no way to express the difference in the inventory of the package, we end up with the same issue where a container now depends on a package which says it contains a certain file, but it does not.

Proposed Solution(s)

In the CycloneDX PR (CycloneDX/cyclonedx-maven-plugin#306), the proposal is to add a hash to the reference which acts as a merkle tree of PURLs which a pkg depends on.

In GUAC, we can take a similar approach where we can perform a hash on descendants of a package when parsing the SBOMs. And express them in our pkg data model as a qualifier (which are used to express specific instances of a library). This can be done via taking the serialization of GUAC pkg predicates for descendants and use that hash as a qualifier via a merkle tree hash by pkg serialization lexical order.

The ideal situation is that the Java ecosystem would encode a way to differentiate between such instances or provide the identifiers to do this analysis. Possibly as a qualifier on a PURL.

lumjjb · 2023-03-16T16:07:05Z

@loosebazooka fyi!

loosebazooka · 2023-03-22T14:23:08Z

Just some notes from chatting with Brandon earlier.

X is probably versioned (1.0.0, 2.1.3, etc), so should X:1.0.0 have it's own subtrees for each resolved (A vs B resolution strategy)

graph TD;
    App1-->X;
    App1-.->X:1.0.0:hash1
    App2-->X;
    X-->X:1.0.0;
    X:1.0.0-->X:1.0.0:hash1;
    X:1.0.0-->X:1.0.0:hash2;
    App2-.->X:1.0.0:hash2
    X:1.0.0:hash1-->A
    X:1.0.0:hash2-->B

This problem is not limited to java, this strategy has to work for all ecosystems (go.mod has replace, etc)

lumjjb · 2023-05-17T14:43:09Z

@knrc mentioned that he will look into this!

knrc · 2023-05-17T15:12:10Z

@lumjjb Please assign this issue to me and I'll start working on it later today

lumjjb · 2024-04-02T23:03:49Z

This issue has been resolved by #1367

lumjjb added long-term Things for the future data-quality Things related to data quality and document ingestion data-sources labels Mar 16, 2023

stevespringett mentioned this issue Mar 22, 2023

Add tight scoping to nodes in the dependency graph CycloneDX/specification#197

Open

pxp928 assigned knrc May 17, 2023

lumjjb mentioned this issue Jun 22, 2023

[Discussion] Principles of IsDepedency data model and beyond #966

Open

lumjjb mentioned this issue Jul 5, 2023

[feature] Investigate if artifacs can/should be used in VEX files #1016

Open

lumjjb mentioned this issue Aug 1, 2023

[feature] Have IsDependency be able to point a version as well on top of existing function #1117

Closed

2 tasks

lumjjb mentioned this issue Sep 12, 2023

[feature] Adding optional IDs as part of mutation API #1261

Closed

lumjjb closed this as completed Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The snowflake package identifier problem #594

The snowflake package identifier problem #594

lumjjb commented Mar 16, 2023

lumjjb commented Mar 16, 2023

loosebazooka commented Mar 22, 2023

lumjjb commented May 17, 2023

knrc commented May 17, 2023

lumjjb commented Apr 2, 2024

The snowflake package identifier problem #594

The snowflake package identifier problem #594

Comments

lumjjb commented Mar 16, 2023

The snowflake package identifier problem

The problem

Java example

Debian example

Proposed Solution(s)

lumjjb commented Mar 16, 2023

loosebazooka commented Mar 22, 2023

lumjjb commented May 17, 2023

knrc commented May 17, 2023

lumjjb commented Apr 2, 2024