Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

externalReferences type for "source" packages #98

Closed
gernot-h opened this issue Nov 3, 2021 · 12 comments · Fixed by #269
Closed

externalReferences type for "source" packages #98

gernot-h opened this issue Nov 3, 2021 · 12 comments · Fixed by #269
Labels
proposed core enhancement tc54 accepted Ecma TC54 has accepted the feature candidate tc54 reviewed Ecma TC54 has reviewed the feature candidate
Milestone

Comments

@gernot-h
Copy link

gernot-h commented Nov 3, 2021

Sorry if I overlooked something obvious, but I miss a way to specify a source archive url for a component, as logical counterpart to the distribution type.

Many ecosystems have the concept of a source and a somehow derived package. In Python's PyPI you have a "wheel" and a "source" package (check https://pypi.org/project/chardet/#files), for Linux packages there are binary and corresponding source packages (check https://packages.debian.org/buster/libgcc1) etc.

Deriving the correct "source" package for a component isn't always straight-forward, but important for many use-cases (for example for license clearing, for mapping source-level sec advisories to binary components etc.). So it would be very helpful to store them in a CycloneDX BOM in a canonical way. Therefore I suggest to add a source type for externalReferences.

Note that this is in most cases not equal to the "vcs" type (which is often some kind of upstream project) because many repositories provide an own source archive exactly reflecting what was used when building their "binary" packages.

Example:

      "name": "chardet",
      "version": "4.0.0",
      "externalReferences": [
        {
          "type": "distribution",
          "url": "https://files.pythonhosted.org/packages/19/c7/fa589626997dd07bd87d9269342ccb74b1720384a4d739a1872bd84fbe68/chardet-4.0.0-py2.py3-none-any.whl",
          "comment": "PyPI wheel file"
        },
        {
          "type": "source",
          "url": "https://files.pythonhosted.org/packages/ee/2d/9cdc2b527e127b4c9db64b86647d567985940ac3698eeabc7ffaccb4ea61/chardet-4.0.0.tar.gz",
          "comment": "PyPI source archive"
        },
        {
          "type": "vcs",
          "url": "https://github.com/chardet/chardet",
          "comment": "upstream repository"
        }
      ]
@stevespringett
Copy link
Member

Distribution is intentionally not specific to binary, source, hybrid, or other. Multiple distributions can be specified for a component.

Take Maven for example. A single component may have multiple artifacts that are part of the distribution.
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-common/3.3.1/

In this case, there's artifacts for the:

  • binary
  • javadoc
  • sources
  • tests
  • test sources
  • pom

It's not the intent to describe every possible artifact type for every ecosystem. I think if we start separating out the types of distributions, we'll create confusion as not all ecosystems are black and white (source and binary).

For ecosystems where the component is the source (e.g. Perl), there would be confusion about which type to use as both distribution and source could be equally relevant. Javascript (npm) could actually be a hybrid containing both source and binary depending on the package.

In the Python example provided, it's easy enough to identify which distribution is the wheel and which one is not. In the Maven example, Maven has naming conventions so simple pattern matching against the distributions will tell you what they are. Other ecosystems may not be as predictable.

@coderpatros, @DarthHater what are your thoughts?

@gernot-h
Copy link
Author

gernot-h commented Nov 8, 2021

Ah, I see, so for my example above I should just use this today:

      "name": "chardet",
      "version": "4.0.0",
      "externalReferences": [
        {
          "type": "distribution",
          "url": "https://files.pythonhosted.org/packages/19/c7/fa589626997dd07bd87d9269342ccb74b1720384a4d739a1872bd84fbe68/chardet-4.0.0-py2.py3-none-any.whl",
          "comment": "PyPI wheel file"
        },
        {
          "type": "distribution",
          "url": "https://files.pythonhosted.org/packages/ee/2d/9cdc2b527e127b4c9db64b86647d567985940ac3698eeabc7ffaccb4ea61/chardet-4.0.0.tar.gz",
          "comment": "PyPI source archive"
        },
        {
          "type": "vcs",
          "url": "https://github.com/chardet/chardet",
          "comment": "upstream repository"
        }
      ]

And it would be the task of the application to either do pattern matching in the URL to differentiate between package types or use other means like application specific comment conventions.

@jkowalleck
Copy link
Member

@gernot-h is this still an open issue?

@gernot-h
Copy link
Author

gernot-h commented May 31, 2023

@gernot-h is this still an open issue?

Thanks for asking! Yes, definitely. Within Siemens AG, we created a kind of downstream specification extending and narrowing down CycloneDX (parts of it are public in https://github.com/siemens/cyclonedx-property-taxonomy). As a workaround, we specify defined comment fields:

grafik

We would highly appreciate if there would be some interoperable upstream solution for it, so BOM scanners can be extended to provide this information over time.

We btw also had a discussion whether a 2nd purl entry for stating source references might be needed as source urls are never unambiguous, but for now, we don't think it's a good idea.

@jkowalleck
Copy link
Member

jkowalleck commented May 31, 2023

That VCS reference could point to the general VersionControlSystem of the project,
while source could point to the actual source used for generating the component, which is not necessarily hosted in a VCS and is not intended to be distributed.
But then there already is the idea of source distribution, which is a specific type of distribution, one that is intended to be used downstream.

Why would it be necessary to document the source of a component, if it was not distributed from source in the first place? I still do not understand.
How I see this: If you had a SBOM for product A which has a component B, and B was built/assembled/compiled/generated/packed from some source, then this B should provide a BOM for itself, describing the build process. Providing these capabilities is the goal of #31.
There is no need for readers of A's BOM to know how(from which source) B was built or claimed to be built. All that is needed to know is where B came from(distribution) and have some hashes for integrity-checks on B.

@tsjensen
Copy link
Contributor

A VCS reference would not be sufficient even in cases where the source code is hosted in a public VCS, because we would want a reference to the sources for the particular version of the component, which is always a deep link. Example:

Determining this deep link to the correct sources can require specific knowledge of the source ecosystem. For example, it may be necessary to understand how Maven Central handles source archives, or what a Golang Proxy is.
Therefore, it would be great if the tool which has this knowledge (such as a CycloneDX scanner) could also record it in its output SBOM.

Currently, it can do so in an externalReferences section with type distribution:

"externalReferences": [
  {
    "type": "distribution",
    "url": "https://github.com/apache/commons-lang/archive/refs/tags/rel/commons-lang-3.12.0.zip",
    "comment": "source archive (download location)"
  }
]

While such an entry is correct, it is very difficult to consume. There can easily be multiple distribution entries - which one contains the source reference?
We currently work around this problem by using a defined comment string, but that is obviously a fragile construct which doesn't scale to partners and customers.

A type of source (or any other type which is clearly distinguished) would greatly improve our situation here.

@gernot-h
Copy link
Author

gernot-h commented Jun 28, 2023

Looks like this topic was already picked up as proposed enhancement, but let me still try to answer the question.

Why would it be necessary to document the source of a component, if it was not distributed from source in the first place? I still do not understand.
How I see this: If you had a SBOM for product A which has a component B, and B was built/assembled/compiled/generated/packed from some source, then this B should provide a BOM for itself, describing the build process. Providing these capabilities is the goal of #31.
There is no need for readers of A's BOM to know how(from which source) B was built or claimed to be built. All that is needed to know is where B came from(distribution) and have some hashes for integrity-checks on B.

For our team, this is a compliance as well as maintenance topic. Think about providing a Linux firmware image with several hundred packages based on a certain Linux distribution. Or think about providing a vendored NPM/Ruby... bundle as part of an application download or product.

Now you need to not only provide a "binary" SBOM for your customer, but you also need to check the licenses of all the contained components internally. And you might want to also mirror a snapshot of the used source packages internally in case you need to patch your product/app in 5 years from now. For all these topics, we need our BOMs to describe the sources which were used by a 3rd party to provide the binary packages we used. (For well-designed eco systems like Python or Debian, the 3rd party provides this information, but all in different ways you want to import in a common format to a central place.) And we don't want to generate several hundred derived BOMs to describe how each of the integrated components was built.

I'm no security guy, but according to anchore/syft#1700 (comment), having the source information for a given "binary image BOM" is also valuable in vulnerability matching. That's why they invented their own proprietry extension to include this information adding custom purl qualifiers like we did specifying Siemens-wide CycloneDX comment strings used for source links.

We think this is relevant for many distribution use cases and we should have a common solution to express this information.

@jkowalleck
Copy link
Member

jkowalleck commented Jun 28, 2023

Thank you very much for your insights.
Thought about the topic a lot, lately. Here is what i came up with

Distribution not only have a URL, but have other attributes, too:

  • Kind: either "source" or "binary", where binary could be anything that is not source.
  • Format (tar, tar.gz, exe, rpm, deb, jar, war, phar, pkg, dmg, apk, wheel, egg, gem, nupkg, ...).
  • constraints
    • OperatingSystem (examples: RedHat, Debian-9, NixOS, OpenBSD, Windows-11, Windows-XP, macOS-13.4, iOS-11.2, Android-9, TempleOS, ...)
      • Maybe even version ranges for the operating system ...
    • ProcessorArchitecture (i86, amd64, arm, M1, M2, ...)
    • Runtime (python2, python3.10, node19-or-later, ruby-*, php8, java-11, DotNet3.1, ...)
      • Maybe even version ranges for the operating system ...
    • ... and more ...

There might be a lot of attributes related to a distribution, that might come in handy being documented.
In case you are documenting distributions in a BOM, for me, it is most important to mark the one distribution that you actually used to build your product.
I might not care about all the possible dists and sources, but I must know which one was actually used during build processes, so that I could reproduce and attest the build.
Therefore, I would need a marker. (Would like to see an XML-constraint that allows only one of the distributions having this marker.)

Just some examples:

@tsjensen
Copy link
Contributor

Don't overthink it though. I would only need one extra item in the list of possible types. That list was already extended from 16 values in 1.4 to 39 values in 1.5. Let's make it 40 values in 1.6 by adding:

  • source = The URL of a source archive from which the component can be built

I don't need to know any additional details. (Of course, then I won't be able to actually build the component given only the SBOM, but frankly, that will be a problem no matter how much metadata you encode into the SBOM.)

@agschrei
Copy link

I'm with @tsjensen on this. The latest spec revision already gives people plenty of options to choose from for specialized types of references. But the one that we are still missing for our needs is the reference to source code.

For us it is critical to not only have the information which specific distribution of a component is in use in an application, but also to reference the source it was generated from. This provenance information allows us to conduct additional analysis. For the scope of this analysis we do not need to have all the information to reproducibly build an artifact from source, a reference to the source itself is sufficient.

To provide a simple example:
For a component describing a maven package I would expect a "distribution" reference describing the maven repository layout the artifact came from and a "source" link that points to the GitHub release, VCS commit snapshot or any other deep link to the code the artifact was built from. With the current options for the reference type we have no option to clearly express both without resorting to comments.

tsjensen added a commit to tsjensen/specification that referenced this issue Jul 26, 2023
tsjensen added a commit to tsjensen/specification that referenced this issue Jul 26, 2023
Signed-off-by: Thomas Jensen <tsjensen@users.noreply.github.com>
tsjensen added a commit to tsjensen/specification that referenced this issue Sep 4, 2023
Signed-off-by: Thomas Jensen <tsjensen@users.noreply.github.com>
@jkowalleck jkowalleck linked a pull request Dec 4, 2023 that will close this issue
@jkowalleck
Copy link
Member

we discussed this topic in our last core working group meeting.
It is still considered for 1.6. We might use an alternative wording. Something along "source-distribution".
CC @stevespringett @coderpatros @DarthHater @CycloneDX/core-team
// #269 (comment)

tsjensen added a commit to tsjensen/specification that referenced this issue Dec 11, 2023
…X#98

Signed-off-by: Thomas Jensen <tsjensen@users.noreply.github.com>
tsjensen added a commit to tsjensen/specification that referenced this issue Jan 12, 2024
…X#98

Signed-off-by: Thomas Jensen <tsjensen@users.noreply.github.com>
@jkowalleck
Copy link
Member

fixed via #269

@jkowalleck jkowalleck mentioned this issue Jan 12, 2024
@stevespringett stevespringett added tc54 reviewed Ecma TC54 has reviewed the feature candidate tc54 accepted Ecma TC54 has accepted the feature candidate labels Jan 12, 2024
stevespringett added a commit that referenced this issue Apr 9, 2024
## Added

* Core enhancement: Attestation
([#192](#192) via
[#348](#348))
* Core enhancement: Cryptography Bill of Materials — CBOM
([#171](#171),
[#291](#291) via
[#347](#347))
* Feature to express the URL to source distribution
([#98](#98) via
[#269](#269))
* Feature to express the URL to RFC 9116 compliant documents
([#380](#380) via
[#381](#381))
* Feature to express tags/keywords for services and components (via
[#383](#383))
* Feature to express details for component authors
([#335](#335) via
[#379](#379))
* Feature to express details for component and BOM manufacturer
([#346](#346) via
[#379](#379))
* Feature to express communicate concluded values from observed
evidences ([#411](#411)
via [#412](#412))
* Features to express license acknowledgement
([#407](#407) via
[#408](#408))
* Feature to express environmental consideration information for model
cards ([#396](#396) via
[#395](#395))
* Feature to express the address of organizational entities (via
[#395](#395))
* Feature to express additional component identifiers: Universal Bill Of
Receipts Identifier and Software Heritage persistent IDs
([#413](#413) via
[#414](#414))

## Fixed

* Allow multiple evidence identities by XML/JSON schema
([#272](#272) via
[#359](#359))
  This was already correct via ProtoBuff schema.
* Prevent empty `license` entities by XML schema
([#288](#288) via
[#292](#292))
  This was already correct in JSON/ProtoBuff schema.
* Prevent empty or malformed `property` entities by JSON schema
([#371](#371) via
[#375](#375))
  This was already correct in XML/ProtoBuff schema.
* Allow multiple `licenses` in `Metadata` by ProtoBuff schema
([#264](#264) via
[#401](#401))
  This was already correct in XML/JSON schema.

## Changed

* Allow arbitrary `$schema` values by JSON schema
([#402](#402) via
[#403](#403))
* Increased max length of `versionRange` (via
[`3e01ce6`](3e01ce6))
* Harmonized length of `version` (via
[#417](#417))

## Deprecated

* Data model "Component"'s field `author` was deprecated. (via
[#379](#379))
  Use field `authors` or field `manufacturer` instead.
* Data model "Metadata"'s field `manufacture` was deprecated.
([#346](#346) via
[#379](#379))
  Use "Metadata"'s field `component`'s field `manufacturer` instead. 
  - for XML: `/bom/metadata/component/manufacturer`
  - for JSON: `$.metadata.component.manufacturer`
  - for ProtoBuf: `Bom:metadata.component.manufacturer`

## Documentation

* Centralize version and version-range (via
[#322](#322))
* Streamlined SPDX expression related descriptions (via
[#327](#327))
* Enhanced descriptions of `bom-ref`/`refType`
([#336](#336) via
[#344](#344))
* Enhanced readability of enum documentation in JSON schema
([#361](#361) via
[#362](#362))
* Fixed typo "compliment" -> "complement" (via
[#369](#369))
* Added documentation for enum "ComponentScope"'s values in JSON schema
([#293](#293) via
[`d92e58e`](d92e58e))
  Texts were a taken from the existing ones in XML/ProtoBuff schema.
* Added documentation for enum "TaskType"'s values
([#245](#245) via
[#377](#377))
* Improve documentation for data model "Metadata"'s field `licenses`
([#273](#273) via
[#378](#378))
* Added documentation for enum "MachineLearningApproachType"'s values
([#351](#351) via
[#416](#416))
* Rephrased some texts here and there.

## Test data

* Added test data for newly added use cases
* Added quality assurance for our ProtoBuf schemas
([#384](#384) via
[#385](#385))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposed core enhancement tc54 accepted Ecma TC54 has accepted the feature candidate tc54 reviewed Ecma TC54 has reviewed the feature candidate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants