Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for RPM-based distros for docker and rootfs images scanpipe #6

Closed
pombredanne opened this issue Sep 15, 2020 · 6 comments
Closed

Comments

@pombredanne
Copy link
Member

pombredanne commented Sep 15, 2020

There is no easy way to access the RPM database but through librpm and the rpm executable.
The installed RPMs database comes in three formats:

  1. bdb: a legacy Berkeley DB hash used as a key/value where the value is a binary blob that contains all the RPM data. The format of this blob should be the same as the RPM header format and scancode-toolkit can parse the headers. This is the format that was/is used in older RH, CentOS, Fedora and most every RPM distros.
  2. sqlite: a SQLite database where one table is used as a key/value store where the value is a binary blob that contains all the RPM data in the same binary format as in 1. and the RPM header. This is the format that is used in newer RH, CentOS and Fedora versions.
  3. ndb: a new key/value store that is built-in librpm. This is the format used by newer openSUSE distros

librpm provides support for each of these formats and also contains a built-in read-only handler for the 1. bdb format such that librpm can be built without Berkeley DB and still can read an older RPM db (for instance to convert it to a newer format).

It needs to be built with specific flags to enable all these formats (typically a given build of a distro does not nee to support all the formats).

The installed DBs locations are:

Distro Path Format
CentOS 8 /var/lib/rpm/Packages Berkeley DB (Hash, version 9, native byte-order)
CentOS 5 /var/lib/rpm/Packages Berkeley DB (Hash, version 8, native byte-order)
Fedora 30 /var/lib/rpm/rpmdb.sqlite SQLite 3.x database
Fedora 20 /var/lib/rpm/Packages Berkeley DB (Hash, version 9, native byte-order)
openMandriva /var/lib/rpm/Packages Berkeley DB (Hash, version 10, native byte-order)
RHEL 8 /var/lib/rpm/Packages Berkeley DB (Hash, version 9, native byte-order)
openSUSE 20200528 /usr/lib/sysimage/rpm/Packages.db data, but this is the ndb format

In addition on Fedora distros there are files under /etc/yum.repos.d/* that contains base and mirror URLs for the repo used to install RPMs. Each file is in .ini format. On openSUSE and SLES, these are under /etc/zypp/repos.d

The licenses (when not deleted as in some CentOS Docker images) are found in /usr/share/licenses/<package name>/<license files> or /usr/share/doc/<package name>/<license files>

If using the rpm cli, this can create an XML like output:
./rpm --query --all --qf '[%{*:xml}\n]' --rcfile=./rpmrc --dbpath=<path to>/var/lib/rpm > somefile.xml
The .rcfile option may not be needed, but when using a fresh RPM build this is needed.

The RPM db may need to be rebuilt first when this is a bdb format from an older version than the bdb with which librpm was built.

tdruez added a commit that referenced this issue Nov 18, 2020
Signed-off-by: Thomas Druez <tdruez@nexb.com>
@pombredanne
Copy link
Member Author

pombredanne commented Jan 5, 2021

To parse simply the XML output of the rpm command

>>> import xmltodict
>>> rpm=open('rpm.xml.txt','rb').read()
>>> rpm=rpm.decode('utf-8', errors='replace')
>>> rpms = ['<rpmHeader>' + val for val in rpm.split('<rpmHeader>') if val]
>>> parsed=[xmltodict.parse(r) for r in rpms]

I have also attached that sample XML created with:
rpm --query --all --qf '[%{*:xml}\n]' --dbpath=<path to RPM DB directory typically some /var/lib/rpm/ > ~/rpm.xml.txt

rpm.xml.txt

@pombredanne
Copy link
Member Author

Note that for the odd cases where the rpmdb is not in the format that the current rpm tool can analyze, the process could be to rebuild the database, though it is not entirely clear if this may not be missing out some RPMs (if so we may need to either read the RPM db directly OR use multiple older version of the RPM exe)

  1. make a copy of the DB and work on that copy since this is destructive
  2. Run rpmdb --rebuilddb --dbpath=<path to RPM DB directory typically some /var/lib/rpm/ where the copy was made> --root=<path to the pseudo root of the filesystem>

@chinyeungli
Copy link
Contributor

chinyeungli commented Jan 6, 2021

@pombredanne I have written a simple parser to parse the above XML file and convert to a JSON file.

#
#  Copyright (c) nexB Inc. and others. All rights reserved.
#

import click
import io
import json
import sys
import xmltodict

"""
The current code will parse everyhing and convert and save to a JSON output. 

Perhaps we will only need to collect the interested bit:\
============
String value
============
Sha1header
Name
Version
Release
Summary
Description
Size
Distribution
Vendor
License
Os
Arch
Sourcerpm
Rpmversion

==========
List value
==========
Basenames
Filesizes
Filedigests <-- This is the MD5 value
Dirindexes
Fileclass

Dirnames
Classdict

Requireflags
Requirename
Requireversion

********
* Note *
********
The value of Dirindexes and Fileclass is the index of Dirnames and Classict
For instance,
Dirindexes: [u'0', u'0', u'0', u'0', u'0', u'0', u'0', u'0', u'0', u'0', u'0', u'0', u'0', u'1', u'2', u'3']
Dirnames: [u'/etc/', u'/usr/share/doc/', u'/usr/share/doc/setup-2.5.58/', u'/var/log/']

Fileclass: [u'1', u'2', u'4', u'4', u'2', u'1', u'2', u'3', u'2', u'2', u'1', u'4', u'2', u'3', u'2', u'4']
Classdict: [None, u'ASCII English text', u'ASCII text', u'directory', u'empty']
"""

def parse_rpm_db(input):
    """
    Parse the rpm DB with xmltodict
    """
    parsed_result = []
    with io.open(input, encoding="utf8", errors='ignore') as loc:
        contents = loc.read()
        sections = contents.split('<rpmHeader>')
        for section in sections:
            if section:
                result = '<rpmHeader>' + section
                parsed_result.append(xmltodict.parse(result))
    return parsed_result


def format_to_dict(parsed_result):
    """
    Convert the parsed information to a list of dictionaries
    """
    results = []
    for result in parsed_result:
        content_dict = result['rpmHeader']['rpmTag']
        new_dict = {}
        for dict in content_dict:
            # The keys should be '@name' and "type" (such as string/integer etc)
            # This is the convention from xmltodict
            assert len(dict.keys()) == 2
            new_dict[dict[dict.keys()[0]]] = dict[dict.keys()[1]]
        results.append(new_dict)
    return results


def save_to_json(results, output):
    """
    Save the output to a JSON file
    """
    with open(output, 'w') as jsonfile:
        json.dump(results, jsonfile, indent=3)


@click.command()
@click.argument('input',
    required=True,
    metavar='INPUT',
    type=click.Path(
        exists=True, file_okay=True, dir_okay=False, readable=True, resolve_path=True))

@click.argument('output',
    required=True,
    metavar='OUTPUT',
    type=click.Path(exists=False, dir_okay=False, writable=True, resolve_path=True))

@click.help_option('-h', '--help')
def cli(input, output):
    if not output.endswith('.json'):
        print("The output has to be in JSON format.")
        sys.exit(1)
    parsed_result = parse_rpm_db(input)
    results = format_to_dict(parsed_result)
    save_to_json(results, output)

Attached the parsed result:
parsed.json.txt
There are some logical things that need to be done for
Dirindexes, Dirnames and Fileclass, Classdict
but I don't know what's the desire output that the pipeline need.

Suggestion and feedback are welcome.

pombredanne added a commit to aboutcode-org/scancode-plugins that referenced this issue Feb 17, 2021
This is to support these tickets:

aboutcode-org/scancode-toolkit#437
aboutcode-org/scancode.io#6
aboutcode-org/scancode-toolkit#2058

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Mar 5, 2021
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
tdruez added a commit that referenced this issue Apr 2, 2021
Signed-off-by: Thomas Druez <tdruez@nexb.com>
@tdruez
Copy link
Contributor

tdruez commented Apr 2, 2021

@pombredanne running a docker pipeline on a rpm-centos-latest.tar image with the latest 6-rpm-support branch (scancode-toolkit[packages]==21.3.31):

collect_installed_rpmdb_xmlish_from_rpmdb_loc: Failed to execute RPM command: /usr/local/lib/python3.9/site-packages/rpm_inspector_rpm/bin/rpm --query --all --qf [%{*:xml}
] --dbpath /var/scancodeio/workspace/projects/rpm-eb756fcb/codebase/50-rpm-centos-latest.tar-extract/cbab9bd72bc8f39dfc72b687ab56cd464589268d2a468ea8104fee89e4ca8b84/var/lib/rpm
/usr/local/lib/python3.9/site-packages/rpm_inspector_rpm/bin/rpm: error while loading shared libraries: libpopt.so.0: cannot open shared object file: No such file or directory
Traceback:
  File "/opt/scancodeio/scanpipe/pipelines/__init__.py", line 95, in execute
    step(self)
  File "/opt/scancodeio/scanpipe/pipelines/docker.py", line 79, in collect_and_create_system_packages
    docker.scan_image_for_system_packages(self.project, image)
  File "/opt/scancodeio/scanpipe/pipes/docker.py", line 122, in scan_image_for_system_packages
    for i, (purl, package, layer) in enumerate(installed_packages):
  File "/usr/local/lib/python3.9/site-packages/container_inspector/image.py", line 329, in get_installed_packages
    for purl, package in layer.get_installed_packages(packages_getter):
  File "/opt/scancodeio/scanpipe/pipes/rpm.py", line 30, in package_getter
    packages = rpm.get_installed_packages(root_dir, detect_licenses=detect_licenses)
  File "/usr/local/lib/python3.9/site-packages/packagedcode/rpm.py", line 163, in get_installed_packages
    xmlish_loc = rpm_installed.collect_installed_rpmdb_xmlish_from_rootfs(root_dir)
  File "/usr/local/lib/python3.9/site-packages/packagedcode/rpm_installed.py", line 401, in collect_installed_rpmdb_xmlish_from_rootfs
    return collect_installed_rpmdb_xmlish_from_rpmdb_loc(rpmdb_loc)
  File "/usr/local/lib/python3.9/site-packages/packagedcode/rpm_installed.py", line 456, in collect_installed_rpmdb_xmlish_from_rpmdb_loc
    raise Exception(msg)

pombredanne added a commit to aboutcode-org/scancode-toolkit that referenced this issue Apr 2, 2021
This is a problem on Linux when using full RPM support otherwise.

See: aboutcode-org/scancode.io#6 (comment)
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne
Copy link
Member Author

I pushed a new RPM plugin for the toolkit https://pypi.org/project/rpm-inspector-rpm/4.16.1.3.210404/ and two commits:

  • 52bbe6a
  • 8bc9b4a
    and this make things work on Linux. I still need to have a few more refinements to also run on macOS with RPM in the path

@pombredanne
Copy link
Member Author

I tested locally on a centos: latest docker image with success

tdruez added a commit that referenced this issue Apr 5, 2021
Signed-off-by: Thomas Druez <tdruez@nexb.com>
tdruez added a commit that referenced this issue Apr 5, 2021
* Add minimal support for RPM distros #6

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

* Relax scancode-toolkit version requirements

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

* Install scancode-toolkit[packages] for rpm support #6

Signed-off-by: Thomas Druez <tdruez@nexb.com>

* Require newest RPM plugin and its deps

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

* Update documentation for all OSes

open is a macOS'ism

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

* Require newest RPM plugin and its deps

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

* Update documentation for all OSes

open is a macOS'ism

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

* Remove explicit dependency on rpm-inspector-rpm

This is not needed as it comes with scancode-tk

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

* Add changelog entry for RPM support #6

Signed-off-by: Thomas Druez <tdruez@nexb.com>

Co-authored-by: Philippe Ombredanne <pombredanne@nexb.com>
@tdruez tdruez closed this as completed Apr 5, 2021
tdruez added a commit that referenced this issue Apr 5, 2021
Signed-off-by: Thomas Druez <tdruez@nexb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants