Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort Turtle output #1978

Draft
wants to merge 14 commits into
base: main
Choose a base branch
from
Draft

Conversation

ajnelson-nist
Copy link
Contributor

Summary of changes

This patch series starts an API for sorting graph serializations, beginning with Turtle. The main objective is to produce consistent Turtle output, no matter the order of RDF triples being added to a Graph.

This PR will close Issue 1890.

The initial pair of patches only starts the PR. Some discussion will be needed to design the remaining patches.

The PR's total effects are expected to be additive and preserve backward compatibility.

Checklist

  • Checked that there aren't other open pull requests for the same change.
  • Added tests for any changes that have a runtime impact.
  • Checked that all tests pass. (Test added, known to not yet pass, but should pass by end of PR development.)
  • Checked that all type checking passes. (Type signatures are being added to touched code.)
  • For changes that have a potential impact on users of this project:
    • Updated relevant documentation to avoid inaccuracies. (Not yet; unknown what to update.)
    • Considered adding additional documentation. (Not yet; unknown what to update.)
    • Considered adding an example in ./examples for new features. (Not yet; unknown what to update. Hard-coded sorted graph in new unit test can be copied to desired location.)
    • Considered updating our changelog (CHANGELOG.md).
  • Considered granting push permissions to the PR branch, so maintainers can fix minor issues and keep your PR up to date.

This patch adds a test to start specifying what sorting Turtle output
would look like.  This is intended to start discussion about
expectations of blank node sorting, and to set an initial interface for
triggering sorted output with a propagated keyword argument in
`Graph.serialize()`.

This patch will fail CI, but should not fail for code-style reasons.
The new test script was reviewed with black, flake8, isort, and
mypy (--strict).

References:
* RDFLib#1890

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
@ajnelson-nist ajnelson-nist marked this pull request as draft June 1, 2022 22:21
@aucampia
Copy link
Member

aucampia commented Jun 4, 2022

Something that may be of interest to you:

https://github.com/aucampia/rdflib/blob/38e854abc845a13dcaa61a66d5fcdba69bcf3ba1/test/utils/earl.py#L438-L446

class OrderedMemory(Memory):
    def __init__(self, configuration=None, identifier=None):
        super().__init__(configuration, identifier)
        self.__spo = OrderedDict()
        self.__pos = OrderedDict()
        self.__osp = OrderedDict()
        self.__namespace = OrderedDict()
        self.__prefix = OrderedDict()
        self.__context_obj_map = OrderedDict()

I wanted stable output for test reports, so it is easier to diff them and so they are not just noisy spam when added to git, so I made that. Seems to work, but it depends on order that things are added, and I'm naming blank nodes to make it work. Doing something similar with SortedDict may also work.

@ajnelson-nist
Copy link
Contributor Author

@aucampia - thanks for the strategy suggestion. I just gave SortedDict a shot in the same manner as you used in the earl test. It wasn't that simple a solution, I'm afraid.

After reviewing the code, there's a chance SortedDict won't be needed. I think where most of the sorting would need to have an impact would be in the Turtle (not longturtle) RecursiveSerializer class, particularly its objectList() method and the nearby _object_comparator sort-key function. The random-order Git noise is coming from BNodes being cast with str() in _object_comparator. If instead the BNodes could be compared in a consistently-sorting form, the random-noise issue should go away naturally, because sortProperties() unconditionally sorts object-lists.

I think one end-result of this analysis is that making consistently-sorting BNodes may have a significant impact on the serializing algorithm flow. BNodes would need to be serialized into some sortable form first (a flat string?), so they may need to be crawled/stringified before doing anything else. There may be some optimizations that can be done based on number of triples where the BNode is the object.

Does this way of thinking sound like it's on the right track?

Unrelatedly, I was using mypy --strict on the new test. Once I brought Memory into the test file, mypy --strict crawls rdflib/plugins/stores/memory.py and comes back with ~1,100 wide-spreading issues, mostly that type annotations haven't been done yet except in a few spots. So, that'll be a decent leg of type review to bring to fruition, but likely doesn't need to be a blocker on this PR.

@nicholascar
Copy link
Member

@ajnelson-nist do you want to pick this up again, now that we have R's being merged again?

Also, is this in line with recent W3C work on canonical hashing of RDF graphs:

@nicholascar nicholascar added the awaiting feedback More feedback is needed from the author of the PR or Issue. label Jul 25, 2024
@sneakers-the-rat
Copy link

+1 would love to have this in the lib, lmk how i can help

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
This patch aligns the type signatures on `Serializer` subclasses,
including renaming the arbitrary-keywords dictionary to always be
`**kwargs`.  This is in part to prepare for the possibility of adding
`*args` as a positional-argument delimiter.

References:
* RDFLib#1890 (comment)

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
@ajnelson-nist
Copy link
Contributor Author

I've made some updates to align type signatures down through the Serializer subclasses. Because some serializers have a new parameter spacious: Optional[bool], I added a positional argument delimiter to handle that some serializers use spacious, some use (or would use) sort, and some don't use either.

Backwards-incompatibility:
I added the positional argument delimiter in a way that might not be backwards-compatible. It might be worth peeling off that minor API change if the maintainers think this varied-arguments approach is likely to be the way Serializer.serialize() will be used in the future. I'm happy to peel that off into a minor PR if there's agreement.

Meanwhile, I have not had time to look deeply at @nicholascar 's link, and I'm likely to be distracted from it before 2025. @sneakers-the-rat , if you would like to pick this PR up, please either file PRs against the branch in my fork, or file a superseding PR here. I think either's fine, as we can merge each other's branches if there's need to sync up.

FWIW, I'm still not well exercised on Poetry, so to keep down on commits flailing against CI I've been using this Makefile. In a fresh clone of this repository, checking out my issue-1890 branch, you should catch up to where I'm at, with the type review passing, but pytest not passing because the sorting around BNodes is not implemented yet.

#!/usr/bin/make -f

# Portions of this file contributed by NIST are governed by the
# following statement:
#
# This software was developed at the National Institute of Standards
# and Technology by employees of the Federal Government in the course
# of their official duties. Pursuant to Title 17 Section 105 of the
# United States Code, this software is not subject to copyright
# protection within the United States. NIST assumes no responsibility
# whatsoever for its use by other parties, and makes no guarantees,
# expressed or implied, about its quality, reliability, or any other
# characteristic.
#
# We would appreciate acknowledgement if the software is used.

SHELL := /bin/bash

all: check

.venv.done.log: \
  devtools/requirements-poetry.in
	test ! -r venv/bin/pre-commit \
	  || ( \
	    source venv/bin/activate \
	      && pre-commit uninstall \
	    )
	rm -rf venv
	python3 -m venv venv
	source venv/bin/activate && pip install --upgrade pip
	source venv/bin/activate && pip install pre-commit
	source venv/bin/activate && pre-commit install
	source venv/bin/activate && pip install --requirement devtools/requirements-poetry.in
	source venv/bin/activate && poetry install --with dev
	touch $@

check: \
  .venv.done.log
	source venv/bin/activate \
	  && mypy rdflib test
	source venv/bin/activate \
	  && pytest -vv test/test_turtle_sort_issue1890.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting feedback More feedback is needed from the author of the PR or Issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sorting Turtle output?
4 participants