Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Order of hashes no longer canonical #105

Closed
SomberNight opened this issue Jan 17, 2019 · 12 comments
Closed

Order of hashes no longer canonical #105

SomberNight opened this issue Jan 17, 2019 · 12 comments
Assignees

Comments

@SomberNight
Copy link

Since version 0.14, the order of hashes does not seem to be deterministic.
Before 0.14, the hashes used to be lexicographically ordered.

I am guessing this was changed in #95

What is the reason for this?

We use hashin in electrum for deterministic builds.
The requirements files are generated by this script.
The freeze_packages.sh script is manually rerun to update dependencies (see e.g. spesmilo/electrum@ba4af29).
If the order of hashes for a given package is not canonical, then rerunning the script can result in changes for packages that have not actually been updated, making the diff considerably less readable.

@peterbe
Copy link
Owner

peterbe commented Jan 17, 2019

The original issue was: #93

The reasoning was that when the code that determines whether to edit the list of hashes kick in, it should be agnostic of order. Because, the order of them does not matter to pip. Also, it doesn't really make sense to sort something like that.

Also, I don't understand why you point out this commit. There's no change of order of hashes there.

@peterbe
Copy link
Owner

peterbe commented Jan 17, 2019

By the way, I couldn't help but to notice...

for requirement in $requirements; do
    echo -e "\r  Hashing $requirement..."
    $other_python -m hashin -r $contrib/deterministic-build/requirements${i}.txt ${requirement}
done

Can, in theory, be replaced with:

echo -e "\r  Hashing $requirements..."
$other_python -m hashin -r $contrib/deterministic-build/requirements${i}.txt ${requirements}

Referring to this.

@SomberNight
Copy link
Author

Also, I don't understand why you point out this commit. There's no change of order of hashes there.

Yes, there is no change there. I just linked to that as an example of rerunning the script. I have just locally rerun the script again, and noticed this ordering issue now.

The reasoning was that when the code that determines whether to edit the list of hashes kick in, it should be agnostic of order. Because, the order of them does not matter to pip. Also, it doesn't really make sense to sort something like that.

If pip is agnostic to it, it might as well be canonical. :)
It makes sense to sort it for the reason I describe, to make diffs that update these lists less cluttered.

Thanks for the code cleanup suggestion.

@peterbe
Copy link
Owner

peterbe commented Jan 17, 2019

Actually, I think the order is deterministic and it's based on PyPI. What I mean is that it downloads, for example, https://pypi.org/pypi/hashin/json and based on the info it figures out the version and then whichever order they releases are in that version, in the JSON, is the order it writes it to the requirements file.

E.g. when you run

▶ curl https://pypi.org/pypi/hashin/json | jq '.releases["0.14.1"]'
[
  {
    "comment_text": "",
    "digests": {
      "md5": "29526146d5c8d4401bd74b5b329613ea",
      "sha256": "5dbf06b2adc32e5e438b9b9b352622b787c8555d4a2aa88b38d64764526f81a1"
    },
    "downloads": -1,
    "filename": "hashin-0.14.1-py3-none-any.whl",
    "has_sig": false,
    "md5_digest": "29526146d5c8d4401bd74b5b329613ea",
    "packagetype": "bdist_wheel",
    "python_version": "py3",
    "requires_python": ">=2.7,!=3.0,!=3.1,!=3.2,!=3.3",
    "size": 15387,
    "upload_time": "2018-12-13T01:29:32",
    "url": "https://files.pythonhosted.org/packages/da/f7/fe7093f1a4bed62f34000c710130ee15a99638e46717f18d0281a6cacff9/hashin-0.14.1-py3-none-any.whl"
  },
  {
    "comment_text": "",
    "digests": {
      "md5": "c7bc5c943853fc192bc01471bbcfa771",
      "sha256": "43ae3277fbd937ba1c99bc3fef9e4dd2e68cd5f2eea8f60ebdb53d83983d861b"
    },
    "downloads": -1,
    "filename": "hashin-0.14.1.tar.gz",
    "has_sig": false,
    "md5_digest": "c7bc5c943853fc192bc01471bbcfa771",
    "packagetype": "sdist",
    "python_version": "source",
    "requires_python": ">=2.7,!=3.0,!=3.1,!=3.2,!=3.3",
    "size": 20941,
    "upload_time": "2018-12-13T01:29:34",
    "url": "https://files.pythonhosted.org/packages/72/9e/971798d201c1a67d66c343d36c335708c4c4585e5afda04e3c2bb046e91c/hashin-0.14.1.tar.gz"
  }
]

the order is

  1. 5dbf06b2adc32e5e438b9b9b352622b787c8555d4a2aa88b38d64764526f81a1
  2. 43ae3277fbd937ba1c99bc3fef9e4dd2e68cd5f2eea8f60ebdb53d83983d861b

and that is always the case. So, what you get in your requirements file is something like this:

--- Old
+++ New
@@ -0,0 +1,3 @@
+hashin==0.14.1 \
+    --hash=sha256:5dbf06b2adc32e5e438b9b9b352622b787c8555d4a2aa88b38d64764526f81a1 \
+    --hash=sha256:43ae3277fbd937ba1c99bc3fef9e4dd2e68cd5f2eea8f60ebdb53d83983d861b

What #93 was about was being smarter about "merging" the hashes. For example, suppose I take that diff above, but manually, in my editor, rearrange the order so it looks like this:

▶ cat /tmp/reqs.txt
hashin==0.14.1 \
    --hash=sha256:5dbf06b2adc32e5e438b9b9b352622b787c8555d4a2aa88b38d64764526f81a1 \
    --hash=sha256:43ae3277fbd937ba1c99bc3fef9e4dd2e68cd5f2eea8f60ebdb53d83983d861b

▶ code /tmp/reqs.txt

▶ cat /tmp/reqs.txt
hashin==0.14.1 \
    --hash=sha256:43ae3277fbd937ba1c99bc3fef9e4dd2e68cd5f2eea8f60ebdb53d83983d861b \
    --hash=sha256:5dbf06b2adc32e5e438b9b9b352622b787c8555d4a2aa88b38d64764526f81a1

(note that the only change is the order of the hashes).

Now, if I were to run hashin again...:

▶ python hashin.py hashin -r /tmp/reqs.txt --dry-run

You get an empty output because the hashes have NOT changed according to hashin. It ignores the fact that the order is "wrong".

Basically, as long as you don't manually "mess" with the order of the hashes, it is deterministic. Always. And even if you did mess with the order of the hashes hashin won't try to fix that.

So, no the order is not canonical but it's certainly deterministic as long as PyPI.org is.

@mythmon
Copy link
Contributor

mythmon commented Jan 18, 2019

I think that Hashin should take a stronger approach here than "whatever PyPI.org does". Consistency is useful, both for deterministic builds and for smaller, easier to read diffs. I think that sorting the hashes in the file anytime Hashin writes them goes a long way towards that end.

Deterministic build are important because of projects like Debian's ReproducibleBuilds, which strive to increase trust and verifiability of packages. These are some of the same goals that hashing packages in Python is intended to solve.

Another tool that is similar to Hashin also uses sorting to help consistency: Yarn. It sorts entries both in package.json and also yarn.lock. It doesn't ever have lists of hashes, but it sorts dependencies alphabetically when listing them.

@peterbe
Copy link
Owner

peterbe commented Jan 22, 2019

@di Do you have an opinion on this one? Should the requirements file that hashin builds sort the hashes lexicographically ordered, no matter what, or should it sort them by the same way that PyPI presents them?

I would argue that it's a no-problem because the only time it comes up is if you compare two different requirements files made in two different places when PyPI might have returned files in a different order. E.g.

# On machine X
$ hashin Django -r reqs-x.txt
# On machine Y
$ hashin Django -r reqs-y.txt
$ rsync reqs-y.txt machineX:
# On machine X
# This *will* diff if the order of files 
# on https://pypi.org/pypi/Django/json changed over time. 
$ diff reqs-x.txt reqs-y.txt

Contrast with doing it all and always on the same machine:

# Monday
$ hashin Django==2.1.5 -r /path/to/reqs.txt
# Tuesday
# This will ONLY diff if the SET of hashes has changed between Monday and Tuesday
$ hashin Django==2.1.5 -r /path/to/reqs.txt --dry-run

Essentially, what hashin does when it figures out to bother changing the requirements file is this:

if set(hashes_from_parsing_existing_file) != set(hashes_from_pypi_download):
    amend_requiments_file(hashes_from_pypi_download)

@SomberNight
Copy link
Author

if you compare two different requirements files made in two different places when PyPI might have returned files in a different order

In Electrum's case, the file is not always generated by the same developer. I think it's unrealistic to assume in general that frozen dependency files are always updated from the same machine.

@di
Copy link
Contributor

di commented Jan 22, 2019

@peterbe Thanks for looping me in.

I would expect the hashes to be sorted lexicographically.

The ordering from PyPI doesn't have much meaning to users: it is ordered by the filename of the distribution artifact. Since hashin doesn't include filenames in it's output, this ordering wouldn't make much sense to a user trying to understand why hashin insists on it.

@peterbe
Copy link
Owner

peterbe commented Jan 22, 2019

I get the feeling that all y'all people are saying that it would be good to NOT rely on the sorting of PyPI and instead always write down the hashes lexicographically sorted. Yes, it has the advantage that the files created a more predictable when compared as a whole.

I don't think the existing logic that determines whether it should amend the hashes should change but when it does it should sort the hashes before writing them.

@SomberNight
Copy link
Author

Yes I think that's a good solution.
To restate, if the set of hashes already in the file and the new set of hashes are equal, it's okay to leave the file as is (ignore order when comparing, detect as no change);
but if the sets are unequal, what gets written should be lexicographically sorted.

@peterbe peterbe self-assigned this Jan 30, 2019
peterbe added a commit that referenced this issue Jan 30, 2019
@peterbe
Copy link
Owner

peterbe commented Jan 30, 2019

Anybody willing to take a look at #110 ?

The unit tests plus my extensive manual testing (in the PR description) makes me pretty confident already but, you know, more eyes etc.

peterbe pushed a commit that referenced this issue Jan 31, 2019
Fixes #105 

CC @SomberNight

### Here's how I tested it:

First observe that for `requests` the sha256 digests appear in this order:

1. `7bf2a778576d825600030a110f3c0e3e8edc51dfaafe1c146e39a2027784957b`
2. `502a824f31acdacb3a35b6690b5fbf0bc41d63a24a45c4004352b0242707598e`

<details>
▶ curl https://pypi.org/pypi/requests/json | jq '.releases["2.21.0"]'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  107k  100  107k    0     0   483k      0 --:--:-- --:--:-- --:--:--  485k
[
  {
    "comment_text": "",
    "digests": {
      "md5": "ed3af234ffcad0b3c1e521e1dfde19be",
      "sha256": "7bf2a778576d825600030a110f3c0e3e8edc51dfaafe1c146e39a2027784957b"
    },
    "downloads": -1,
    "filename": "requests-2.21.0-py2.py3-none-any.whl",
    "has_sig": false,
    "md5_digest": "ed3af234ffcad0b3c1e521e1dfde19be",
    "packagetype": "bdist_wheel",
    "python_version": "py2.py3",
    "requires_python": ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*",
    "size": 57987,
    "upload_time": "2018-12-10T15:40:08",
    "url": "https://files.pythonhosted.org/packages/7d/e3/20f3d364d6c8e5d2353c72a67778eb189176f08e873c9900e10c0287b84b/requests-2.21.0-py2.py3-none-any.whl"
  },
  {
    "comment_text": "",
    "digests": {
      "md5": "1bcd0e0977c3f8db1848ba0e2b7ab904",
      "sha256": "502a824f31acdacb3a35b6690b5fbf0bc41d63a24a45c4004352b0242707598e"
    },
    "downloads": -1,
    "filename": "requests-2.21.0.tar.gz",
    "has_sig": false,
    "md5_digest": "1bcd0e0977c3f8db1848ba0e2b7ab904",
    "packagetype": "sdist",
    "python_version": "source",
    "requires_python": ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*",
    "size": 111528,
    "upload_time": "2018-12-10T15:40:11",
    "url": "https://files.pythonhosted.org/packages/52/2c/514e4ac25da2b08ca5a464c50463682126385c4272c18193876e91f4bc38/requests-2.21.0.tar.gz"
  }
]
</details>

Run it the first time:

```
▶ python hashin.py -r /tmp/reqs.txt requests

▶ cat /tmp/reqs.txt
requests==2.21.0 \
    --hash=sha256:502a824f31acdacb3a35b6690b5fbf0bc41d63a24a45c4004352b0242707598e \
    --hash=sha256:7bf2a778576d825600030a110f3c0e3e8edc51dfaafe1c146e39a2027784957b
```
Note that `502a824...` comes before `7bf2a778...` which is different from the JSON from Pypi.org. 

Next, mess with the list of hashes manually:

```
▶ jed /tmp/reqs.txt

▶ cat /tmp/reqs.txt
requests==2.21.0 \
    --hash=sha256:7bf2a778576d825600030a110f3c0e3e8edc51dfaafe1c146e39a2027784957b \
    --hash=sha256:502a824f31acdacb3a35b6690b5fbf0bc41d63a24a45c4004352b0242707598e

▶ python hashin.py -r /tmp/reqs.txt requests --dry-run

```

In other words, it didn't touch the file. It's because when it compares the previous hashes with the found hashes from pypi, it does it by comparing to `set`s. 

Now really mess with it:

```
▶ cat /tmp/reqs.txt
requests==2.21.0 \
    --hash=sha256:7bf2a778576d825600030a110f3c0e3e8edc51dfaafe1c146e39a2027784957b \
    --hash=sha256:502a824f31acdacb3a35b6690b5fbf0bc41d63a24a45c4004352b0242707598e

▶ jed /tmp/reqs.txt

▶ cat /tmp/reqs.txt
requests==2.21.0 \
    --hash=sha256:7bf2a778576d825600030a110f3c0e3e8edc51dfaafe1c146e39a2027784957b

▶ python hashin.py -r /tmp/reqs.txt requests

▶ cat /tmp/reqs.txt
requests==2.21.0 \
    --hash=sha256:502a824f31acdacb3a35b6690b5fbf0bc41d63a24a45c4004352b0242707598e \
    --hash=sha256:7bf2a778576d825600030a110f3c0e3e8edc51dfaafe1c146e39a2027784957b
```

This time, the sets aren't equal so it goes to write down the hashes and when it does it does so in a sorted order.
@peterbe
Copy link
Owner

peterbe commented Jan 31, 2019

This is now out in version 0.14.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants