-
Notifications
You must be signed in to change notification settings - Fork 335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
_binary_array_to_hex gives wrong value #51
Comments
I think this may just be about the choice of bit order. As long str(ph) and hex_to_hash understand each other. Maybe you have some ideas on how to simplify this, but it might break backward compatibility. |
I have further investigated it, and you are right: >>> _binary_array_to_hex(numpy.array([1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]))
'0f00000000000000'
>>> _binary_array_to_hex(numpy.array([0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]))
'f000000000000000'
>>> _binary_array_to_hex(numpy.array([0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]))
'000f000000000000'
|
I'd like to reopen this issue, because the implementation of the binary-to-hexadecimal conversion is not interoperable with other implementations of the same algorithm, and simply not interoperable with "standard" value-conversion functions. Therefore, if I calculate the pHash with When I say "standard" I mean something like "widely used". Take for example the
versus
Again, for Hamming-distance comparison, as long as the bits are the same, it doesn't matter. However, if you use other libraries, you might get weird results. |
That's why I have my own
I could submit a PR changing the behavior to "standard". But it will definitely break backward compatibility. If you would like it, just say so. |
is this a big-endian vs small-endian thing? |
Nop. >>> _binary_array_to_hex(numpy.array([1,1,1,1,0,0,0,1, 0,0,0,0,0,0,0,0, 0,0,1,1,1,1,1,1, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]))
'1f00f30000000000'
# big-endian:
'f1003f0000000000'
# little-endian:
'00000000003f00f1' Plus, the big vs little endian problem is about hardware and not about software/programing. As far as I know, the standard and default is big-endian in a software/programing level. Simply because that is how we write text in latin languages: from left to right. Even in C, which is considered a low level language, the standard is big-endian when you code. (I am not aware about any language that does not behave in this way.) How the compiler (C) or the interpreter (Python) will translate it to machine where it is running is not our problem. We should not care about it when writing code. Reference: |
I see that @djunzu established earlier that this is about the encoding by byte order, namely that it is swapping byte pairs? I suspect that one can reproduce imagehash's implementation by reversing the byte sequence, transforming to hex and reversing that. I don't think there is a single "correct" solution for ordering the bytes from a 2d array into a hex string, any encoding is a choice. For accessing the bytes, the .hash attribute is the way to go. Could you explain some more why you need consistent encoding behaviour with other modules here, or you just stumbled across this oddity while making your own encoding? If you have a alternative and shorter string codec that does not introduce new dependencies, I would be interested.
That's simply not true, see your references for network layers, file encoding, etc. |
Not quite. The byte order seems correct. The problem relies on how bits inside a byte are read. As far as I could understand
Not sure I understood your algo. I am not 100% confident but I think one can reproduce imagehash's implementation by reversing the bit sequence, for each byte, before transforming it to hex value.
Agreed. Your choice was to use There is also no "correct" solution for ordering bits in a byte. Your choice was to consider the right most bit as the most significant bit (if I understood correctly). Usually most of people consider the right most bit the least significant (and people do it just because we do it writing numbers: 23 == 2 * 10 + 3 and 2 + 3 *10 != 23).
Totally agree.
I must save all hashes as hex values in plain text files and I must insert all values in a Postgres DB as bit sequences. Now I can't debug or implement something following some ways because I have two different bit sequences: one from imagehash and one from DB (even though the hex value is the same!). Sometimes I can follow another ways to debug/implement something. But it is always easier to have a consistent behavior across softwares.
I will submit a PR later.
I think we agree big / little endian is not the problem here. So it is already off topic. But... Binary files are a problem. Right. I forget that. Basically because that is not a problem with ASCII files. But a good binary file will have some kind of indication about it: for example unicode text can start with a BOM just for it. My point still stands for the network layer. If I write a server/client in python or in C or any other language, I will not care about big vs little endian and it will work. It will work even though the server is in a big endian machine and the client is in a litle endian machine (or the other way around). It will work because the libraries will take care of it at some point. Analogous is the 7 OSI layers. We usually don't care about them because libraries take care of them for us. I could have stated it in a different way: "Before you see big or little endian, you may have had no idea it even existed. That's because it's reasonably well-hidden from you." That is my point. Big/little endian is a low level thing we should not care about most of time in a high level programming. If everything have a consistent behavior, we do not need to deal with little details. |
@JohannesBuchner , some questions in order to build an alternative.
>>> str(imagehash.phash(Image.open('Lenna.png'), hash_size=8))
'99636ab4aecc4569'
>>> str(imagehash.phash(Image.open('Lenna.png'), hash_size=2))
''
>>> imagehash.phash(Image.open('Lenna.png'), hash_size=2)
array([[ True, False],
[ True, False]], dtype=bool)
|
ad 1: Yes, that's a bug. Could you please add your example as a unit test? |
It definitely has some bugs: >>> str(imagehash.phash(Image.open('Lenna.png'), hash_size=6))
'd9a8d22e'
>>> imagehash.hex_to_hash('d9a8d22e', hash_size=6)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/perfilgp/.pyenv/versions/3.6.1/lib/python3.6/site-packages/imagehash/__init__.py", line 106, in hex_to_hash
raise ValueError(emsg.format(count))
ValueError: Expected hex string size of 6. PR coming in the next days. |
@phretor, could you check if the newest version (4.0) works OK for you? |
Correct me if I am wrong, but
_binary_array_to_hex
gives wrong values.If I made no mistake, let me know and I will submit a PR.
The text was updated successfully, but these errors were encountered: