Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_binary_array_to_hex gives wrong value #51

Closed
djunzu opened this issue Feb 21, 2017 · 12 comments · Fixed by #64
Closed

_binary_array_to_hex gives wrong value #51

djunzu opened this issue Feb 21, 2017 · 12 comments · Fixed by #64

Comments

@djunzu
Copy link

djunzu commented Feb 21, 2017

Correct me if I am wrong, but _binary_array_to_hex gives wrong values.

Python 2.7.13 (default, Feb 16 2017, 19:11:00) 
[GCC 5.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from PIL import Image
>>> import imagehash
>>> img = Image.open('sample_image.png')
>>> ph = imagehash.phash(img)
>>> bool_array = ph.hash.flatten()
>>> bool_array
array([ True, False, False,  True,  True,  True,  True, False, False,
       False,  True,  True,  True,  True, False, False,  True,  True,
       False, False, False, False, False,  True,  True,  True,  True,
        True, False, False, False, False,  True,  True,  True,  True,
       False, False, False, False,  True,  True, False, False, False,
       False,  True,  True,  True,  True,  True, False, False,  True,
        True,  True,  True,  True, False, False, False, False, False, False], dtype=bool)
>>> bit_array = 1*bool_array
>>> bit_array
array([1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0])
>>> bit_string = ''.join(str(b) for b in bit_array)
>>> bit_string
'1001111000111100110000011111000011110000110000111110011111000000'
>>> int(bit_string, 2)
11402201597170935744L
>>> hex(int(bit_string, 2))
'0x9e3cc1f0f0c3e7c0L' # This is the expected hex value for the given bit array (phash)
>>> str(ph)
'793c830f0fc3e703' # and here is the actual value.

If I made no mistake, let me know and I will submit a PR.

@JohannesBuchner
Copy link
Owner

I think this may just be about the choice of bit order. As long str(ph) and hex_to_hash understand each other. Maybe you have some ideas on how to simplify this, but it might break backward compatibility.

@djunzu
Copy link
Author

djunzu commented Feb 22, 2017

I think this may just be about the choice of bit order.

I have further investigated it, and you are right:

>>> _binary_array_to_hex(numpy.array([1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]))
'0f00000000000000'
>>> _binary_array_to_hex(numpy.array([0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]))
'f000000000000000'
>>> _binary_array_to_hex(numpy.array([0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]))
'000f000000000000'

_binary_array_to_hex parses the bit array in a very unexpected way. But in this case it does not matter: the bit array is used just to compute hamming distances. So bit order is not important as long as we can represent 2^64 values.

@djunzu djunzu closed this as completed Feb 22, 2017
@phretor
Copy link

phretor commented Oct 26, 2017

I'd like to reopen this issue, because the implementation of the binary-to-hexadecimal conversion is not interoperable with other implementations of the same algorithm, and simply not interoperable with "standard" value-conversion functions.

Therefore, if I calculate the pHash with imagehash and then I take its hex value and interpret it with another "standard" hex-to-bin function, I obtain a completely different value.

When I say "standard" I mean something like "widely used". Take for example the bitstring.BitArray class:

In [3]: BitArray(bin='1000111100001111000011110000111100001111000010110000101101111010')
Out[3]: BitArray('0x8f0f0f0f0f0b0b7a')

versus

In [13]: _binary_array_to_hex(np.array([bool(int(x)) for x in '1000111100001111000011110000111100001111000010110000101101111010']))
Out[13]: 'f1f0f0f0f0d0d05e'

Again, for Hamming-distance comparison, as long as the bits are the same, it doesn't matter. However, if you use other libraries, you might get weird results.

@djunzu
Copy link
Author

djunzu commented Oct 26, 2017

interoperable with other implementations

That's why I have my own bool_array_to_hex and use bool_array_to_hex(imagehash.phash(image).hash).

Maybe you have some ideas on how to simplify this, but it might break backward compatibility.

I could submit a PR changing the behavior to "standard". But it will definitely break backward compatibility. If you would like it, just say so.

@JohannesBuchner
Copy link
Owner

is this a big-endian vs small-endian thing?

@djunzu
Copy link
Author

djunzu commented Oct 27, 2017

is this a big-endian vs small-endian thing?

Nop.
In both cases the byte sequence is preserved. Using big-endian the bytes are ordered from left to right. Using little-endian the bytes are ordered from right to left. But in both cases the bit order inside each byte does not change. The byte F1 will be represented like F1 in both cases.
The _binary_array_to_hex's unexpected behavior is exactly the changing of bit order inside each byte. That is: F1 is being represented like 1F.

>>> _binary_array_to_hex(numpy.array([1,1,1,1,0,0,0,1,  0,0,0,0,0,0,0,0,  0,0,1,1,1,1,1,1,  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]))
'1f00f30000000000'
# big-endian:
'f1003f0000000000'
# little-endian:
'00000000003f00f1'

Plus, the big vs little endian problem is about hardware and not about software/programing. As far as I know, the standard and default is big-endian in a software/programing level. Simply because that is how we write text in latin languages: from left to right. Even in C, which is considered a low level language, the standard is big-endian when you code. (I am not aware about any language that does not behave in this way.) How the compiler (C) or the interpreter (Python) will translate it to machine where it is running is not our problem. We should not care about it when writing code.

Reference:
https://en.wikipedia.org/wiki/Endianness
https://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Data/endian.html

@JohannesBuchner
Copy link
Owner

JohannesBuchner commented Oct 27, 2017

I see that @djunzu established earlier that this is about the encoding by byte order, namely that it is swapping byte pairs? I suspect that one can reproduce imagehash's implementation by reversing the byte sequence, transforming to hex and reversing that.

I don't think there is a single "correct" solution for ordering the bytes from a 2d array into a hex string, any encoding is a choice. For accessing the bytes, the .hash attribute is the way to go.
I fully expect that many users would implement their own encoding for their needs (e.g. storing in databases, etc.). Some may store the byte array directly, others a hex encoding, others may choose something that takes advantage of a larger alphabet.

Could you explain some more why you need consistent encoding behaviour with other modules here, or you just stumbled across this oddity while making your own encoding? If you have a alternative and shorter string codec that does not introduce new dependencies, I would be interested.

Plus, the big vs little endian problem is about hardware and not about software/programing.

That's simply not true, see your references for network layers, file encoding, etc.

@djunzu
Copy link
Author

djunzu commented Oct 27, 2017

this is about the encoding by byte order, namely that it is swapping byte pairs?

Not quite. The byte order seems correct. The problem relies on how bits inside a byte are read. As far as I could understand _binary_array_to_hex considers the left most bit the least significant and the right most bit the most significant.

one can reproduce imagehash's implementation by reversing the byte sequence, transforming to hex and reversing that.

Not sure I understood your algo. I am not 100% confident but I think one can reproduce imagehash's implementation by reversing the bit sequence, for each byte, before transforming it to hex value.

I don't think there is a single "correct" solution for ordering the bytes from a 2d array into a hex string, any encoding is a choice.

Agreed. Your choice was to use arr.flatten() for it. Could we say that is the "correct" Python way to do it?

There is also no "correct" solution for ordering bits in a byte. Your choice was to consider the right most bit as the most significant bit (if I understood correctly). Usually most of people consider the right most bit the least significant (and people do it just because we do it writing numbers: 23 == 2 * 10 + 3 and 2 + 3 *10 != 23).

For accessing the bytes, the .hash attribute is the way to go.
I fully expect that many users would implement their own encoding for their needs (e.g. storing in databases, etc.). Some may store the byte array directly, others a hex encoding, others may choose something that takes advantage of a larger alphabet.

Totally agree.
But I still would like to have a "standard" hex value.

Could you explain some more why you need consistent encoding behaviour with other modules

I must save all hashes as hex values in plain text files and I must insert all values in a Postgres DB as bit sequences. Now I can't debug or implement something following some ways because I have two different bit sequences: one from imagehash and one from DB (even though the hex value is the same!). Sometimes I can follow another ways to debug/implement something. But it is always easier to have a consistent behavior across softwares.

if you have a alternative and shorter string codec that does not introduce new dependencies, I would be interested.

I will submit a PR later.

That's simply not true, see your references for network layers, file encoding, etc.

I think we agree big / little endian is not the problem here. So it is already off topic. But...

Binary files are a problem. Right. I forget that. Basically because that is not a problem with ASCII files. But a good binary file will have some kind of indication about it: for example unicode text can start with a BOM just for it.

My point still stands for the network layer. If I write a server/client in python or in C or any other language, I will not care about big vs little endian and it will work. It will work even though the server is in a big endian machine and the client is in a litle endian machine (or the other way around). It will work because the libraries will take care of it at some point. Analogous is the 7 OSI layers. We usually don't care about them because libraries take care of them for us.

I could have stated it in a different way: "Before you see big or little endian, you may have had no idea it even existed. That's because it's reasonably well-hidden from you." That is my point. Big/little endian is a low level thing we should not care about most of time in a high level programming. If everything have a consistent behavior, we do not need to deal with little details.

@djunzu djunzu reopened this Nov 1, 2017
@djunzu
Copy link
Author

djunzu commented Nov 1, 2017

@JohannesBuchner , some questions in order to build an alternative.

  1. Suppose I want a hash_size = 2. Currently _binary_array_to_hex implementation returns a blank string for it as the code below shows. That is a bug, isn't it? Or hash_size can't be 2? (see next question)
    (With hash_size=2 there can be just 2⁴ distinct images. But anyway, let's just suppose someone wanna do it for whatever reason.)
>>> str(imagehash.phash(Image.open('Lenna.png'), hash_size=8))
'99636ab4aecc4569'
>>> str(imagehash.phash(Image.open('Lenna.png'), hash_size=2))
''
>>> imagehash.phash(Image.open('Lenna.png'), hash_size=2)
array([[ True, False],
       [ True, False]], dtype=bool)
  1. For whash is explicit stated that hash_size must be a power of 2 and less than image_scale. Does it hold for every other *hash here? Or for the other *hash functions one could choose image_hash = 6, for example?

@JohannesBuchner
Copy link
Owner

ad 1: Yes, that's a bug. Could you please add your example as a unit test?
ad 2: IIRC for ahash and dhash hash_size is just the width and height of the downsampled grayscale image (which basically is the hash). So it can any positive integer should be acceptable. Same for phash. Wavelets have more requirements.

@djunzu
Copy link
Author

djunzu commented Nov 1, 2017

It definitely has some bugs:

>>> str(imagehash.phash(Image.open('Lenna.png'), hash_size=6))
'd9a8d22e'
>>> imagehash.hex_to_hash('d9a8d22e', hash_size=6)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/perfilgp/.pyenv/versions/3.6.1/lib/python3.6/site-packages/imagehash/__init__.py", line 106, in hex_to_hash
    raise ValueError(emsg.format(count))
ValueError: Expected hex string size of 6.

PR coming in the next days.

@JohannesBuchner
Copy link
Owner

@phretor, could you check if the newest version (4.0) works OK for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants