_binary_array_to_hex gives wrong value #51

djunzu · 2017-02-21T22:53:59Z

Correct me if I am wrong, but _binary_array_to_hex gives wrong values.

Python 2.7.13 (default, Feb 16 2017, 19:11:00) 
[GCC 5.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from PIL import Image
>>> import imagehash
>>> img = Image.open('sample_image.png')
>>> ph = imagehash.phash(img)
>>> bool_array = ph.hash.flatten()
>>> bool_array
array([ True, False, False,  True,  True,  True,  True, False, False,
       False,  True,  True,  True,  True, False, False,  True,  True,
       False, False, False, False, False,  True,  True,  True,  True,
        True, False, False, False, False,  True,  True,  True,  True,
       False, False, False, False,  True,  True, False, False, False,
       False,  True,  True,  True,  True,  True, False, False,  True,
        True,  True,  True,  True, False, False, False, False, False, False], dtype=bool)
>>> bit_array = 1*bool_array
>>> bit_array
array([1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0])
>>> bit_string = ''.join(str(b) for b in bit_array)
>>> bit_string
'1001111000111100110000011111000011110000110000111110011111000000'
>>> int(bit_string, 2)
11402201597170935744L
>>> hex(int(bit_string, 2))
'0x9e3cc1f0f0c3e7c0L' # This is the expected hex value for the given bit array (phash)
>>> str(ph)
'793c830f0fc3e703' # and here is the actual value.

If I made no mistake, let me know and I will submit a PR.

The text was updated successfully, but these errors were encountered:

JohannesBuchner · 2017-02-22T10:38:55Z

I think this may just be about the choice of bit order. As long str(ph) and hex_to_hash understand each other. Maybe you have some ideas on how to simplify this, but it might break backward compatibility.

djunzu · 2017-02-22T19:17:59Z

I think this may just be about the choice of bit order.

I have further investigated it, and you are right:

>>> _binary_array_to_hex(numpy.array([1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]))
'0f00000000000000'
>>> _binary_array_to_hex(numpy.array([0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]))
'f000000000000000'
>>> _binary_array_to_hex(numpy.array([0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]))
'000f000000000000'

_binary_array_to_hex parses the bit array in a very unexpected way. But in this case it does not matter: the bit array is used just to compute hamming distances. So bit order is not important as long as we can represent 2^64 values.

phretor · 2017-10-26T08:20:05Z

I'd like to reopen this issue, because the implementation of the binary-to-hexadecimal conversion is not interoperable with other implementations of the same algorithm, and simply not interoperable with "standard" value-conversion functions.

Therefore, if I calculate the pHash with imagehash and then I take its hex value and interpret it with another "standard" hex-to-bin function, I obtain a completely different value.

When I say "standard" I mean something like "widely used". Take for example the bitstring.BitArray class:

In [3]: BitArray(bin='1000111100001111000011110000111100001111000010110000101101111010')
Out[3]: BitArray('0x8f0f0f0f0f0b0b7a')

versus

In [13]: _binary_array_to_hex(np.array([bool(int(x)) for x in '1000111100001111000011110000111100001111000010110000101101111010']))
Out[13]: 'f1f0f0f0f0d0d05e'

Again, for Hamming-distance comparison, as long as the bits are the same, it doesn't matter. However, if you use other libraries, you might get weird results.

djunzu · 2017-10-26T23:10:21Z

interoperable with other implementations

That's why I have my own bool_array_to_hex and use bool_array_to_hex(imagehash.phash(image).hash).

Maybe you have some ideas on how to simplify this, but it might break backward compatibility.

I could submit a PR changing the behavior to "standard". But it will definitely break backward compatibility. If you would like it, just say so.

JohannesBuchner · 2017-10-27T00:18:03Z

is this a big-endian vs small-endian thing?

djunzu · 2017-10-27T13:18:28Z

is this a big-endian vs small-endian thing?

Nop.
In both cases the byte sequence is preserved. Using big-endian the bytes are ordered from left to right. Using little-endian the bytes are ordered from right to left. But in both cases the bit order inside each byte does not change. The byte F1 will be represented like F1 in both cases.
The _binary_array_to_hex's unexpected behavior is exactly the changing of bit order inside each byte. That is: F1 is being represented like 1F.

>>> _binary_array_to_hex(numpy.array([1,1,1,1,0,0,0,1,  0,0,0,0,0,0,0,0,  0,0,1,1,1,1,1,1,  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]))
'1f00f30000000000'
# big-endian:
'f1003f0000000000'
# little-endian:
'00000000003f00f1'

Plus, the big vs little endian problem is about hardware and not about software/programing. As far as I know, the standard and default is big-endian in a software/programing level. Simply because that is how we write text in latin languages: from left to right. Even in C, which is considered a low level language, the standard is big-endian when you code. (I am not aware about any language that does not behave in this way.) How the compiler (C) or the interpreter (Python) will translate it to machine where it is running is not our problem. We should not care about it when writing code.

Reference:
https://en.wikipedia.org/wiki/Endianness
https://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Data/endian.html

JohannesBuchner · 2017-10-27T16:52:54Z

I see that @djunzu established earlier that this is about the encoding by byte order, namely that it is swapping byte pairs? I suspect that one can reproduce imagehash's implementation by reversing the byte sequence, transforming to hex and reversing that.

I don't think there is a single "correct" solution for ordering the bytes from a 2d array into a hex string, any encoding is a choice. For accessing the bytes, the .hash attribute is the way to go.
I fully expect that many users would implement their own encoding for their needs (e.g. storing in databases, etc.). Some may store the byte array directly, others a hex encoding, others may choose something that takes advantage of a larger alphabet.

Could you explain some more why you need consistent encoding behaviour with other modules here, or you just stumbled across this oddity while making your own encoding? If you have a alternative and shorter string codec that does not introduce new dependencies, I would be interested.

Plus, the big vs little endian problem is about hardware and not about software/programing.

That's simply not true, see your references for network layers, file encoding, etc.

djunzu · 2017-10-27T19:29:12Z

this is about the encoding by byte order, namely that it is swapping byte pairs?

Not quite. The byte order seems correct. The problem relies on how bits inside a byte are read. As far as I could understand _binary_array_to_hex considers the left most bit the least significant and the right most bit the most significant.

one can reproduce imagehash's implementation by reversing the byte sequence, transforming to hex and reversing that.

Not sure I understood your algo. I am not 100% confident but I think one can reproduce imagehash's implementation by reversing the bit sequence, for each byte, before transforming it to hex value.

I don't think there is a single "correct" solution for ordering the bytes from a 2d array into a hex string, any encoding is a choice.

Agreed. Your choice was to use arr.flatten() for it. Could we say that is the "correct" Python way to do it?

There is also no "correct" solution for ordering bits in a byte. Your choice was to consider the right most bit as the most significant bit (if I understood correctly). Usually most of people consider the right most bit the least significant (and people do it just because we do it writing numbers: 23 == 2 * 10 + 3 and 2 + 3 *10 != 23).

For accessing the bytes, the .hash attribute is the way to go.
I fully expect that many users would implement their own encoding for their needs (e.g. storing in databases, etc.). Some may store the byte array directly, others a hex encoding, others may choose something that takes advantage of a larger alphabet.

Totally agree.
But I still would like to have a "standard" hex value.

Could you explain some more why you need consistent encoding behaviour with other modules

I must save all hashes as hex values in plain text files and I must insert all values in a Postgres DB as bit sequences. Now I can't debug or implement something following some ways because I have two different bit sequences: one from imagehash and one from DB (even though the hex value is the same!). Sometimes I can follow another ways to debug/implement something. But it is always easier to have a consistent behavior across softwares.

if you have a alternative and shorter string codec that does not introduce new dependencies, I would be interested.

I will submit a PR later.

That's simply not true, see your references for network layers, file encoding, etc.

I think we agree big / little endian is not the problem here. So it is already off topic. But...

Binary files are a problem. Right. I forget that. Basically because that is not a problem with ASCII files. But a good binary file will have some kind of indication about it: for example unicode text can start with a BOM just for it.

My point still stands for the network layer. If I write a server/client in python or in C or any other language, I will not care about big vs little endian and it will work. It will work even though the server is in a big endian machine and the client is in a litle endian machine (or the other way around). It will work because the libraries will take care of it at some point. Analogous is the 7 OSI layers. We usually don't care about them because libraries take care of them for us.

I could have stated it in a different way: "Before you see big or little endian, you may have had no idea it even existed. That's because it's reasonably well-hidden from you." That is my point. Big/little endian is a low level thing we should not care about most of time in a high level programming. If everything have a consistent behavior, we do not need to deal with little details.

djunzu · 2017-11-01T13:55:45Z

@JohannesBuchner , some questions in order to build an alternative.

Suppose I want a hash_size = 2. Currently _binary_array_to_hex implementation returns a blank string for it as the code below shows. That is a bug, isn't it? Or hash_size can't be 2? (see next question)
(With hash_size=2 there can be just 2⁴ distinct images. But anyway, let's just suppose someone wanna do it for whatever reason.)

>>> str(imagehash.phash(Image.open('Lenna.png'), hash_size=8))
'99636ab4aecc4569'
>>> str(imagehash.phash(Image.open('Lenna.png'), hash_size=2))
''
>>> imagehash.phash(Image.open('Lenna.png'), hash_size=2)
array([[ True, False],
       [ True, False]], dtype=bool)

For whash is explicit stated that hash_size must be a power of 2 and less than image_scale. Does it hold for every other *hash here? Or for the other *hash functions one could choose image_hash = 6, for example?

JohannesBuchner · 2017-11-01T14:03:15Z

ad 1: Yes, that's a bug. Could you please add your example as a unit test?
ad 2: IIRC for ahash and dhash hash_size is just the width and height of the downsampled grayscale image (which basically is the hash). So it can any positive integer should be acceptable. Same for phash. Wavelets have more requirements.

djunzu · 2017-11-01T15:12:31Z

It definitely has some bugs:

>>> str(imagehash.phash(Image.open('Lenna.png'), hash_size=6))
'd9a8d22e'
>>> imagehash.hex_to_hash('d9a8d22e', hash_size=6)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/perfilgp/.pyenv/versions/3.6.1/lib/python3.6/site-packages/imagehash/__init__.py", line 106, in hex_to_hash
    raise ValueError(emsg.format(count))
ValueError: Expected hex string size of 6.

PR coming in the next days.

JohannesBuchner · 2017-12-07T00:20:40Z

@phretor, could you check if the newest version (4.0) works OK for you?

djunzu closed this as completed Feb 22, 2017

djunzu reopened this Nov 1, 2017

djunzu mentioned this issue Nov 2, 2017

Fix binary array to hex #64

Merged

JohannesBuchner closed this as completed in #64 Dec 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_binary_array_to_hex gives wrong value #51

_binary_array_to_hex gives wrong value #51

djunzu commented Feb 21, 2017 •

edited

Loading

JohannesBuchner commented Feb 22, 2017

djunzu commented Feb 22, 2017

phretor commented Oct 26, 2017

djunzu commented Oct 26, 2017

JohannesBuchner commented Oct 27, 2017

djunzu commented Oct 27, 2017

JohannesBuchner commented Oct 27, 2017 •

edited

Loading

djunzu commented Oct 27, 2017

djunzu commented Nov 1, 2017 •

edited

Loading

JohannesBuchner commented Nov 1, 2017

djunzu commented Nov 1, 2017

JohannesBuchner commented Dec 7, 2017

_binary_array_to_hex gives wrong value #51

_binary_array_to_hex gives wrong value #51

Comments

djunzu commented Feb 21, 2017 • edited Loading

JohannesBuchner commented Feb 22, 2017

djunzu commented Feb 22, 2017

phretor commented Oct 26, 2017

djunzu commented Oct 26, 2017

JohannesBuchner commented Oct 27, 2017

djunzu commented Oct 27, 2017

JohannesBuchner commented Oct 27, 2017 • edited Loading

djunzu commented Oct 27, 2017

djunzu commented Nov 1, 2017 • edited Loading

JohannesBuchner commented Nov 1, 2017

djunzu commented Nov 1, 2017

JohannesBuchner commented Dec 7, 2017

djunzu commented Feb 21, 2017 •

edited

Loading

JohannesBuchner commented Oct 27, 2017 •

edited

Loading

djunzu commented Nov 1, 2017 •

edited

Loading