How are content types handled by this project? What risks are there? #128

ross-spencer · 2024-05-20T12:27:53Z

ross-spencer
May 20, 2024

Is there any more information about what's happening for the content-hash part of this algorithm? how it is expected to be predictable and reliable to generate each each time? and into the future?

Currently there are other content types:

Text
Image
Audio
Video
Mixed

There are many more content types than this, perhaps it is better in most cases just to use a data-code? (or mixed which is quite a large umbrella?)

With the image code, there seems like there might be potential to introduce errors, e.g. in cropping the image, or the algorithm used to convert an image to greyscale without a specific color->grey map? (also is there a list of image formats that can be used and will be maintained over a long-term period so hashes can be recreated in future?)

Ideally I can create a content hash-reliably over the longest period of time so my identifier can be recreated (by myself or someone else) but it seems like a risk that this won't always be possible with the complexity of this additional content similarity hash.

I'd love to hear more about how folks are doing with this and examples IRL.

titusz · 2024-05-28T15:20:00Z

titusz
May 28, 2024
Maintainer

Even though the ISCC is used for content identification, it is not an identifier in the traditional sense.

You can mix and match the different ISCC-UNITs depending on your requirements. For instance, you could create an ISCC-CODE with SubType SUM. The ISCC-CODE SUM is a composite Data-Code and Instance-Code, both of which are reproducible, deterministic, and independent of the content/media type. Even if you only create and use these two units, you will still be able to match them against other ISCCs that include more units. This can be done by decomposing ISCCs and indexing each type of UNIT separately.

Regarding the reproducibility of Content-Codes, your best bet is to stick with a single implementation and also add information about the generator to the metadata.

You are correct that Content-Code UNITs depend on media-type-specific content extraction & preprocessing, which may diverge across implementations. However, the purpose of the Content-Code units is to match near-identical content based on their ISCC.

For example, consider having both a JPEG and a PNG version of an image. The role of the Image-Code is to enable the user to find the JPEG version (or other slightly modified versions) based on the ISCC of the PNG version (or vice versa) within or across large collection of ISCCs. This can be done by performing an efficient Hamming-distance-based nearest neighbor search over the bit-vector of the Image-Code.

As long as different implementations feed "similar" enough data into the algorithms, the resulting Content-Codes will very likely be identical or close enough to be matched.

1 reply

titusz May 28, 2024
Maintainer

Here is an example ISCC with SubType SUM: ISCC:KUAG5LUBVP23N3DOHCHWIYGXVN7ZS

I encourage you to go to the INSPECT Tab at https://huggingface.co/spaces/iscc/iscc-playground and paste it there to see how it is decoded/decomposed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How are content types handled by this project? What risks are there? #128

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

How are content types handled by this project? What risks are there? #128

ross-spencer May 20, 2024

Replies: 1 comment · 1 reply

titusz May 28, 2024 Maintainer

titusz May 28, 2024 Maintainer

ross-spencer
May 20, 2024

Replies: 1 comment 1 reply

titusz
May 28, 2024
Maintainer

titusz May 28, 2024
Maintainer