How are content types handled by this project? What risks are there? #128
Replies: 1 comment 1 reply
-
Even though the ISCC is used for content identification, it is not an identifier in the traditional sense. You can mix and match the different ISCC-UNITs depending on your requirements. For instance, you could create an ISCC-CODE with SubType SUM. The ISCC-CODE SUM is a composite Data-Code and Instance-Code, both of which are reproducible, deterministic, and independent of the content/media type. Even if you only create and use these two units, you will still be able to match them against other ISCCs that include more units. This can be done by decomposing ISCCs and indexing each type of UNIT separately. Regarding the reproducibility of Content-Codes, your best bet is to stick with a single implementation and also add information about the generator to the metadata. You are correct that Content-Code UNITs depend on media-type-specific content extraction & preprocessing, which may diverge across implementations. However, the purpose of the Content-Code units is to match near-identical content based on their ISCC. For example, consider having both a JPEG and a PNG version of an image. The role of the Image-Code is to enable the user to find the JPEG version (or other slightly modified versions) based on the ISCC of the PNG version (or vice versa) within or across large collection of ISCCs. This can be done by performing an efficient Hamming-distance-based nearest neighbor search over the bit-vector of the Image-Code. As long as different implementations feed "similar" enough data into the algorithms, the resulting Content-Codes will very likely be identical or close enough to be matched. |
Beta Was this translation helpful? Give feedback.
-
Is there any more information about what's happening for the content-hash part of this algorithm? how it is expected to be predictable and reliable to generate each each time? and into the future?
Currently there are other content types:
There are many more content types than this, perhaps it is better in most cases just to use a data-code? (or mixed which is quite a large umbrella?)
With the image code, there seems like there might be potential to introduce errors, e.g. in cropping the image, or the algorithm used to convert an image to greyscale without a specific color->grey map? (also is there a list of image formats that can be used and will be maintained over a long-term period so hashes can be recreated in future?)
Ideally I can create a content hash-reliably over the longest period of time so my identifier can be recreated (by myself or someone else) but it seems like a risk that this won't always be possible with the complexity of this additional content similarity hash.
I'd love to hear more about how folks are doing with this and examples IRL.
Beta Was this translation helpful? Give feedback.
All reactions