Missing offset table? #4

erikogabrielsson · 2024-02-19T09:02:37Z

Hi,

Im reading some WSI files retrieved from the IDC portal that I assume are created using this script.

From the files I have looked at so far, they seem to lack both basic and extended offset table. This makes it very slow to read out tiles (using wsidicom. Is there a reason for this?

As the files are dual DICOM TIFF I have experimented with reading the tile offsets from the tiff tags, and that works.

fedorov · 2024-02-19T15:50:37Z

@erikogabrielsson yes, all of the DICOM WSI files available from IDC at this point (as of IDC data release v17) were created using this script. We will need to wait to hear from David regarding the offset tables.

dclunie · 2024-02-19T20:38:05Z

AFAIK, the GHC implementation of the DICOMstore that IDC uses does not take advantage of the offset tables but computes its own frame index, as I think does dcm4chee (not 100% sure), so I did not add the BOT (which wouldn't have worked for the larger images anyway given its 32 bit offset limit), or the EOT (which we added to the standard after I started doing these conversions). The EOT is also a (pair of) very large metadata data element(s), which potentially causes problems with GHC (which has limits on total metadata size and in general release doesn't support BulkDataURI).

It is something to think about going forward though, esp. for those who are processing offline files and can make use of the offset tables.

Walking forward skipping to the next fragment (without reading or parsing them) to build their own index should not be that expensive an operation for the consumer. Does wsidicom do that in an optimized manner? The sweet spot is when there is one fragment per frame (same number of fragments as frames); otherwise one needs to detect start and end fragments of frames, preferably without full JPEG/J2K marker syntax parsing, which is a pain (I cheat by walking back looking for EOI or EOI+padding).

dclunie · 2024-02-19T20:57:34Z

I updated the documentation to mention this,

erikogabrielsson · 2024-02-20T15:36:32Z

Thanks for the quick responses @fedorov and @dclunie and for the technical reasons for not including a BOT.

Wsidicom can parse either BOT, EOT, or create the index from the pixel data (but only if one fragment per frame).

I tested reading WSIs from the IDC-S3 bucket like:

from wsidicom import WsiDicom

bucket = "idc-open-data/66384e58-9ff9-4e60-94c1-83a906a793d7"
s3_wsi = WsiDicom.open(
    files =f"s3://{bucket}",
    file_options={
        "s3": {
            "anon": True,
            "endpoint_url":
            "https://s3.amazonaws.com",
            "default_cache_type": "first"
        },
    }
)

(Using a development version of wsidicom that supports fsspec).

However, without index, reading the first tile from the higher resolution levels takes a long time (minutes), presumably as the random read access is slow. Adding the parser for the tile offsets in the tiff tags reduced the first tile access time to some seconds.

Closing this, as it was only a question of curiosity. Thanks again.

dclunie · 2024-02-21T12:00:14Z

@erikogabrielsson, reopening, because if this choice is causing problems for some readers, it should be investigated +/- mitigated.

Also created IDC issue Evaluate impact of absence of frame offset tables (BOT, EOT) in converted DICOM WSI files +/- mitigate #1732, but you may not have access to that

Do you know if wsidicom makes use of random access to binary file mechanisms, or does it read the whole pixel data value to make sense of it in the absence of an offset table? For the offset table to improve performance it must have some ability to seek to the appropriate location (assuming the whole thing is not already loaded into memory or memory-mapped).

erikogabrielsson · 2024-02-21T14:09:36Z

Hi @dclunie

I made a summary of BOT/EOT and concatenation support for the opensource readers for DICOM WSI that I know of and have used:

Reader	Library	BOT	EOT	Pixel data parsing	Concatenated files
WsiDicom		X	X	By fragment length	X
Openslide	libdicom	X	X	By fragment length	issue
Dicomslide	dicomweb-client	X		By fragment length	No?
Bioformats				By jpeg magic bytes	X (not tested)

WsiDicom, and I assume all of the above, relies on random access to tiles.

When the accessing local files (on fast storage) with WsiDicom I have observed that the absence of BOT/EOT increased the load time (e.g. including building a tile index) from <50 ms to > 200 ms (DICOM converted file of CMU-1.svs). For S3 storage it goes from <5 s to >40 s, but I have yet to determine how much this can be improved by caching strategy.

fedorov assigned dclunie Feb 19, 2024

erikogabrielsson closed this as completed Feb 20, 2024

dclunie reopened this Feb 21, 2024

dclunie assigned DanielaSchacherer, CPBridge and erikogabrielsson Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing offset table? #4

Missing offset table? #4

erikogabrielsson commented Feb 19, 2024

fedorov commented Feb 19, 2024

dclunie commented Feb 19, 2024 •

edited

Loading

dclunie commented Feb 19, 2024

erikogabrielsson commented Feb 20, 2024

dclunie commented Feb 21, 2024

erikogabrielsson commented Feb 21, 2024

Missing offset table? #4

Missing offset table? #4

Comments

erikogabrielsson commented Feb 19, 2024

fedorov commented Feb 19, 2024

dclunie commented Feb 19, 2024 • edited Loading

dclunie commented Feb 19, 2024

erikogabrielsson commented Feb 20, 2024

dclunie commented Feb 21, 2024

erikogabrielsson commented Feb 21, 2024

dclunie commented Feb 19, 2024 •

edited

Loading