Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing offset table? #4

Open
erikogabrielsson opened this issue Feb 19, 2024 · 6 comments
Open

Missing offset table? #4

erikogabrielsson opened this issue Feb 19, 2024 · 6 comments
Assignees

Comments

@erikogabrielsson
Copy link

Hi,

Im reading some WSI files retrieved from the IDC portal that I assume are created using this script.

From the files I have looked at so far, they seem to lack both basic and extended offset table. This makes it very slow to read out tiles (using wsidicom. Is there a reason for this?

As the files are dual DICOM TIFF I have experimented with reading the tile offsets from the tiff tags, and that works.

@fedorov
Copy link
Member

fedorov commented Feb 19, 2024

@erikogabrielsson yes, all of the DICOM WSI files available from IDC at this point (as of IDC data release v17) were created using this script. We will need to wait to hear from David regarding the offset tables.

@dclunie
Copy link
Collaborator

dclunie commented Feb 19, 2024

AFAIK, the GHC implementation of the DICOMstore that IDC uses does not take advantage of the offset tables but computes its own frame index, as I think does dcm4chee (not 100% sure), so I did not add the BOT (which wouldn't have worked for the larger images anyway given its 32 bit offset limit), or the EOT (which we added to the standard after I started doing these conversions). The EOT is also a (pair of) very large metadata data element(s), which potentially causes problems with GHC (which has limits on total metadata size and in general release doesn't support BulkDataURI).

It is something to think about going forward though, esp. for those who are processing offline files and can make use of the offset tables.

Walking forward skipping to the next fragment (without reading or parsing them) to build their own index should not be that expensive an operation for the consumer. Does wsidicom do that in an optimized manner? The sweet spot is when there is one fragment per frame (same number of fragments as frames); otherwise one needs to detect start and end fragments of frames, preferably without full JPEG/J2K marker syntax parsing, which is a pain (I cheat by walking back looking for EOI or EOI+padding).

@dclunie
Copy link
Collaborator

dclunie commented Feb 19, 2024

I updated the documentation to mention this,

@erikogabrielsson
Copy link
Author

Thanks for the quick responses @fedorov and @dclunie and for the technical reasons for not including a BOT.

Wsidicom can parse either BOT, EOT, or create the index from the pixel data (but only if one fragment per frame).

I tested reading WSIs from the IDC-S3 bucket like:

from wsidicom import WsiDicom

bucket = "idc-open-data/66384e58-9ff9-4e60-94c1-83a906a793d7"
s3_wsi = WsiDicom.open(
    files =f"s3://{bucket}",
    file_options={
        "s3": {
            "anon": True,
            "endpoint_url":
            "https://s3.amazonaws.com",
            "default_cache_type": "first"
        },
    }
)

(Using a development version of wsidicom that supports fsspec).

However, without index, reading the first tile from the higher resolution levels takes a long time (minutes), presumably as the random read access is slow. Adding the parser for the tile offsets in the tiff tags reduced the first tile access time to some seconds.

Closing this, as it was only a question of curiosity. Thanks again.

@dclunie
Copy link
Collaborator

dclunie commented Feb 21, 2024

@erikogabrielsson, reopening, because if this choice is causing problems for some readers, it should be investigated +/- mitigated.

Also created IDC issue Evaluate impact of absence of frame offset tables (BOT, EOT) in converted DICOM WSI files +/- mitigate #1732, but you may not have access to that

Do you know if wsidicom makes use of random access to binary file mechanisms, or does it read the whole pixel data value to make sense of it in the absence of an offset table? For the offset table to improve performance it must have some ability to seek to the appropriate location (assuming the whole thing is not already loaded into memory or memory-mapped).

@erikogabrielsson
Copy link
Author

Hi @dclunie

I made a summary of BOT/EOT and concatenation support for the opensource readers for DICOM WSI that I know of and have used:

Reader Library BOT EOT Pixel data parsing Concatenated files
WsiDicom X X By fragment length X
Openslide libdicom X X By fragment length issue
Dicomslide dicomweb-client X By fragment length No?
Bioformats By jpeg magic bytes X (not tested)

WsiDicom, and I assume all of the above, relies on random access to tiles.

When the accessing local files (on fast storage) with WsiDicom I have observed that the absence of BOT/EOT increased the load time (e.g. including building a tile index) from <50 ms to > 200 ms (DICOM converted file of CMU-1.svs). For S3 storage it goes from <5 s to >40 s, but I have yet to determine how much this can be improved by caching strategy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants