-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing offset table? #4
Comments
@erikogabrielsson yes, all of the DICOM WSI files available from IDC at this point (as of IDC data release v17) were created using this script. We will need to wait to hear from David regarding the offset tables. |
AFAIK, the GHC implementation of the DICOMstore that IDC uses does not take advantage of the offset tables but computes its own frame index, as I think does dcm4chee (not 100% sure), so I did not add the BOT (which wouldn't have worked for the larger images anyway given its 32 bit offset limit), or the EOT (which we added to the standard after I started doing these conversions). The EOT is also a (pair of) very large metadata data element(s), which potentially causes problems with GHC (which has limits on total metadata size and in general release doesn't support BulkDataURI). It is something to think about going forward though, esp. for those who are processing offline files and can make use of the offset tables. Walking forward skipping to the next fragment (without reading or parsing them) to build their own index should not be that expensive an operation for the consumer. Does wsidicom do that in an optimized manner? The sweet spot is when there is one fragment per frame (same number of fragments as frames); otherwise one needs to detect start and end fragments of frames, preferably without full JPEG/J2K marker syntax parsing, which is a pain (I cheat by walking back looking for EOI or EOI+padding). |
I updated the documentation to mention this, |
Thanks for the quick responses @fedorov and @dclunie and for the technical reasons for not including a BOT. Wsidicom can parse either BOT, EOT, or create the index from the pixel data (but only if one fragment per frame). I tested reading WSIs from the IDC-S3 bucket like: from wsidicom import WsiDicom
bucket = "idc-open-data/66384e58-9ff9-4e60-94c1-83a906a793d7"
s3_wsi = WsiDicom.open(
files =f"s3://{bucket}",
file_options={
"s3": {
"anon": True,
"endpoint_url":
"https://s3.amazonaws.com",
"default_cache_type": "first"
},
}
) (Using a development version of wsidicom that supports fsspec). However, without index, reading the first tile from the higher resolution levels takes a long time (minutes), presumably as the random read access is slow. Adding the parser for the tile offsets in the tiff tags reduced the first tile access time to some seconds. Closing this, as it was only a question of curiosity. Thanks again. |
@erikogabrielsson, reopening, because if this choice is causing problems for some readers, it should be investigated +/- mitigated. Also created IDC issue Evaluate impact of absence of frame offset tables (BOT, EOT) in converted DICOM WSI files +/- mitigate #1732, but you may not have access to that Do you know if wsidicom makes use of random access to binary file mechanisms, or does it read the whole pixel data value to make sense of it in the absence of an offset table? For the offset table to improve performance it must have some ability to seek to the appropriate location (assuming the whole thing is not already loaded into memory or memory-mapped). |
Hi @dclunie I made a summary of BOT/EOT and concatenation support for the opensource readers for DICOM WSI that I know of and have used:
WsiDicom, and I assume all of the above, relies on random access to tiles. When the accessing local files (on fast storage) with WsiDicom I have observed that the absence of BOT/EOT increased the load time (e.g. including building a tile index) from <50 ms to > 200 ms (DICOM converted file of CMU-1.svs). For S3 storage it goes from <5 s to >40 s, but I have yet to determine how much this can be improved by caching strategy. |
Hi,
Im reading some WSI files retrieved from the IDC portal that I assume are created using this script.
From the files I have looked at so far, they seem to lack both basic and extended offset table. This makes it very slow to read out tiles (using wsidicom. Is there a reason for this?
As the files are dual DICOM TIFF I have experimented with reading the tile offsets from the tiff tags, and that works.
The text was updated successfully, but these errors were encountered: