-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How does CytoTable handle duplicate ImageNumbers? #30
Comments
Thank you @shntnu ! I don't believe this has been addressed yet in this project and I'll look into how this might be implemented. In addition to |
I think that would be fine. To clarify, both these hashes serve the goal of creating a unique ID for the compartments. md5sum <(echo s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A01-1/)
# fcc94af68878c557192a6d8c3d069cb7 /dev/fd/63
md5sum <(aws s3 cp s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A01-1/Image.csv -)
# 9a7a2deed81a80ebacb139e7a249e3ab /dev/fd/63 We'd prefer the second when possible because although both are unique, the second one is path invariant (but it doesn't matter too much, just that in the first option the cc'ing @bethac07 to keep her posted because this relates to CellProfiler output. |
Hi @shntnu and @gwaybio, I've been thinking about how to implement new features into CytoTable in order to add TableNumber. I wanted to ask for your input on what would be best when it comes to SQLite data sources and generating hashes. I can see two approaches:
Option 1. seems most in alignment with current procedures but could come with performance impacts (unsure of the most performant way to generate a hash from the SQLite data before an export to other formats). Option 2. would deviate from "table-focused" alignment of the hash data, but could come with a performance benefit. Any thoughts or preferences here? Thank you in advance for your consideration on this. |
@d33bs hm - I don't think I followed what you proposed. Neither of the options you propose would generate a hash per Note that typically, but not always, an We want to create a hash per Does that make sense? |
Thanks @shntnu for the greater clarity here! I wanted to outline what I understand of your thoughts and how these might extrapolate to SQLite data sources in addition to CSV's (attempting to maintain uniformity in behavior). flowchart LR
images[(images)]
cellprofiler["CellProfiler"]
subgraph cellprofiler_csv
image_csv[("image.csv")]
compartment_csv[("compartment(s).csv")]
end
subgraph cellprofiler_sqlite
image_tbl[("image.tbl")]
compartment_tbl[("compartment(s).tbl")]
end
cytotable_hash_extract["CytoTable \n creates a hash"]
cytotable_extract["CytoTable \n extracts full data \n with appended hash column"]
image_parquet[(image.parquet)]
compartments_parquet[("compartments(s).parquet")]
images --> cellprofiler
cellprofiler --> cellprofiler_csv
cellprofiler --> cellprofiler_sqlite
image_csv --> | infer a tablenumber hash \n from image.csv | cytotable_hash_extract
image_tbl --> | infer a tablenumber hash \n from image table | cytotable_hash_extract
cytotable_hash_extract -.-> | afterwards, extract all data \n from cellprofiler sources | cytotable_extract
cytotable_extract --> | image file or table hash is \n added to image data | image_parquet
cytotable_extract --> | image file or table hash is \n added to compartment data | compartments_parquet
With this diagram I'm trying to illustrate that the "tablenumber" hash from the image csv or SQLite table is added to image and compartment data in an effort to retain data relationships (especially when parsing multiple nested groups of files or SQLite files). I've left out some lines connecting the cellprofiler data sources to elements further on the right side to help highlight the hashing elements specifically. One challenge with the SQLite based image table approach is that we may be forced to read the data from the table in order to generate the hash (incurring performance costs and branching complexity within the code). Does this align with your thoughts on how this should work? Very open to input here and seeking to help make this work the best for you and the community. |
@d33bs Thanks for the diagram and for clarifying how all this fits in. I now understand the logic of Option 2 in your previous comment #30 (comment). Indeed, Option 2 will work fine. Here's why: CellProfiler's Therefore, the SQLite file hash will be just as identifiable as the Image table hash. I was previously confused because I thought the SQLite file we are talking about here is the one we create after ingesting, but it is, in fact, the SQLite file that is the output of Please proceed with Option 2:
|
Will do, thank you @shntnu ! |
Hi @shntnu, in working towards changes related to this I found a discrepancy in the current way CytoTable operates vs Cytominer-database. Cytominer-database utilizes various test data and techniques for eliminating candidates for data processing. Some of these tests are used with CytoTable to help ensure similar functionality where desired. One of these test datasets, data_b, shows discrepancies between Cytominer-database and CytoTable functionality. Cytominer-database removes directory J21-2 entirely due to Cells.csv being invalid (I believe due to csvkit flagging the file as having mismatched row value counts on line 2 here). CytoTable removes only Cells.csv (and not the other compartment or image tables) due to similar issues with the file (via DuckDB errors in reading the file, which manifest as read errors and are filtered from output downstream of here). I noticed this when testing and comparing results of Cytominer-database and CytoTable, which I'm using to judge to ensure data output matches expectations (where possible). The difference in these processing methods results in a mismatch for TableNumber production. Question: Should CytoTable remove the entirety of a directory of source data (all compartments + image table data) when a single file shows issues in that directory? If it makes more sense to discuss this outside of this issue, I can open a new issue to discuss which data are invalid vs not. |
Sorry for taking so long to respond to this
I think no, it should not drop. When analyzing the data downstream, one can always explicitly drop such cases. |
No worries, and thanks @shntnu ! We will stick with the existing functionality in CytoTable to drop individual compartments (files or tables) where there are errors and not an entire directory of image + compartment data. |
ImageNumber
is guaranteed to be unique only within anImage.csv
but not across allImage.csv
files in a dataset.Does
pycytominer-transform
handle this correctly?In
cytominer-database
, we create a column,TableNumber
, to identify rows in theImage
table uniquely.TableNumber
is set to be the hash of theImage.csv
file:https://github.com/cytomining/cytominer-database/blob/5aa00f58e4a31bbbd2a3779c87e7a3620b0030db/cytominer_database/ingest_variable_engine.py#L99
The text was updated successfully, but these errors were encountered: