How does CytoTable handle duplicate ImageNumbers? #30

shntnu · 2023-02-07T22:22:45Z

ImageNumber is guaranteed to be unique only within an Image.csv but not across all Image.csv files in a dataset.

Does pycytominer-transform handle this correctly?

In cytominer-database, we create a column, TableNumber, to identify rows in the Image table uniquely. TableNumber is set to be the hash of the Image.csv file:

https://github.com/cytomining/cytominer-database/blob/5aa00f58e4a31bbbd2a3779c87e7a3620b0030db/cytominer_database/ingest_variable_engine.py#L99

The text was updated successfully, but these errors were encountered:

d33bs · 2023-02-08T15:19:04Z

Thank you @shntnu ! I don't believe this has been addressed yet in this project and I'll look into how this might be implemented. In addition to TableNumber, and acknowledging some of the s3://cellpainting-gallery object structures, would it make sense to utilize "directory" paths as another way to retain provenance and uniqueness across ImageNumber's for compartment objects? I'm wondering about this especially for scenarios where we may not have access to the image files themselves.

shntnu · 2023-02-09T01:15:30Z

would it make sense to utilize "directory" paths as another way to retain provenance and uniqueness across ImageNumber's for compartment objects? I'm wondering about this especially for scenarios where we may not have access to the image files themselves.

I think that would be fine.

To clarify, both these hashes serve the goal of creating a unique ID for the compartments.

md5sum <(echo s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A01-1/)
# fcc94af68878c557192a6d8c3d069cb7  /dev/fd/63

md5sum <(aws s3 cp s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A01-1/Image.csv -)
# 9a7a2deed81a80ebacb139e7a249e3ab  /dev/fd/63

We'd prefer the second when possible because although both are unique, the second one is path invariant (but it doesn't matter too much, just that in the first option the TableNumber is a function of the path, which is a bit odd but not wrong)

cc'ing @bethac07 to keep her posted because this relates to CellProfiler output.

d33bs · 2023-05-18T18:55:00Z

Hi @shntnu and @gwaybio, I've been thinking about how to implement new features into CytoTable in order to add TableNumber. I wanted to ask for your input on what would be best when it comes to SQLite data sources and generating hashes. I can see two approaches:

Create a unique hash per SQLite table, meaning we must read the data inside the table in order to generate the hash (as it is not a standalone file).
Create a unique hash per SQLite database, using the same hash (from the .sqlite or .db file) for each compartment and metadata table result.

Option 1. seems most in alignment with current procedures but could come with performance impacts (unsure of the most performant way to generate a hash from the SQLite data before an export to other formats). Option 2. would deviate from "table-focused" alignment of the hash data, but could come with a performance benefit.

Any thoughts or preferences here? Thank you in advance for your consideration on this.

shntnu · 2023-06-09T17:37:28Z

@d33bs hm - I don't think I followed what you proposed.

Neither of the options you propose would generate a hash per Image.csv, correct?

Note that typically, but not always, an Image.csv file generated by CellProfiler has multiple rows, one corresponding to each image processed in that set.

We want to create a hash per Image.csv so that the hash, together with the ImageNumber, can uniquely identify each row of the Image table that would result from ingesting all the Image.csvs from a dataset.

Does that make sense?

d33bs · 2023-06-22T19:23:41Z

Thanks @shntnu for the greater clarity here! I wanted to outline what I understand of your thoughts and how these might extrapolate to SQLite data sources in addition to CSV's (attempting to maintain uniformity in behavior).

flowchart LR

images[(images)]
cellprofiler["CellProfiler"]
subgraph cellprofiler_csv
  image_csv[("image.csv")]
  compartment_csv[("compartment(s).csv")]
end
subgraph cellprofiler_sqlite
  image_tbl[("image.tbl")]
  compartment_tbl[("compartment(s).tbl")]
end

cytotable_hash_extract["CytoTable \n creates a hash"]
cytotable_extract["CytoTable \n extracts full data \n with appended hash column"]

image_parquet[(image.parquet)]
compartments_parquet[("compartments(s).parquet")]

images --> cellprofiler

cellprofiler --> cellprofiler_csv
cellprofiler --> cellprofiler_sqlite

image_csv --> | infer a tablenumber hash \n from image.csv | cytotable_hash_extract
image_tbl --> | infer a tablenumber hash \n from image table | cytotable_hash_extract

cytotable_hash_extract -.-> | afterwards, extract all data \n from cellprofiler sources | cytotable_extract

cytotable_extract --> | image file or table hash is \n added to image data | image_parquet
cytotable_extract --> | image file or table hash is \n added to compartment data | compartments_parquet

With this diagram I'm trying to illustrate that the "tablenumber" hash from the image csv or SQLite table is added to image and compartment data in an effort to retain data relationships (especially when parsing multiple nested groups of files or SQLite files). I've left out some lines connecting the cellprofiler data sources to elements further on the right side to help highlight the hashing elements specifically.

One challenge with the SQLite based image table approach is that we may be forced to read the data from the table in order to generate the hash (incurring performance costs and branching complexity within the code).

Does this align with your thoughts on how this should work? Very open to input here and seeking to help make this work the best for you and the community.

shntnu · 2023-06-23T15:50:50Z

@d33bs Thanks for the diagram and for clarifying how all this fits in.

I now understand the logic of Option 2 in your previous comment #30 (comment).

Indeed, Option 2 will work fine. Here's why:

CellProfiler's ExportToDatabase module will create a single SQLite file (assuming we are exporting to SQLite; we can ignore server-based backends entirely) for each run, just as it would create a single set of CSVs for each run in ExportToSpreadsheet.

Therefore, the SQLite file hash will be just as identifiable as the Image table hash.

I was previously confused because I thought the SQLite file we are talking about here is the one we create after ingesting, but it is, in fact, the SQLite file that is the output of ExportToDatabase.

Please proceed with Option 2:

Create a unique hash per SQLite database, using the same hash (from the .sqlite or .db file) for each compartment and metadata table result.

d33bs · 2023-06-23T18:29:25Z

Will do, thank you @shntnu !

d33bs · 2023-08-16T20:04:17Z

Hi @shntnu, in working towards changes related to this I found a discrepancy in the current way CytoTable operates vs Cytominer-database. Cytominer-database utilizes various test data and techniques for eliminating candidates for data processing. Some of these tests are used with CytoTable to help ensure similar functionality where desired. One of these test datasets, data_b, shows discrepancies between Cytominer-database and CytoTable functionality.

Cytominer-database removes directory J21-2 entirely due to Cells.csv being invalid (I believe due to csvkit flagging the file as having mismatched row value counts on line 2 here). CytoTable removes only Cells.csv (and not the other compartment or image tables) due to similar issues with the file (via DuckDB errors in reading the file, which manifest as read errors and are filtered from output downstream of here).

I noticed this when testing and comparing results of Cytominer-database and CytoTable, which I'm using to judge to ensure data output matches expectations (where possible). The difference in these processing methods results in a mismatch for TableNumber production. Question: Should CytoTable remove the entirety of a directory of source data (all compartments + image table data) when a single file shows issues in that directory? If it makes more sense to discuss this outside of this issue, I can open a new issue to discuss which data are invalid vs not.

shntnu · 2024-03-28T12:38:50Z

Sorry for taking so long to respond to this

Should CytoTable remove the entirety of a directory of source data (all compartments + image table data) when a single file shows issues in that directory? If it makes more sense to discuss this outside of this issue, I can open a new issue to discuss which data are invalid vs not.

I think no, it should not drop. When analyzing the data downstream, one can always explicitly drop such cases.

d33bs · 2024-03-29T16:49:02Z

No worries, and thanks @shntnu ! We will stick with the existing functionality in CytoTable to drop individual compartments (files or tables) where there are errors and not an entire directory of image + compartment data.

shntnu added the question Further information is requested label Feb 7, 2023

d33bs self-assigned this Apr 5, 2023

d33bs added the enhancement New feature or request label May 18, 2023

d33bs mentioned this issue May 19, 2023

Add ImageNumber distinction through TableNumber capabilities #59

Closed

13 tasks

gwaybio added this to the v0.0.1 milestone Jul 26, 2023

d33bs changed the title ~~How does pycytominer-transform handle duplicate ImageNumbers?~~ How does CytoTable handle duplicate ImageNumbers? Sep 24, 2023

shntnu mentioned this issue Mar 23, 2024

fix: SQLite joins should be on ImageNumber,TableNumber and not ImageNumber cytomining/pycytominer#378

Merged

10 tasks

This was referenced Apr 12, 2024

Verify how large floating-point numbers are exported #187

Open

Add tablenumber capabilities to distinguish repeated image numbers from multiple datasets #188

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does CytoTable handle duplicate ImageNumbers? #30

How does CytoTable handle duplicate ImageNumbers? #30

shntnu commented Feb 7, 2023

d33bs commented Feb 8, 2023

shntnu commented Feb 9, 2023 •

edited

Loading

d33bs commented May 18, 2023 •

edited

Loading

shntnu commented Jun 9, 2023

d33bs commented Jun 22, 2023

shntnu commented Jun 23, 2023 •

edited

Loading

d33bs commented Jun 23, 2023

d33bs commented Aug 16, 2023 •

edited

Loading

shntnu commented Mar 28, 2024

d33bs commented Mar 29, 2024

How does CytoTable handle duplicate ImageNumbers? #30

How does CytoTable handle duplicate ImageNumbers? #30

Comments

shntnu commented Feb 7, 2023

d33bs commented Feb 8, 2023

shntnu commented Feb 9, 2023 • edited Loading

d33bs commented May 18, 2023 • edited Loading

shntnu commented Jun 9, 2023

d33bs commented Jun 22, 2023

shntnu commented Jun 23, 2023 • edited Loading

d33bs commented Jun 23, 2023

d33bs commented Aug 16, 2023 • edited Loading

shntnu commented Mar 28, 2024

d33bs commented Mar 29, 2024

shntnu commented Feb 9, 2023 •

edited

Loading

d33bs commented May 18, 2023 •

edited

Loading

shntnu commented Jun 23, 2023 •

edited

Loading

d33bs commented Aug 16, 2023 •

edited

Loading