Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does CytoTable handle duplicate ImageNumbers? #30

Open
shntnu opened this issue Feb 7, 2023 · 10 comments · May be fixed by #188
Open

How does CytoTable handle duplicate ImageNumbers? #30

shntnu opened this issue Feb 7, 2023 · 10 comments · May be fixed by #188
Assignees
Labels
enhancement New feature or request question Further information is requested
Milestone

Comments

@shntnu
Copy link
Member

shntnu commented Feb 7, 2023

ImageNumber is guaranteed to be unique only within an Image.csv but not across all Image.csv files in a dataset.

Does pycytominer-transform handle this correctly?

In cytominer-database, we create a column, TableNumber, to identify rows in the Image table uniquely. TableNumber is set to be the hash of the Image.csv file:

https://github.com/cytomining/cytominer-database/blob/5aa00f58e4a31bbbd2a3779c87e7a3620b0030db/cytominer_database/ingest_variable_engine.py#L99

@shntnu shntnu added the question Further information is requested label Feb 7, 2023
@d33bs
Copy link
Member

d33bs commented Feb 8, 2023

Thank you @shntnu ! I don't believe this has been addressed yet in this project and I'll look into how this might be implemented. In addition to TableNumber, and acknowledging some of the s3://cellpainting-gallery object structures, would it make sense to utilize "directory" paths as another way to retain provenance and uniqueness across ImageNumber's for compartment objects? I'm wondering about this especially for scenarios where we may not have access to the image files themselves.

@shntnu
Copy link
Member Author

shntnu commented Feb 9, 2023

would it make sense to utilize "directory" paths as another way to retain provenance and uniqueness across ImageNumber's for compartment objects? I'm wondering about this especially for scenarios where we may not have access to the image files themselves.

I think that would be fine.

To clarify, both these hashes serve the goal of creating a unique ID for the compartments.

md5sum <(echo s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A01-1/)
# fcc94af68878c557192a6d8c3d069cb7  /dev/fd/63

md5sum <(aws s3 cp s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A01-1/Image.csv -)
# 9a7a2deed81a80ebacb139e7a249e3ab  /dev/fd/63

We'd prefer the second when possible because although both are unique, the second one is path invariant (but it doesn't matter too much, just that in the first option the TableNumber is a function of the path, which is a bit odd but not wrong)

cc'ing @bethac07 to keep her posted because this relates to CellProfiler output.

@d33bs d33bs self-assigned this Apr 5, 2023
@d33bs
Copy link
Member

d33bs commented May 18, 2023

Hi @shntnu and @gwaybio, I've been thinking about how to implement new features into CytoTable in order to add TableNumber. I wanted to ask for your input on what would be best when it comes to SQLite data sources and generating hashes. I can see two approaches:

  1. Create a unique hash per SQLite table, meaning we must read the data inside the table in order to generate the hash (as it is not a standalone file).
  2. Create a unique hash per SQLite database, using the same hash (from the .sqlite or .db file) for each compartment and metadata table result.

Option 1. seems most in alignment with current procedures but could come with performance impacts (unsure of the most performant way to generate a hash from the SQLite data before an export to other formats). Option 2. would deviate from "table-focused" alignment of the hash data, but could come with a performance benefit.

Any thoughts or preferences here? Thank you in advance for your consideration on this.

@d33bs d33bs added the enhancement New feature or request label May 18, 2023
@shntnu
Copy link
Member Author

shntnu commented Jun 9, 2023

@d33bs hm - I don't think I followed what you proposed.

Neither of the options you propose would generate a hash per Image.csv, correct?

Note that typically, but not always, an Image.csv file generated by CellProfiler has multiple rows, one corresponding to each image processed in that set.

We want to create a hash per Image.csv so that the hash, together with the ImageNumber, can uniquely identify each row of the Image table that would result from ingesting all the Image.csvs from a dataset.

Does that make sense?

@d33bs
Copy link
Member

d33bs commented Jun 22, 2023

Thanks @shntnu for the greater clarity here! I wanted to outline what I understand of your thoughts and how these might extrapolate to SQLite data sources in addition to CSV's (attempting to maintain uniformity in behavior).

flowchart LR

images[(images)]
cellprofiler["CellProfiler"]
subgraph cellprofiler_csv
  image_csv[("image.csv")]
  compartment_csv[("compartment(s).csv")]
end
subgraph cellprofiler_sqlite
  image_tbl[("image.tbl")]
  compartment_tbl[("compartment(s).tbl")]
end

cytotable_hash_extract["CytoTable \n creates a hash"]
cytotable_extract["CytoTable \n extracts full data \n with appended hash column"]

image_parquet[(image.parquet)]
compartments_parquet[("compartments(s).parquet")]

images --> cellprofiler

cellprofiler --> cellprofiler_csv
cellprofiler --> cellprofiler_sqlite

image_csv --> | infer a tablenumber hash \n from image.csv | cytotable_hash_extract
image_tbl --> | infer a tablenumber hash \n from image table | cytotable_hash_extract

cytotable_hash_extract -.-> | afterwards, extract all data \n from cellprofiler sources | cytotable_extract

cytotable_extract --> | image file or table hash is \n added to image data | image_parquet
cytotable_extract --> | image file or table hash is \n added to compartment data | compartments_parquet
Loading

With this diagram I'm trying to illustrate that the "tablenumber" hash from the image csv or SQLite table is added to image and compartment data in an effort to retain data relationships (especially when parsing multiple nested groups of files or SQLite files). I've left out some lines connecting the cellprofiler data sources to elements further on the right side to help highlight the hashing elements specifically.

One challenge with the SQLite based image table approach is that we may be forced to read the data from the table in order to generate the hash (incurring performance costs and branching complexity within the code).

Does this align with your thoughts on how this should work? Very open to input here and seeking to help make this work the best for you and the community.

@shntnu
Copy link
Member Author

shntnu commented Jun 23, 2023

@d33bs Thanks for the diagram and for clarifying how all this fits in.

I now understand the logic of Option 2 in your previous comment #30 (comment).

Indeed, Option 2 will work fine. Here's why:

CellProfiler's ExportToDatabase module will create a single SQLite file (assuming we are exporting to SQLite; we can ignore server-based backends entirely) for each run, just as it would create a single set of CSVs for each run in ExportToSpreadsheet.

Therefore, the SQLite file hash will be just as identifiable as the Image table hash.

I was previously confused because I thought the SQLite file we are talking about here is the one we create after ingesting, but it is, in fact, the SQLite file that is the output of ExportToDatabase.

Please proceed with Option 2:

Create a unique hash per SQLite database, using the same hash (from the .sqlite or .db file) for each compartment and metadata table result.

@d33bs
Copy link
Member

d33bs commented Jun 23, 2023

Will do, thank you @shntnu !

@gwaybio gwaybio added this to the v0.0.1 milestone Jul 26, 2023
@d33bs
Copy link
Member

d33bs commented Aug 16, 2023

Hi @shntnu, in working towards changes related to this I found a discrepancy in the current way CytoTable operates vs Cytominer-database. Cytominer-database utilizes various test data and techniques for eliminating candidates for data processing. Some of these tests are used with CytoTable to help ensure similar functionality where desired. One of these test datasets, data_b, shows discrepancies between Cytominer-database and CytoTable functionality.

Cytominer-database removes directory J21-2 entirely due to Cells.csv being invalid (I believe due to csvkit flagging the file as having mismatched row value counts on line 2 here). CytoTable removes only Cells.csv (and not the other compartment or image tables) due to similar issues with the file (via DuckDB errors in reading the file, which manifest as read errors and are filtered from output downstream of here).

I noticed this when testing and comparing results of Cytominer-database and CytoTable, which I'm using to judge to ensure data output matches expectations (where possible). The difference in these processing methods results in a mismatch for TableNumber production. Question: Should CytoTable remove the entirety of a directory of source data (all compartments + image table data) when a single file shows issues in that directory? If it makes more sense to discuss this outside of this issue, I can open a new issue to discuss which data are invalid vs not.

@d33bs d33bs changed the title How does pycytominer-transform handle duplicate ImageNumbers? How does CytoTable handle duplicate ImageNumbers? Sep 24, 2023
@shntnu
Copy link
Member Author

shntnu commented Mar 28, 2024

Sorry for taking so long to respond to this

Should CytoTable remove the entirety of a directory of source data (all compartments + image table data) when a single file shows issues in that directory? If it makes more sense to discuss this outside of this issue, I can open a new issue to discuss which data are invalid vs not.

I think no, it should not drop. When analyzing the data downstream, one can always explicitly drop such cases.

@d33bs
Copy link
Member

d33bs commented Mar 29, 2024

No worries, and thanks @shntnu ! We will stick with the existing functionality in CytoTable to drop individual compartments (files or tables) where there are errors and not an entire directory of image + compartment data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
3 participants