-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature selected files available, with more row annotations included #48
Conversation
This comment has been minimized.
This comment has been minimized.
Simpler test: No diffs when uncompressed
But binaries differ
and that's because of the date
There must be some git attribute setting to address this |
One solution is to specify the diff rule for
|
@gwaygenomics I just realized that you must have encountered this issue, right? Do you explicitly check for diffs rather than relying on git e.g. |
(there's an additional, separate issue related to this cytomining/pycytominer#82 which affects the GCT file) |
Woah, never knew this was a thing. Thanks for digging into it! So, if I'm understanding correctly, the md5sum for the file we provided CLUE will fail to validate? |
Nope, that aspect of it is fine. It is actually a somewhat minor but annoying issue. If you rerun any notebook that produces a file and then gzips it, git will detect the .gz as a modified file because the hashes are different (because the dates of the compressed data are different, even if the contents the file are identical). See below. We now have a fix! (old notes are collapsed below). We need to use mkdir gittest
cd gittest/
git init .
# Initialized empty Git repository in /private/tmp/gittest/.git/
echo a b c > x.txt
file x.txt
# x.txt: ASCII text
gzip -n x.txt
file x.txt.gz
# x.txt.gz: gzip compressed data, from Unix, original size 6
md5sum x.txt.gz
# ca04a6662ec96a20339f793db203b9c6 x.txt.gz
git add x.txt.gz
git commit -m "add file"
# [master (root-commit) 540f687] add file
# 1 file changed, 0 insertions(+), 0 deletions(-)
# create mode 100644 x.txt.gz
echo a b c > x.txt
file x.txt
# x.txt: ASCII text
gzip -n x.txt
# x.txt.gz already exists -- do you wish to overwrite (y or n)? y
file x.txt.gz
# x.txt.gz: gzip compressed data, from Unix, original size 6
md5sum x.txt.gz
# ca04a6662ec96a20339f793db203b9c6 x.txt.gz
git status
# On branch master
# nothing to commit, working tree clean Click to expand old notes
However, if we do these two things:
to this
then although a git status will show that the gz files have been modified, no diffs will be reported because it will use the But this still doesn't solve the issue that a user may inadvertently do a |
@gwaygenomics Please see the updates to the previous comment. We need to modify these lines appropriately so that
Unfortunately pandas does not currently support specifying options if the compression method is gzip, so you'd first need to save as CSV and then compress https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html |
Revert git diff because we no longer need that
What a coincidence! For now, I suggest we take the non-pretty approach from Dan's snippet to address this issue: |
Awesome - great we could track this down! Will this change be made in this PR? |
Can this line lincs-cell-painting/consensus/scripts/nbconverted/build-consensus-signatures.py Lines 206 to 208 in e6852b4
can be replaced with this? cyto_utils.output(
df=consensus_df,
output_filename=consensus_file,
float_format="%5g",
compression=gzip,
) I think yes, and if so, then the changes should be made in In that case, this PR will do 3 things, after the
Note that I recommend doing (2) instead of running all the scripts again (except the consensus one because that's easy) – that's too much work! Bash script # create gz,csv pairs
for gzip_file in `find . -name "*.csv.gz" `;
do
echo $gzip_file,${gzip_file%.*}
done > /tmp/rename.csv
# re-zip the file
parallel -a /tmp/rename.csv -C "," -L 2 \
"echo gunzip {1} && echo gzip -n {2}" |
Yep, we can make that update - indeed i do think it will improve In general, though, we want to avoid processing version 1 lincs-cell-painting with multiple pycytominer versions. i.e. the whole image-based-profiling pipeline was processed with pycytominer@dd064c2185435e19541bafa7c976d55da15cf09e, a small change introduced here for a relatively minor improvement would bump the version. So, unless we then reprocess the whole pipeline with the updated hash, it would be tough to track down which pieces were processed with which pycytominer version. I recommend adding this fix to a version 2 wishlist, and writing the consensus output with the proposed gzip solution in the Is there something I am overlooking here? What do you think? |
I updated my previous comment (looks like we were responding simultaneously :D). My only additional note is that this PR can be part of Version 2; no need to produce the GCT file for Version 1 in any case. Admittedly the GCT and the CSV issue are distinct components, and this PR might be a little too bulky for our liking, but it's not terrible. |
@gwaygenomics I don't think I've fully absorbed #48 (comment). But given that this PR will go in version 2, is it ok I delay looking into this further? Or is that blocking you on your immediate goals? |
From #48 (comment) @gwaygenomics said
Although |
This PR will do 2 things:
Bash script # create gz,csv pairs
for gzip_file in `find . -name "*.csv.gz" `;
do
echo $gzip_file,${gzip_file%.*}
done > /tmp/rename.csv
# re-zip the file
parallel -a /tmp/rename.csv -C "," -L 2 \
"echo gunzip {1} && echo gzip -n {2}" |
a quick heads up @shntnu - I am running through the spherize operations and I noticed a couple things:
What should we do? If we update pycytominer version, we'll likely need to reprocess everything. It'll also enable us to output gzip profiles without timestamps |
This is not a good fix either, because it seems to introduce other, undesirable metadata (see metadata of new file) Original file, i.e. before running the script:
New file, i.e.after running the script:
@gwaygenomics Can you run If that does not have this extra metadata, it wouldn't make sense for us to include this manual |
Actually, coming to think of it, there's no upside to including the fix, because our next version will anyway have the updated pycytominer. I'll ponder and ping again. |
There's literally no upside. I verified that this command did produce any outputs. # copy newly create files to a new folder
cp -r 2016_04_01_a549_48hr_batch1 2016_04_01_a549_48hr_batch1_shsingh
# restore old files
git stash
# do a diff on gz contents
(ls 2016_04_01_a549_48hr_batch1/*.csv.gz) | parallel basename {} | parallel "diff <(gzcat 2016_04_01_a549_48hr_batch1/{}) <(gzcat 2016_04_01_a549_48hr_batch1_shsingh/{})" We are all set. In a future version, we will re-run using the update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - my only question is: do we want to perform feature selection before creating the .gct?
Ah yes, we should! Will do |
So in all, we don't need to bump the pycytominer version if 2. works out fine |
@gwaygenomics bumping to make sure you saw that it's back on your desk to review (no hurry from my end) Edit: I just realized that you're working on the spherizing thingie, so we need to sort that out first. Ping me if it needs anything from me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One suggestion, one small fix, and one question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All set on my end, but I'll let you merge in case you want to make any additional adjustments.
Co-authored-by: Greg Way <gregory.way@gmail.com>
gzip -n
problem Feature selected files available, with more row annotations included #48 (comment)