Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve scanpy compatibility #774

Merged
merged 11 commits into from
Jul 30, 2024
Merged

Improve scanpy compatibility #774

merged 11 commits into from
Jul 30, 2024

Conversation

jashapiro
Copy link
Member

Closes #773

I'm filing this as a draft because I have not yet really tested it, but I wanted to get something up while it was fresh in my mind.

The goal here is to improve the compatibility with scanpy as described in #773, so I have done 4 main things:

  • Use lower case for X_pca and X_umap to match scanpy.
  • Convert the PCA and UMAP matrices to numpy matrices.
  • add a highly_variable column to the var table
  • Add PCA metadata to uns

The last step involves exporting the variance explained data and then importing that separately, with the assumption that the PCA was centered and highly variable genes were used. I could (and probably should) make those assumptions into arguments for the script just to be safe and maybe a bit future-proof.

There are probably also a few places where I am making other assumptions that I should check more explicitly. As I said, a draft...

I also renamed the script to be a bit more generally named, but annoyingly there were apparently enough changes that GitHub is not displaying it as a rename but as a deletion and recreation. Probably because it was run through a code formatter automatically.

@jashapiro jashapiro changed the title Improve scanpy compatilibility Improve scanpy compatibility Jul 29, 2024
Copy link
Member

@allyhawkins allyhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally this all looks good, I just found a few places we might want to add or update checks.

I also think it would be good to test this using the default run IDs that we have in the CCDL profile if you haven't already.

bin/reformat_anndata.py Outdated Show resolved Hide resolved
Comment on lines 89 to 91
adata.var["highly_variable"] = adata.var.gene_ids.isin(
adata.uns["highly_variable_genes"]
)
Copy link
Member

@allyhawkins allyhawkins Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might want to only do this if args.mask_var != "None"? And does this fail if adata.uns["highly_variable_genes"] does not exist? If so, we should add a check for that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are correct, as I realized when looking at the first comment!

bin/reformat_anndata.py Outdated Show resolved Hide resolved
move_counts_anndata.py --anndata_file ${rna_h5ad_file}
${has_adt ? "move_counts_anndata.py --anndata_file ${feature_h5ad_file}" : ''}
reformat_anndata.py --anndata_file ${rna_h5ad_file}
${has_adt ? "reformat_anndata.py --anndata_file ${feature_h5ad_file}" : ''}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to include the PCA information in these objects too like you do for the processed individual objects? I believe we do re-calculate PCs after merging.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't planning to change anything for the merged files. We don't move the reduced dims to X_PCA and X_UMAP for those, do we?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do move them in those objects.

reducedDimNames(sce) <- glue::glue("X_{reducedDimNames(sce)}")

I can see both arguments, but I think for consistency between objects we would want to do this for merged objects?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should make the names match. I think I don't want to add the .uns["pca"] object for merged data though, since we use a different function for the PCA calculation for merged data that I am not sure would be equivalent to what ScanPy would do. So I think probably better to just leave that off?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense. So for the merged object the only change should be the lower case naming.

jashapiro and others added 3 commits July 29, 2024 16:51
Co-authored-by: Ally Hawkins <54039191+allyhawkins@users.noreply.github.com>
@jashapiro
Copy link
Member Author

Test run here completed successfully, and at first examination the outputs are as expected.

Copy link
Member

@allyhawkins allyhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants