Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post process add mondo utils #22

Merged
merged 23 commits into from
Jun 6, 2024
Merged

Conversation

leokim-l
Copy link
Member

@justaddcoffee try and see if you can run the whole pipeline with 2/few phenopackets, please.

You may have to: create some test input and output folder, to be passed as cli arguments to pheval. The input folder should contain a folder called phenopacket-store, which contains another folder containing 2 phenopackets.
Maybe the jar file will give issues. I have been working on the java code and can update heavily the jar soon.

@caufieldjh
Copy link
Member

May make sense to include input phenopackets as test fixtures, in tests/input

@caufieldjh
Copy link
Member

Plus the corresponding prompts, of course.
I see that by default the runner will just process all prompts in /prompts even if it isn't provided the corresponding phenopacket JSON.

@caufieldjh
Copy link
Member

Also looks like the phenopackets2prompt.jar is only producing en and es prompts, though the output includes empty files for the full set of five languages (so malco raises a FileNotFoundError when it can't find the prompts and a pandas.errors.EmptyDataError when it tries to parse the results)

Copy link
Member

@caufieldjh caufieldjh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran this on two phenopackets and it didn't get to the point of producing a plot, but it did produce results for en and es. Results below:

label	term	score	rank
PMID_15266616_3.json_en-prompt.txt	MONDO:0016512	1.0	1
PMID_15266616_3.json_en-prompt.txt	MONDO:0018997	0.5	2
PMID_15266616_3.json_en-prompt.txt	MONDO:0016033	0.3333333333333333	3
PMID_15266616_3.json_en-prompt.txt	MONDO:0019188	0.25	4
PMID_15266616_3.json_en-prompt.txt	MONDO:0008678	0.2	5
PMID_15266616_3.json_en-prompt.txt	MONDO:0009341	0.16666666666666666	6
PMID_15266616_3.json_en-prompt.txt	MONDO:0008965	0.14285714285714285	7
PMID_10874631_II.2.json_en-prompt.txt	MONDO:0019200	1.0	1
PMID_10874631_II.2.json_en-prompt.txt	MONDO:0018998	0.5	2
PMID_10874631_II.2.json_en-prompt.txt	MONDO:0019501	0.3333333333333333	3
PMID_10874631_II.2.json_en-prompt.txt	MONDO:0015993	0.25	4
PMID_10874631_II.2.json_en-prompt.txt	MONDO:0019262	0.2	5
label	term	score	rank
PMID_10874631_II.2.json_es-prompt.txt	MONDO:0019200	1.0	1
PMID_10874631_II.2.json_es-prompt.txt	MONDO:0019501	0.5	2
PMID_10874631_II.2.json_es-prompt.txt	MONDO:0018998	0.3333333333333333	3
PMID_10874631_II.2.json_es-prompt.txt	MONDO:0015993	0.25	4
PMID_10874631_II.2.json_es-prompt.txt	MONDO:0015229	0.2	5
PMID_15266616_3.json_es-prompt.txt	MONDO:0018997	1.0	1
PMID_15266616_3.json_es-prompt.txt	MONDO:0008678	0.5	2
PMID_15266616_3.json_es-prompt.txt	MONDO:0016033	0.3333333333333333	3
PMID_15266616_3.json_es-prompt.txt	MONDO:0016512	0.25	4
PMID_15266616_3.json_es-prompt.txt	MONDO:0019188	0.2	5
PMID_15266616_3.json_es-prompt.txt	MONDO:0010035	0.16666666666666666	6
PMID_15266616_3.json_es-prompt.txt	MONDO:0018923	0.14285714285714285	7
PMID_15266616_3.json_es-prompt.txt	MONDO:0008564	0.125	8
PMID_15266616_3.json_es-prompt.txt	MONDO:0009341	0.1111111111111111	9

@caufieldjh
Copy link
Member

This is confusing me, though - perhaps I'm missing something about how the scores are calculated.
The results above for PMID_15266616_3 indicate that the top hit for the en prompt is MONDO:0016512 (Kabuki syndrome) and the top hit for the es prompt is MONDO:0018997 (Noonan syndrome). The results for each include the other top hit at a different rank, e.g., the en prompt results include MONDO:0018997 as the second result, for half score.

But according to the source phenopacket, the correct diagnosis is OMIM:147791 (Jacobsen syndrome), or MONDO:0007838. This isn't among the predicted diagnoses at all.
Which output, if any, will tell me that?

@leokim-l
Copy link
Member Author

This is confusing me, though - perhaps I'm missing something about how the scores are calculated. The results above for PMID_15266616_3 indicate that the top hit for the en prompt is MONDO:0016512 (Kabuki syndrome) and the top hit for the es prompt is MONDO:0018997 (Noonan syndrome). The results for each include the other top hit at a different rank, e.g., the en prompt results include MONDO:0018997 as the second result, for half score.

But according to the source phenopacket, the correct diagnosis is OMIM:147791 (Jacobsen syndrome), or MONDO:0007838. This isn't among the predicted diagnoses at all. Which output, if any, will tell me that?

These results come from this function:
https://github.com/monarch-initiative/malco/blob/2ad4f452bee194e25f71210654af982388646fcf/src/malco/post_process/post_process_results_format.py#L27

whereas the comparison with the correct result is done here:
https://github.com/monarch-initiative/malco/blob/2ad4f452bee194e25f71210654af982388646fcf/src/malco/post_process/compute_mrr.py#L9

Does this answer your question?

@leokim-l
Copy link
Member Author

Also looks like the phenopackets2prompt.jar is only producing en and es prompts, though the output includes empty files for the full set of five languages (so malco raises a FileNotFoundError when it can't find the prompts and a pandas.errors.EmptyDataError when it tries to parse the results)

Sorry for this, I added the new .jar in the right place, see 2a30bb1
2a30bb1

@leokim-l
Copy link
Member Author

leokim-l commented May 29, 2024

Plus the corresponding prompts, of course. I see that by default the runner will just process all prompts in /prompts even if it isn't provided the corresponding phenopacket JSON.

Not sure I understand this correctly, maybe this is related to your first comment. It does compute everything that is in prompts, but the prepare step runs phenopacket2prompt.jar which populates the /prompts/{language} folder based on what is in the phenopacket_store_path, which is the location of a folder named phenopacket-store present in the input_directory, which is given to pheval via command line argument. Thus, one can place any set of phenopackets in such a folder and run the whole thing effortlessly. We could provide a set of test_inputdir, shipped directly with the code, containing:

  1. 10 phenopackets (to play around)
  2. 1 phenopacket per gene
  3. all phenopackets.

We did discuss with Justin and Peter and said we want to change this. Namely, go from the setup_phenopackets() which fetches and downloads from the internet (from the, right now, latest phenopacket store release) to having the phenopackets simply saved directly in the gh repo :)

Let me know if this makes sense!

@caufieldjh
Copy link
Member

Plus the corresponding prompts, of course. I see that by default the runner will just process all prompts in /prompts even if it isn't provided the corresponding phenopacket JSON.

Not sure I understand this correctly, maybe this is related to your first comment. It does compute everything that is in prompts, but the prepare step runs phenopacket2prompt.jar which populates the /prompts/{language} folder based on what is in the phenopacket_store_path, which is the location of a folder named phenopacket-store present in the input_directory, which is given to pheval via command line argument. Thus, one can place any set of phenopackets in such a folder and run the whole thing effortlessly. We could provide a set of test_inputdir, shipped directly with the code, containing:

  1. 10 phenopackets (to play around)
  2. 1 phenopacket per gene
  3. all phenopackets.

We did discuss with Justin and Peter and said we want to change this. Namely, go from the setup_phenopackets() which fetches and downloads from the internet (from the, right now, latest phenopacket store release) to having the phenopackets simply saved directly in the gh repo :)

Let me know if this makes sense!

No worries - I had the full collection of pregenerated prompts in my prompts directory already, and was surprised to see that the runner operated over every phenopacket I had specified in the command line and everything there was a prompt for. This may not be an issue on a fresh install. In the end I did exactly what you suggest - I assembled a test set of 2 phenopackets and ran only on them and their associated prompts.

@caufieldjh
Copy link
Member

This is confusing me, though - perhaps I'm missing something about how the scores are calculated. The results above for PMID_15266616_3 indicate that the top hit for the en prompt is MONDO:0016512 (Kabuki syndrome) and the top hit for the es prompt is MONDO:0018997 (Noonan syndrome). The results for each include the other top hit at a different rank, e.g., the en prompt results include MONDO:0018997 as the second result, for half score.
But according to the source phenopacket, the correct diagnosis is OMIM:147791 (Jacobsen syndrome), or MONDO:0007838. This isn't among the predicted diagnoses at all. Which output, if any, will tell me that?

These results come from this function:

https://github.com/monarch-initiative/malco/blob/2ad4f452bee194e25f71210654af982388646fcf/src/malco/post_process/post_process_results_format.py#L27

whereas the comparison with the correct result is done here:

https://github.com/monarch-initiative/malco/blob/2ad4f452bee194e25f71210654af982388646fcf/src/malco/post_process/compute_mrr.py#L9

Does this answer your question?

I think, for me, the operation was just failing before it got to that point. Seems to be working more like expected now:
malcoplot1

That's just 2 phenopackets of course - likely not representative of any real phenomena.

@justaddcoffee
Copy link
Member

@justaddcoffee try and see if you can run the whole pipeline with 2/few phenopackets, please.

Just catching up on this thread - I think the code needs to be tidied up a bit. Glad to help with this of course. It's pretty difficult to run from scratch in a new directory. As far as I can tell, you have to:

  • find and copy a config.yaml file
  • manually create the output directory
  • find and copy a directory called tool_input_commands (not sure what this does) to the output directory
  • pass a -t '--testdata-dir' argument (not sure what this does either - what exactly is the test data directory?)
    and after the above, I still see several runtime errors:
testinput/phenopacket-store exists, skipping download.
org.monarchinitiative.phenol.base.PhenolRuntimeException: Error loading JSON
	at org.monarchinitiative.phenol.io.OntologyLoader.loadGraphDocument(OntologyLoader.java:109)
	at org.monarchinitiative.phenol.io.OntologyLoader.loadGraphDocument(OntologyLoader.java:90)
	at org.monarchinitiative.phenol.io.OntologyLoader.loadOntology(OntologyLoader.java:43)
FileNotFoundError: [Errno 2] No such file or directory: '/Users/jtr4v/PythonProject/malco_new/testrun/prompts/en/'
WARNING:ontogpt.clients:llm_gpt4all module not found. GPT4All support will be disabled.

(similar error for es), and finally the MRR calculation fails.

Things work better when I start in the base project directory where everything already exists. The run starts, but I see many errors like this:
[ERROR] Could not find it translation for Multiple lentigines (HP:0001003).
which we probably should sort out. Other than that, things seem to run correctly (still running)

@caufieldjh
Copy link
Member

WARNING:ontogpt.clients:llm_gpt4all module not found. GPT4All support will be disabled.

That one's on me - it's an OntoGPT warning I keep trying to unsuccessfully squash. Shouldn't impact output here though.

@leokim-l
Copy link
Member Author

A few quick comments (via phone) since I may not be able to work on this until after ESHG:

Regarding output directory and config.yaml, this is a quick fix, we should simply have them there ready to go when one clones.

Tool_input_commands is a PhEval thing, in my case no issues with this, it gets automatically generated in the output folder if I remember this right.

Test data dir is a PhEval thing, I don't think we are using it.

Regarding the first errors: /prompts should be in the project top directory, or I guess wherever os.system(java...) is run from, whereas the folder phenopacket-store should have a subfolder containing the phenopackets. This is because in the store they are always grouped in folders by gene. I can see with @pnrobinson if we can edit it so that it will simply (recursively ) take all .json files it finds within the folder it is given, without the need for a subdirectory.

The Italian errors are related to the missing Italian translation in the public versions, I circulated on slack a file containing also translations in Italian and German.

Hope this helps in my absence.

@justaddcoffee justaddcoffee merged commit 9fca933 into main Jun 6, 2024
1 check passed
@justaddcoffee justaddcoffee deleted the post_process_add_mondo_utils branch June 6, 2024 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants