Post process add mondo utils #22

leokim-l · 2024-05-27T14:47:42Z

@justaddcoffee try and see if you can run the whole pipeline with 2/few phenopackets, please.

You may have to: create some test input and output folder, to be passed as cli arguments to pheval. The input folder should contain a folder called phenopacket-store, which contains another folder containing 2 phenopackets.
Maybe the jar file will give issues. I have been working on the java code and can update heavily the jar soon.

….py and post_process.py, and a test50genes folder

…malco into post_process

…ostprocessing. Running `pheval run -i test50genes_input -r "malcorunner" -o test50genes_output -t tests` will generate plot. Make sure to comment out everything but the plot part in runner.py

…eful wildcarding

…s, and cleaned up some analysis

…directly. Languages are defined in runner as a tuple, plotting is done as a last step in a separate function. Still slow, needs to be sped up

…ng, and tested

caufieldjh · 2024-05-28T17:36:27Z

May make sense to include input phenopackets as test fixtures, in tests/input

caufieldjh · 2024-05-28T17:48:24Z

Plus the corresponding prompts, of course.
I see that by default the runner will just process all prompts in /prompts even if it isn't provided the corresponding phenopacket JSON.

caufieldjh · 2024-05-28T18:01:55Z

Also looks like the phenopackets2prompt.jar is only producing en and es prompts, though the output includes empty files for the full set of five languages (so malco raises a FileNotFoundError when it can't find the prompts and a pandas.errors.EmptyDataError when it tries to parse the results)

caufieldjh

I ran this on two phenopackets and it didn't get to the point of producing a plot, but it did produce results for en and es. Results below:

label	term	score	rank
PMID_15266616_3.json_en-prompt.txt	MONDO:0016512	1.0	1
PMID_15266616_3.json_en-prompt.txt	MONDO:0018997	0.5	2
PMID_15266616_3.json_en-prompt.txt	MONDO:0016033	0.3333333333333333	3
PMID_15266616_3.json_en-prompt.txt	MONDO:0019188	0.25	4
PMID_15266616_3.json_en-prompt.txt	MONDO:0008678	0.2	5
PMID_15266616_3.json_en-prompt.txt	MONDO:0009341	0.16666666666666666	6
PMID_15266616_3.json_en-prompt.txt	MONDO:0008965	0.14285714285714285	7
PMID_10874631_II.2.json_en-prompt.txt	MONDO:0019200	1.0	1
PMID_10874631_II.2.json_en-prompt.txt	MONDO:0018998	0.5	2
PMID_10874631_II.2.json_en-prompt.txt	MONDO:0019501	0.3333333333333333	3
PMID_10874631_II.2.json_en-prompt.txt	MONDO:0015993	0.25	4
PMID_10874631_II.2.json_en-prompt.txt	MONDO:0019262	0.2	5

label	term	score	rank
PMID_10874631_II.2.json_es-prompt.txt	MONDO:0019200	1.0	1
PMID_10874631_II.2.json_es-prompt.txt	MONDO:0019501	0.5	2
PMID_10874631_II.2.json_es-prompt.txt	MONDO:0018998	0.3333333333333333	3
PMID_10874631_II.2.json_es-prompt.txt	MONDO:0015993	0.25	4
PMID_10874631_II.2.json_es-prompt.txt	MONDO:0015229	0.2	5
PMID_15266616_3.json_es-prompt.txt	MONDO:0018997	1.0	1
PMID_15266616_3.json_es-prompt.txt	MONDO:0008678	0.5	2
PMID_15266616_3.json_es-prompt.txt	MONDO:0016033	0.3333333333333333	3
PMID_15266616_3.json_es-prompt.txt	MONDO:0016512	0.25	4
PMID_15266616_3.json_es-prompt.txt	MONDO:0019188	0.2	5
PMID_15266616_3.json_es-prompt.txt	MONDO:0010035	0.16666666666666666	6
PMID_15266616_3.json_es-prompt.txt	MONDO:0018923	0.14285714285714285	7
PMID_15266616_3.json_es-prompt.txt	MONDO:0008564	0.125	8
PMID_15266616_3.json_es-prompt.txt	MONDO:0009341	0.1111111111111111	9

caufieldjh · 2024-05-28T18:20:33Z

This is confusing me, though - perhaps I'm missing something about how the scores are calculated.
The results above for PMID_15266616_3 indicate that the top hit for the en prompt is MONDO:0016512 (Kabuki syndrome) and the top hit for the es prompt is MONDO:0018997 (Noonan syndrome). The results for each include the other top hit at a different rank, e.g., the en prompt results include MONDO:0018997 as the second result, for half score.

But according to the source phenopacket, the correct diagnosis is OMIM:147791 (Jacobsen syndrome), or MONDO:0007838. This isn't among the predicted diagnoses at all.
Which output, if any, will tell me that?

leokim-l · 2024-05-29T12:07:37Z

This is confusing me, though - perhaps I'm missing something about how the scores are calculated. The results above for PMID_15266616_3 indicate that the top hit for the en prompt is MONDO:0016512 (Kabuki syndrome) and the top hit for the es prompt is MONDO:0018997 (Noonan syndrome). The results for each include the other top hit at a different rank, e.g., the en prompt results include MONDO:0018997 as the second result, for half score.

But according to the source phenopacket, the correct diagnosis is OMIM:147791 (Jacobsen syndrome), or MONDO:0007838. This isn't among the predicted diagnoses at all. Which output, if any, will tell me that?

These results come from this function:
https://github.com/monarch-initiative/malco/blob/2ad4f452bee194e25f71210654af982388646fcf/src/malco/post_process/post_process_results_format.py#L27

whereas the comparison with the correct result is done here:
https://github.com/monarch-initiative/malco/blob/2ad4f452bee194e25f71210654af982388646fcf/src/malco/post_process/compute_mrr.py#L9

Does this answer your question?

leokim-l · 2024-05-29T12:12:26Z

Also looks like the phenopackets2prompt.jar is only producing en and es prompts, though the output includes empty files for the full set of five languages (so malco raises a FileNotFoundError when it can't find the prompts and a pandas.errors.EmptyDataError when it tries to parse the results)

Sorry for this, I added the new .jar in the right place, see 2a30bb1
2a30bb1

leokim-l · 2024-05-29T15:18:40Z

Plus the corresponding prompts, of course. I see that by default the runner will just process all prompts in /prompts even if it isn't provided the corresponding phenopacket JSON.

Not sure I understand this correctly, maybe this is related to your first comment. It does compute everything that is in prompts, but the prepare step runs phenopacket2prompt.jar which populates the /prompts/{language} folder based on what is in the phenopacket_store_path, which is the location of a folder named phenopacket-store present in the input_directory, which is given to pheval via command line argument. Thus, one can place any set of phenopackets in such a folder and run the whole thing effortlessly. We could provide a set of test_inputdir, shipped directly with the code, containing:

10 phenopackets (to play around)
1 phenopacket per gene
all phenopackets.

We did discuss with Justin and Peter and said we want to change this. Namely, go from the setup_phenopackets() which fetches and downloads from the internet (from the, right now, latest phenopacket store release) to having the phenopackets simply saved directly in the gh repo :)

Let me know if this makes sense!

… be further tested. See #23

caufieldjh · 2024-05-29T17:49:44Z

Plus the corresponding prompts, of course. I see that by default the runner will just process all prompts in /prompts even if it isn't provided the corresponding phenopacket JSON.

Not sure I understand this correctly, maybe this is related to your first comment. It does compute everything that is in prompts, but the prepare step runs phenopacket2prompt.jar which populates the /prompts/{language} folder based on what is in the phenopacket_store_path, which is the location of a folder named phenopacket-store present in the input_directory, which is given to pheval via command line argument. Thus, one can place any set of phenopackets in such a folder and run the whole thing effortlessly. We could provide a set of test_inputdir, shipped directly with the code, containing:

10 phenopackets (to play around)

1 phenopacket per gene

all phenopackets.

We did discuss with Justin and Peter and said we want to change this. Namely, go from the setup_phenopackets() which fetches and downloads from the internet (from the, right now, latest phenopacket store release) to having the phenopackets simply saved directly in the gh repo :)

Let me know if this makes sense!

No worries - I had the full collection of pregenerated prompts in my prompts directory already, and was surprised to see that the runner operated over every phenopacket I had specified in the command line and everything there was a prompt for. This may not be an issue on a fresh install. In the end I did exactly what you suggest - I assembled a test set of 2 phenopackets and ran only on them and their associated prompts.

caufieldjh · 2024-05-29T17:58:25Z

This is confusing me, though - perhaps I'm missing something about how the scores are calculated. The results above for PMID_15266616_3 indicate that the top hit for the en prompt is MONDO:0016512 (Kabuki syndrome) and the top hit for the es prompt is MONDO:0018997 (Noonan syndrome). The results for each include the other top hit at a different rank, e.g., the en prompt results include MONDO:0018997 as the second result, for half score.
But according to the source phenopacket, the correct diagnosis is OMIM:147791 (Jacobsen syndrome), or MONDO:0007838. This isn't among the predicted diagnoses at all. Which output, if any, will tell me that?

These results come from this function:

https://github.com/monarch-initiative/malco/blob/2ad4f452bee194e25f71210654af982388646fcf/src/malco/post_process/post_process_results_format.py#L27

whereas the comparison with the correct result is done here:

https://github.com/monarch-initiative/malco/blob/2ad4f452bee194e25f71210654af982388646fcf/src/malco/post_process/compute_mrr.py#L9

Does this answer your question?

I think, for me, the operation was just failing before it got to that point. Seems to be working more like expected now:

That's just 2 phenopackets of course - likely not representative of any real phenomena.

justaddcoffee · 2024-05-29T19:42:18Z

@justaddcoffee try and see if you can run the whole pipeline with 2/few phenopackets, please.

Just catching up on this thread - I think the code needs to be tidied up a bit. Glad to help with this of course. It's pretty difficult to run from scratch in a new directory. As far as I can tell, you have to:

find and copy a config.yaml file
manually create the output directory
find and copy a directory called tool_input_commands (not sure what this does) to the output directory
pass a -t '--testdata-dir' argument (not sure what this does either - what exactly is the test data directory?)
and after the above, I still see several runtime errors:

testinput/phenopacket-store exists, skipping download.
org.monarchinitiative.phenol.base.PhenolRuntimeException: Error loading JSON
	at org.monarchinitiative.phenol.io.OntologyLoader.loadGraphDocument(OntologyLoader.java:109)
	at org.monarchinitiative.phenol.io.OntologyLoader.loadGraphDocument(OntologyLoader.java:90)
	at org.monarchinitiative.phenol.io.OntologyLoader.loadOntology(OntologyLoader.java:43)

FileNotFoundError: [Errno 2] No such file or directory: '/Users/jtr4v/PythonProject/malco_new/testrun/prompts/en/'
WARNING:ontogpt.clients:llm_gpt4all module not found. GPT4All support will be disabled.

(similar error for es), and finally the MRR calculation fails.

Things work better when I start in the base project directory where everything already exists. The run starts, but I see many errors like this:
[ERROR] Could not find it translation for Multiple lentigines (HP:0001003).
which we probably should sort out. Other than that, things seem to run correctly (still running)

caufieldjh · 2024-05-29T20:02:07Z

WARNING:ontogpt.clients:llm_gpt4all module not found. GPT4All support will be disabled.

That one's on me - it's an OntoGPT warning I keep trying to unsuccessfully squash. Shouldn't impact output here though.

leokim-l · 2024-05-29T22:49:58Z

A few quick comments (via phone) since I may not be able to work on this until after ESHG:

Regarding output directory and config.yaml, this is a quick fix, we should simply have them there ready to go when one clones.

Tool_input_commands is a PhEval thing, in my case no issues with this, it gets automatically generated in the output folder if I remember this right.

Test data dir is a PhEval thing, I don't think we are using it.

Regarding the first errors: /prompts should be in the project top directory, or I guess wherever os.system(java...) is run from, whereas the folder phenopacket-store should have a subfolder containing the phenopackets. This is because in the store they are always grouped in folders by gene. I can see with @pnrobinson if we can edit it so that it will simply (recursively ) take all .json files it finds within the folder it is given, without the need for a subdirectory.

The Italian errors are related to the missing Italian translation in the public versions, I circulated on slack a file containing also translations in Italian and German.

Hope this helps in my absence.

Justin Reese and others added 18 commits May 9, 2024 13:07

Commit poetry.lock after poetry update

ac76273

Make results tsv

11661dd

Make post_process step

0162585

Merge branch 'main' into post_process

77c8803

Fix gitignore

69cb7ac

Add code to make MMR plot

ce65c4e

Find result.tsv files in subdirs

a023064

minimal hacky working example, simple for loops over es and en in run…

ae0f1bf

….py and post_process.py, and a test50genes folder

Comments about TODO for tomorrow

34afbcc

Merge branch 'post_process' of https://github.com/monarch-initiative/…

2bd93ba

…malco into post_process

added 78 spanish and english results.tsv, this is the result of the p…

d284263

…ostprocessing. Running `pheval run -i test50genes_input -r "malcorunner" -o test50genes_output -t tests` will generate plot. Make sure to comment out everything but the plot part in runner.py

Add back template code to allow tests to pass

35a2873

fixed labeling for spanish, should make it take any language with car…

13c2f2d

…eful wildcarding

Merge in main

ff412ab

Add OAK-based scoring

8f60837

Tidy up

133d22f

refactored significantly runner, moving functions to different folder…

245f3a6

…s, and cleaned up some analysis

finished refactoring and cleaning, whole pipeline should almost work …

7f91ae8

…directly. Languages are defined in runner as a tuple, plotting is done as a last step in a separate function. Still slow, needs to be sped up

leokim-l requested review from justaddcoffee and caufieldjh May 27, 2024 14:47

leokim-l assigned justaddcoffee and leokim-l May 27, 2024

This was referenced May 27, 2024

Post process add scoring using mondo utils #20

Closed

Post process #9

Closed

leokim-l added 2 commits May 28, 2024 14:48

moved ontogpt output to subdir, added rank in dataframe for bookkeepi…

e876922

…ng, and tested

added plot file naming

2ad4f45

caufieldjh requested changes May 28, 2024

View reviewed changes

upload of jar with 5 languages, previously not done by mistake

2a30bb1

increased cache size, seems heuristically to speed up the process, to…

d18112b

… be further tested. See #23

Merge branch 'main' into post_process_add_mondo_utils

22a249f

justaddcoffee merged commit 9fca933 into main Jun 6, 2024
1 check passed

justaddcoffee deleted the post_process_add_mondo_utils branch June 6, 2024 09:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Post process add mondo utils #22

Post process add mondo utils #22

leokim-l commented May 27, 2024

caufieldjh commented May 28, 2024

caufieldjh commented May 28, 2024

caufieldjh commented May 28, 2024

caufieldjh left a comment

caufieldjh commented May 28, 2024

leokim-l commented May 29, 2024

leokim-l commented May 29, 2024

leokim-l commented May 29, 2024 •

edited

Loading

caufieldjh commented May 29, 2024

caufieldjh commented May 29, 2024

justaddcoffee commented May 29, 2024

caufieldjh commented May 29, 2024

leokim-l commented May 29, 2024

Post process add mondo utils #22

Post process add mondo utils #22

Conversation

leokim-l commented May 27, 2024

caufieldjh commented May 28, 2024

caufieldjh commented May 28, 2024

caufieldjh commented May 28, 2024

caufieldjh left a comment

Choose a reason for hiding this comment

caufieldjh commented May 28, 2024

leokim-l commented May 29, 2024

leokim-l commented May 29, 2024

leokim-l commented May 29, 2024 • edited Loading

caufieldjh commented May 29, 2024

caufieldjh commented May 29, 2024

justaddcoffee commented May 29, 2024

caufieldjh commented May 29, 2024

leokim-l commented May 29, 2024

leokim-l commented May 29, 2024 •

edited

Loading