Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rosetta Score issue #39

Closed
jeep3 opened this issue Jan 30, 2021 · 54 comments
Closed

Rosetta Score issue #39

jeep3 opened this issue Jan 30, 2021 · 54 comments
Assignees

Comments

@jeep3
Copy link

jeep3 commented Jan 30, 2021

Hi, Andrew,
I want to use the Rosetta score to detect some links, Metcalf scoring was working(for example, Metcalf score value=2.3 to 3.2), however, I found that Rosetta Score is still not working after switching scoring methods. Do you think my configuration in the antismash folder is correct? (A screenshot can be found below) Since the output file I run on the server with Antismash 5 does not include the strains name in front of the *.gbk file name, so, I added the strains name in front of all *.gbk files for NPLinker to identify and parse. In addition, I also added the strains name in front of all *.txt files in the knownclusterblast folder. Or do I need to do some other preparatory work? Thanks a lot!

Best wishes!

@andrewramsay
Copy link
Contributor

It sounds like it may be a problem with the way that the knownclusterblast filenames are derived from the .gbk filenames. I haven't looked at that part of the code for a long time and it may need updated to account for some of the more recent changes that have been made.

I'll do some testing with it and add some more debug output too so it's easier to check what's going on from the log data.

@andrewramsay andrewramsay self-assigned this Feb 1, 2021
@andrewramsay
Copy link
Contributor

andrewramsay commented Feb 1, 2021

In addition, I also added the strains name in front of all *.txt files in the knownclusterblast folder. Or do I need to do some other preparatory work?

Do you have a screenshot / list of the original .txt filenames you had before doing this? Ideally we'd want NPLinker to parse everything without requiring manual renaming, so it'd help to know what it might have to deal with as there seem to be different naming conventions in different datasets.

@andrewramsay
Copy link
Contributor

I've updated the Docker image so you can pull the latest version now.

Before you run it, can you:

  • delete the "rosetta" folder you'll find inside your dataset folder (this will remove any cached files leftover from the old version)
  • edit your nplinker.toml file and add/edit the line loglevel = "DEBUG" at the top (this will increase the number of log messages to make finding problems easier)

I've adjusted the code so that it will try to load the knownclusterblast data first instead of the metabolomics data. If it doesn't find the knownclusterblast files it should abort the process immediately so it should be faster to check if it's working or not.

If it seems to work straight away that'd be great, but if not just upload the latest log.txt and we can go from there.

@andrewramsay
Copy link
Contributor

It looks like it's successfully found the knownclusterblast .txt files this time so that part worked!

The new problem is happening later on when it is gathering all the parsed data together. It is crashing because it's trying to divide by the length of a list which should have at least 1 item in it but is apparently empty. I haven't seen this happen before and I don't know if it's a badly formatted file or something else that's preventing it from parsing correctly.

At the moment there's no way to tell which file caused the error, but I've pushed an update to the Docker image which should print a log message when it detects this situation. Can you try pulling the new image and sending me another log from running it again? That should show us which file(s) it is failing to parse and then we can try to figure out what is wrong with them.

@andrewramsay
Copy link
Contributor

OK, unfortunately it looks like there are quite a lot of these problematic files, I was expecting only a couple. If you look at the log, every line like this indicates an instance of the problem:

13:25:23 [DEBUG] kcb.py:147, KCBParser failed to extract any MiBIG genes from file /data/antismash/JXZS-118-5B/knownclusterblast/JXZS-118-5B_c00001_ctg1_pi.._c3.txt, BGC ID BGC0001446

Could you send 2 or 3 of the files that are mentioned in all of these messages? So for the message above it would be [dataset folder]/antismash/JXZS-118-5B/knownclusterblast/JXZS-118-5B_c00001_ctg1_pi.._c3.txt and so on. It doesn't matter which particular files, just pick a few different ones at random.

@jeep3
Copy link
Author

jeep3 commented Feb 3, 2021

image
In addition, Andrew, I would like to ask a question. This summary file exists in every Antismash Folder. Do I need to remove it? In fact, the antismash folder *.gbk and Knowclusterblast folder *.txt file from Antismash 5 output does not contain the prefix of strain name, which I modified in batch later. Is this operation necessary? Because it seems that the *. gbk files are not prefixed by the strain name, so NPLinker cannot parse these files.

@justinjjvanderhooft
Copy link

could you add the "knownclusterblastoutput.txt" for some strains?

@jeep3
Copy link
Author

jeep3 commented Feb 3, 2021

Thanks, Justin, now I firstly correct the file name and see if it works.

@andrewramsay
Copy link
Contributor

Sure, Sorry, I didn't see the message in time, I think I know what the problem is, should be the file name is wrong.
JXZS-118-5B_JXZS-118-5B_c00001_ctg1_pi.._c1.txt
JXZS-118-5B_c00012_ctg12_p.._c2.txt
image

I don't think the filenames are the problem here. Or at least they aren't the whole problem. If some of them are wrong after you renamed them, the best thing to do is probably remove the renamed files and copy in the originals again so everything is consistent. I think they should be handled correctly this time.

But there's a line in the log which records how many .gbks were matched up with a .txt successfully:

13:25:47 [INFO] rosetta.py:166, GenBank files matched with knownclusterblast results: 1878/1902

That shows almost all of them were apparently OK. There are 24 files that it failed to match up, and 24 instances of this type of message in the log:

13:25:43 [WARNING] rosetta.py:151, KCBParser failed to find file "/data/antismash/HNCD-7A/knownclusterblast/HNCD-7A_c00001_ctg1_pi.._c1.txt"

which makes sense.

Both of those JXZZ-118-5B... files you attached are formatted correctly, they don't contain any hits but that's OK and they parse as expected. I don't see either of them mentioned in the log as having problems either.

What I was looking for is some of the files that are listed with the "failed to extract any MiBIG genes" messages. There might be multiple messages linked to the same file but there still seem to be a few hundred of them. Some of the files I'd be interested in seeing are:

/data/antismash/GDZJ-105-2A/knownclusterblast/GDZJ-105-2A_c00007_ctg7_pi.._c3.txt
/data/antismash/HBHA-134-8A/knownclusterblast/HBHA-134-8A_c00004_ctg4_pi.._c3.txt
/data/antismash/HNCD-7A/knownclusterblast/HNCD-7A_c00008_ctg8_pi.._c3.txt
/data/antismash/SichGA-2A/knownclusterblast/SichGA-2A_c00004_ctg4_pi.._c4.txt

If you can pick out 1 or 2 of those that'd be very helpful because there must be something different about the way they are formatted that is breaking the parser we have.


In addition, Andrew, I would like to ask a question. This summary file exists in every Antismash Folder. Do I need to remove it? In fact, the antismash folder *.gbk and Knowclusterblast folder *.txt file from Antismash 5 output does not contain the prefix of strain name, which I modified in batch later. Is this operation necessary? Because it seems that the *. gbk files are not prefixed by the strain name, so NPLinker cannot parse these files.

I think you can just leave those files in place. You can remove them if you prefer but it shouldn't break anything to have them there.

@andrewramsay
Copy link
Contributor

Great, I can see the problem now. For some of the "Details" sections in these .txt files, there are no entries in the section that the Rosetta code usually parses. This is causing the error at the end of the log because the length of the list of table entries is zero. So for example in GDZJ-105-2A_c00007_ctg7_pi.._c3.txt, the first "Details" section is:

1. BGC0001445
Source: leporin B
Type: NRP + Polyketide:Iterative type I
Number of proteins with BLAST hits to this cluster: 10
Cumulative BLAST score: 15649

Table of genes, locations, strands and annotations of subject cluster:

Table of Blast hits (query gene, subject gene, %identity, blast score, %coverage, e-value):
evm.TU.ctg7_pilon_pilon_pilon.237	AFLA_066840	99	7695	100.45824847250509	0.0
evm.TU.ctg7_pilon_pilon_pilon.238	AFLA_066850	95	171	94.79166666666666	4.9e-43
evm.TU.ctg7_pilon_pilon_pilon.239	AFLA_066860	100	1134	80.1418439716312	0.0
evm.TU.ctg7_pilon_pilon_pilon.240	AFLA_066880	85	746	110.72261072261071	2.8e-215
evm.TU.ctg7_pilon_pilon_pilon.241	AFLA_066890	99	1008	100.0	4.7e-294
evm.TU.ctg7_pilon_pilon_pilon.242	AFLA_066900	98	1626	102.09617755856965	0.0
evm.TU.ctg7_pilon_pilon_pilon.243	AFLA_066910	99	705	100.0	3.5e-203
evm.TU.ctg7_pilon_pilon_pilon.244	AFLA_066920	100	740	100.0	9.8e-214
evm.TU.ctg7_pilon_pilon_pilon.245	AFLA_066930	100	1049	93.88888888888889	1.9e-306
evm.TU.ctg7_pilon_pilon_pilon.246	AFLA_066940	100	775	100.0	3e-224

The code expects to find entries in the "Table of genes, locations, strands and annotations of subject cluster:" table, and in this case it's empty. The next BGC in the same file is fine though, it has a single entry there:

2. BGC0001254
Source: ACT-Toxin II
Type: Polyketide
Number of proteins with BLAST hits to this cluster: 1
Cumulative BLAST score: 2103

Table of genes, locations, strands and annotations of subject cluster:
BAJ14522.1	BAJ14522.1	83	7456	+	polyketide_synthase

Table of Blast hits (query gene, subject gene, %identity, blast score, %coverage, e-value):
evm.TU.ctg7_pilon_pilon_pilon.237	BAJ14522.1	47	2103	63.365580448065174	0.0

I'm not sure what to do about this since I didn't write this part of the code and I'm not very familiar with the intricacies of antiSMASH and knownclusterblast (e.g. I don't know what it means that the "Table of Blast hits" is populated when the other one is empty).

@sdrogers, should I just modify the code to skip over cases like this or is there an alternative way the parsing could be done?

@jeep3
Copy link
Author

jeep3 commented Feb 3, 2021

Great, I admire that you can quickly find the root of the problem.

@justinjjvanderhooft
Copy link

@grimur do you have a suggestion? @andrewramsay I would for now skip over the "empty" entries to make it work for those that are correct, but I think Grímur wrote this code, and I don't exactly know what piece of information gets parsed from here....

@andrewramsay
Copy link
Contributor

@andrewramsay I would for now skip over the "empty" entries to make it work for those that are correct, but I think Grímur wrote this code, and I don't exactly know what piece of information gets parsed from here....

OK, I've updated the Docker image to do that - there will still be a warning message in the log each time it happens but it should remove the affected objects so that the divide-by-zero exception doesn't occur.

@grimur
Copy link
Contributor

grimur commented Feb 6, 2021 via email

@justinjjvanderhooft
Copy link

@jeep3 I think you run antiSMASH on all the genomes, right? Could you rerun some of the ones giving problems and check if any error pops up?
@grimur @sdrogers yes, the text format seems to be the legacy format so switching over to the json format seems to be the logical option to prevent more things falling over....

@andrewramsay
Copy link
Contributor

OK, I guess what I'll need to do is add support for parsing the JSON files first while keeping the text parser around as a fallback for older datasets.

I just picked one of the JSON files at random from a dataset I have here and it covers some of the files which have missing output, so that's helpful to check if I can find the missing info in the JSON correctly. From a quick look through things seem promising... I think I can see how to extract everything that is normally parsed from the text version, including the missing lines from the table!

I'll take a better look at it during next week and let you know when I've got NPLinker updated.

@justinjjvanderhooft
Copy link

Awesome @andrewramsay - I think that would be the best solution indeed. Those JSON files are included in the standard output for all samples, right?

@andrewramsay
Copy link
Contributor

Just a quick update: still working on this and making progress, but there are more cases to handle than I thought at first so it might be another day or two before I have a new Docker version you can try out.

@grimur
Copy link
Contributor

grimur commented Feb 9, 2021 via email

@andrewramsay
Copy link
Contributor

I think I've finally got this working enough that you can try it, there's a new Docker image available now.

It will check for a .json file in each antiSMASH folder as the primary source of knownclusterblast info, then fall back to the text files if the .json file isn't available.

I've tested it with a couple of datasets including the Crusemann one (which has a mix of JSON and text) and it seems OK, but as usual just include the log output if/when you run into any problems.

@jeep3
Copy link
Author

jeep3 commented Feb 11, 2021

Nice, I'm preparing a larger data set to try. and then I'll feedback to you. Thanks, Andrew.

@jeep3
Copy link
Author

jeep3 commented Feb 15, 2021

Hi, Andrew, a new error occurred when NPLinker was loading data, Can you help find out what went wrong? Thanks.

@jeep3
Copy link
Author

jeep3 commented Feb 15, 2021

For example, I don't know how to adjust base on 'TypeError: list indices must be integers or slices, not str'.

@andrewramsay
Copy link
Contributor

Looks like a bug, not anything you did wrong. Probably something to do with the new JSON parsing code I added. I'll see if I can get it to produce the same error here.

@jeep3
Copy link
Author

jeep3 commented Feb 15, 2021

Ok, if you need any information, please feel free to tell me.

@andrewramsay
Copy link
Contributor

I think I've found the problem, there was a mismatch in the structure of the data returned by the JSON parser compared with the text parser. There's a new Docker image available which should fix it.

@jeep3
Copy link
Author

jeep3 commented Feb 15, 2021

Excause me, Idon't know what happened. the text get bigger.

@andrewramsay
Copy link
Contributor

I think this is one of the downsides of relying on a browser-based interface, there can be a lot more overhead than in a native application. From the screenshot that looks like a much larger dataset (in terms of the number of BGCs/spectra) than any of the ones I've had access to so far which probably explains some of the slowness. I thought Crusemann was on the larger end and it only has a fraction of that amount.

What sort of hardware are you currently running it on? If it's not particularly fast and doesn't have a lot of RAM it might struggle to load everything.

There are a couple of things you can try:

  • increase the threshold used in the initial pass of Metcalf scoring that is performed before loading the web interface. You can see in the screenshot in the top right it is trying to load the tables with all objects that appear in any links with a score of >=2.0. For such a large dataset it might make more sense to increase this threshold to something higher, which will reduce the number of objects loaded into the webapp. You can do that by adding a section to the config file to set the threshold:
[webapp]
tables_metcalf_threshold = 4.0
  • you should be able to run the Docker image on a different machine to the one you view the webapp on, so you can view it on a laptop and have a more powerful server do the bulk of the work. You need to modify the Docker command slightly to allow for connections from a remote machine like this:
# assuming server has IP 192.168.10.100
docker run --name webapp -p 5006:5006 -v c:\your_shared_folder:/data:rw -e BOKEH_ALLOW_WS_ORIGIN=192.168.10.100:5006 andrewramsay/nplinker
  • if what you are most interested in is the output of the Rosetta scoring, you should find a file in the dataset/rosetta folder called "rosetta_hits.csv" which gives a simple listing of all the hits that were discovered by that method

@andrewramsay
Copy link
Contributor

One other thing I just remembered: when we were emailing about a different problem back in December, I think you had asked about adding a way to filter out some parts of the dataset while it was being loaded. I added an issue for this (#33) but had forgotten about it until now.

Would having that feature be useful for this large dataset? You could write a list of strain IDs into a file and NPLinker would discard all objects that didn't contain or weren't linked to one of those strains? (I'm not sure what would make the most sense here)

@andrewramsay
Copy link
Contributor

Let me split this reply in two:

Docker/server stuff:

The URL thing can be a bit confusing, I should clarify:

You can see in the log that it is blocking connections that don't match the configured URL:

2021-02-17 12:02:02,728 Refusing websocket connection from Origin 'http://localhost:5006'; use --allow-websocket-origin=localhost:5006 or set BOKEH_ALLOW_WS_ORIGIN=localhost:5006 to permit this; currently we allow origins {'thornton.bioinformatics.nl:5006'}

You can have multiple allowed URLs though, so I will tweak the Docker image so that localhost:5006/npapp will always work in addition to any server URL that's been configured.

Performance problems

From the log I can see it's having a lot of problems even on the server, e.g.

11:33:01 [INFO] methods.py:420, MetcalfScoring.setup (bgcs=18333, gcfs=422, spectra=22837, molfams=10745, strains=234
11:39:51 [INFO] methods.py:427, MetcalfScoring.setup completed

So that's taking over 6 minutes just to run the initial pass of Metcalf scoring, which is normally very fast - it takes seconds for the Crusemann data!

Maybe I need to reduce the number of samples.

Did you see my question about allowing filtering of the dataset, so you could effectively select subset of the whole thing and only load that particular part into NPLinker? This would give you control over how many objects you have loaded. If that's something you'd like to try let me know.

@justinjjvanderhooft
Copy link

Thanks @andrewramsay!

Regarding performance issue --> does it always calculate all Metcalf scores? Trying a threshold of 10 may work to further trim the results down a bit?
I think @jeep3 will be interested in knowing how to pass on a selection of the samples as that means that we do not need to repeat the original setup with all the files and we could make different selections.

Regarding server and local browser --> @jeep3 to tunnel the local host from the server to your laptop, the command will look like this: ssh -L 5006:thornton:5006 myers.bioinformatics.nl

@andrewramsay
Copy link
Contributor

Regarding performance issue --> does it always calculate all Metcalf scores? Trying a threshold of 10 may work to further trim the results down a bit?

Yes - it's using the code Florian wrote a couple of years ago which builds co-occurrence matrices and uses them to precalculate the scores for all combinations of objects. His code was much much faster than the original implementation which is why I'm surprised it's taking so long here. The score data it generates could actually be cached quite easily, I hadn't done that before because it was always fast enough it didn't really matter. I'll add that in soon.

I think @jeep3 will be interested in knowing how to pass on a selection of the samples as that means that we do not need to repeat the original setup with all the files and we could make different selections.

OK, what would be the most sensible way of doing this? A whitelist of strains? Separate whitelists for genomics and metabolomics objects? If I know what to add I can try to get that done this week.

I might also make a start on a native version of the webapp, although that'd be a longer term thing. Some of the problems (like the Metcalf scoring) are not being caused by running things in the browser, but I still think trying to stuff tens of thousands of objects into those HTML-based tables is going to remain very slow even if most other things are fast.

@justinjjvanderhooft
Copy link

Justin, I didn't catch your meaning "tunnel the local host from the server to your laptop, the command will look like this: ssh -L 5006:thornton:5006 myers.bioinformatics.nl" You mean on the server shell to type: ssh -L 5006:thornton.bioinformatics.nl? Because at this point, NPLinker occupies the command window on my laptop.
@jeep3 if you use a cmd shell and perform the ssh command, you can then open the local host page on your own laptop (whilst keeping the connection open)

@justinjjvanderhooft
Copy link

justinjjvanderhooft commented Feb 17, 2021

Yes - it's using the code Florian wrote a couple of years ago which builds co-occurrence matrices and uses them to precalculate the scores for all combinations of objects. His code was much much faster than the original implementation which is why I'm surprised it's taking so long here. The score data it generates could actually be cached quite easily, I hadn't done that before because it was always fast enough it didn't really matter. I'll add that in soon.
Okay @andrewramsay that makes sense then - this is quite a large dataset (pushing boundaries....) - then highering the threshold to a Metcalf score of 10 will not reduce the calculation time, but it may speed up the rendering of the browser?

@justinjjvanderhooft
Copy link

_OK, what would be the most sensible way of doing this? A whitelist of strains? Separate whitelists for genomics and metabolomics objects? If I know what to add I can try to get that done this week.

I might also make a start on a native version of the webapp, although that'd be a longer term thing. Some of the problems (like the Metcalf scoring) are not being caused by running things in the browser, but I still think trying to stuff tens of thousands of objects into those HTML-based tables is going to remain very slow even if most other things are fast._

@andrewramsay yes, it will probably remain slow, but the key thing is that it will at least work (at reasonable speed) once loaded.

And I think a selection could be best done on the either the strain labels (pointing to both genomics and metabolomics items) or a list of the genomics strains. Since the objects are linked, do we also need to add the metabolomics items? I am assuming that all the data is on the relevant place including the file that links the genomic and metabolic data files (and has strain labels?!), and now we make a subset of that to load into the web app (and calculate the scores)
@jeep3 what do you think?

@jeep3
Copy link
Author

jeep3 commented Feb 17, 2021

And I think a selection could be best done on the either the strain labels (pointing to both genomics and metabolomics items) or a list of the genomics strains. Since the objects are linked, do we also need to add the metabolomics items? I am assuming that all the data is on the relevant place including the file that links the genomic and metabolic data files (and has strain labels?!), and now we make a subset of that to load into the web app (and calculate the scores)_
Same there

@justinjjvanderhooft
Copy link

Sorry, I'm a little confused, This command is run in the cmd shell of my laptop, right? When Nplinker was loaded, I opened the website(http://thornton.bioinformatics.nl:5006/npapp) in the browser on my laptop?
as you say, at the moment, the cmd shell is occupied by a NPLinker. At this point, I should open a new cmd shell to input ssh -L 5006:thornton.bioinformatics.nl?

Indeed, you need a new cmd shell, but the tunnelling command should be something like this: ssh -L 5006:thornton:5006 myers.bioinformatics.nl

@andrewramsay
Copy link
Contributor

andrewramsay commented Feb 17, 2021

Okay @andrewramsay that makes sense then - this is quite a large dataset (pushing boundaries....) - then highering the threshold to a Metcalf score of 10 will not reduce the calculation time, but it may speed up the rendering of the browser?

Exactly, anything that reduces the number of objects should make the rendering process a bit faster.

And I think a selection could be best done on the either the strain labels (pointing to both genomics and metabolomics items) or a list of the genomics strains. Since the objects are linked, do we also need to add the metabolomics items? I am assuming that all the data is on the relevant place including the file that links the genomic and metabolic data files (and has strain labels?!), and now we make a subset of that to load into the web app (and calculate the scores)

Assuming the strains are all correctly identified in strain_mappings.csv I think either way should work and there wouldn't be any need to add entries for the metabolomics items manually. As long as there is a set of user-supplied strains it can filter out both metabolomics and genomics objects which don't contain at least one of the strains in the set.

I'll push an update to the Docker image once I've finished this (and added the Metcalf scoring caching too).

@andrewramsay
Copy link
Contributor

A quick question about the strain filtering feature...

Currently NPLinker has a built-in filtering step which takes the combined set of strain IDs parsed from the genomics & metabolomics data and trims it down to include only those which are common to both (this doesn't remove any BGC/GCF/Spectrum/MolFam objects, just reduces the number of strains).

How should this work when you're also supplying a list of strains you explicitly want included? There are a few potential ways to handle it and I'm not sure what would make most sense:

  1. Do the common strains step as usual, then do the "include only these strains" step?
  2. Do the "include only these strains" step, then remove the non-common strains?
  3. Do the common strains step as usual, unless there is a set of "include these strains only" supplied by the user. If so, then only do that part and skip the common strains step?

Any suggestions?

@jeep3
Copy link
Author

jeep3 commented Feb 22, 2021

Cool, I prefer option 3.

@justinjjvanderhooft
Copy link

Good question @andrewramsay - I will touch base with with @grimur tomorrow so I will shortly discuss it with him. To me, if we store the results of the original "common strains" step, 1 makes most sense as you can then supply the "include only these strains" step and easily make another selection based on the same initial matching. If those initial matching results are not stored, than option 3 makes sense to me.

@grimur
Copy link
Contributor

grimur commented Feb 23, 2021

After discussing with @justinjjvanderhooft I agree that option 1 would make most sense, as long as it's a display issue, and not an issue of keeping the strains in memory. The problem with option 3, as I see it, is that you really don't want to do e.g. strain correlation scoring unless you actually have genomic and metabolomic info for all the strains in your analysis. I think we should always restrict analysis to shared strains, or at least make it so that it takes conscious effort to introduce asymmetry.

@justinjjvanderhooft
Copy link

Thanks @grimur for summarizing that - to add to this, if NPLinker caches the input data and IOKR scores, it makes sense to do that once (assuming that the size of the dataset mainly poses issues on the HTML rendering of the visualization and not on the backend calculations. Furthermore, BiG-SCAPE results will also differ for different amounts of input data, so to do that only once on the complete matchable set makes most sense to us.

@andrewramsay
Copy link
Contributor

After discussing with @justinjjvanderhooft I agree that option 1 would make most sense, as long as it's a display issue, and not an issue of keeping the strains in memory.

@grimur sorry to keep asking questions, but I'm not sure I understand what you're getting at here? My intention was to take this set of user-supplied strains, then go through the lists of BGC/Spectrum objects and remove any that didn't include any of those strains (and then in turn delete any GCFs/MolFams that ended up empty). The issue with rendering the UI that Justin mentions is not (directly) due to the number of strains involved, it's the numbers of the other object types. There should be no problem in keeping all the strains in memory since they're not listed in one of the tables like the BGCs/Spectra/GCFs/MolFams. But if I keep all the strains, then the number of other objects isn't going to change and that's the bigger problem...

Thanks @grimur for summarizing that - to add to this, if NPLinker caches the input data and IOKR scores, it makes sense to do that once (assuming that the size of the dataset mainly poses issues on the HTML rendering of the visualization and not on the backend calculations. Furthermore, BiG-SCAPE results will also differ for different amounts of input data, so to do that only once on the complete matchable set makes most sense to us.

I'd also really like to clarify some points touched on here in case I've been making some very wrong assumptions. This is a summary of the steps NPLinker would normally take when loading the webapp:

  1. (if needed) retrieve remote genomics + metabolomics data when using a platform dataset
  2. (if needed) run BiG-SCAPE on the antiSMASH data (only done once)
  3. parse strain_mappings.csv to create a list of valid strain IDs/labels
  4. parse all the data files and build full lists of BGC/GCF/Spectrum/MolFam objects (this includes another filtering step I'd forgotten about which removes GCFs that only contain MiBIG BGCs)
  5. filter the original list of strains down to the common set between genomics+metabolomics. This only removes strain objects, not any of the other types, although it does modify the GCFs and Spectra by updating their lists of strains
  6. do the initial pass of Metcalf scoring with the configured threshold. This will be run on the full set of objects that exists at this point (and the results will be cached starting from the next version, this was the step taking 6 minutes on your big dataset)
  7. finally there's some further preprocessing done to establish the links between the various types of object in the format that the code behind the tables expects, but no object modification/deletion is involved

My concern is that I was planning to put this extra filtering step between 5 and 6 in that list. It would involve removing objects from memory, so they would be excluded from the initial Metcalf pass and any other user-initiated scoring done via the browser interface.

But now I'm thinking this might not be very helpful if it messes up things like the strain correlation scoring Grimur mentioned by pruning the number of objects available.

Am I missing something? I don't think this will be a lot of work to implement once I'm clear on what needs to be done, I'm just keen to avoid a situation where someone ends up wasting a lot of time because they didn't realise they weren't working with the set of objects they thought they were!

@justinjjvanderhooft
Copy link

Thanks @andrewramsay for the clear overview - I think we can put in the additional filtering step between 5 and 6 as the GCFs and MFs do not change anymore - just the number of their members will fluctuate (and indeed the actual MetCalf score, and the other scores as a result of that - something to be aware of....) if you would restart the analysis on the same data but with a different subset (all the spectral and BGC objects are still stored) - @grimur could you confirm that this would work for calculation of your new scores as well? This subsetting is trickier than I thought it would be....

@grimur
Copy link
Contributor

grimur commented Feb 23, 2021

@andrewramsay not at all, ask away - I'm not even sure myself if what I'm saying makes sense. And thanks for the clarification.

Is there anything to gain by keeping the strains in memory, then? The reason @justinjjvanderhooft and I had for keeping them around was some vague notion that it would be easier to redo the analysis for a different subset, but that's perhaps a premature optimisation and simpler to just change the subset list and reload the whole thing.

Putting the filtering step between steps 5 and 6 makes perfect sense, just as long as any computation of scores is done after any filtering steps that might remove strains from the population.

I'm wondering, though - would swapping steps 4 and 5 make sense - to only load the strain IDs in the final set of IDs, instead of loading all the strains, creating all the objects and deleting them afterwards if the strain is excluded? So the order of the steps would be 1, 2, 3, 5, user filtering, 4, 6, 7?

@andrewramsay
Copy link
Contributor

Thanks for the responses, that helps make things clearer for me! I've been really busy with other things this week but will try to get an updated Docker image ready as soon as I can find some time.

I'm wondering, though - would swapping steps 4 and 5 make sense - to only load the strain IDs in the final set of IDs, instead of loading all the strains, creating all the objects and deleting them afterwards if the strain is excluded? So the order of the steps would be 1, 2, 3, 5, user filtering, 4, 6, 7?

I can see how this could work, but I think the problem as things stand is that there's no way for NPLinker to do step 5 before parsing the various data files in step 4 because it relies on constructing the objects to discover which strains occur where. The list of strain IDs it reads from strain_mappings don't have any context, it's just "here's a strain ID and possibly various aliases for it". Until it goes through the files and finds where those IDs occur, it doesn't know which are found in genomics/metabolomics/both types of objects.

It should already skip over creating objects which have strain IDs that don't appear in strain_mappings.csv so that should help to limit the number of objects created+deleted at least a little.

The implementation of step 5 is just a few lines here in case there's an obvious alternative way to do it. I suppose it could parse the data files, extract strain IDs along with an indicator of the source, do the step 5 filtering based on that, and only then go back and create all the BGC/GCF/Spectrum/MolFam objects using the reduced set of strains? Don't see why that wouldn't work, but it'd take a bit more time to rewrite the parsing code since it currently constructs the objects as it goes through the files line by line.

@justinjjvanderhooft
Copy link

--> I suppose it could parse the data files, extract strain IDs along with an indicator of the source, do the step 5 filtering based on that, and only then go back and create all the BGC/GCF/Spectrum/MolFam objects using the reduced set of strains?

@andrewramsay if that happens on the basis of the same initial BGC/GCF/Spectrum/MolFam assignments for different subsets a user selects, than it would at least provide "consistent" results and scores. Some GCFs may become (very) small.... and another thought occurred to me: we could let the user recalculate the scores based on the subset of strains.
@grimur what are your thoughts?
@andrewramsay in the end, it would be nice to get your view on how to make this process most robust to future influx of datasets and possible novel scores to calculate.

@andrewramsay
Copy link
Contributor

I've updated the Docker image with my first attempt at implementing this change. The key points are:

  1. The app now looks for an optional file called include_strains.csv in the dataset folder. This should contain a single column of strain labels as found in strain_mappings.csv (it doesn't matter which label you use for a particular strain, as long as it's in the mappings somewhere). Once the common strains filtering (step 5) is finished, the app will:
    • take the list of common strains and filter it down to only those given in this file
    • delete all BGC & Spectrum objects that don't include any of the remaining strains
    • delete all GCF & MolFam objects that now don't contain any BGCs or Spectra respectively
    • this all happens before any scoring takes place
  2. While I was testing the changes I realised it was easy to end up with a set of strains in include_strains.csv that doesn't overlap with the set of common genomic+metabolomic strains NPLinker collects. If that happens you get zero strains and other objects left and it has to abort. To make it easier to avoid this, the app now automatically writes a list of those common strains into a file called common_strains.csv, one strain label per line. My idea was that you'd run NPLinker once to get this list, then copy/paste it into include_strains.csv, delete the strains you didn't want to retain, then run NPLinker again so it could apply that extra filtering.

@justinjjvanderhooft
Copy link

Thanks @andrewramsay! I think this makes sense and can be easily explained to people. Same GCFs/MFs, but different scorings dependent on the subset.

@jeep3 could you test it at some point? You could make a selection from the common_strains.csv that covers the different clades and make a subset of 100 strains and see if that loads okay for example.

@jeep3
Copy link
Author

jeep3 commented Feb 26, 2021

Nice, Got it, thanks.

@andrewramsay
Copy link
Contributor

Nice, Got it, thanks.

I just found a bug in the webapp code in the last version that would have made it seem like the "Show scores ..." buttons wouldn't do anything. If you update to the latest Docker image today that should be fixed.

@jeep3
Copy link
Author

jeep3 commented Mar 3, 2021

Very cool, Andrew, I haven't use NPLinker,recently, due to some other things. Thanks.

@CunliangGeng
Copy link
Member

I assume the issue has been solved. Please reopen it if not :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants