-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rosetta Score issue #39
Comments
It sounds like it may be a problem with the way that the knownclusterblast filenames are derived from the .gbk filenames. I haven't looked at that part of the code for a long time and it may need updated to account for some of the more recent changes that have been made. I'll do some testing with it and add some more debug output too so it's easier to check what's going on from the log data. |
Do you have a screenshot / list of the original .txt filenames you had before doing this? Ideally we'd want NPLinker to parse everything without requiring manual renaming, so it'd help to know what it might have to deal with as there seem to be different naming conventions in different datasets. |
I've updated the Docker image so you can pull the latest version now. Before you run it, can you:
I've adjusted the code so that it will try to load the knownclusterblast data first instead of the metabolomics data. If it doesn't find the knownclusterblast files it should abort the process immediately so it should be faster to check if it's working or not. If it seems to work straight away that'd be great, but if not just upload the latest log.txt and we can go from there. |
It looks like it's successfully found the knownclusterblast .txt files this time so that part worked! The new problem is happening later on when it is gathering all the parsed data together. It is crashing because it's trying to divide by the length of a list which should have at least 1 item in it but is apparently empty. I haven't seen this happen before and I don't know if it's a badly formatted file or something else that's preventing it from parsing correctly. At the moment there's no way to tell which file caused the error, but I've pushed an update to the Docker image which should print a log message when it detects this situation. Can you try pulling the new image and sending me another log from running it again? That should show us which file(s) it is failing to parse and then we can try to figure out what is wrong with them. |
OK, unfortunately it looks like there are quite a lot of these problematic files, I was expecting only a couple. If you look at the log, every line like this indicates an instance of the problem:
Could you send 2 or 3 of the files that are mentioned in all of these messages? So for the message above it would be [dataset folder]/antismash/JXZS-118-5B/knownclusterblast/JXZS-118-5B_c00001_ctg1_pi.._c3.txt and so on. It doesn't matter which particular files, just pick a few different ones at random. |
could you add the "knownclusterblastoutput.txt" for some strains? |
Thanks, Justin, now I firstly correct the file name and see if it works. |
I don't think the filenames are the problem here. Or at least they aren't the whole problem. If some of them are wrong after you renamed them, the best thing to do is probably remove the renamed files and copy in the originals again so everything is consistent. I think they should be handled correctly this time. But there's a line in the log which records how many .gbks were matched up with a .txt successfully:
That shows almost all of them were apparently OK. There are 24 files that it failed to match up, and 24 instances of this type of message in the log:
which makes sense. Both of those JXZZ-118-5B... files you attached are formatted correctly, they don't contain any hits but that's OK and they parse as expected. I don't see either of them mentioned in the log as having problems either. What I was looking for is some of the files that are listed with the "failed to extract any MiBIG genes" messages. There might be multiple messages linked to the same file but there still seem to be a few hundred of them. Some of the files I'd be interested in seeing are:
If you can pick out 1 or 2 of those that'd be very helpful because there must be something different about the way they are formatted that is breaking the parser we have.
I think you can just leave those files in place. You can remove them if you prefer but it shouldn't break anything to have them there. |
Great, I can see the problem now. For some of the "Details" sections in these .txt files, there are no entries in the section that the Rosetta code usually parses. This is causing the error at the end of the log because the length of the list of table entries is zero. So for example in GDZJ-105-2A_c00007_ctg7_pi.._c3.txt, the first "Details" section is:
The code expects to find entries in the "Table of genes, locations, strands and annotations of subject cluster:" table, and in this case it's empty. The next BGC in the same file is fine though, it has a single entry there:
I'm not sure what to do about this since I didn't write this part of the code and I'm not very familiar with the intricacies of antiSMASH and knownclusterblast (e.g. I don't know what it means that the "Table of Blast hits" is populated when the other one is empty). @sdrogers, should I just modify the code to skip over cases like this or is there an alternative way the parsing could be done? |
Great, I admire that you can quickly find the root of the problem. |
@grimur do you have a suggestion? @andrewramsay I would for now skip over the "empty" entries to make it work for those that are correct, but I think Grímur wrote this code, and I don't exactly know what piece of information gets parsed from here.... |
OK, I've updated the Docker image to do that - there will still be a warning message in the log each time it happens but it should remove the affected objects so that the divide-by-zero exception doesn't occur. |
Hi -
The known cluster blast parser is Simon’s, but as far as I can see, this is due to antiSMASH returning a malformed output file: normally, this list should contain all the genes in the MIBiG entry being referenced, but for some reason, it is occasionally empty.
We can see that antiSMASH does at some point have the genes in the MIBiG entry, because it reports BLAST hits for some of them, but I don’t know why it doesn’t output that list.
Are you using the antiSMASH results from antiSMASH DB, or running it yourself? If the latter, do you get any error messages? It might be worth finding an assembly that has a BGC with an empty MIBiG hit and see if antiSMASH alerts us about this (I can see this happen in my data as well, but when I was running antiSMASH, I tended to do so in batches and not capture the logs…)
I’m presuming the error occurs when we’re computing the % of the BGC which matches the MIBiG BGC. As far as I can tell, we have two options: return a sensible error message when this list is empty (something like ‘can’t determine coverage’) or parse the list of genes from MIBiG on our own. The latter would require MIBiG to be baked into NPLinker - which it isn’t, right?
In any case, if this is from antiSMASH DB - or if we can verify that we aren’t doing anything seriously wrong - I think this calls for a bug report to the antiSMASH people...
… that being said, I think we might want to consider changing the KCB parser to use the json results, as discussed here:
antismash/antismash#291
Best regards,
…-grímur
On 4 Feb 2021, 09:54 +0000, Andrew Ramsay <notifications@github.com>, wrote:
@andrewramsay<https://github.com/andrewramsay> I would for now skip over the "empty" entries to make it work for those that are correct, but I think Grímur wrote this code, and I don't exactly know what piece of information gets parsed from here....
OK, I've updated the Docker image to do that - there will still be a warning message in the log each time it happens but it should remove the affected objects so that the divide-by-zero exception doesn't occur.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#39 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAFYXIC6CMPFUI36RPNA2MDS5JVDTANCNFSM4W2M6BSQ>.
|
@jeep3 I think you run antiSMASH on all the genomes, right? Could you rerun some of the ones giving problems and check if any error pops up? |
OK, I guess what I'll need to do is add support for parsing the JSON files first while keeping the text parser around as a fallback for older datasets. I just picked one of the JSON files at random from a dataset I have here and it covers some of the files which have missing output, so that's helpful to check if I can find the missing info in the JSON correctly. From a quick look through things seem promising... I think I can see how to extract everything that is normally parsed from the text version, including the missing lines from the table! I'll take a better look at it during next week and let you know when I've got NPLinker updated. |
Awesome @andrewramsay - I think that would be the best solution indeed. Those JSON files are included in the standard output for all samples, right? |
Just a quick update: still working on this and making progress, but there are more cases to handle than I thought at first so it might be another day or two before I have a new Docker version you can try out. |
Ok, thanks for the info. If you take a look at the known cluster blast output files, is there still the empty-table problem? The files are the .txt files in the knownclusterblast folder, and the empty table that’s causing trouble is “Table of genes, locations, …” for any of the hits in the “Details” section.
On 9 Feb 2021, 14:13 +0000, jeep3 <notifications@github.com>, wrote:
Good job, Many thanks, Andrew. I re-run 5 strains using antismash 5.1.2, There seems to be no error. There is only one warning information, but it should look ok. I think one possible reason(blank value) is that in the past I only used fasta file input to antismash. I should use FASTA+GFF input to antismash. I'll try it as soon as possible. and then feedback to you. Thanks again.
antismashlog.txt<https://github.com/sdrogers/nplinker/files/5951878/antismashlog.txt>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#39 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAFYXIEEWWK45ODMJE7GHX3S6E7I7ANCNFSM4W2M6BSQ>.
|
I think I've finally got this working enough that you can try it, there's a new Docker image available now. It will check for a .json file in each antiSMASH folder as the primary source of knownclusterblast info, then fall back to the text files if the .json file isn't available. I've tested it with a couple of datasets including the Crusemann one (which has a mix of JSON and text) and it seems OK, but as usual just include the log output if/when you run into any problems. |
Nice, I'm preparing a larger data set to try. and then I'll feedback to you. Thanks, Andrew. |
Hi, Andrew, a new error occurred when NPLinker was loading data, Can you help find out what went wrong? Thanks. |
For example, I don't know how to adjust base on 'TypeError: list indices must be integers or slices, not str'. |
Looks like a bug, not anything you did wrong. Probably something to do with the new JSON parsing code I added. I'll see if I can get it to produce the same error here. |
Ok, if you need any information, please feel free to tell me. |
I think I've found the problem, there was a mismatch in the structure of the data returned by the JSON parser compared with the text parser. There's a new Docker image available which should fix it. |
Excause me, Idon't know what happened. the text get bigger. |
I think this is one of the downsides of relying on a browser-based interface, there can be a lot more overhead than in a native application. From the screenshot that looks like a much larger dataset (in terms of the number of BGCs/spectra) than any of the ones I've had access to so far which probably explains some of the slowness. I thought Crusemann was on the larger end and it only has a fraction of that amount. What sort of hardware are you currently running it on? If it's not particularly fast and doesn't have a lot of RAM it might struggle to load everything. There are a couple of things you can try:
|
One other thing I just remembered: when we were emailing about a different problem back in December, I think you had asked about adding a way to filter out some parts of the dataset while it was being loaded. I added an issue for this (#33) but had forgotten about it until now. Would having that feature be useful for this large dataset? You could write a list of strain IDs into a file and NPLinker would discard all objects that didn't contain or weren't linked to one of those strains? (I'm not sure what would make the most sense here) |
Let me split this reply in two: Docker/server stuff:The URL thing can be a bit confusing, I should clarify:
You can see in the log that it is blocking connections that don't match the configured URL:
You can have multiple allowed URLs though, so I will tweak the Docker image so that localhost:5006/npapp will always work in addition to any server URL that's been configured. Performance problemsFrom the log I can see it's having a lot of problems even on the server, e.g.
So that's taking over 6 minutes just to run the initial pass of Metcalf scoring, which is normally very fast - it takes seconds for the Crusemann data!
Did you see my question about allowing filtering of the dataset, so you could effectively select subset of the whole thing and only load that particular part into NPLinker? This would give you control over how many objects you have loaded. If that's something you'd like to try let me know. |
Thanks @andrewramsay! Regarding performance issue --> does it always calculate all Metcalf scores? Trying a threshold of 10 may work to further trim the results down a bit? Regarding server and local browser --> @jeep3 to tunnel the local host from the server to your laptop, the command will look like this: ssh -L 5006:thornton:5006 myers.bioinformatics.nl |
Yes - it's using the code Florian wrote a couple of years ago which builds co-occurrence matrices and uses them to precalculate the scores for all combinations of objects. His code was much much faster than the original implementation which is why I'm surprised it's taking so long here. The score data it generates could actually be cached quite easily, I hadn't done that before because it was always fast enough it didn't really matter. I'll add that in soon.
OK, what would be the most sensible way of doing this? A whitelist of strains? Separate whitelists for genomics and metabolomics objects? If I know what to add I can try to get that done this week. I might also make a start on a native version of the webapp, although that'd be a longer term thing. Some of the problems (like the Metcalf scoring) are not being caused by running things in the browser, but I still think trying to stuff tens of thousands of objects into those HTML-based tables is going to remain very slow even if most other things are fast. |
Justin, I didn't catch your meaning "tunnel the local host from the server to your laptop, the command will look like this: ssh -L 5006:thornton:5006 myers.bioinformatics.nl" You mean on the server shell to type: ssh -L 5006:thornton.bioinformatics.nl? Because at this point, NPLinker occupies the command window on my laptop. |
Yes - it's using the code Florian wrote a couple of years ago which builds co-occurrence matrices and uses them to precalculate the scores for all combinations of objects. His code was much much faster than the original implementation which is why I'm surprised it's taking so long here. The score data it generates could actually be cached quite easily, I hadn't done that before because it was always fast enough it didn't really matter. I'll add that in soon. |
_OK, what would be the most sensible way of doing this? A whitelist of strains? Separate whitelists for genomics and metabolomics objects? If I know what to add I can try to get that done this week. I might also make a start on a native version of the webapp, although that'd be a longer term thing. Some of the problems (like the Metcalf scoring) are not being caused by running things in the browser, but I still think trying to stuff tens of thousands of objects into those HTML-based tables is going to remain very slow even if most other things are fast._ @andrewramsay yes, it will probably remain slow, but the key thing is that it will at least work (at reasonable speed) once loaded. And I think a selection could be best done on the either the strain labels (pointing to both genomics and metabolomics items) or a list of the genomics strains. Since the objects are linked, do we also need to add the metabolomics items? I am assuming that all the data is on the relevant place including the file that links the genomic and metabolic data files (and has strain labels?!), and now we make a subset of that to load into the web app (and calculate the scores) |
And I think a selection could be best done on the either the strain labels (pointing to both genomics and metabolomics items) or a list of the genomics strains. Since the objects are linked, do we also need to add the metabolomics items? I am assuming that all the data is on the relevant place including the file that links the genomic and metabolic data files (and has strain labels?!), and now we make a subset of that to load into the web app (and calculate the scores)_ |
Sorry, I'm a little confused, This command is run in the cmd shell of my laptop, right? When Nplinker was loaded, I opened the website(http://thornton.bioinformatics.nl:5006/npapp) in the browser on my laptop? |
Exactly, anything that reduces the number of objects should make the rendering process a bit faster.
Assuming the strains are all correctly identified in strain_mappings.csv I think either way should work and there wouldn't be any need to add entries for the metabolomics items manually. As long as there is a set of user-supplied strains it can filter out both metabolomics and genomics objects which don't contain at least one of the strains in the set. I'll push an update to the Docker image once I've finished this (and added the Metcalf scoring caching too). |
A quick question about the strain filtering feature... Currently NPLinker has a built-in filtering step which takes the combined set of strain IDs parsed from the genomics & metabolomics data and trims it down to include only those which are common to both (this doesn't remove any BGC/GCF/Spectrum/MolFam objects, just reduces the number of strains). How should this work when you're also supplying a list of strains you explicitly want included? There are a few potential ways to handle it and I'm not sure what would make most sense:
Any suggestions? |
Cool, I prefer option 3. |
Good question @andrewramsay - I will touch base with with @grimur tomorrow so I will shortly discuss it with him. To me, if we store the results of the original "common strains" step, 1 makes most sense as you can then supply the "include only these strains" step and easily make another selection based on the same initial matching. If those initial matching results are not stored, than option 3 makes sense to me. |
After discussing with @justinjjvanderhooft I agree that option 1 would make most sense, as long as it's a display issue, and not an issue of keeping the strains in memory. The problem with option 3, as I see it, is that you really don't want to do e.g. strain correlation scoring unless you actually have genomic and metabolomic info for all the strains in your analysis. I think we should always restrict analysis to shared strains, or at least make it so that it takes conscious effort to introduce asymmetry. |
Thanks @grimur for summarizing that - to add to this, if NPLinker caches the input data and IOKR scores, it makes sense to do that once (assuming that the size of the dataset mainly poses issues on the HTML rendering of the visualization and not on the backend calculations. Furthermore, BiG-SCAPE results will also differ for different amounts of input data, so to do that only once on the complete matchable set makes most sense to us. |
@grimur sorry to keep asking questions, but I'm not sure I understand what you're getting at here? My intention was to take this set of user-supplied strains, then go through the lists of BGC/Spectrum objects and remove any that didn't include any of those strains (and then in turn delete any GCFs/MolFams that ended up empty). The issue with rendering the UI that Justin mentions is not (directly) due to the number of strains involved, it's the numbers of the other object types. There should be no problem in keeping all the strains in memory since they're not listed in one of the tables like the BGCs/Spectra/GCFs/MolFams. But if I keep all the strains, then the number of other objects isn't going to change and that's the bigger problem...
I'd also really like to clarify some points touched on here in case I've been making some very wrong assumptions. This is a summary of the steps NPLinker would normally take when loading the webapp:
My concern is that I was planning to put this extra filtering step between 5 and 6 in that list. It would involve removing objects from memory, so they would be excluded from the initial Metcalf pass and any other user-initiated scoring done via the browser interface. But now I'm thinking this might not be very helpful if it messes up things like the strain correlation scoring Grimur mentioned by pruning the number of objects available. Am I missing something? I don't think this will be a lot of work to implement once I'm clear on what needs to be done, I'm just keen to avoid a situation where someone ends up wasting a lot of time because they didn't realise they weren't working with the set of objects they thought they were! |
Thanks @andrewramsay for the clear overview - I think we can put in the additional filtering step between 5 and 6 as the GCFs and MFs do not change anymore - just the number of their members will fluctuate (and indeed the actual MetCalf score, and the other scores as a result of that - something to be aware of....) if you would restart the analysis on the same data but with a different subset (all the spectral and BGC objects are still stored) - @grimur could you confirm that this would work for calculation of your new scores as well? This subsetting is trickier than I thought it would be.... |
@andrewramsay not at all, ask away - I'm not even sure myself if what I'm saying makes sense. And thanks for the clarification. Is there anything to gain by keeping the strains in memory, then? The reason @justinjjvanderhooft and I had for keeping them around was some vague notion that it would be easier to redo the analysis for a different subset, but that's perhaps a premature optimisation and simpler to just change the subset list and reload the whole thing. Putting the filtering step between steps 5 and 6 makes perfect sense, just as long as any computation of scores is done after any filtering steps that might remove strains from the population. I'm wondering, though - would swapping steps 4 and 5 make sense - to only load the strain IDs in the final set of IDs, instead of loading all the strains, creating all the objects and deleting them afterwards if the strain is excluded? So the order of the steps would be 1, 2, 3, 5, user filtering, 4, 6, 7? |
Thanks for the responses, that helps make things clearer for me! I've been really busy with other things this week but will try to get an updated Docker image ready as soon as I can find some time.
I can see how this could work, but I think the problem as things stand is that there's no way for NPLinker to do step 5 before parsing the various data files in step 4 because it relies on constructing the objects to discover which strains occur where. The list of strain IDs it reads from strain_mappings don't have any context, it's just "here's a strain ID and possibly various aliases for it". Until it goes through the files and finds where those IDs occur, it doesn't know which are found in genomics/metabolomics/both types of objects. It should already skip over creating objects which have strain IDs that don't appear in strain_mappings.csv so that should help to limit the number of objects created+deleted at least a little. The implementation of step 5 is just a few lines here in case there's an obvious alternative way to do it. I suppose it could parse the data files, extract strain IDs along with an indicator of the source, do the step 5 filtering based on that, and only then go back and create all the BGC/GCF/Spectrum/MolFam objects using the reduced set of strains? Don't see why that wouldn't work, but it'd take a bit more time to rewrite the parsing code since it currently constructs the objects as it goes through the files line by line. |
--> I suppose it could parse the data files, extract strain IDs along with an indicator of the source, do the step 5 filtering based on that, and only then go back and create all the BGC/GCF/Spectrum/MolFam objects using the reduced set of strains? @andrewramsay if that happens on the basis of the same initial BGC/GCF/Spectrum/MolFam assignments for different subsets a user selects, than it would at least provide "consistent" results and scores. Some GCFs may become (very) small.... and another thought occurred to me: we could let the user recalculate the scores based on the subset of strains. |
I've updated the Docker image with my first attempt at implementing this change. The key points are:
|
Thanks @andrewramsay! I think this makes sense and can be easily explained to people. Same GCFs/MFs, but different scorings dependent on the subset. @jeep3 could you test it at some point? You could make a selection from the common_strains.csv that covers the different clades and make a subset of 100 strains and see if that loads okay for example. |
Nice, Got it, thanks. |
I just found a bug in the webapp code in the last version that would have made it seem like the "Show scores ..." buttons wouldn't do anything. If you update to the latest Docker image today that should be fixed. |
Very cool, Andrew, I haven't use NPLinker,recently, due to some other things. Thanks. |
I assume the issue has been solved. Please reopen it if not :-) |
Hi, Andrew,
I want to use the Rosetta score to detect some links, Metcalf scoring was working(for example, Metcalf score value=2.3 to 3.2), however, I found that Rosetta Score is still not working after switching scoring methods. Do you think my configuration in the antismash folder is correct? (A screenshot can be found below) Since the output file I run on the server with Antismash 5 does not include the strains name in front of the *.gbk file name, so, I added the strains name in front of all *.gbk files for NPLinker to identify and parse. In addition, I also added the strains name in front of all *.txt files in the knownclusterblast folder. Or do I need to do some other preparatory work? Thanks a lot!
Best wishes!
The text was updated successfully, but these errors were encountered: