-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jabref rewrites the entire journal list on addition #3323
Comments
The code that does the actual writing of the abbreviations is actually quite simple:
I don't see an immediate problem in it and the performance problems might simply come from the fact that there are so many abbreviations (around 14 000) if I recall it correctly. Last time I dealt with this, I think also Notepad++ had problems processing the file. This is just a guess, but we might be able to improve this if we compute the string representation before passing it to I am unsure as to why the sorting is lost. This needs further investigation. |
Are the abbreviations somewhere read or stored in a HashMap ? That might explain the different order |
Regarding the performance issue: Has the following suggestion already been incorporated into JabRef? |
From user side I would have two suggestions:
|
@AEgit As far as I can see this has not yet been addressed. @bdcaf Regarding 2.: Can you suggest how to provide files that don't cause performance issues? We can delete 75% of the abbreviations. Then performance issues are gone, but that's surely not the right solution. Let me make this clear: The performance problems stem from the fact that the list of abbreviations is so gigantic. There's probably some optimizations that we can do in the code, but that won't make the huge amount of data to process go away. I would be very happy about suggestions on how to improve the file structure! |
Maybe I understand something wrong, but I believe the implementation flawed. To explain what I mean, see this memory profile that was collected during exactly what @bdcaf described This shows the call-tree and you see that the memory consumption goes down to public void addEntry(Abbreviation abbreviation) {
Objects.requireNonNull(abbreviation);
if (abbreviations.contains(abbreviation)) {
Abbreviation previous = getAbbreviation(abbreviation.getName()).get();
abbreviations.remove(previous);
LOGGER.info("Duplicate journal abbreviation - old one will be overwritten by new one\nOLD: "
+ previous + "\nNEW: " + abbreviation);
}
abbreviations.add(abbreviation);
} First the obvious part: We log every duplicate entry to the log file. While this surely was well intended, note that several ten thousand lines are written into the log-file during a run like this that should be fast. My major question is the following: The Can someone explain why the costly battle down to So what I propose this change for public void addEntry(Abbreviation abbreviation) {
Objects.requireNonNull(abbreviation);
if (abbreviations.contains(abbreviation)) {
// Abbreviation previous = getAbbreviation(abbreviation.getName()).get();
abbreviations.remove(abbreviation);
LOGGER.info("Duplicate journal abbreviation - old one will be overwritten by new one: " + abbreviation);
}
abbreviations.add(abbreviation);
} |
@halirutan Thanks once again for an in-depth analysis :-) Maybe we should just remove the logging here or alternatively switch to a log level that will not execute in regular mode. As for the removal of the abbreviation, I agree with what you write. If I think we should remove the logging and change the code as you described. |
@lenhard I think there is a good reason for the
Therefore, making two "Abbreviations" equal by comparing their name sounds OK to me. I would have chosen a different layout for this though. I tested my fix in mean-time and it runs in maybe 200ms for the largest of those abbr-files with no memory worth mentioning, even with logging. Maybe we should wait for some others to give their OK. |
@JabRef/developers What do you think about the proposed solution? Can we go ahead? |
All that increases performance is welcomed ;) |
The fix suggested by @halirutan looks good. There are even some tests for the abbreviation stuff so I don't expect that something serious breaks. This, however, still leaves the question why do we actually rewrite an abbreviation list that just got imported? |
@lenhard about shortening the lists. IMO the goal should be lists that load efficiently. The largest list is web of science having 87230 lines and entrez with 19506 lines. I had a quick look at them. There seem to be a serious number of non-english journals - I suppose they could be separated in their own files. When I look at the webofscience files there are 5745 Proceedings, 3995 Conferences and 2028 Symopsium. Many of which having numbers or years in the full name. Maybe put these in their own file. I'm not certain - but wouldn't it be correct to have these numbers and dates in different fields? Then the list could be even more reduced. |
JabRef versionJabRef 4.0
Mac OS X 10.13 x86_64
Java 1.8.0_144
I was surprised that the files from abbrv.jabref.org were rewritten in a randomised order. This happens under serious CPU and memory usage. I suppose this is not intended.
Steps to reproduce:
head journal_abbreviations_ams.txt
now looks like this:The text was updated successfully, but these errors were encountered: