Jabref rewrites the entire journal list on addition #3323

bdcaf · 2017-10-18T18:09:24Z

JabRef versionJabRef 4.0
Mac OS X 10.13 x86_64
Java 1.8.0_144

I was surprised that the files from abbrv.jabref.org were rewritten in a randomised order. This happens under serious CPU and memory usage. I suppose this is not intended.

Steps to reproduce:

clone JabRef/abbrv.jabref.org
Select: Options - Manage journal abbreviations
Add one of the journal abbreviations
Click save changes. In Task Monitor I now see memory jumps from about 200MB to >1GB, CPU load > 100%
The file is now changed with entries in randomised order
Possible related - after doing this Jabref takes forever to quit - like minutes. After the window closes the symbol still stays in the task bar. Still the same memory and CPU load.
If the process is killed at next start the journal abbreviations are not in the list and the process needs to be repeated.

head journal_abbreviations_ams.txt now looks like this:

Philosophy of Science = Philos. Sci.
Philosophia Naturalis = Philos. Natur.
MIT Press Series in Artificial Intelligence = MIT Press Ser. Artificial Intelligence
Acta Academiae Aboensis = Acta Acad. Abo. Ser. B
Advances in Chemical Physics = Adv. Chem. Phys.
Cahiers de Topologie et Géométrie Différentielle Catégoriques = Cahiers Topologie Géom. Différentielle Catég.
Malaysian Mathematical Society = Bull. Malaysian Math. Soc. (2)
ASA-SIAM Series on Statistics and Applied Probability = ASA-SIAM Ser. Stat. Appl. Probab.
Nonlinear Analysis = Nonlinear Anal.
Problems of Information Transmission = Problems Inform. Transmission

The text was updated successfully, but these errors were encountered:

lenhard · 2017-10-19T15:14:02Z

The code that does the actual writing of the abbreviations is actually quite simple:

jabref/src/main/java/org/jabref/logic/journals/AbbreviationWriter.java

Lines 29 to 38 in 1751cba

    
           public static void writeOrCreate(Path path, List<Abbreviation> abbreviations, Charset encoding) throws IOException { 
        
               try (OutputStream outStream = Files.newOutputStream(path); 
        
                       OutputStreamWriter writer = new OutputStreamWriter(outStream, encoding)) { 
        
                   for (Abbreviation entry : abbreviations) { 
        
                       writer.write(entry.getName()); 
        
                       writer.write(" = "); 
        
                       writer.write(entry.getAbbreviation()); 
        
                       writer.write(OS.NEWLINE); 
        
                   } 
        
               }

I don't see an immediate problem in it and the performance problems might simply come from the fact that there are so many abbreviations (around 14 000) if I recall it correctly. Last time I dealt with this, I think also Notepad++ had problems processing the file. This is just a guess, but we might be able to improve this if we compute the string representation before passing it to writer.write() and calling this method only once.

I am unsure as to why the sorting is lost. This needs further investigation.

Siedlerchr · 2017-10-19T15:36:06Z

Are the abbreviations somewhere read or stored in a HashMap ? That might explain the different order

AEgit · 2017-10-19T15:49:45Z

Regarding the performance issue: Has the following suggestion already been incorporated into JabRef?
#2831

bdcaf · 2017-10-20T06:51:48Z

From user side I would have two suggestions:

as it is not intended to change the files - would it be possible to add them read only?
as these files are provided by jabref and suggested in the documentation. I think it would make sense either to provide files that don't cause performance issues or clearly point that out in the documentation.

lenhard · 2017-10-20T07:43:23Z

@AEgit As far as I can see this has not yet been addressed.

@bdcaf Regarding 2.: Can you suggest how to provide files that don't cause performance issues? We can delete 75% of the abbreviations. Then performance issues are gone, but that's surely not the right solution. Let me make this clear: The performance problems stem from the fact that the list of abbreviations is so gigantic. There's probably some optimizations that we can do in the code, but that won't make the huge amount of data to process go away. I would be very happy about suggestions on how to improve the file structure!

halirutan · 2017-10-20T08:43:49Z

Maybe I understand something wrong, but I believe the implementation flawed. To explain what I mean, see this memory profile that was collected during exactly what @bdcaf described

This shows the call-tree and you see that the memory consumption goes down to String.replace. What troubles me is, why addEntry is implemented like this

    public void addEntry(Abbreviation abbreviation) {
        Objects.requireNonNull(abbreviation);

        if (abbreviations.contains(abbreviation)) {
            Abbreviation previous = getAbbreviation(abbreviation.getName()).get();
            abbreviations.remove(previous);
           LOGGER.info("Duplicate journal abbreviation - old one will be overwritten by new one\nOLD: "
                    + previous + "\nNEW: " + abbreviation);
        }

        abbreviations.add(abbreviation);
    }

First the obvious part: We log every duplicate entry to the log file. While this surely was well intended, note that several ten thousand lines are written into the log-file during a run like this that should be fast.

My major question is the following: The abbreviations list is a HashSet which means it cannot contain duplicates. When I see this right, then equals for Abbreviation is well defined by comparing the keys (and hashCode also).

Can someone explain why the costly battle down to isMatched is necessary (@simonharrer)? Why can we not simply ~~insert~~ "remove the new entry" and add it again?

So what I propose this change for addEntry:

    public void addEntry(Abbreviation abbreviation) {
        Objects.requireNonNull(abbreviation);

        if (abbreviations.contains(abbreviation)) {
//            Abbreviation previous = getAbbreviation(abbreviation.getName()).get();
            abbreviations.remove(abbreviation);
            LOGGER.info("Duplicate journal abbreviation - old one will be overwritten by new one: " + abbreviation);
        }

        abbreviations.add(abbreviation);
    }

lenhard · 2017-10-20T10:01:10Z

@halirutan Thanks once again for an in-depth analysis :-)

Maybe we should just remove the logging here or alternatively switch to a log level that will not execute in regular mode.

As for the removal of the abbreviation, I agree with what you write. If contains() returns true here, then also remove() will operate as expected. I do not know why there should be a need to retrieve the old abbreviation. The implementation of equals() in Abbreviation seems flawed to me, because it is just based on the name and not the actual abbreviation. But I do not know if this is intentional and what may break if we change it.

I think we should remove the logging and change the code as you described.

halirutan · 2017-10-20T10:11:08Z

@lenhard I think there is a good reason for the equals in Abbreviation. When I see this right, the intention is as follows:

There can only be exactly one Journal XXXX, but there can be different abbreviations for it. In JabRef, we always have exactly one current abbreviation for Journal XXXX and we can
1. insert new Journal names where we currently don't have an abbr for
2. overwrite existing one if we need a different style

Therefore, making two "Abbreviations" equal by comparing their name sounds OK to me. I would have chosen a different layout for this though.

I tested my fix in mean-time and it runs in maybe 200ms for the largest of those abbr-files with no memory worth mentioning, even with logging. Maybe we should wait for some others to give their OK.

lenhard · 2017-10-20T10:17:19Z

@JabRef/developers What do you think about the proposed solution? Can we go ahead?

Siedlerchr · 2017-10-20T10:20:02Z

All that increases performance is welcomed ;)

tobiasdiez · 2017-10-20T11:15:44Z

The fix suggested by @halirutan looks good. There are even some tests for the abbreviation stuff so I don't expect that something serious breaks.

This, however, still leaves the question why do we actually rewrite an abbreviation list that just got imported?

bdcaf · 2017-10-20T14:11:23Z

@lenhard about shortening the lists. IMO the goal should be lists that load efficiently.

The largest list is web of science having 87230 lines and entrez with 19506 lines. I had a quick look at them. There seem to be a serious number of non-english journals - I suppose they could be separated in their own files. When I look at the webofscience files there are 5745 Proceedings, 3995 Conferences and 2028 Symopsium. Many of which having numbers or years in the full name. Maybe put these in their own file. I'm not certain - but wouldn't it be correct to have these numbers and dates in different fields? Then the list could be even more reduced.

lenhard added the type: performance label Oct 23, 2017

lenhard mentioned this issue Feb 16, 2018

Improve performance for managing journal abbreviations #3729

Merged

5 tasks

Siedlerchr closed this as completed in #3729 Feb 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jabref rewrites the entire journal list on addition #3323

Jabref rewrites the entire journal list on addition #3323

bdcaf commented Oct 18, 2017

lenhard commented Oct 19, 2017 •

edited

Loading

Siedlerchr commented Oct 19, 2017

AEgit commented Oct 19, 2017

bdcaf commented Oct 20, 2017

lenhard commented Oct 20, 2017

halirutan commented Oct 20, 2017 •

edited

Loading

lenhard commented Oct 20, 2017

halirutan commented Oct 20, 2017 •

edited

Loading

lenhard commented Oct 20, 2017

Siedlerchr commented Oct 20, 2017

tobiasdiez commented Oct 20, 2017 •

edited

Loading

bdcaf commented Oct 20, 2017 •

edited

Loading

Jabref rewrites the entire journal list on addition #3323

Jabref rewrites the entire journal list on addition #3323

Comments

bdcaf commented Oct 18, 2017

lenhard commented Oct 19, 2017 • edited Loading

Siedlerchr commented Oct 19, 2017

AEgit commented Oct 19, 2017

bdcaf commented Oct 20, 2017

lenhard commented Oct 20, 2017

halirutan commented Oct 20, 2017 • edited Loading

lenhard commented Oct 20, 2017

halirutan commented Oct 20, 2017 • edited Loading

lenhard commented Oct 20, 2017

Siedlerchr commented Oct 20, 2017

tobiasdiez commented Oct 20, 2017 • edited Loading

bdcaf commented Oct 20, 2017 • edited Loading

lenhard commented Oct 19, 2017 •

edited

Loading

halirutan commented Oct 20, 2017 •

edited

Loading

halirutan commented Oct 20, 2017 •

edited

Loading

tobiasdiez commented Oct 20, 2017 •

edited

Loading

bdcaf commented Oct 20, 2017 •

edited

Loading