Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jabref rewrites the entire journal list on addition #3323

Closed
bdcaf opened this issue Oct 18, 2017 · 12 comments · Fixed by #3729
Closed

Jabref rewrites the entire journal list on addition #3323

bdcaf opened this issue Oct 18, 2017 · 12 comments · Fixed by #3729

Comments

@bdcaf
Copy link

bdcaf commented Oct 18, 2017

JabRef versionJabRef 4.0
Mac OS X 10.13 x86_64
Java 1.8.0_144

I was surprised that the files from abbrv.jabref.org were rewritten in a randomised order. This happens under serious CPU and memory usage. I suppose this is not intended.

Steps to reproduce:

  1. clone JabRef/abbrv.jabref.org
  2. Select: Options - Manage journal abbreviations
  3. Add one of the journal abbreviations
  4. Click save changes. In Task Monitor I now see memory jumps from about 200MB to >1GB, CPU load > 100%
  5. The file is now changed with entries in randomised order
  6. Possible related - after doing this Jabref takes forever to quit - like minutes. After the window closes the symbol still stays in the task bar. Still the same memory and CPU load.
  7. If the process is killed at next start the journal abbreviations are not in the list and the process needs to be repeated.

head journal_abbreviations_ams.txt now looks like this:

Philosophy of Science = Philos. Sci.
Philosophia Naturalis = Philos. Natur.
MIT Press Series in Artificial Intelligence = MIT Press Ser. Artificial Intelligence
Acta Academiae Aboensis = Acta Acad. Abo. Ser. B
Advances in Chemical Physics = Adv. Chem. Phys.
Cahiers de Topologie et Géométrie Différentielle Catégoriques = Cahiers Topologie Géom. Différentielle Catég.
Malaysian Mathematical Society = Bull. Malaysian Math. Soc. (2)
ASA-SIAM Series on Statistics and Applied Probability = ASA-SIAM Ser. Stat. Appl. Probab.
Nonlinear Analysis = Nonlinear Anal.
Problems of Information Transmission = Problems Inform. Transmission
@lenhard
Copy link
Member

lenhard commented Oct 19, 2017

The code that does the actual writing of the abbreviations is actually quite simple:

public static void writeOrCreate(Path path, List<Abbreviation> abbreviations, Charset encoding) throws IOException {
try (OutputStream outStream = Files.newOutputStream(path);
OutputStreamWriter writer = new OutputStreamWriter(outStream, encoding)) {
for (Abbreviation entry : abbreviations) {
writer.write(entry.getName());
writer.write(" = ");
writer.write(entry.getAbbreviation());
writer.write(OS.NEWLINE);
}
}

I don't see an immediate problem in it and the performance problems might simply come from the fact that there are so many abbreviations (around 14 000) if I recall it correctly. Last time I dealt with this, I think also Notepad++ had problems processing the file. This is just a guess, but we might be able to improve this if we compute the string representation before passing it to writer.write() and calling this method only once.

I am unsure as to why the sorting is lost. This needs further investigation.

@Siedlerchr
Copy link
Member

Are the abbreviations somewhere read or stored in a HashMap ? That might explain the different order

@AEgit
Copy link

AEgit commented Oct 19, 2017

Regarding the performance issue: Has the following suggestion already been incorporated into JabRef?
#2831

@bdcaf
Copy link
Author

bdcaf commented Oct 20, 2017

From user side I would have two suggestions:

  1. as it is not intended to change the files - would it be possible to add them read only?
  2. as these files are provided by jabref and suggested in the documentation. I think it would make sense either to provide files that don't cause performance issues or clearly point that out in the documentation.

@lenhard
Copy link
Member

lenhard commented Oct 20, 2017

@AEgit As far as I can see this has not yet been addressed.

@bdcaf Regarding 2.: Can you suggest how to provide files that don't cause performance issues? We can delete 75% of the abbreviations. Then performance issues are gone, but that's surely not the right solution. Let me make this clear: The performance problems stem from the fact that the list of abbreviations is so gigantic. There's probably some optimizations that we can do in the code, but that won't make the huge amount of data to process go away. I would be very happy about suggestions on how to improve the file structure!

@halirutan
Copy link
Collaborator

halirutan commented Oct 20, 2017

Maybe I understand something wrong, but I believe the implementation flawed. To explain what I mean, see this memory profile that was collected during exactly what @bdcaf described

img

This shows the call-tree and you see that the memory consumption goes down to String.replace. What troubles me is, why addEntry is implemented like this

    public void addEntry(Abbreviation abbreviation) {
        Objects.requireNonNull(abbreviation);

        if (abbreviations.contains(abbreviation)) {
            Abbreviation previous = getAbbreviation(abbreviation.getName()).get();
            abbreviations.remove(previous);
           LOGGER.info("Duplicate journal abbreviation - old one will be overwritten by new one\nOLD: "
                    + previous + "\nNEW: " + abbreviation);
        }

        abbreviations.add(abbreviation);
    }

First the obvious part: We log every duplicate entry to the log file. While this surely was well intended, note that several ten thousand lines are written into the log-file during a run like this that should be fast.

My major question is the following: The abbreviations list is a HashSet which means it cannot contain duplicates. When I see this right, then equals for Abbreviation is well defined by comparing the keys (and hashCode also).

Can someone explain why the costly battle down to isMatched is necessary (@simonharrer)? Why can we not simply insert "remove the new entry" and add it again?

So what I propose this change for addEntry:

    public void addEntry(Abbreviation abbreviation) {
        Objects.requireNonNull(abbreviation);

        if (abbreviations.contains(abbreviation)) {
//            Abbreviation previous = getAbbreviation(abbreviation.getName()).get();
            abbreviations.remove(abbreviation);
            LOGGER.info("Duplicate journal abbreviation - old one will be overwritten by new one: " + abbreviation);
        }

        abbreviations.add(abbreviation);
    }

@lenhard
Copy link
Member

lenhard commented Oct 20, 2017

@halirutan Thanks once again for an in-depth analysis :-)

Maybe we should just remove the logging here or alternatively switch to a log level that will not execute in regular mode.

As for the removal of the abbreviation, I agree with what you write. If contains() returns true here, then also remove() will operate as expected. I do not know why there should be a need to retrieve the old abbreviation. The implementation of equals() in Abbreviation seems flawed to me, because it is just based on the name and not the actual abbreviation. But I do not know if this is intentional and what may break if we change it.

I think we should remove the logging and change the code as you described.

@halirutan
Copy link
Collaborator

halirutan commented Oct 20, 2017

@lenhard I think there is a good reason for the equals in Abbreviation. When I see this right, the intention is as follows:

  1. There can only be exactly one Journal XXXX, but there can be different abbreviations for it. In JabRef, we always have exactly one current abbreviation for Journal XXXX and we can
    1. insert new Journal names where we currently don't have an abbr for
    2. overwrite existing one if we need a different style

Therefore, making two "Abbreviations" equal by comparing their name sounds OK to me. I would have chosen a different layout for this though.

I tested my fix in mean-time and it runs in maybe 200ms for the largest of those abbr-files with no memory worth mentioning, even with logging. Maybe we should wait for some others to give their OK.

@lenhard
Copy link
Member

lenhard commented Oct 20, 2017

@JabRef/developers What do you think about the proposed solution? Can we go ahead?

@Siedlerchr
Copy link
Member

All that increases performance is welcomed ;)

@tobiasdiez
Copy link
Member

tobiasdiez commented Oct 20, 2017

The fix suggested by @halirutan looks good. There are even some tests for the abbreviation stuff so I don't expect that something serious breaks.

This, however, still leaves the question why do we actually rewrite an abbreviation list that just got imported?

@bdcaf
Copy link
Author

bdcaf commented Oct 20, 2017

@lenhard about shortening the lists. IMO the goal should be lists that load efficiently.

The largest list is web of science having 87230 lines and entrez with 19506 lines. I had a quick look at them. There seem to be a serious number of non-english journals - I suppose they could be separated in their own files. When I look at the webofscience files there are 5745 Proceedings, 3995 Conferences and 2028 Symopsium. Many of which having numbers or years in the full name. Maybe put these in their own file. I'm not certain - but wouldn't it be correct to have these numbers and dates in different fields? Then the list could be even more reduced.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants