Updates to institution citation keys #7210

k3KAW8Pnf7mkmdSMPHz27 · 2020-12-20T20:29:31Z

Fixes #6942. Fixes #7199.

TL;DR

Authors only having a last name are abbreviated. #6942 is miss-abbreviated because name parts containing uni are assumed to be universities, e.g., United Airlines. #7199 is triggers the abbreviation method because, from BibTeX´s point-of-view, the author only has a last name. The name is later removed because Java does not classify the first character as an uppercase letter, hence the heuristic assumes it is an insignificant word. E.g., {eBay} gets removed, {JabRef} doesn't.

Background

Authors that are institutions can have very lengthy citation keys unless they are abbreviated in some fashion. The abbreviations are generated in

jabref/src/main/java/org/jabref/logic/citationkeypattern/BracketedPattern.java

Line 1243 in 6718c08

private static String generateInstitutionKey(String content) {

which is based on a heuristic that tries to determine a suitable abbreviation based on the content of the specific author.

Issue #6942 is due to a regexp being overly broad when determining if an author is a university.

Issue #7199 was created from #6706 where I changed how a company/institute is identified. In previous versions of JabRef, an author is an institute if it is enclosed in curly brackets (e.g., {JabRef}), and I changed it to also include authors' who only have a last name (e.g., JabRef), which means that the author in issue #7199 is treated as an institution.
This would not have been an issue if it wasn't because the heuristic for creating a citation key identifies words starting with a lowercase letter in the author's name as an invalid part of an institute abbreviation. Think of the "and" in "National Aeronautics and Space Administration"). Since the author's name only consist of one "word", it is removed. (this is why the workaround @tobiasdiez mentioned does not work, even if it should)

Changes for

#6942

Identifying universities in the authors' names is arbitrary. The new regexp matches ^uni(v|b|$).* instead of ^uni. There are other suggestions in Readability for citation key patterns #6706 that can be implemented instead.
The inline institute abbreviation should now work again (e.g., {JabRef ({JB})} should produce JB)
I am ignoring the shortauthor field, because I find it unclear how and when to use the field (e.g., for multiple authors).

#7199

I'll change how institutes are identified, so that the Author must only have a last name and that last name must contain a space. This should only happen if the Author originally were enclosed in brackets. In all other cases I'd argue it is acceptable to not abbreviate the author since the name is only one word.
1. Drawbacks to this approach is that I am using the LatexToUnicodeAdapter to remove braces, and it doesn't like unbalanced ones.

Todo

Log a warning for unbalanced brackets in Authors
Change in CHANGELOG.md described (if applicable)
Tests created for changes (if applicable)
Manually tested changed features in running JabRef (always required)
Screenshots added in PR description (for UI changes)
Checked documentation: Is the information available and up to date? If not created an issue at https://github.com/JabRef/user-documentation/issues or, even better, submitted a pull request to the documentation repository.

The regexp is still broad, but unless there are further complaints perhaps it is enough.

k3KAW8Pnf7mkmdSMPHz27 · 2020-12-21T13:28:51Z

Except for logging when LatexToUnicodeAdapter fails (otherwise names such as "{Link{"{o}}ping University}}" will generate only Uni, due to the unbalanced brackets, and no warning, I don't intend any more code changes.

There will be some updates to the documentation, change log etc.

k3KAW8Pnf7mkmdSMPHz27 · 2020-12-21T14:01:48Z

Wait... I can't check for failed parsing?

https://github.com/tomtung/latex2unicode/blob/71979d6904fcf7efade786297983b881a1b019d4/src/main/scala/com/github/tomtung/latex2unicode/LaTeX2Unicode.scala#L139-L146

When generating a key from a university name it should contain at least two parts, "university" and the university's name. If it does not it is likely that the name contained latex that could not be resolved correctly.

tobiasdiez · 2020-12-21T14:21:21Z

I guess you can use parse directly? Is this ready for review?

k3KAW8Pnf7mkmdSMPHz27 · 2020-12-21T14:24:38Z

@tobiasdiez except for JavaDoc and changelog, yes, it is ready for review.

Which parse?

k3KAW8Pnf7mkmdSMPHz27 · 2020-12-21T14:26:21Z

@tobiasdiez sorry, I missed to push 1 local commit.

tobiasdiez · 2020-12-21T14:45:04Z

https://github.com/tomtung/latex2unicode/blob/71979d6904fcf7efade786297983b881a1b019d4/src/main/scala/com/github/tomtung/latex2unicode/LaTeX2Unicode.scala#L126
this one?

k3KAW8Pnf7mkmdSMPHz27 · 2020-12-21T14:54:33Z

O.o
That would indeed work. Should I put this PR back in draft then? ~~I am not sure how to go about this one, should I create a LatexToUnicodeParseAdapter that throws an exception so that the rest of the code in LatexToUnicodeAdapter is not repeated?~~ Edit I were thinking about the Formatter classes, which LatexToUnicodeAdapter is not.

tobiasdiez · 2020-12-21T15:09:24Z

I have to admit I had not yet looked at your PR. It was just in response to your question. What do you want to do?

k3KAW8Pnf7mkmdSMPHz27 · 2020-12-21T15:13:41Z

@tobiasdiez good then :)

In my opinion it is better if the generation warns if there is incorrect latex. But I am not sure how to do it in practice. I don't have experience with Scala <-> Java but if I am not misreading the LaTeX2Unicode#parse I am going to have to check if the result is instanceof fastparse.Success in which case we need to add fastparse as a dependency?

k3KAW8Pnf7mkmdSMPHz27 · 2020-12-21T15:15:09Z

I don't know the drawbacks/advantages of that approach, but if this is the only part of the JabRef code that needs this, perhaps it is better to leave the code as-is?

tobiasdiez · 2020-12-21T15:37:16Z

We already have farstparse as a dependency, so that's not a problem. I agree it would be nice to check for parsing errors, so I support you here (sadly only morally since I don't have any scala experience too).

k3KAW8Pnf7mkmdSMPHz27 · 2020-12-21T15:46:14Z

fastparse is an indirect dependency from latex2unicode?
The issue I am having is that latex2unicode#parse returns a fastparse.core.Parsed instance. Unless I am missing something it means I must be able to import fastparse.core which I cannot right now.
Should I add fastparse to module-info.java?
I am missing something in the JabRef building process, I'll see if I can figure out.

src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java

Co-authored-by: Christoph <siedlerkiller@gmail.com>

src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java

k3KAW8Pnf7mkmdSMPHz27 · 2020-12-22T22:50:03Z

I’ll be looking at it shortly (dinner)

…

On December 22, 2020 at 17:29:07, Christoph ***@***.******@***.***)) wrote: @Siedlerchr commented on this pull request. In src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java(#7210 (comment)): > + * @return a String with LaTeX resolved into Unicode + * @throws IllegalArgumentException if the LaTeX could not be parsed + */ + public static Optional<String> parse(String inField) throws IllegalArgumentException { + Objects.requireNonNull(inField); + String toFormat = UNDERSCORE_MATCHER.matcher(inField).replaceAll(REPLACEMENT_CHAR); + try { + var parsingResult = LaTeX2Unicode.parse(toFormat); + if (parsingResult instanceof Parsed.Success) { + String text = parsingResult.get().value(); + toFormat = Normalizer.normalize(text, Normalizer.Form.NFC); + return Optional.of(UNDERSCORE_PLACEHOLDER_MATCHER.matcher(toFormat).replaceAll("_")); + } else { + return Optional.empty() + } + } catch (Throwable throwable) { the try and catch can now be removed — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub(#7210 (review)), or unsubscribe(https://github.com/notifications/unsubscribe-auth/AAT2NZ3T6F3IGY5GR7TBESDSWEMTHANCNFSM4VDLJ7NQ).

Siedlerchr · 2020-12-26T20:40:00Z

Is this ready for review? (I think it looks good)

k3KAW8Pnf7mkmdSMPHz27 · 2020-12-27T02:05:50Z

Is this ready for review? (I think it looks good)

Yup, it is ready.

tobiasdiez

The change look good to me. Thanks a lot!

Could you please also add the examples from the issues as test to verify that the issue doesn't reoccur. Thanks!

k3KAW8Pnf7mkmdSMPHz27 · 2020-12-27T15:25:47Z

@tobiasdiez good catch, apparently I did not examine my assumptions enough. My concerns regarding including it is,

1.

Based on #6706 (comment) and others,

(i) when using in the bibtex key generator, the non-ascii characters are converted to a "good" representation (as in your example: Göd -> Goe)
...
, because the bibtex tooling is not fully UTF8 (biblatex+biber is, but not the more commonly used bibtex tool).

my thoughts were that, to the largest extent possible, BibTeX citation keys should be ASCII/ANSI only, in which case #7199 should not be added as a test case. If {eBay} is successful, it is likely that #7199 will be as well (from BibTeX/Java´s point-of-view both contain only a last name that starts with a lowercase letter, and it was the institute abbreviation method that removed 杨秀群).

2.

I am not convinced that the currently generated citation key is correct, but you seem to know more about this particular issue than I, so feel free to correct me 😛
I based my decision on @xiaodongcentury's comment

"杨秀群", but not "杨秀群", the former has a space between last name and first name of a Chinese name.

In my opinion, the "correct" citation key should be either 杨 or 秀群 (i.e., based on the documentation, the citation key shoul be the last name). With eBay I know the citation key is correct.

However, all of that being said (and my apologies about the rather lengthy reading), if this is how JabRef is used, perhaps the test case should be added anyway?

tobiasdiez · 2020-12-27T17:56:24Z

As we have quite a lot of Chinese users, it would be good to support this as far as possible. Thus, if the user enters Chinese authors, then I would say we should generate a key with chinese characters. In the best case, one would actually replace the Chinese character by their romanization, but I'm not sure if there exist a convenient library for this.

For now, I would say we simply add 杨秀群 as a test (with whatever result it currently produces).

k3KAW8Pnf7mkmdSMPHz27 · 2020-12-28T11:30:23Z

For now, I would say we simply add 杨秀群 as a test (with whatever result it currently produces).

Sure!

In the best case, one would actually replace the Chinese character by their romanization, but I'm not sure if there exist a convenient library for this.

pinyin4j seem to be the most popular Java alternative. However, given that there isn't a need of high performance (clean up operation?), there are more popular javascript libraries. I don't know if there is an interest for this?

tobiasdiez · 2020-12-28T12:20:29Z

Thanks, also for the research concerning pinyin libraries. Let's see if users report this as a feature request.

k3KAW8Pnf7mkmdSMPHz27 · 2020-12-28T13:03:09Z

@Siedlerchr @tobiasdiez thank you both for your time, reviews and suggestions, they are appreciated!

xiaodongcentury · 2020-12-28T13:14:27Z

Thank you ALL!
The JabRef 5.3--2020-12-28--78b08b5 works perfectly with Chinese names.
Thanks again.

…dtask * upstream/master: Improved detection of long DOI's within text (#7260) Add missing author and fix name Fix style of highlighted checkboxes while searching in preferences (#7258) Updates to institution citation keys (#7210)

* upstream/master: (201 commits) Only disable move to file dir when path equals (#7269) Improved detection of long DOI's within text (#7260) Add missing author and fix name Fix style of highlighted checkboxes while searching in preferences (#7258) Updates to institution citation keys (#7210) Bump xmlunit-core from 2.8.1 to 2.8.2 (#7251) Bump classgraph from 4.8.97 to 4.8.98 (#7250) Bump bcprov-jdk15on from 1.67 to 1.68 (#7249) Bump xmlunit-matchers from 2.8.1 to 2.8.2 (#7252) Bump unirest-java from 3.11.06 to 3.11.09 (#7254) Bump org.beryx.jlink from 2.23.0 to 2.23.1 (#7253) Bump pascalgn/automerge-action from v0.12.0 to v0.13.0 (#7255) Added a check to integrate with the flatpak package (#7248) New translations JabRef_en.properties (Chinese Traditional) (#7247) Update code-howtos.md GitBook: [master] 5 pages and 25 assets modified New Crowdin updates (#7246) add language mapping for chinese remove chinese content fix hamcrest link ... # Conflicts: # src/main/java/org/jabref/gui/JabRefFrame.java # src/main/java/org/jabref/gui/dialogs/AutosaveUiManager.java # src/main/java/org/jabref/gui/exporter/SaveAction.java # src/main/java/org/jabref/gui/exporter/SaveDatabaseAction.java # src/main/java/org/jabref/logic/autosaveandbackup/BackupManager.java # src/test/java/org/jabref/gui/exporter/SaveDatabaseActionTest.java

* updateGradle7: (39 commits) try upgrading gradle Only disable move to file dir when path equals (#7269) Improved detection of long DOI's within text (#7260) Add missing author and fix name Fix style of highlighted checkboxes while searching in preferences (#7258) Updates to institution citation keys (#7210) Bump xmlunit-core from 2.8.1 to 2.8.2 (#7251) Bump classgraph from 4.8.97 to 4.8.98 (#7250) Bump bcprov-jdk15on from 1.67 to 1.68 (#7249) Bump xmlunit-matchers from 2.8.1 to 2.8.2 (#7252) Bump unirest-java from 3.11.06 to 3.11.09 (#7254) Bump org.beryx.jlink from 2.23.0 to 2.23.1 (#7253) Bump pascalgn/automerge-action from v0.12.0 to v0.13.0 (#7255) Added a check to integrate with the flatpak package (#7248) New translations JabRef_en.properties (Chinese Traditional) (#7247) Update code-howtos.md GitBook: [master] 5 pages and 25 assets modified New Crowdin updates (#7246) add language mapping for chinese remove chinese content ...

* upstream/master: fix checsktyle Fix for application dialogs opening in wrong displays (#7273) Only disable move to file dir when path equals (#7269) Improved detection of long DOI's within text (#7260) Add missing author and fix name Fix style of highlighted checkboxes while searching in preferences (#7258) Updates to institution citation keys (#7210) Bump xmlunit-core from 2.8.1 to 2.8.2 (#7251) Bump classgraph from 4.8.97 to 4.8.98 (#7250) Bump bcprov-jdk15on from 1.67 to 1.68 (#7249) Bump xmlunit-matchers from 2.8.1 to 2.8.2 (#7252) Bump unirest-java from 3.11.06 to 3.11.09 (#7254) Bump org.beryx.jlink from 2.23.0 to 2.23.1 (#7253) Bump pascalgn/automerge-action from v0.12.0 to v0.13.0 (#7255) # Conflicts: # src/main/java/org/jabref/gui/openoffice/OpenOfficePanel.java

k3KAW8Pnf7mkmdSMPHz27 added 12 commits December 18, 2020 15:13

Add test cases

3dd0b28

Fix overly broad regexp

6b26d0c

The regexp is still broad, but unless there are further complaints perhaps it is enough.

Fix case-sensitivity in test case

1d6d393

Fix inline abbreviation for institutes

5d61a44

Drop test for short author

c76bb28

Add test case

df0144c

Add test case

e5b5874

Fix test case for single word author

95dbc21

Fix un-escaped backslash in test case

e635759

Fix unbalanced brackets in text cases

f6e41cc

Fix institute author abbreviations

d531b88

Readability modifications

bcfa8da

k3KAW8Pnf7mkmdSMPHz27 changed the title ~~[WIP] Updates to institution citation keys~~ Updates to institution citation keys Dec 21, 2020

k3KAW8Pnf7mkmdSMPHz27 marked this pull request as ready for review December 21, 2020 13:25

Add log output for generating university key

b55c413

When generating a key from a university name it should contain at least two parts, "university" and the university's name. If it does not it is likely that the name contained latex that could not be resolved correctly.

k3KAW8Pnf7mkmdSMPHz27 added 3 commits December 21, 2020 09:32

Fix JavaDoc

a787acc

Update CHANGELOG.md

5b6454f

Merge branch 'master' into fix-for-issue-6942

7b5271d

Siedlerchr reviewed Dec 22, 2020

View reviewed changes

src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java Outdated Show resolved Hide resolved

Siedlerchr reviewed Dec 22, 2020

View reviewed changes

src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java Outdated Show resolved Hide resolved

Siedlerchr reviewed Dec 22, 2020

View reviewed changes

src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java Outdated Show resolved Hide resolved

Siedlerchr reviewed Dec 22, 2020

View reviewed changes

src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java Outdated Show resolved Hide resolved

Siedlerchr reviewed Dec 22, 2020

View reviewed changes

src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java Outdated Show resolved Hide resolved

k3KAW8Pnf7mkmdSMPHz27 and others added 4 commits December 22, 2020 17:16

Update src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java

3eac89d

Co-authored-by: Christoph <siedlerkiller@gmail.com>

Update src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java

e90f9ca

Co-authored-by: Christoph <siedlerkiller@gmail.com>

Update src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java

f6165cb

Co-authored-by: Christoph <siedlerkiller@gmail.com>

Update src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java

e78e8b9

Co-authored-by: Christoph <siedlerkiller@gmail.com>

Siedlerchr reviewed Dec 22, 2020

View reviewed changes

src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java Outdated Show resolved Hide resolved

k3KAW8Pnf7mkmdSMPHz27 added 2 commits December 22, 2020 19:42

Removes unnecessary checked exception

3e3deae

Fix missed NFC normalization

ad4dab9

Siedlerchr added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Dec 26, 2020

tobiasdiez requested changes Dec 27, 2020

View reviewed changes

tobiasdiez added the status: changes required Pull requests that are not yet complete label Dec 27, 2020

Add test case

1fc991e

tobiasdiez merged commit 78b08b5 into JabRef:master Dec 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates to institution citation keys #7210

Updates to institution citation keys #7210

k3KAW8Pnf7mkmdSMPHz27 commented Dec 20, 2020 •

edited

Loading

k3KAW8Pnf7mkmdSMPHz27 commented Dec 21, 2020 •

edited

Loading

k3KAW8Pnf7mkmdSMPHz27 commented Dec 21, 2020

tobiasdiez commented Dec 21, 2020

k3KAW8Pnf7mkmdSMPHz27 commented Dec 21, 2020

k3KAW8Pnf7mkmdSMPHz27 commented Dec 21, 2020 •

edited

Loading

tobiasdiez commented Dec 21, 2020

k3KAW8Pnf7mkmdSMPHz27 commented Dec 21, 2020 •

edited

Loading

tobiasdiez commented Dec 21, 2020

k3KAW8Pnf7mkmdSMPHz27 commented Dec 21, 2020 •

edited

Loading

k3KAW8Pnf7mkmdSMPHz27 commented Dec 21, 2020

tobiasdiez commented Dec 21, 2020

k3KAW8Pnf7mkmdSMPHz27 commented Dec 21, 2020 •

edited

Loading

k3KAW8Pnf7mkmdSMPHz27 commented Dec 22, 2020 via email

Siedlerchr commented Dec 26, 2020

k3KAW8Pnf7mkmdSMPHz27 commented Dec 27, 2020

tobiasdiez left a comment

k3KAW8Pnf7mkmdSMPHz27 commented Dec 27, 2020 •

edited

Loading

tobiasdiez commented Dec 27, 2020

k3KAW8Pnf7mkmdSMPHz27 commented Dec 28, 2020

tobiasdiez commented Dec 28, 2020

k3KAW8Pnf7mkmdSMPHz27 commented Dec 28, 2020

xiaodongcentury commented Dec 28, 2020

Updates to institution citation keys #7210

Updates to institution citation keys #7210

Conversation

k3KAW8Pnf7mkmdSMPHz27 commented Dec 20, 2020 • edited Loading

TL;DR

Background

Changes for

Todo

k3KAW8Pnf7mkmdSMPHz27 commented Dec 21, 2020 • edited Loading

k3KAW8Pnf7mkmdSMPHz27 commented Dec 21, 2020

tobiasdiez commented Dec 21, 2020

k3KAW8Pnf7mkmdSMPHz27 commented Dec 21, 2020

k3KAW8Pnf7mkmdSMPHz27 commented Dec 21, 2020 • edited Loading

tobiasdiez commented Dec 21, 2020

k3KAW8Pnf7mkmdSMPHz27 commented Dec 21, 2020 • edited Loading

tobiasdiez commented Dec 21, 2020

k3KAW8Pnf7mkmdSMPHz27 commented Dec 21, 2020 • edited Loading

k3KAW8Pnf7mkmdSMPHz27 commented Dec 21, 2020

tobiasdiez commented Dec 21, 2020

k3KAW8Pnf7mkmdSMPHz27 commented Dec 21, 2020 • edited Loading

k3KAW8Pnf7mkmdSMPHz27 commented Dec 22, 2020 via email

Siedlerchr commented Dec 26, 2020

k3KAW8Pnf7mkmdSMPHz27 commented Dec 27, 2020

tobiasdiez left a comment

Choose a reason for hiding this comment

k3KAW8Pnf7mkmdSMPHz27 commented Dec 27, 2020 • edited Loading

1.

2.

tobiasdiez commented Dec 27, 2020

k3KAW8Pnf7mkmdSMPHz27 commented Dec 28, 2020

tobiasdiez commented Dec 28, 2020

k3KAW8Pnf7mkmdSMPHz27 commented Dec 28, 2020

xiaodongcentury commented Dec 28, 2020

k3KAW8Pnf7mkmdSMPHz27 commented Dec 20, 2020 •

edited

Loading

k3KAW8Pnf7mkmdSMPHz27 commented Dec 21, 2020 •

edited

Loading

k3KAW8Pnf7mkmdSMPHz27 commented Dec 21, 2020 •

edited

Loading

k3KAW8Pnf7mkmdSMPHz27 commented Dec 21, 2020 •

edited

Loading

k3KAW8Pnf7mkmdSMPHz27 commented Dec 21, 2020 •

edited

Loading

k3KAW8Pnf7mkmdSMPHz27 commented Dec 21, 2020 •

edited

Loading

k3KAW8Pnf7mkmdSMPHz27 commented Dec 27, 2020 •

edited

Loading