Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for issue 5850: Journal abbreviations in UTF-8 not recognized #7639

Merged
merged 30 commits into from
Apr 23, 2021
Merged

Fix for issue 5850: Journal abbreviations in UTF-8 not recognized #7639

merged 30 commits into from
Apr 23, 2021

Conversation

MrGhabi
Copy link
Contributor

@MrGhabi MrGhabi commented Apr 17, 2021

Fixes #5850

  • Change in CHANGELOG.md described in a way that is understandable for the average user (if applicable)
  • Tests created for changes (if applicable)
  • Manually tested changed features in running JabRef (always required)
  • Screenshots added in PR description (for UI changes)
  • Checked documentation: Is the information available and up to date? If not created an issue at https://github.com/JabRef/user-documentation/issues or, even better, submitted a pull request to the documentation repository.

Reproduce the issue:

  1. New library
  2. New article
  3. BibeTx source adds the following:
    @article{杨芙清2005软件工程技术发展思索,
      title={软件工程技术发展思索},
      author={杨芙清},
      journal={软件学报},
      volume={16},
      number={1},
      year={2005},
      publisher={Citeseer}
    }
    
  4. click "check integrity"

The main reason for this bug is the check-tools Check integrity only accept the charset ASCII. It works well in English citations, but jabref has users across the world and they have different charsets.

The screenshot:

  1. before:
    before

  2. after
    intigrity-check

The way to fix:

  1. First I find the bug related to the class ASCIICharacterChecker.java
  2. In this class
boolean asciiOnly = CharMatcher.ascii().matchesAllOf(field.getValue());

any non-ASCII encoded characters will be warned.
3. Then I remove the steps in IntegrityCheck .

And still, I want to give a warning about non-UTF8 encoded characters.

  1. I get the default encoding from the system (Since we need to give a warning when the field cannot be decoded in UTF-8. And this may happen when users' default encoding charset is non-UTF-8)
  2. check whether the value of fields(string) can be decoded in UTF-8
  3. if not, just give a warning about "Non-UTF-8 encoded found"

To check this, we need first set out the default charset(for example GBK) in the whole environment.
Then we can get the following warning when using Integrity check:
image

public List<IntegrityMessage> check(BibEntry entry) {
List<IntegrityMessage> results = new ArrayList<>();
for (Map.Entry<Field, String> field : entry.getFieldMap().entrySet()) {
Charset charset = Charset.forName(System.getProperty("file.encoding"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would extract this out of the loop, as it doesn't depend on the loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason to use System.getProperty("file.encoding") and not say the encoding specified in the Library properties?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since different users have different charsets due to the operating system or the default settings of the computer. And System.getProperty("file.encoding") is used get the default charset. If the charset is not UTF-8, we should give a warning about that.
And the reason not to use the Library properties & Database properties: Maybe the user doesn't know the default charset in his computer or he set the charset for jabref, but we should give a warning about that since Non-UTF-8 charset may cause to garbled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks for the explanation.

But doesn't this give a lot of false positives? Say I have my library encoded in Charset A, and my systems default is Charset B. If all characters in my database are properly encoded with Charset A, then I shouldn't get any warnings even though some of the characters may not be encodable in Charset B, right?

But I also have to admit that I do not yet understand the use-case from the user perspective, so maybe I'm missing something obvious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have thought about that. In my test, if one user's default charset is A then his paste-board, his input is all encoded in A. So when he input something maybe just garbled. Maybe there is an example: #7629
So the scenario may be rare. That's the reason I don't choose to get the charset by Charset charset = bibDatabaseContext.getMetaData().getEncoding().orElse(preferences.getDefaultEncoding());

By the way, I have a question about the design. If the bibtex is only allowed in ascii in design, why do we allow the user to save it into different charsets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, reviewer! @tobiasdiez After thinking for a long time and doing some tests, I think maybe it's better to give 2 kinds of warning:

  1. In BibLatex, if the Library charset is not UTF-8, then give a warning Non-UTF-8 field found.
  2. In both BibLatex and BibTeX, if the System env is not UTF-8, give the warning Non-UTF-8 env, may cause garbled.
    And I'm eagerly waiting for your suggestions and reply!

CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Member

@Siedlerchr Siedlerchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks already good. PLease have a look at the checkstyle issues

@Siedlerchr Siedlerchr added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Apr 17, 2021
@tobiasdiez
Copy link
Member

I might be not up-to-date, but I always thought UTF8 characters are only allowed in biblatex and that bibtex only handles asci characters. Did this change?

@MrGhabi
Copy link
Contributor Author

MrGhabi commented Apr 17, 2021

I might be not up-to-date, but I always thought UTF8 characters are only allowed in biblatex and that bibtex only handles asci characters. Did this change?

Yeah, some journals and papers use non-ASCII characters as their names.. etc(just as the bib in bibtex I added before). and maybe it is difficult to do with it in jabref. The details are shown in the issue. So I think maybe it is better to trade them equally.

@tobiasdiez
Copy link
Member

I don't really have experience with say Chinese names (as authors or journals) with bibtex. But the only evidence I could find was always suggesting bibLAtex, since bibtex doesn't support UTF8, see e.g. https://tex.stackexchange.com/questions/100092/how-to-include-a-chinese-paper-in-reference-via-bibtex.

So does it make more sense to keep the asci check for bibtex, and add the new utf8 check for biblatex?

@Siedlerchr
Copy link
Member

I agree with @tobiasdiez we need the utf8 check for biblatex and the ascii checker for bibtex then.

@MrGhabi
Copy link
Contributor Author

MrGhabi commented Apr 17, 2021

I don't really have experience with say Chinese names (as authors or journals) with bibtex. But the only evidence I could find was always suggesting bibLAtex, since bibtex doesn't support UTF8, see e.g. https://tex.stackexchange.com/questions/100092/how-to-include-a-chinese-paper-in-reference-via-bibtex.

So does it make more sense to keep the asci check for bibtex, and add the new utf8 check for biblatex?

I agree with @tobiasdiez we need the utf8 check for biblatex and the ascii checker for bibtex then.

Good idea!I will refactor my code to meet this need! (After searching more information about bibtex and biblatex, I agree with you~ )

And the check of utf8 for biblatex maybe it's not a bug but an enhancement? (laugh) I will focus on it!

@MrGhabi
Copy link
Contributor Author

MrGhabi commented Apr 17, 2021

Hi Reviewers! I have added the UTF-8 check for biblatex and recovery the ASCII check for bibtex!
Is there anything I should do to give a better submission?

@Siedlerchr
Copy link
Member

So far looks good, you only need to add the new localization string the l10 files, see here for more details https://devdocs.jabref.org/getting-into-the-code/code-howtos#using-localization-correctly

@MrGhabi
Copy link
Contributor Author

MrGhabi commented Apr 17, 2021

Hi reviewers!I added this statement to all language packs, but I rely on Google Translate for most of my translations, so please double check it for errors~

@Siedlerchr
Copy link
Member

You only need to add it to the English file. All otherttranslations are managed by crowdin.

@MrGhabi
Copy link
Contributor Author

MrGhabi commented Apr 17, 2021

You only need to add it to the English file. All otherttranslations are managed by crowdin.

Emmm, so I need to subtract all the files except the English file, right?

@MrGhabi
Copy link
Contributor Author

MrGhabi commented Apr 18, 2021

I have changed that. Hope everything goes well...

@MrGhabi
Copy link
Contributor Author

MrGhabi commented Apr 19, 2021

I added the javaDoc for UTFChecker and fix a little problem in my Junit test.

String NonUTF8 = "";
try {
NonUTF8 = new String("你好,这条语句使用GBK字符集".getBytes(), "GBK");
} catch (Exception e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simply remove that catch here and add throws Exception to the test method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK!

add 2 Junit Test for UTF8Checker.UTF8EncodingChecker in UTF8CheckerTest

add 2 Junit Test for IntegrityCheck in IntegrityCheckTest
@MrGhabi MrGhabi requested a review from tobiasdiez April 19, 2021 12:26
@MrGhabi
Copy link
Contributor Author

MrGhabi commented Apr 19, 2021

Hi reviewers! I have added 2 Junit Test for UTF8Checker and 2 for IntegrityCheck. I'm not quite sure if these test cases are redundant and standardized, so please give me some advice if problems exist!

@MrGhabi MrGhabi requested a review from Siedlerchr April 21, 2021 23:00
@Siedlerchr Siedlerchr merged commit 434250d into JabRef:main Apr 23, 2021
@Siedlerchr
Copy link
Member

Thanks a lot for your contribution!

Siedlerchr added a commit that referenced this pull request Apr 24, 2021
…om.tngtech.archunit-archunit-junit5-api-0.18.0

* upstream/main:
  Fix exception when searching (#7659)
  Fixes Jabref#7660 (#7663)
  Fix for issue 5850: Journal abbreviations in UTF-8 not recognized (#7639)
  Fix SSLHandshake Exception by using bypass (#7657)
  Fix for issue 7633: Unable to download arXiv pdfs if Title contains curly brackets (#7652)
  Fix#7195 partly Opacity of disabled icon-buttons
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Journal abbreviations in UTF-8 not recognized
3 participants