-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for issue 5850: Journal abbreviations in UTF-8 not recognized #7639
Conversation
public List<IntegrityMessage> check(BibEntry entry) { | ||
List<IntegrityMessage> results = new ArrayList<>(); | ||
for (Map.Entry<Field, String> field : entry.getFieldMap().entrySet()) { | ||
Charset charset = Charset.forName(System.getProperty("file.encoding")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would extract this out of the loop, as it doesn't depend on the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason to use System.getProperty("file.encoding")
and not say the encoding specified in the Library properties?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since different users have different charsets due to the operating system or the default settings of the computer. And System.getProperty("file.encoding")
is used get the default charset. If the charset is not UTF-8, we should give a warning about that.
And the reason not to use the Library properties & Database properties: Maybe the user doesn't know the default charset in his computer or he set the charset for jabref, but we should give a warning about that since Non-UTF-8 charset may cause to garbled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok thanks for the explanation.
But doesn't this give a lot of false positives? Say I have my library encoded in Charset A, and my systems default is Charset B. If all characters in my database are properly encoded with Charset A, then I shouldn't get any warnings even though some of the characters may not be encodable in Charset B, right?
But I also have to admit that I do not yet understand the use-case from the user perspective, so maybe I'm missing something obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have thought about that. In my test, if one user's default charset is A then his paste-board, his input is all encoded in A. So when he input something maybe just garbled. Maybe there is an example: #7629
So the scenario may be rare. That's the reason I don't choose to get the charset by Charset charset = bibDatabaseContext.getMetaData().getEncoding().orElse(preferences.getDefaultEncoding());
By the way, I have a question about the design. If the bibtex is only allowed in ascii in design, why do we allow the user to save it into different charsets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, reviewer! @tobiasdiez After thinking for a long time and doing some tests, I think maybe it's better to give 2 kinds of warning:
- In BibLatex, if the Library charset is not UTF-8, then give a warning
Non-UTF-8 field found.
- In both BibLatex and BibTeX, if the System env is not UTF-8, give the warning
Non-UTF-8 env, may cause garbled.
And I'm eagerly waiting for your suggestions and reply!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looks already good. PLease have a look at the checkstyle issues
I might be not up-to-date, but I always thought UTF8 characters are only allowed in biblatex and that bibtex only handles asci characters. Did this change? |
Yeah, some journals and papers use non-ASCII characters as their names.. etc(just as the bib in bibtex I added before). and maybe it is difficult to do with it in jabref. The details are shown in the issue. So I think maybe it is better to trade them equally. |
I don't really have experience with say Chinese names (as authors or journals) with bibtex. But the only evidence I could find was always suggesting bibLAtex, since bibtex doesn't support UTF8, see e.g. https://tex.stackexchange.com/questions/100092/how-to-include-a-chinese-paper-in-reference-via-bibtex. So does it make more sense to keep the asci check for bibtex, and add the new utf8 check for biblatex? |
I agree with @tobiasdiez we need the utf8 check for biblatex and the ascii checker for bibtex then. |
Good idea!I will refactor my code to meet this need! (After searching more information about bibtex and biblatex, I agree with you~ ) And the check of utf8 for biblatex maybe it's not a bug but an enhancement? (laugh) I will focus on it! |
Co-authored-by: Christoph <siedlerkiller@gmail.com>
Hi Reviewers! I have added the UTF-8 check for biblatex and recovery the ASCII check for bibtex! |
So far looks good, you only need to add the new localization string the l10 files, see here for more details https://devdocs.jabref.org/getting-into-the-code/code-howtos#using-localization-correctly |
Hi reviewers!I added this statement to all language packs, but I rely on Google Translate for most of my translations, so please double check it for errors~ |
You only need to add it to the English file. All otherttranslations are managed by crowdin. |
Emmm, so I need to subtract all the files except the English file, right? |
I have changed that. Hope everything goes well... |
I added the javaDoc for UTFChecker and fix a little problem in my Junit test. |
String NonUTF8 = ""; | ||
try { | ||
NonUTF8 = new String("你好,这条语句使用GBK字符集".getBytes(), "GBK"); | ||
} catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can simply remove that catch here and add throws Exception to the test method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK!
add 2 Junit Test for UTF8Checker.UTF8EncodingChecker in UTF8CheckerTest add 2 Junit Test for IntegrityCheck in IntegrityCheckTest
Hi reviewers! I have added 2 Junit Test for UTF8Checker and 2 for IntegrityCheck. I'm not quite sure if these test cases are redundant and standardized, so please give me some advice if problems exist! |
src/test/java/org/jabref/logic/integrity/IntegrityCheckTest.java
Outdated
Show resolved
Hide resolved
Thanks a lot for your contribution! |
…om.tngtech.archunit-archunit-junit5-api-0.18.0 * upstream/main: Fix exception when searching (#7659) Fixes Jabref#7660 (#7663) Fix for issue 5850: Journal abbreviations in UTF-8 not recognized (#7639) Fix SSLHandshake Exception by using bypass (#7657) Fix for issue 7633: Unable to download arXiv pdfs if Title contains curly brackets (#7652) Fix#7195 partly Opacity of disabled icon-buttons
Fixes #5850
CHANGELOG.md
described in a way that is understandable for the average user (if applicable)Reproduce the issue:
The main reason for this bug is the check-tools
Check integrity
only accept the charset ASCII. It works well in English citations, but jabref has users across the world and they have different charsets.The screenshot:
before:
after
The way to fix:
ASCIICharacterChecker.java
any non-ASCII encoded characters will be warned.
3. Then I remove the steps in
IntegrityCheck
.And still, I want to give a warning about non-UTF8 encoded characters.
To check this, we need first set out the default charset(for example GBK) in the whole environment.
Then we can get the following warning when using Integrity check: