Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BOM now missing at beginning of bibliography file -- causes JabRef to not recognize existing library #9496

Open
2 tasks done
andrewhw opened this issue Dec 24, 2022 · 12 comments
Assignees
Labels
export / save unicode unicode related issues
Milestone

Comments

@andrewhw
Copy link

JabRef version

5.8 (latest release)

Operating system

macOS

Details on version and operating system

Darwin daphne.local 22.2.0 Darwin Kernel Version 22.2.0: Fri Nov 11 02:03:51 PST 2022; root:xnu-8792.61.2~4/RELEASE_ARM64_T6000 arm64

Checked with the latest development build

  • I made a backup of my libraries before testing the latest development version.
  • I have tested the latest development version and the problem persists

Steps to reproduce the behaviour

  1. Begin with an existing bibliography file
  2. Update to the newest JabRef
  3. Save the database (no edits required)
  4. You will likely get a warning that "the library has been modified by another program". This is not actually true. Dismiss changes.
  5. Examine the bibliography file using a text editor. The BOM (the bytes at the beginning of the file forming the Byte Order Mark) are now missing.
  6. Reopening the file using JabRef will now cause a "no content in table" error after opening.

Note that if you reestablish the BOM using an external editor and then open the file again using JabRef, all is well until the bibliography is saved again.

Note that this may be apparent on my machine because I have an ARM processor, so this error may not be reproducible on an older Mac with an Intel processor.

The underlying problem is simply that the BOM is now missing during write. Putting the BOM back in (as it was in older JabRef versions) will fix the problem.

Appendix

...

Log File
Paste an excerpt of your log file here
@Siedlerchr
Copy link
Member

Siedlerchr commented Dec 24, 2022

Thanks for reporting, does the bib file include a header line with % Encoding encoding? In general JabRef tries to detect the encoding for reading and will write in normal UTF8 if no header line is present
Additionally, could you please provide the bib file for us for debugging? You can also send it privately to web@jabref.org

@andrewhw
Copy link
Author

Yes, the bib file does include a % Encoding line. This now reads "% Encoding: UTF-16BE" however at the last update I had (when the BOM was working) this line read "% Encoding: UTF-16" (that is, without the "BE").

I have attached two bib files. The first, "tiny-1-withBOM.bib" works fine and can be successfully read by JabRef. If however you read this file and save it, it will then match "tiny-2-noBOM.bib". The difference between the files is simply the two 0xfeff bytes prior to the '%' beginning the header proper that are missing in the second one.

Thanks for looking into this.
tiny-bib-example.zip

@Siedlerchr Siedlerchr added export / save unicode unicode related issues labels Dec 24, 2022
@andrewhw
Copy link
Author

I just looked up what "UTF-16BE" is meant to mean, and the "BE" part is trying to flag that the file is "big endian".

The problem with this, in this context, is that the endianness of the file is required in order to correctly parse the 16-bit characters of the file, so without the BOM the "first" character (the "%" sign) will get loaded as character 0x2500 ("Box drawings light horizontal") rather than as 0x0025 ("percent").

The "% Encoding" strategy works well for UTF-8 as it is a orderless encoding (one byte processed at a time), but UTF-16 requires the order to be known before any characters are parsed at all.

Not sure if this helps, or if this is already obvious to everyone. Sorry if I am over-explaining.

@Siedlerchr
Copy link
Member

Thanks for the additional information. For reference, we have been down that rabbit hole in #8947 and unicode-org/icu#2127

@andrewhw
Copy link
Author

andrewhw commented Dec 25, 2022 via email

@andrewhw
Copy link
Author

andrewhw commented Dec 26, 2022

In light of the examples in linked threads, maybe it is helpful to show the direct byte encodings in the files. I have shown them here with hexdump(1) and od(1) "octal dump" -- both of these are available command line tools under Linux and MacOSX.

byte-encodings-UTF-16-big-endian

Note the two bytes forming the BOM (0xFE 0xFF) shown prior to two byte sequence (<nul>-'%') forming the first readable Unicode character of the file.

@koppor koppor self-assigned this Jan 2, 2023
@github-project-automation github-project-automation bot moved this to Normal priority in Prioritization Jan 2, 2023
@koppor koppor moved this from Normal priority to High priority in Prioritization Jan 2, 2023
@koppor
Copy link
Member

koppor commented Jun 6, 2023

Could you try the latest development version?

I think, this is a duplicate of #9926, which was fixed recently.

@andrewhw
Copy link
Author

andrewhw commented Jun 6, 2023 via email

@koppor
Copy link
Member

koppor commented Jun 7, 2023

Regarding the Mac OS X bug, there is a work around: #9553

@koppor
Copy link
Member

koppor commented Jul 31, 2023

Note to us: There was a fix on May 20 (#9927), but at the comment on June, it said, some files can be broken. We need

@andrewhw
Copy link
Author

If you are referring to the files I uploaded in the tiny-bib-example.zip file on Dec 24, 2022 above, then the test cases are simply this:

  • open JabRef with no database
  • select one of the two .bib files in the zip file above

Expected behaviour (as far as I understand it):

  • tiny-1-withBOM.bib -- successfully opens the file
  • tiny-2-noBOM.bib -- an error (the description mentions parsing) when loaded on a big-endian machine (I am on a MacOSX M1 chip machine). I do not know what will happen when loaded on a little-endian machine (e.g. intel chip). The fact that the files themselves have a big-endian ordering and it does not work when loaded on a big-endian machine causes me to suspect that things will go no better on Intel.

Is that what you need?

@andrewhw
Copy link
Author

If it is helpful, here are the "tiny" files in both big and little endian formats, with and without BOM markers.

tiny-bib-example-endian-and-BOM-combinations.zip

@koppor koppor added this to the 6.0-beta milestone Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
export / save unicode unicode related issues
Projects
Status: High priority
Development

No branches or pull requests

3 participants