LaTeX to Unicode formatter should not replace `\%` with `%` #8490

JasonGross · 2022-02-08T18:12:10Z

JabRef version

5.5 (latest release)

Operating system

Windows

Details on version and operating system

Windows 10

Checked with the latest development build

I made a backup of my libraries before testing the latest development version.
I have tested the latest development version and the problem persists

Steps to reproduce the behaviour

Create the following .bib file:

@Misc{test,
  abstract = {10\%},
}

Open the file in JabRef, select the entry, click Quality -> Cleanup Entries, ensure that "Enable Field Formatters" is checked and "LaTeX to Unicode" is enabled for Abstract, as in the following image, and then click "Ok"
Notice that the abstract is abstract = {10%}

Since % is a comment character in LaTeX, this change is incorrect. More generally, escaped special characters in LaTeX should not be unescaped when converting to Unicode (or at least the general "convert to Unicode" should not have this behavior)

Appendix

The text was updated successfully, but these errors were encountered:

Siedlerchr · 2022-02-08T19:15:13Z

Well, technically this is the correct behavior, it converts everything to Unicode. What you probably want is to use the LaTeXCleanup formatter as well. That respects those things
https://docs.jabref.org/finding-sorting-and-cleaning-entries/saveactions#latex-cleanup

JasonGross · 2022-02-08T19:21:55Z

Technically correct but practically wrong. LaTeXCleanup will fix the issue with % but will not escape $, right? I want a transformer that will transform LaTeX to Unicode-aware LaTeX, preferring Unicode characters when available. What use is "LaTeX to Unicode" if it generates text that breaks the .bib file?

ThiloteE · 2022-04-06T18:25:08Z

Thinking about this a little, the way forward might indeed be to transform the $ sign to \$ when using the Latexcleanup action.
Background story: $ opens mathmode in Latex. One does not want to accidentially open mathmode, just because a $ sign was in the library.

The code that would need to be changed is here: https://github.com/JabRef/jabref/blob/main/src/main/java/org/jabref/logic/formatter/bibtexfields/LatexCleanupFormatter.java

ThiloteE · 2022-04-07T07:43:30Z

LaTeXtoUnicode:

The LatextoUnicode converter assumes the bibliographic data is formatted in Latex Syntax. In LaTeX syntax, writing the percentage sign requires a backslash in front (\%). A simple % would denote the start of a LaTeX comment.

Hence, the removal of a simple backslash \ is correct.

From this we can see that:

If the bibliographic data is already in Unicode format, using the LaTeXtoUnicode converter is not advised.
If the bibliographic data is in mixed LaTeX and Unicode format, using the LaTeXtoUnicode converter is not advised. Manual cleanup (or another cleanup action) might be necessary.

LaTeX Cleanup:

Furthermore, the "LaTeXcleanup" turns out to be slightly a Frankenstein. https://docs.jabref.org/finding-sorting-and-cleaning-entries/saveactions#latex-cleanup. The name is misleading. It does not only clean up redundant LaTeX code or special characters. It actually mostly does the opposite: It makes bibliographies ready to be used with LaTeX (by removing characters, though)

I would recommend a name change or at least link to the documentation page for this command within Jabref. E.g. something to Make LaTeX ready

Examples:

On the one hand, the command makes the bibliographic data ready to be used with LaTeX: e.g. "scape percent character (e.g.50% ⇒ 50\%)".
On the other hand, this command removes LaTeX code e.g. by removing redundant $ signs. With redundant, it means for example two $$ in a row. Therefore, making it ready to be used with programs that require Unicode, if there was a lot of math-mode stuff in the bibliographic data before. Of course, this would also make bibliographies formatted in Unicode ready to be used with LaTeX.

Interestingly, I just did a test. Running the LaTeXcleanup command does NOT remove a singular $ sign!

Jason, maybe you still had your LaTeXtoUnicode cleanup running before or after you used the LaTeX Cleanup action? Maybe you actually had math-mode stuff in the library?

Fun fact: Searching on google scholar for % or $ yields 0 results.
Maybe not a good idea to put these special characters into the title of an entry :D

After having written all this, I still am of the opinion that the way forward would be to change the LaTeX Cleanup action OR the UnicodetoLaTeX action to add a backslash to $ sign. Maybe do both.

UnicodeToLaTeX:

Doing a similar test for UnicodeToLaTeX, for whatever reason, both the $ and the % sign do not get backslashed ... am I missing something?

ThiloteE · 2022-04-13T12:37:04Z

Interestingly, I just did a test. Running the LaTeXcleanup command does NOT remove a singular $ sign!

Since this is the case, I assume you should have no problems anymore.

Closing this.

If you still have problems, feel free to open again and report them.

ThiloteE · 2022-04-22T12:11:35Z

Technically correct but practically wrong. LaTeXCleanup will fix the issue with % but will not escape $, right?

@JasonGross The next release of JabRef will contain a separate cleanup action that excapes $ signs. Please do not use it lightly. Use with care. JabRef is not able to know if dollar signs were present to A) start mathmode or B) simply render a $ sign. Using this cleanup action will require a double check by users, unless you want to challenge your "luck".

JasonGross · 2022-04-22T14:54:07Z

I am still interested in a cleanup action that converts LaTeX to mixed LaTeX and Unicode, ie, it should be valid LaTeX code and display the same, but anything that could be replaced by a non-special Unicode character is. As I've said above, the current behavior of LaTeX to Unicode is useless because it generates invalid bibliographic files. Should I open a new issue for this, or reopen this one?

ThiloteE · 2022-04-22T15:01:25Z

I would propose trying to fix this via an integrity check. #8712
You could convert from LaTeX to Unicode and then to do the integrity check. Would that work for you?

ThiloteE · 2022-04-22T15:05:44Z

The problem is, somebody would need to do the mapping from LaTeX to "Unicode aware LaTeX" or since we are at it from Unicode to "LaTeX aware Unicode", which is a lot of work. The Comprehensive LATEX Symbol List lists

18150 symbols and the corresponding LATEX commands that produce them. Some of these symbols are guaranteed to be available in every LATEX 2𝜀 system; others require fonts and packages that may not accompany a given distribution and that therefore need to be installed.

A conversion (e.g. via cleanup actions) is non-trivial.

JasonGross · 2022-04-22T16:34:20Z

I would propose trying to fix this via an integrity check. #8712 You could convert from LaTeX to Unicode and then to do the integrity check. Would that work for you?

That would be great! However, even better would be a version of LaTeX to Unicode that lets the user explicitly deactivate any subset of the mapping that they'd like. The default exclusion list would just include special/control characters like % and \ .

The Comprehensive LATEX Symbol List lists

18150 symbols and the corresponding LATEX commands that produce them. Some of these symbols are guaranteed to be available in every LATEX 2𝜀 system; others require fonts and packages that may not accompany a given distribution and that therefore need to be installed.

A conversion (e.g. via cleanup actions) is non-trivial.

This is a red herring. If the symbol is not available in the font, it doesn't matter whether it comes from a Unicode character or not. If the symbol is available via command and you're using a Unicode-aware TeX engine, I expect it to be available by Unicode character too.

ThiloteE · 2022-04-22T17:15:03Z

That would be great! However, even better would be a version of LaTeX to Unicode that lets the user explicitly deactivate any subset of the mapping that they'd like. The default exclusion list would just include special/control characters

Ok, I finally may understand why this might be useful. If you want to bring really old databases up to date and transform to unicode, but not for the sake of using the database to export to LibreOffice/OpenOffice or Microsoft Office (These would be fine with "pure" Unicode I think), but still would want to continue to export them to a (La)TeX engine (that can read unicode), you would only need to do ONE conversion (with some excluded terms), instead of TWO conversions + integrity check. You would not need to check all entries via "integrity check", because the terms you excluded were already working fine with LaTeX before the conversion.

Suggestion to change the name of this issue to: "Add cleanup action for "LaTeX to LaTeX aware Unicode"".

Have you tried what Christoph suggest by the way? Using "Latex cleanup"? Have you run into problems with it?

It does:

Escape percent character (e.g.50% ⇒ 50%)
Remove redundant $, {, and } (but not if the } is part of a command argument)
Move numbers, +, -, /, and brackets into equations
Move numbers followed by a space left of $ inside the equation (e.g. 0.35 $\mu$m)
Replace all @@ with $
Replace multiple spaces with a single space

JasonGross · 2022-04-22T22:15:15Z

Ok, I finally may understand why this might be useful. If you want to bring really old databases up to date and transform to unicode, but not for the sake of using the database to export to LibreOffice/OpenOffice or Microsoft Office (These would be fine with "pure" Unicode I think), but still would want to continue to export them to a (La)TeX engine

Yes! (Though more often it's "I copy-pasted from Google Scholar or some internet-provided .bib file" than "I had a really old database".)

Suggestion to change the name of this issue to: "Add cleanup action for "LaTeX to LaTeX aware Unicode"".

Name changed, please reopen issue.

Have you tried what Christoph suggest by the way? Using "Latex cleanup"? Have you run into problems with it?

I have not tried it yet. I'll try it the next time I'm manipulating databases.

ThiloteE · 2022-04-23T12:12:47Z

@JasonGross lets rename this issue back to "LaTeX to Unicode formatter should not replace % with %" again, then we close this issue and open a new issue with a well explained first post understandable for people that have no clue about these issues listing:

problem
desired solution
example for how a future workflow would look like
list "special symbols" that would need to be excluded

E.g., you can copy paste following text:

Problem:

There is no cleanup action that allows converting (old) bibliographic data that is (still) formatted in LaTeX with Non-Unicode characters to Unicode aware LaTeX formatting (newer LaTeX engines (e.g. LaTeX2e) can now read most Unicode characters).
Current workarounds include converting to from LaTeX to Unicode and then back to LaTeX, while manuall checking, if any characters were wrongly converted. This is inefficient and takes a long time.
- This workaround is bothersome, because there are symbols that do not get converted when using LaTeXToUnicode and UnicodeToLaTeX cleanup actions (e.g. Rework superscript: latex-to-unicode and unicode-to-latex roundtrip not working #3644) and there are other special symbols that SHOULD not get converted automatically, because multiple conversions are possible and users would need to take take manually (e.g. Add integrity check for LaTeX special characters #8712)

Desired Solution:

Create cleanup action for "LaTeX to Unicode aware LaTeX".

Example workflow:

Have the following entry (BEFORE using the cleanup action):

@Article{Testkey,
  author   = {Testauthor},
  title    = {Bibliographic data that can be read by LaTeX engines},
  a = {Here is a backslashed percentage sign \% and it should be excluded from conversion},
  b = {Here is a \textcopyright{} and it should be converted to Unicode}, 
}

(Comment: \textcopyright{} can be converted to © by the inputenc package. When using the LaTeX to Unicode aware LaTeX cleanup action, the result of the conversion should also be ©)

Use cleanup action "LaTeX to Unicode aware LaTeX"

AFTER using the cleanup action, the following result should emerge:

@Article{Testkey,
  author   = {Testauthor},
  title    = {Bibliographic data that can be read by LaTeX engines},
  a = {Here is a backslashed percentage sign \% and it should be excluded from conversion},
  b = {Here is a © and it should be converted to Unicode}, 
}

"Special Symbols" that would need to be excluded from conversion:

The list should be similar to the symbols mentioned in Add integrity check for LaTeX special characters #8712.
At the very least Page 15 (Tables 1); Table 1 lists escapable special characters in LaTeX.
Maybe also Page 15 Table 2 and Page 16 Table 3.
There might be a lot more, but I am not knowledgable enough to list them here. If you know of any, just post it in this thread.

Additional Information

When working on this, The Comprehensive LATEX Symbol List will be of help. Especially chapters about "Unicode" (Page 272) and "Special Characters" (Page 15-16).
JabRef currently uses https://github.com/tomtung/latex2unicode; Maybe it can be adapted internally in JabRef (e.g. some pre-processing). Another solution would be to fork it or ask tomtung about creating a LaTeX2UnicodeAwareLaTeX converter.

This was referenced Apr 12, 2022

CSL and entry table renders $ only, when it is backslashed. E.g. \$ #8650

Open

Rename cleanup actions: "Prepare for LaTeX" and "Prepare for BibTeX" #8672

Closed

Add cleanup action: "Make LaTeX ready: Escape $" #8673

Closed

ThiloteE added unicode unicode related issues status: waiting-for-feedback The submitter or other users need to provide more information about the issue labels Apr 13, 2022

ThiloteE closed this as completed Apr 13, 2022

JasonGross changed the title ~~LaTeX to Unicode formatter should not replace \% with %~~ Add cleanup action for "LaTeX to LaTeX aware Unicode" Apr 22, 2022

ThiloteE reopened this Apr 23, 2022

JasonGross mentioned this issue Apr 23, 2022

Add cleanup action for "LaTeX to LaTeX aware Unicode" #8715

Open

JasonGross changed the title ~~Add cleanup action for "LaTeX to LaTeX aware Unicode"~~ LaTeX to Unicode formatter should not replace \% with % Apr 23, 2022

JasonGross closed this as completed Apr 23, 2022

ThiloteE removed the status: waiting-for-feedback The submitter or other users need to provide more information about the issue label May 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LaTeX to Unicode formatter should not replace `\%` with `%` #8490

LaTeX to Unicode formatter should not replace `\%` with `%` #8490

JasonGross commented Feb 8, 2022

Siedlerchr commented Feb 8, 2022

JasonGross commented Feb 8, 2022

ThiloteE commented Apr 6, 2022 •

edited

Loading

ThiloteE commented Apr 7, 2022 •

edited

Loading

ThiloteE commented Apr 13, 2022

ThiloteE commented Apr 22, 2022

JasonGross commented Apr 22, 2022

ThiloteE commented Apr 22, 2022 •

edited

Loading

ThiloteE commented Apr 22, 2022

JasonGross commented Apr 22, 2022

ThiloteE commented Apr 22, 2022

JasonGross commented Apr 22, 2022

ThiloteE commented Apr 23, 2022

LaTeX to Unicode formatter should not replace \% with % #8490

LaTeX to Unicode formatter should not replace \% with % #8490

Comments

JasonGross commented Feb 8, 2022

JabRef version

Operating system

Details on version and operating system

Checked with the latest development build

Steps to reproduce the behaviour

Appendix

Siedlerchr commented Feb 8, 2022

JasonGross commented Feb 8, 2022

ThiloteE commented Apr 6, 2022 • edited Loading

ThiloteE commented Apr 7, 2022 • edited Loading

ThiloteE commented Apr 13, 2022

ThiloteE commented Apr 22, 2022

JasonGross commented Apr 22, 2022

ThiloteE commented Apr 22, 2022 • edited Loading

ThiloteE commented Apr 22, 2022

JasonGross commented Apr 22, 2022

ThiloteE commented Apr 22, 2022

JasonGross commented Apr 22, 2022

ThiloteE commented Apr 23, 2022

LaTeX to Unicode formatter should not replace `\%` with `%` #8490

LaTeX to Unicode formatter should not replace `\%` with `%` #8490

ThiloteE commented Apr 6, 2022 •

edited

Loading

ThiloteE commented Apr 7, 2022 •

edited

Loading

ThiloteE commented Apr 22, 2022 •

edited

Loading