New quality check and cleanup for & #585

koppor · 2022-07-04T18:09:36Z

Quality check for presence of unescaped & -- current user documentation: https://docs.jabref.org/finding-sorting-and-cleaning-entries/checkintegrity
Cleanup action for escaping & (when unescaped) -- current user documentation: https://docs.jabref.org/finding-sorting-and-cleaning-entries/cleanupentries

The text was updated successfully, but these errors were encountered:

KonstantinEger · 2023-03-29T11:42:48Z

Hi, seems like this is still free to take. We would like to take this as a first issue and post a PR soon :)

koppor · 2023-03-29T12:08:49Z

Thank you for choosing JabRef! Looking forward to your PR! - Would it be possible to tell us which course recommended JabRef? 🤩 - Here or via email (which you find on my profile)

As a general advice for newcomers: check out Contributing for a start. Also, guidelines for setting up a local workspace is worth having a look at.

Feel free to ask here at GitHub, if you have any issue related questions. If you have questions about how to setup your workspace use JabRef's Gitter chat. Try to open a (draft) pull-request early on, so that people can see you are working on the issue and so that they can see the direction the pull request is heading towards. This way, you will likely receive valuable feedback.

Siedlerchr · 2023-04-15T22:59:51Z

Quality check is now implemented: JabRef#9758

Zylence · 2023-04-22T16:16:11Z

Regarding the Cleanup action, would you like the UI component as checkbox or entry inside 'field formatters'? (We currently use a checkbox for testing purposes)

koppor · 2023-04-24T21:05:23Z

This is a part of JabRef#8712. When working on this, maybe completely fix JabRef#8712?

Zylence · 2023-04-26T15:26:09Z

Hello, looking into this, I just realized that the cleanup action for unescaped ampersands is already implemented in jabref. So I am definitely up for a new task. :)
I have to add though that I will not be able to work as much on it the next weeks - so if this is a time critical task, someone else should take it.

As far as I understand we need to add missing integrity checks for %, _, $ to warn the user if they are not escaped. We could easily do that extracting the ampersand check as a template.

We additionally need to decide whether we alter the integrity check for # to be usable with the template. As of right now it functions differently from the & check "firing" only if there is an uneven amount of unescaped #. I guess when implementing it, it was assumed that the user rather wants the formatting functionality than the sign.

Further, we need cleanup actions for % and # (whereas the # cleanup action should only be implemented if the integrity check follows the same pattern as the ampersand integrity check does: assuming the user wants plain signs over formatter functionalities).

And as far as the integrity check for table 2 symbols goes. Do we check the latex commands or input unicode chars? I mean: Do we warn the user if he uses a symbol that's correctly displayed in jabref but not in latex (when inputting unicode chars), such as '○' or vice versa? Or do we check the latex commands for spelling mistakes? (I think I did not fully understand this part)

Siedlerchr · 2023-04-26T15:48:50Z

Be careful with the "#" sign. The # is used for BibTex-Strings! https://docs.jabref.org/advanced/strings

koppor · 2023-04-26T20:28:27Z

As far as I understand we need to add missing integrity checks for %, _, $ to warn the user if they are not escaped. We could easily do that extracting the ampersand check as a template.

While you are talking about, it is more complicated:

%: Yes, there needs to be an escape check!

$ can be an indicator of math formulas. Can you try the following: If $ appear in pairs, and there is a space before the first $ and a space after the second $ (or beginning/end of the string), it is ok?

Example: Title with a math formula: $x=x+1$.

Now, the thing with _ gets even more difficult: It is perfectly valid in math formulas as follows: $a_2+b_2=c_2$ .

Maybe, this detailed parsing is too much.

Idea: A LaTeX parser should be used to check if the title parses correctly. SnuggleTeX (Example: https://github.com/davemckain/snuggletex/blob/development_1_2_x/snuggletex-core/src/main/java/uk/ac/ed/ph/snuggletex/samples/MinimalExample.java) seems to be good. A fork of it can be retrieved from maven central: https://github.com/rototor/snuggletex. Note: Playing around with that will take some time... In the long run, this could replace our latex2unicode converter. -- In the short run, it should only be checked if the content can be parsed using LaTeX.

We additionally need to decide whether we alter the integrity check for # to be usable with the template. As of right now it functions differently from the & check "firing" only if there is an uneven amount of unescaped #. I guess when implementing it, it was assumed that the user rather wants the formatting functionality than the sign.

This is a hard one. As Christoph stated: # is typically used to concatenate strings: https://docs.jabref.org/advanced/strings. The current entry editor (and all the logic code) has no good support for that. The "magic" is happening at reading/writing a field. Not sure whether you really want to dive into that. -- Maybe it's as simple as checking the entry editor content similar to the $ above: # must appear in pairs.

Further, we need cleanup actions for % and # (whereas the # cleanup action should only be implemented if the integrity check follows the same pattern as the ampersand integrity check does: assuming the user wants plain signs over formatter functionalities).

Cleanup for % is good.

No straight-forward cleanup for # possible. One could search for defined strings. If a matched string is found, then corrected. Example: User inputs #kopp and #kubovy, which is corrected to #kopp# and #kubovy#, because kopp and kubovy are defined strings. Maybe, its also possible to just correct without checking the strings. Needs some more thougt/testing.

And as far as the integrity check for table 2 symbols goes. Do we check the latex commands or input unicode chars? I mean: Do we warn the user if he uses a symbol that's correctly displayed in jabref but not in latex (when inputting unicode chars), such as '○' or vice versa? Or do we check the latex commands for spelling mistakes? (I think I did not fully understand this part)

I would leave that part as is. No checking whatever.

I could neverthless try to understand it: Take an arbitrary command of the table. If it renders in JabRef correctly: No worries. If it does not render correctly: Add it to a check. -- However, this heavily depends on the latex2unicode converter. Instead of checking for things it cannot handle, the converter should be made better (or replaced by SnuggleTeX as outlined above). -- In other words: The user should not adapt their bibtex library because JabRef cannot handle it.

Zylence · 2023-04-27T16:03:47Z

Thank you for the detailed response. I will look into it and try out snuggletex this weekend.

$ can be an indicator of math formulas. Can you try the following: If $ appear in pairs, and there is a space before the first $ and a space after the second $ (or beginning/end of the string), it is ok?

In case we decide against snuggletex and need to check it manually, we can surely do that. But why check for spaces at the beginning and end? As far as I know they will not effect the functionality and the user may just leave them out. In any case, we can use the BibStringChecker (#) for orientation, it already implements the desired behaviour (just without the checks for spaces).

Zylence · 2023-04-27T16:13:17Z

Be careful with the "#" sign. The # is used for BibTex-Strings! https://docs.jabref.org/advanced/strings

I am sorry, initially I missunderstood the task: I thought the task was simply to warn the users whenever they use latex special characters, but now I realized that this would be very restricting and that the real task is to check whether it's valid latex to guarantee a pain free workflow for the user when integrating the bib files into his own documents.

Siedlerchr · 2023-04-27T17:25:46Z

It's a mix of both. But I would start with the low hanging fruits:

a) Check and b) cleanup for % and & ( I think % is already there -> LatexCleanup)
Check for balanced $ (e.g. always pairs)
Check if underscore is only inside a group of $$ (can be probaly done easily with regex)
If an underscore is outside $$ offer a cleanup either escaping it with \ or replacing it with \textunderscore

For the moment I would not consider # and the more complex cases. That should be fine.

Zylence · 2023-05-01T15:07:40Z

It's a mix of both. But I would start with the low hanging fruits:

a) Check and b) cleanup for % and & ( I think % is already there -> LatexCleanup)

Check for balanced $ (e.g. always pairs)

Check if underscore is only inside a group of $$ (can be probaly done easily with regex)

If an underscore is outside $$ offer a cleanup either escaping it with \ or replacing it with \textunderscore

For the moment I would not consider # and the more complex cases. That should be fine.

Okay, but I would like to use a proper parser like snuggletex "from the get go", since the code I write without using it would need to be replaced eventually anyways. Further I just now had some time to try it out and really like it! :-)

I have already tried some integration testing with snuggletex in jabref - it does not seem to immediately break any Tests:

Before:

After:

Using snuggletex it could be as simple as this (basicly takes care of all integrity checks):

SnuggleEngine engine = new SnuggleEngine();
SnuggleSession session = engine.createSession();
SnuggleInput input = new SnuggleInput(field.getValue()); //field is a BibEntryField
session.parseInput(input);
if (!session.getErrors().isEmpty()) {
    ErrorCode code = session.getErrors().get(0).getErrorCode();
     // filter only for tokenization errors
    if (code.getErrorGroup().equals(CoreErrorGroup.TTE)) { 
        //todo use custom translation
        String message = CorePackageDefinitions.getPackage().getErrorMessageBundle().getString(code.getName()); 
        //todo "handle" parsing error error
        session.getErrors().clear();
    }
}

This is jsut a mock and has to be improved but just for fun I edited one of the IntegrityChecks to use Snuggletex, seems promising:

We'd need to copy and slightly adjust the libraries properties file into our own for custom translation, but that's no big deal. And Since CoreErrorCode is exposed, we could also easily filter for specific errors if we did not want to catch them all.

If you want to try how snuggletex integrates in jabref yourself, you can add these lines to the build.gradle:

implementation "de.rototor.snuggletex:snuggletex:1.3.0"
implementation ("de.rototor.snuggletex:snuggletex-jeuclid:1.3.0") {
    exclude group: "org.apache.xmlgraphics"
}

and this line to module-info.java:

requires snuggletex.core;

If you @koppor , @Siedlerchr and the other maintainers allow it, I'd like to integrate it. That way, the integrity checks are easily implemented, and we do not rely on handwritten comparisons or checking edge cases, but instead use a proper parser. The only thing left to do would be cleanup actions.

PS: I have not checked if all functions of it work, but tokenization and parsing seem fine.

Siedlerchr · 2023-05-01T15:58:42Z

Thanks for the investigation. We can try, but I think we run into problems with jlink + jpackage and the method too large error. You can try this locally, run a gradlew jpackage and see if it works. https://bugs.openjdk.org/browse/JDK-8240567

If this doesn't work we need your "manual" fix.

Create a branch with the snuggletex implementation (if this fails due to the jdk issue we can still keep the code and put it here at koppor's repo)
Create a branch without the external dependency reliance

koppor · 2023-05-02T07:40:47Z

The error messages are really great! Looking forward to see it integrated.

Regarding the jlink issue, we need to spend energy again to get openjdk/jdk#10704 done.

Zylence · 2023-05-13T16:16:14Z

If this doesn't work we need your "manual" fix.

It did indeed fail, unfortunately. :(

I see you are working on this fix since quite a while, just out of curiosity when do you hope to have this resolved?

Regarding the snuggletex implementation, unfortunately it can not be used to handle all our use cases (for example it can not check for # out of the box) - but we have access to the parsed tokens which means we can implement our own functionality on top of that. (This also raises the question, whether we should consider rewriting the cleanup actions based on that.)

Create a branch with the snuggletex implementation (if this fails due to the jdk issue we can still keep the code and put it here at koppor's repo)

I am currently working on getting it cleaned up (as of now it's barely a hack).

Create a branch without the external dependency reliance

Just started doing checks for _, %, $ and cleanup for $, I am aiming to get this ready over the upcoming holidays (~2 weeks)

koppor · 2023-05-14T07:44:52Z

No worries regarding method too large. More motivation to get it fixed.

Regarding #, maybe it's a SnuggleTex limitation? Is it able to parse field = a # b # {text}?

Zylence · 2023-05-15T21:04:05Z

Is it able to parse field = a # b # {text}

No matter the constellation, if there is at least one # in the text, Snuggletex does treat it as error TTEG04, with this message: "Argument placeholder tokens (e.g. # 1) may only appear in command and environment definitions".

It can parse it, but it treats # differently than we do. We can easily work around this since we have access to the tokens. For starters, since the # integrity check is already implemented, we may as well just exclude TTEG04.

koppor · 2023-05-15T21:20:51Z

Excluding TTEG04 is OK for me.

Can you file an issue at https://github.com/davemckain/snuggletex/issues?

Zylence · 2023-05-17T13:10:53Z

Can you file an issue at https://github.com/davemckain/snuggletex/issues?

I've looked into the library, there is no bibtex reference anywhere. It does not seem like it was ever ment to support including / validating bibtex files. It's build just for plain latex and hence only allows # in "inside command/environment definitions" LaTeXTokeniser.java.

Their website states: "SnuggleTeX’s parser pretends that TeX never happened and may behave slightly differently to what experienced LaTeX users might expect." And since this is a bibtex and no latex feature, I do not think this change should be made to the source, nor that the maintainer would do it.

Should I still file the issue?

Zylence · 2023-05-17T15:23:38Z

Just created a PR with what is ready so far.

It is by no means perfect and does break some unit tests (because of localization). But its functional. The error messages are ported from the original snuggletex repo, we may need to rewrite them and can probably exclude some.

koppor · 2023-05-17T15:39:21Z

Ah, thank you, I was too quick in writing my thought. THank you for digging deeper!

You must not pass the field value directly to SnuggleTeX.

The handling of # is solved not very well in JabRef.

See test cases for examples

https://github.com/JabRef/jabref/blob/2ee5c9cf3fbb095edf71d6d9044400188ebe0c0b/src/test/java/org/jabref/logic/importer/fileformat/BibtexParserTest.java#L640
org.jabref.logic.importer.fileformat.BibtexParserTest#parseRecognizesStringAndEntry

Seeing the code of org.jabref.logic.bibtex.FieldWriter#formatAndResolveStrings, you should just ignore checking if the field value contains #.

Future work can provide a more proper data model for a BibTeX field

ThiloteE · 2023-07-25T08:34:44Z

"method too large" problem is fixed in the current development version.

Zylence · 2023-07-25T17:14:25Z

"method too large" problem is fixed in the current development version.

Thats good news! The requested test cases in the koppor repo for the snuggletex integration where also added. So I guess we are good to go?

koppor · 2023-09-26T12:34:55Z

I tried the escape ampersands:

Works!

Siedlerchr · 2023-09-26T12:37:50Z

But we need to disable it for file field

koppor added this to Candidates for University Projects Jul 4, 2022

koppor moved this to Free to take in Candidates for University Projects Jul 4, 2022

koppor mentioned this issue Oct 14, 2022

Improve Equality check for Journal abbreviations JabRef/jabref#8948

Closed

koppor assigned KonstantinEger Mar 29, 2023

koppor moved this from Free to take to Reserved in Candidates for University Projects Mar 29, 2023

Siedlerchr mentioned this issue Apr 15, 2023

Quality check escaped ampersand JabRef/jabref#9758

Merged

koppor assigned Zylence Apr 26, 2023

koppor mentioned this issue Jun 12, 2023

Snuggletex Integration #646

Closed

6 tasks

koppor closed this as completed Sep 26, 2023

github-project-automation bot moved this from Reserved to Done in Candidates for University Projects Sep 26, 2023

koppor mentioned this issue Sep 26, 2023

Exclude check for ampersand (&) at verbatim fields JabRef/jabref#10419

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New quality check and cleanup for & #585

New quality check and cleanup for & #585

koppor commented Jul 4, 2022 •

edited

Loading

KonstantinEger commented Mar 29, 2023

koppor commented Mar 29, 2023

Siedlerchr commented Apr 15, 2023

Zylence commented Apr 22, 2023

koppor commented Apr 24, 2023

Zylence commented Apr 26, 2023

Siedlerchr commented Apr 26, 2023

koppor commented Apr 26, 2023

Zylence commented Apr 27, 2023 •

edited

Loading

Zylence commented Apr 27, 2023

Siedlerchr commented Apr 27, 2023

Zylence commented May 1, 2023

Siedlerchr commented May 1, 2023 •

edited

Loading

koppor commented May 2, 2023

Zylence commented May 13, 2023

koppor commented May 14, 2023

Zylence commented May 15, 2023 •

edited

Loading

koppor commented May 15, 2023

Zylence commented May 17, 2023 •

edited

Loading

Zylence commented May 17, 2023

koppor commented May 17, 2023

ThiloteE commented Jul 25, 2023

Zylence commented Jul 25, 2023

koppor commented Sep 26, 2023

Siedlerchr commented Sep 26, 2023

New quality check and cleanup for & #585

New quality check and cleanup for & #585

Comments

koppor commented Jul 4, 2022 • edited Loading

KonstantinEger commented Mar 29, 2023

koppor commented Mar 29, 2023

Siedlerchr commented Apr 15, 2023

Zylence commented Apr 22, 2023

koppor commented Apr 24, 2023

Zylence commented Apr 26, 2023

Siedlerchr commented Apr 26, 2023

koppor commented Apr 26, 2023

Zylence commented Apr 27, 2023 • edited Loading

Zylence commented Apr 27, 2023

Siedlerchr commented Apr 27, 2023

Zylence commented May 1, 2023

Siedlerchr commented May 1, 2023 • edited Loading

koppor commented May 2, 2023

Zylence commented May 13, 2023

koppor commented May 14, 2023

Zylence commented May 15, 2023 • edited Loading

koppor commented May 15, 2023

Zylence commented May 17, 2023 • edited Loading

Zylence commented May 17, 2023

koppor commented May 17, 2023

ThiloteE commented Jul 25, 2023

Zylence commented Jul 25, 2023

koppor commented Sep 26, 2023

Siedlerchr commented Sep 26, 2023

koppor commented Jul 4, 2022 •

edited

Loading

Zylence commented Apr 27, 2023 •

edited

Loading

Siedlerchr commented May 1, 2023 •

edited

Loading

Zylence commented May 15, 2023 •

edited

Loading

Zylence commented May 17, 2023 •

edited

Loading