Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snuggletex Integration #646

Closed
wants to merge 28 commits into from

Conversation

Zylence
Copy link

@Zylence Zylence commented May 17, 2023

Snuggletex

Snuggletex is a library from the University of Edinburgh for converting latex to XML, but can be used for latex parsing as well. It is extendible, easy to use and powerful, all whilst containing almost no external dependencies.
In the future, it could become our main latex parser for integrity checks.

What it does do

  • ^
  • _
  • $
  • commands: \command[opt]{arg}{...}
  • verbs: \verb(*)...
  • environments: \begin{env}[opt]{arg}{...}...\end{env}
  • user-defined commands

Takeaways

Some commands may be missing, for example I found \text{} to be absent,
to check which commands are supported by default, refer here: CorePackageDefinitions.java
Thankfully, enough of the package is exposed to be able to inject new commands, like so for example:
engine.getPackages().get(0).addComplexCommandOneArg("text", false, ALL_MODES,LR, StyleDeclarationInterpretation.NORMALSIZE, null, TextFlowContext.ALLOW_INLINE);

I have not checked if this is the correct way to represent the text command, but now it parses it correctly.

What it does not do

  • &
  • #
  • %

What we could do

  1. Use the tokens provided by snuggletex to implement our own parser on top

    Or

  2. Keep our integrity checks for & and # and implement % like we used to

What I would like to do

I am really fascinated by this library, it's clean, well documented and build thoughfully and extendible. I'd really like to do more with it. If you do not mind, I'd like to port our integrity checks to snuggletex, rather than writing them as we used to.

Mandatory checks

  • Change in CHANGELOG.md described in a way that is understandable for the average user (if applicable)
  • Tests created for changes (if applicable)
  • Manually tested changed features in running JabRef (always required)
  • Screenshots added in PR description (for UI changes)
  • Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
  • Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

@Zylence Zylence mentioned this pull request May 17, 2023
2 tasks
@koppor
Copy link
Member

koppor commented Jun 12, 2023

In JabRef's localization, we keep the English keys equal to the English full text. In this way, we can have readabile code. Details at https://devdocs.jabref.org/code-howtos/localization.html.

Regarding the SnuggleTeX errors, it would be OK to show "LaTeX cannot be parsed" and have the detail message in English only. These are all very technical terms and I thnk, no JabRef tranlsator should be bothered to translate them.

@koppor
Copy link
Member

koppor commented Jun 12, 2023

To workaround the method-too-large exception, I am trying to include our custom JDK build.

Copy link
Member

@koppor koppor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks pretty readable.

Please add test cases - and then this should be good to go for a review at the jabref repo

CHANGELOG.md Outdated Show resolved Hide resolved
@koppor
Copy link
Member

koppor commented Jun 12, 2023

Thank you for having worked on this. Sorry for the late reply. Hope, you will continue on this!

Some commands may be missing, for example I found \text{} to be absent,

\text appears in math-mode only. Not outside math mode. Maybe, it is available inside math mode (e..g, $\text{test}$).

I have not checked if this is the correct way to represent the text command, but now it parses it correctly.

Which concrete case did you have?

What it does not do

Please unzip what you mean.

Ah, I understand: It does not check for ampersands, wrong bibtex string concatenation, and percent signs.

For BibTeX string concatenation, we already discussed. Difficult thing as this is JabRef custom logic, too.

What we could do

1. Use the tokens provided by snuggletex to implement our own parser on top
   **Or**

Could be a fun excercise? With the possible outcome that the code is harder to read as our existing code?

What I would like to do

I am really fascinated by this library, it's clean, well documented and build thoughfully and extendible. I'd really like to do more with it. If you do not mind, I'd like to port our integrity checks to snuggletex, rather than writing them as we used to.

Suggestion: Finish this PR with the current functionality. - Create a new branch (based on this branch) to rewrite the other checks. In this way, you can work in parallel. If the updated code turns out to be good, we can continue working forward to include it. If it turns out it is not that maintainable, the issue JabRef#8712 is still fixed.

@Zylence
Copy link
Author

Zylence commented Jun 13, 2023

In JabRef's localization, we keep the English keys equal to the English full text. In this way, we can have readabile code. Details at https://devdocs.jabref.org/code-howtos/localization.html.

Regarding the SnuggleTeX errors, it would be OK to show "LaTeX cannot be parsed" and have the detail message in English only. These are all very technical terms and I thnk, no JabRef tranlsator should be bothered to translate them.

Okay, just to make sure I understand you correctly, a message would for example look like this:

LaTeX cannot be parsed:  Nothing following \

And we would only translate this part:  LaTeX cannot be parsed.

So having  Nothing following \ inside the properties file is no necessary?

Consequently we would only need one entry inside the properties file:
LaTeX\ cannot\ be\ parsed:=LaTeX cannot be parsed: %0

and "feed it" the internal error messages.

@Zylence
Copy link
Author

Zylence commented Jun 13, 2023

To workaround the method-too-large exception, I am trying to include our custom JDK build.

Sounds great, I am looking forward to trying it out :^)

@Zylence
Copy link
Author

Zylence commented Jun 13, 2023

Thank you for having worked on this. Sorry for the late reply. Hope, you will continue on this!

No worries, I was very busy anyway. I would be happy to continue :^) .

  \text appears in math-mode only. Not outside math mode. Maybe, it is available inside math mode (e..g, ).

It really is not handled. I looked through the library with text search and tried it out. \text is an undefined command according to snuggletex.

  Ah, I understand: It does not check for ampersands, wrong bibtex string concatenation, and percent signs.

Exactly, could have made this clearer, sorry for the confusion.

 Could be a fun excercise? With the possible outcome that the code is harder to read as our existing code?

Just guessing, I'd say it won't make much of a difference. Perhaps we should just be happy with what we already have. I will do a case study regardless, so we can see the possible outcome before deciding. 

Suggestion: Finish this PR with the current functionality. - Create a new branch (based on this branch) to rewrite the other checks. In this way, you can work in parallel. If the updated code turns out to be good, we can continue working forward to include it. If it turns out it is not that maintainable, the issue https://github.com/JabRef/jabref/issues/8712 is still fixed.

Sounds good to me, I'll do it. :)

@koppor
Copy link
Member

koppor commented Jun 14, 2023

Okay, just to make sure I understand you correctly, a message would for example look like this:

LaTeX cannot be parsed:  Nothing following \

And we would only translate this part:  LaTeX cannot be parsed.

So having  Nothing following \ inside the properties file is no necessary?

Yes. We should move forward in that way. Later, we should collect (somehow) the returned errors and translate them. We have telemetry infrastructure for that, but currently not working. Needs some updates...

Consequently we would only need one entry inside the properties file:
LaTeX\ cannot\ be\ parsed:=LaTeX cannot be parsed: %0

The %0 needs to be on the left side, too. Key and value really need to be the same 😅

and "feed it" the internal error messages.

Yes!

@koppor
Copy link
Member

koppor commented Jun 14, 2023

Thank you for having worked on this. Sorry for the late reply. Hope, you will continue on this!

No worries, I was very busy anyway. I would be happy to continue :^) .

Looking forward 🤩

  \text appears in math-mode only. Not outside math mode. Maybe, it is available inside math mode (e..g, ).
It really is not handled. I looked through the library with text search and tried it out. \text is an undefined command according to snuggletex...

Maybe worth a PR for SnuggleTex? 😅

 Could be a fun excercise? With the possible outcome that the code is harder to read as our existing code?
Just guessing, I'd say it won't make much of a difference. Perhaps we should just be happy with what we already have. I will do a case study regardless, so we can see the possible outcome before deciding. 

👍

Zylence and others added 3 commits June 18, 2023 18:06
Localization now uses a string from jabrefs properties to wrap the internal error messages.
Local fields have been made static members in order to improve performance with large bib files. We no longer instanciate an Engine and Session per bib entry.
Co-authored-by: Oliver Kopp <kopp.dev@gmail.com>
@Zylence
Copy link
Author

Zylence commented Jun 18, 2023

Later, we should collect (somehow) the returned errors and translate them. We have telemetry infrastructure for that, but currently not working. Needs some updates...

Do you mean to translate them automaticly, via a service?

The %0 needs to be on the left side, too. Key and value really need to be the same 😅

Slip of the pen. 😅

Yes!

Errors are now prefixed with "LaTeX Parsing Error:" I found that quite appealing, but we can change that back to "LaTeX cannot be parsed" if you like.

@Zylence
Copy link
Author

Zylence commented Jun 18, 2023

Maybe worth a PR for SnuggleTex? 😅

I can try, but do you think it will be merged? I saw your PR in the repo is still dangling around unmerged. 😅

@Zylence
Copy link
Author

Zylence commented Jun 18, 2023

Please add test cases - and then this should be good to go for a review at the jabref repo

Tests will be ready over the next week. :)

Code looks pretty readable.

Thanks, just did some performance oriented refactoring regardless, but that should not have too big of an effect on readability. Engine and Session are now static members, prior to that we instantiated them per bib entry. (Keeping the references aroung only adds ~1 KB of memory overhead)

@koppor
Copy link
Member

koppor commented Jun 19, 2023

Later, we should collect (somehow) the returned errors and translate them. We have telemetry infrastructure for that, but currently not working. Needs some updates...
Do you mean to translate them automaticly, via a service?

No ^^. Here my line of thought:

  • The number error messages is greater than ten.
  • We do not know if we need to translate them all.
  • How to find out which messages should be translated?
  • Idea: Use telemtry! Collect the send error messages centrally.
  • After some time, we know the set of tuples (error message, number of occurrence)
  • These can then be translated.

The translation will be done as usual using JabRef_en.properties.

Errors are now prefixed with "LaTeX Parsing Error:" I found that quite appealing,

Reas good!

@koppor
Copy link
Member

koppor commented Jun 19, 2023

Maybe worth a PR for SnuggleTex? 😅
I can try, but do you think it will be merged? I saw your PR in the repo is still dangling around unmerged. 😅

@davemckain is the original developer on snuggletex. My bet is that he is happy if several people contribute to his repository - https://github.com/davemckain/snuggletex.

Note to self: The other "maintained" fork seems to be https://github.com/rototor/snuggletex. I would, however, like to stick to the "original" one ^^.

Zylence and others added 4 commits June 25, 2023 01:04
…therefore for its encapsulated SnuggleTex Parser). Further, slightly adjusted the LatexIntegrityChecker to expose a static errorMessageFormatHelper method to increase maintainability.
@Zylence
Copy link
Author

Zylence commented Jul 25, 2023

Code looks pretty readable.

Please add test cases - and then this should be good to go for a review at the jabref repo

Test cases have been added in this commit. Some that I expected to work did not (due to snuggletex) these are commented out for now.

@Zylence Zylence closed this Jul 25, 2023
@Zylence
Copy link
Author

Zylence commented Jul 25, 2023

Code looks pretty readable.
Please add test cases - and then this should be good to go for a review at the jabref repo

Test cases have been added in this commit. Some that I expected to work did not (due to snuggletex) these are commented out for now.

Well that was unfortunate. I accidentally closed the PR in this comment, sorry.

@Zylence Zylence reopened this Jul 25, 2023
@Zylence
Copy link
Author

Zylence commented Jul 25, 2023

Later, we should collect (somehow) the returned errors and translate them. We have telemetry infrastructure for that, but currently not working. Needs some updates...
Do you mean to translate them automaticly, via a service?

No ^^. Here my line of thought:

  • The number error messages is greater than ten.
  • We do not know if we need to translate them all.
  • How to find out which messages should be translated?
  • Idea: Use telemtry! Collect the send error messages centrally.
  • After some time, we know the set of tuples (error message, number of occurrence)
  • These can then be translated.

The translation will be done as usual using JabRef_en.properties.

Errors are now prefixed with "LaTeX Parsing Error:" I found that quite appealing,

Reas good!

Oh, okay, now I get it - I did not know you had telemetry infrastructure for that kind of thing. Thanks for letting me know. I agree, narrowing down the selection of messages that need translation in advance is the better choice.

@Zylence Zylence marked this pull request as ready for review July 25, 2023 19:23
@koppor
Copy link
Member

koppor commented Sep 7, 2023

This should now be a PR to JabRef's main repo. I resolved the merge conflicts at 4fb512e.

Should go as one commit in the upstream repo - if possible.

@Zylence
Copy link
Author

Zylence commented Sep 8, 2023

This should now be a PR to JabRef's main repo. I resolved the merge conflicts at 4fb512e.

Should go as one commit in the upstream repo - if possible.

Thank you, I am happy to hear that. I am excited to see how it will do in the wild! :d

@koppor
Copy link
Member

koppor commented Sep 12, 2023

Submitted JabRef#10376 - therefore closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants