Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] CSL citations: Fix sync numbering #11688

Closed
wants to merge 13 commits into from

Conversation

subhramit
Copy link
Collaborator

@subhramit subhramit commented Aug 30, 2024

This PR is a work in progress, and is open to anyone who wishes to work on it. Consider this a very rough playground with no focus on code quality (till we get the functionality right). Effort needed for testing will be a bit high, as you need to test using LibreOffice every time you implement something new.

Aim:

To try to solve the issue subhramit#22 (actually an extended version of it - as the citation numbering should update on changing the order of citations in the document as well, not just deletion, although both of them will have the same solution - to update the citation text as well as reference mark whenever a new entry is cited).

Current progress (starting point) on this PR:

I have been able to update the citation numbers (after insertion of the ciattions), but as a consequence of my changes, the following issues exist that I am unable to solve:

  1. The indexing begins from 0, mostly because of public int getCitationNumber(String citationKey). If I change the default to 1, the behavior begins to be weird. One needs to investigate.
  2. The citation text updation happens in the reverse order, and I am unable to fix it:
    image
  3. The new citation marks inserted always have a CID of 0. This needs to be fixed. Many moving parts to keep track of.
  4. There is a mismatch between the reference marks and the cited entries.
  5. Important: One may suggest that before working on this, we should make changes in the way reference marks are currently implemented for CSL (another good candidate for a university project):
    a) In the current implementation, when citing a group of entries, the citation text is not draped with a combined reference mark, rather the reference mark for each entry is inserted one by one after the citation text. Reason: I could not find a way on how to segment the grouped citation so that each number gets its respective reference mark. Or, how to generate a combined reference mark for all entries in the group and parse them (maybe one can look at how JStyles does it, but it is complicated, and also makes some assumptions based on the specified opening/closing braces, which may not apply to CSL). Result: This makes them difficult to anchor or trace, or update. The was also because CSL styles have a lot of variety when it comes to formatting and the separators between them (for a group of entries).
    b) As a result of (a), even if we are able to make a working model for single citations, we cannot apply it to citation groups (when it comes to updating their text, as well as the reference marks, due to the positioning as well as variety in "what brace or separator surrounds the number" - some styles have no braces, only formatting such as <sup>, which cannot be parsed from the document (they are as good as naked numbers as we can get the text, but not the formatting details) - so regex parsing won't apply, as the rest of the document can have both braces and/or numbers).

However, it is upto whoever takes this problem if they want to make a working model for single citations first. For that, we just need to solve points 1, 2, 3 and 4.

I would specially invite @Siedlerchr and other maintainers to work on this whenever they find time.

What we have at our disposal: Look at the code for how JStyle numbers are synced, and also how https://github.com/zotero/zotero-libreoffice-integration does it.

Note:

The issues need to be migrated to JabRef:main as I alone would not be able to solve them. Two of them are potentially good Candidates for University Projects or medium GSoC projects (large if combined with Zotero format migration + unification of JStyle/CSL style backend). I would ask @ThiloteE or @koppor (the authors of the issues) to migrate them as open issues with a more detailed framing. (maybe melting pot? I would like to help). The link to the issue above needs to be updated once migration is done.

Mandatory checks

  • Change in CHANGELOG.md described in a way that is understandable for the average user (if applicable)
  • Tests created for changes (if applicable)
  • Manually tested changed features in running JabRef (always required)
  • Screenshots added in PR description (for UI changes)
  • Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
  • Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

@subhramit subhramit changed the title [WIP] CSL: Fix sync numbering [WIP] CSL citations: Fix sync numbering Aug 30, 2024
github-actions[bot]

This comment was marked as outdated.

@koppor
Copy link
Member

koppor commented Aug 31, 2024

@subhramit Issue migration is separate from this PR. Do you intend to keep your fork? If yes, we can keep the issues as is and do the refinement when time allows. On case you are going to delete it, we need to prioritize the issue refinement (which I would do only if absolutely necessary).

@subhramit
Copy link
Collaborator Author

subhramit commented Aug 31, 2024

@subhramit Issue migration is separate from this PR. Do you intend to keep your fork? If yes, we can keep the issues as is and do the refinement when time allows. On case you are going to delete it, we need to prioritize the issue refinement (which I would do only if absolutely necessary).

Of course, I will be keeping my fork, I wish to continue contributing. There is no hurry. We can take our time.
(Although I may branch off upstream more often).

Point behind issue migration is just to keep them free-to-take and give more detailed context to contributors (by refining).

With regards to this particular issue, I expect people using numeric styles to anyway raise this soon enough when they face it.

@subhramit
Copy link
Collaborator Author

subhramit commented Sep 4, 2024

@antalk2 Karoly, if you ever get some time to go through this, I'd love to know your suggestions.

The PR description is a bit more detailed, but to summarise, I have questions like: "how to generate a combined citation mark", "what could be a possible way to deal with the variety in CSL" and finally, "how can one make a unified system for anchoring the reference marks, changing them as well as changing the citation text when its surrounding characters or separators are not known"...Maybe even some suggestions for simplification for the case of single entry citations -> as you can see I am confused why starting with "1" bugs out, and text is updated in the reverse order, with the reference marks getting updated with CID = 0 each.

We can ignore other safety mechanisms like "check if cursor is in restricted area", "what if mark is in footer", etc. that you implemented in JStyles, for now.

If the changes in this PR are not neat (well, they are not neat) or weird, you can check out the unchanged files for the current implementation (it does not have the update citations functionality when order changes/ middle element is deleted).

A relevant update which may help: .csl files can be parsed to get the prefix, suffix and delimiter. [as per last discussion with @Siedlerchr]

@antalk2
Copy link
Contributor

antalk2 commented Sep 5, 2024

The PR description is a bit more detailed, but to summarise, I have
questions like:

"how to generate a combined citation mark",

"what could be a possible way to deal with the variety in CSL"

"how can one make a unified system for anchoring the reference
marks, changing them as well as changing the citation text when its
surrounding characters or separators are not known"

  • Embarrassingly, I probably do not understand the questions. My notes
    below may be misplaced.

"how to generate a combined citation mark",

I thought CSL does that. We just have to insert the result.

"what could be a possible way to deal with the variety in CSL"

Is this about parsing the CSL output?

My impression is that you are trying to parse the CSL output in order
to get its parts. That is going to be hard or impossible.

Updating citation groups (calculating the text to be inserted) is already hard
(that is why we need CSL). Inverting the process is
probably worse. (Maybe one could send "colored" info through CSL and
look at the output where the different colors end up? How do we color
years in a way that does not interfere with CSL operation? What colors do we expect in
(Smith 2000a,b)?)

I never finished reading about all the features of CSL, but already met some
requirements one needs to be aware of.

  • The order of citations in a group is not necessarily the order the
    user provided. For example the input (Smith 2000a, Jones 2001, Smith 2000b)
    may have to generate (Smith 2000ab, Jones 2001) (ordered by year)
    or (Jones 2001, Smith 2000ab) (ordered by author)

  • If (Smith 2000) is already cited somewhere, and the user inserts a (Smith 2000) citing another article they need to be modified to (Smith 2000a) and (Smith 2000b).

  • For some styles, the first in-text citation of a source is different
    then the others. (For example may mention more authors.)

  • Some styles put the (Smith 2000) in a footnote when the citation is in the
    text body. What should happen, when citation is already in a footnote? Or in an
    insert capable of having footnotes? And if not capable?

The upshot is:

  • In general, when inserting a citation, even generating the
    corresponding in-text form is a global operation. Updating the other
    in-text references and the bibliography is even more so.

  • For larger documents with many citations this may become slow.

    • May need an alternative, non-final in-text form.

The flow of information would be

  • The user inserts one or more in-text citations at a location.

    • This becomes a citation group.
    • It may include some extra, like "pp 10-19" for each cited source
    • It may include some extra text like "See" shared by the group.
  • Later, we need

    • Find where are the citation groups in the text.

      • Usual solutions: Specially named reference marks or bookmarks.
        Maybe "Comments". Mainly because these can be queried (get a
        list or iterator) and provide associated text ranges.
    • Get the associated data (citation keys, and the extras)

    • Decide (1) the order of citation groups (and (2) citations in them) for
      numbering and the a,b suffixes (in 2000a,b)

      • For (1), the problem is: apart from the main text, citations may
        (hopefully) go to several types of inserts (like text boxes,
        tables (in-cell or in the description below), figure legend) and
        even if we chase where are these inserts are anchored, my
        experience with libreoffice is that moving the insert around
        moves the anchor in unexpected ways. I would not want the numbering
        of references rely on something I could not reliably control.
        • To solve this, I was thinking about allowing markers in the text
          with the meaning "Table 1 is logically here". This would require
          extra effort from the user, but probably make the order easier to control.
        • The solution JabRef used was asking the coordinates (with its
          problems with two-column documents and when showing
          two or more pages, because the routine used was created
          for a different purpose).
          • To avoid these, maybe libreoffice should be modified to
            provide a similar function that knows about the
            columns. Still, the logical location of an inset inserted
            into a two-column text, spanning the whole or parts of the
            two columns is ambiguous. Maybe we should ask the user.
          • When a "float" (in LaTeX sense) is moved to the next
            page or elsewhere: should this change the order of citations
            inside w.r.t those outside?
      • For (2): the order of citations within a group might depend on settings
        in a CSL style (could be order by year, join by same author ...). Hopefully
        CSL will handle this for us.

      So, as I do not understand all the possible variations that CSL considers,
      I would try to avoid parsing its output (as much as possible).

    • When we have decided the orders, we can feed CSL and get our texts
      to be inserted (for citation groups and for the reference list)

    • What we need to store is the information necessary to construct the
      input for CSL.

  • Zotero seems to store the text of in-text citations and use it for
    detecting if the user changed the corresponding part in the text.

    • This stored copy can also be used to diff the earlier and freshly
      created versions and decide if the text needs update, and maybe to
      minimize the updated parts. Diff does not have to understand
      (parse into meaningful parts, like "this is a citation number")
      what it sees (although it should not emit invalid markup)

So the information flow:

  • User -> location, citation group data (cited keys and extra)
  • Refresh: Collect locations, the corresponding citation group data,
    order the groups, feed to CSL, update text in the document.

No parsing of CSL output, just comparing old and new output, or new
output and the corresponding part (text, maybe text and markup) of the
document.


Terminology: I started to use the terms "CitationGroup" and "Citation" before I looked at
CSL. They use different names for these (maybe "Citation" and "Cite"). This may be confusing, sorry.

github-actions[bot]

This comment was marked as resolved.

github-actions[bot]

This comment was marked as resolved.

github-actions[bot]

This comment was marked as resolved.

@subhramit
Copy link
Collaborator Author

subhramit commented Sep 6, 2024

  1. The indexing begins from 0, mostly because of public int getCitationNumber(String citationKey). If I change the default to 1, the behavior begins to be weird. One needs to investigate.

Fixed this.

@subhramit
Copy link
Collaborator Author

  1. The citation text updation happens in the reverse order, and I am unable to fix it:
    image

Fixed.

github-actions[bot]

This comment was marked as resolved.

github-actions[bot]

This comment was marked as resolved.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code currently does not meet JabRef's code guidelines.
We use OpenRewrite to ensure "modern" Java coding practices.
The issues found can be automatically fixed.
Please execute the gradle task rewriteRun, check the results, commit, and push.

You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "OpenRewrite".

@subhramit
Copy link
Collaborator Author

subhramit commented Sep 6, 2024

Added number re-distribution in #11712.

@subhramit
Copy link
Collaborator Author

subhramit commented Sep 7, 2024

@Siedlerchr I suppose we can close this? I don't think we have anything left to experiment with, in the context.

@Siedlerchr Siedlerchr closed this Sep 7, 2024
@subhramit subhramit deleted the numberings branch September 7, 2024 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants