Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BBT parser #4

Open
retorquere opened this issue Nov 6, 2019 · 58 comments
Open

BBT parser #4

retorquere opened this issue Nov 6, 2019 · 58 comments

Comments

@retorquere
Copy link
Contributor

WRT the issues reported here on the BBT parser:

  • does not seem to support all diacritics (errors on {\r a}) is fixed now
  • does not seem to support chained concatenations (a # b # c) I can't replicate this.

The sample below imports in BBT since 5.1.154

@string{j = {a space between this }}
@string{a = { string a}}
@string{b = { string b}}
@string{c = { string c}}
@article{key,
    author  = "Author",
    title   = "{\r a}Title" # a # b # c,
    year    = 1990,
    journal = j # "and this"
}

but the concat part of the title imported before that too.

@larsgw
Copy link
Member

larsgw commented Nov 7, 2019

does not seem to support all diacritics (errors on {\r a}) is fixed now

I've updated the package and will update the surrounding documentation.

does not seem to support chained concatenations (a # b # c) I can't replicate this

I think I removed a part of the test file which removed the concatenation but that also removed the real culprit: it throws for {\i} and {\'i} which both work for natbib (on my Overleaf anyway).

@retorquere
Copy link
Contributor Author

I also cannot reproduce this. It imports both using the standalone BBT parser and in BBT as I'd expect. I've added this testcase but that passes for me.

@retorquere
Copy link
Contributor Author

As an aside, it's a simple fix for me to add missing diacritics (or other constructs), but {\i} and {\'i} are in my mapping, so I don't currently know which diacritics "Misses some forms of diacritics" refers to.

@retorquere
Copy link
Contributor Author

BTW, as far as completeness testing goes, I'd suggest testing at least

  • Names both in standard and biblatex extended format
  • Dates in EDTF
  • Verbatim fields
  • Sentence-casing
  • Math

and optionally

  • JabRef groups

@retorquere
Copy link
Contributor Author

BTW the BBT parser builds on the astrocite parser, parts of which are by my hand, but the BBT parser will therefore necessarily be slower than astrocite. I'm open to looking at the idea parser in this test, but that would need it to either produce an AST which I can postprocess, or that the postprocessing happens during parsing (which I'd not recommend).

@larsgw
Copy link
Member

larsgw commented Nov 7, 2019

I also cannot reproduce this. It imports both using the standalone BBT parser and in BBT as I'd expect.

It seems to be specifically when a user-defined string with the aforementioned specific forms of diacritics are used in a field (it works fine if they're not used, or if the diacritic is in the field itself like the test case you made). Here's proof I'm not crazy :)

@larsgw
Copy link
Member

larsgw commented Nov 7, 2019

BTW, as far as completeness testing goes, I'd suggest testing at least

I wanted to leave value parsing to a different part of the parser (namely, the mapping) as the mapping is used for Bib.TXT as well, which I assume has names in both formats, dates in EDTF, verbatim fields and everything else basically. And this parser would be specifically for everything that is BibLaTeX/BibTeX except the values, if that makes sense. I get that that makes it a bit of an unfair comparison, especially performance-wise, but I didn't mean this repository as a way of calling people out, just to see if my results were somewhat adequate.

@larsgw
Copy link
Member

larsgw commented Nov 7, 2019

I'm open to looking at the idea parser in this test, but that would need it to either produce an AST which I can postprocess, or that the postprocessing happens during parsing (which I'd not recommend).

I'm not trying to convince you to switch, either. I knew my old parser was bad and I wanted to see which method of parsing was best for my purposes (PEG.js, nearley-js, rolling my own parser, etc.). There isn't much postprocessing going on, it converts commands to Unicode, concatenates fields, and puts everything into an object. No conversion to CSL though, but it's not an AST and I understand if this is too much pre-processing.

{
  type: String,
  label: String,
  properties: {
    field: "value" // note: this is verbatim, except command -> Unicode
  }
}

@retorquere
Copy link
Contributor Author

retorquere commented Nov 7, 2019

I wanted to leave value parsing to a different part of the parser (namely, the mapping) as the mapping is used for Bib.TXT as well, which I assume has names in both formats, dates in EDTF, verbatim fields and everything else basically. And this parser would be specifically for everything that is BibLaTeX/BibTeX except the values, if that makes sense.

Verbatim fields can't be parsed outside the grammar, because verbatim fields have a different parsing mode; the grammar has to know about them. EDTF dates can indeed be done later, but sentence casing and name parsing must be done at the AST level:

  • Sentence casing I Like {ISDN} Heaps Better than {dial-up} must know that ISDN and dial-up are exempt from case meddling. Having I Like ISDN Heaps Better than dial-up in CSL does not mean the same, and CSL styles that demand titles are in sentence case would produce the wrong rendering.
  • {Bausch and Lomb} and {{Bausch and Lomb}} are not the same when parsing lists (such as names).

It's not just the speed difference. The BBT parser (and biblatex-csl-converter) keep the intended meaning structurally better than others in the list.

@retorquere
Copy link
Contributor Author

It seems to be specifically when a user-defined string with the aforementioned specific forms of diacritics are used in a field (it works fine if they're not used, or if the diacritic is in the field itself like the test case you made). Here's proof I'm not crazy :)

This is now fixed.

@retorquere
Copy link
Contributor Author

(I tried parsing syntax.bib, but it'd require changes to the astrocite parser, and it isn't valid bib(la)tex it seems; overleaf chokes on it at least)

@larsgw
Copy link
Member

larsgw commented Nov 7, 2019

Verbatim fields can't be parsed outside the grammar, because verbatim fields have a different parsing mode; the grammar has to know about them.

Right, I mixed that up. It's in #3 (checkbox 3).

sentence casing and name parsing must be done at the AST level

Braces in values are kept for that reason (except around some diacritic commands, as D<span class="nocase">é</span>coret is a bit over the top for {\' e}, in my opinion). Bib.TXT needs those too for authors and casing, so that's dealt with in the mapping for the moment.

It's not just the speed difference. The BBT parser (and biblatex-csl-converter) keep the intended meaning structurally better than others in the list.

True, but I think this is well enough for my intended purposes. I can try to make a switch for it to return an AST, given the structure of the parser I think that should be very possible. Anyway, I updated the README to mention the AST capabilities.

@retorquere
Copy link
Contributor Author

Braces in values are kept for that reason (except around some diacritic commands, as D<span class="nocase">é</span>coret is a bit over the top for {\' e}, in my opinion).

Sure. But it's more complicated than that; the braces usually, but not always, mean nocase. See https://retorque.re/zotero-better-bibtex/support/faq/#why-the-double-braces for some examples and links to details. And then there's still the point that lists (literal lists and names) can only be properly distinguished at the grammar level. nocase isn't appropriate there.

Bib.TXT needs those too for authors and casing, so that's dealt with in the mapping for the moment.

I don't know what Bib.TXT is BTW.

True, but I think this is well enough for my intended purposes.

Can't argue that of course, but then "complete" doesn't mean a whole lot. But at the least, footnote 5 has been fixed now, unless there's more diacritics I missed.

I can try to make a switch for it to return an AST, given the structure of the parser I think that should be very possible. Anyway, I updated the README to mention the AST capabilities.

Cool. BTW if name parsing and the meaning of braces (nocase or not) happens inside the parser, and the parser also converts markup (such as superscript, emph, etc), an AST may not be required. But I found it easier to do those by transforming the AST; that's actually what the BBT parser adds to the astrocite parser. The actual grammar is just that of astrocite, although I did add changes to the astrocite parser to be able to parse my test suite.

My parser also adds a simple form of error recovery BTW. The astrocite parser is an all-or-nothing parser. The BBT parser will parse entries one by one and give some info on entries that failed to parse.

@larsgw
Copy link
Member

larsgw commented Nov 7, 2019

I don't know what Bib.TXT is BTW.

Sorry, Bib.TXT is just a reskin of BibTeX. I don't think it has gotten much use, but the premise is that it supports Unicode and presents the key/value pairs in a different way, but the values, in theory, stay the same. I say that now, and that's how I implemented it, but I don't really have a way of knowing; that website is my only point of reference.

Anyway, if the values stay the same there are basically two ways of presenting the values, Bib(La)TeX and Bib.TXT. My parser only makes level ground for the two, the rest is in the mapping, i.e. the Bib(La)TeX/Bib.TXT to CSL mapping.

The BBT parser will parse entries one by one and give some info on entries that failed to parse.

I had something like that in my previous parser, I'll see how I can fit that in in this one. I guess braces still have to be paired for your one?

@retorquere
Copy link
Contributor Author

retorquere commented Nov 7, 2019

Sorry, Bib.TXT is just a reskin of BibTeX. I don't think it has gotten much use, but the premise is that it supports Unicode and presents the key/value pairs in a different way, but the values, in theory, stay the same. I say that now, and that's how I implemented it, but I don't really have a way of knowing; that website is my only point of reference.

I mean... if you're leaning that way, wouldn't TOML or YAML make more sense? At least the more naive parsers (which can sometimes be useful) become trivial.

Anyway, if the values stay the same there are basically two ways of presenting the values, Bib(La)TeX and Bib.TXT. My parser only makes level ground for the two, the rest is in the mapping, i.e. the Bib(La)TeX/Bib.TXT to CSL mapping.

It may be that we see the meaning of "values" differently. For a title, HTML markup will mostly do, as long as the actual intent (which is, as noted, non-trivial) comes through. But name-lists and literal-lists are not strings, they're lists of strings, and you can't safely deduce where they're to be broken into parts without passing on the structure.

I had something like that in my previous parser, I'll see how I can fit that in in this one. I guess braces still have to be paired for your one?

An unclosed open brace will consume all the input after it, yes, but all other errors (also unexpected closing braces) will skip ahead to the first @ it can find and attempt reparsing from that point on, repeatedly, until all input is parsed or consumed this way. So it will generally report and skip the smallest error it can, with the worst case being a single unpaired open brace.

@larsgw
Copy link
Member

larsgw commented Nov 7, 2019

This mixes tokens (lowercase) and rules (capitalized), but that could be changed as long as there are no naming conflicts.

@book{label,
  title = "{T}est"
}
{
  kind: 'Main',
  loc: {
    start: { offset: 0, line: 1, col: 1 },
    end: { offset: 33, line: 3, col: 2 }
  },
  children: [
    {
      kind: 'Entry',
      loc: {
        start: { offset: 0, line: 1, col: 1 },
        end: { offset: 33, line: 3, col: 2 }
      },
      children: [
        {
          kind: 'at',
          loc: {
            start: { offset: 0, line: 1, col: 1 },
            end: { offset: 1, line: 1, col: 2 }
          },
          value: '@'
        },
        {
          kind: 'dataEntryType',
          loc: {
            start: { offset: 1, line: 1, col: 2 },
            end: { offset: 5, line: 1, col: 6 }
          },
          value: 'book'
        },
        {
          kind: 'lbrace',
          loc: {
            start: { offset: 5, line: 1, col: 6 },
            end: { offset: 6, line: 1, col: 7 }
          },
          value: '{'
        },
        {
          kind: 'label',
          loc: {
            start: { offset: 6, line: 1, col: 7 },
            end: { offset: 11, line: 1, col: 12 }
          },
          value: 'label'
        },
        {
          kind: 'comma',
          loc: {
            start: { offset: 11, line: 1, col: 12 },
            end: { offset: 12, line: 1, col: 13 }
          },
          value: ','
        },
        {
          kind: '_',
          loc: {
            start: { offset: 12, line: 1, col: 13 },
            end: { offset: 15, line: 2, col: 2 }
          },
          children: [
            {
              kind: 'whitespace',
              loc: {
                start: { offset: 12, line: 1, col: 13 },
                end: { offset: 15, line: 2, col: 2 }
              },
              value: '\n  '
            }
          ],
          value: undefined
        },
        {
          kind: 'EntryBody',
          loc: {
            start: { offset: 15, line: 2, col: 3 },
            end: { offset: 32, line: 3, col: 0 }
          },
          children: [
            {
              kind: 'Field',
              loc: {
                start: { offset: 15, line: 2, col: 3 },
                end: { offset: 32, line: 3, col: 0 }
              },
              children: [
                {
                  kind: 'identifier',
                  loc: {
                    start: { offset: 15, line: 2, col: 3 },
                    end: { offset: 20, line: 2, col: 8 }
                  },
                  value: 'title'
                },
                {
                  kind: '_',
                  loc: {
                    start: { offset: 20, line: 2, col: 8 },
                    end: { offset: 21, line: 2, col: 9 }
                  },
                  children: [
                    {
                      kind: 'whitespace',
                      loc: {
                        start: { offset: 20, line: 2, col: 8 },
                        end: { offset: 21, line: 2, col: 9 }
                      },
                      value: ' '
                    }
                  ],
                  value: undefined
                },
                {
                  kind: 'equals',
                  loc: {
                    start: { offset: 21, line: 2, col: 9 },
                    end: { offset: 22, line: 2, col: 10 }
                  },
                  value: '='
                },
                {
                  kind: '_',
                  loc: {
                    start: { offset: 22, line: 2, col: 10 },
                    end: { offset: 23, line: 2, col: 11 }
                  },
                  children: [
                    {
                      kind: 'whitespace',
                      loc: {
                        start: { offset: 22, line: 2, col: 10 },
                        end: { offset: 23, line: 2, col: 11 }
                      },
                      value: ' '
                    }
                  ],
                  value: undefined
                },
                {
                  kind: 'Expression',
                  loc: {
                    start: { offset: 23, line: 2, col: 11 },
                    end: { offset: 32, line: 3, col: 0 }
                  },
                  children: [
                    {
                      kind: 'ExpressionPart',
                      loc: {
                        start: { offset: 23, line: 2, col: 11 },
                        end: { offset: 31, line: 2, col: 19 }
                      },
                      children: [
                        {
                          kind: 'QuoteString',
                          loc: {
                            start: { offset: 23, line: 2, col: 11 },
                            end: { offset: 31, line: 2, col: 19 }
                          },
                          children: [
                            {
                              kind: 'quote',
                              loc: {
                                start: { offset: 23, line: 2, col: 11 },
                                end: { offset: 24, line: 2, col: 12 }
                              },
                              value: '"'
                            },
                            {
                              kind: 'Text',
                              loc: {
                                start: { offset: 24, line: 2, col: 12 },
                                end: { offset: 27, line: 2, col: 15 }
                              },
                              children: [
                                {
                                  kind: 'BracketString',
                                  loc: {
                                    start: { offset: 24, line: 2, col: 12 },
                                    end: { offset: 27, line: 2, col: 15 }
                                  },
                                  children: [
                                    {
                                      kind: 'lbrace',
                                      loc: {
                                        start: {
                                          offset: 24,
                                          line: 2,
                                          col: 12
                                        },
                                        end: {
                                          offset: 25,
                                          line: 2,
                                          col: 13
                                        }
                                      },
                                      value: '{'
                                    },
                                    {
                                      kind: 'Text',
                                      loc: {
                                        start: {
                                          offset: 25,
                                          line: 2,
                                          col: 13
                                        },
                                        end: {
                                          offset: 26,
                                          line: 2,
                                          col: 14
                                        }
                                      },
                                      children: [
                                        {
                                          kind: 'text',
                                          loc: {
                                            start: {
                                              offset: 25,
                                              line: 2,
                                              col: 13
                                            },
                                            end: {
                                              offset: 26,
                                              line: 2,
                                              col: 14
                                            }
                                          },
                                          value: 'T'
                                        }
                                      ],
                                      value: 'T'
                                    },
                                    {
                                      kind: 'rbrace',
                                      loc: {
                                        start: {
                                          offset: 26,
                                          line: 2,
                                          col: 14
                                        },
                                        end: {
                                          offset: 27,
                                          line: 2,
                                          col: 15
                                        }
                                      },
                                      value: '}'
                                    }
                                  ],
                                  value: 'T'
                                }
                              ],
                              value: '{T}'
                            },
                            {
                              kind: 'Text',
                              loc: {
                                start: { offset: 27, line: 2, col: 15 },
                                end: { offset: 30, line: 2, col: 18 }
                              },
                              children: [
                                {
                                  kind: 'text',
                                  loc: {
                                    start: { offset: 27, line: 2, col: 15 },
                                    end: { offset: 30, line: 2, col: 18 }
                                  },
                                  value: 'est'
                                }
                              ],
                              value: 'est'
                            },
                            {
                              kind: 'quote',
                              loc: {
                                start: { offset: 30, line: 2, col: 18 },
                                end: { offset: 31, line: 2, col: 19 }
                              },
                              value: '"'
                            }
                          ],
                          value: '{T}est'
                        }
                      ],
                      value: '{T}est'
                    },
                    {
                      kind: '_',
                      loc: {
                        start: { offset: 31, line: 2, col: 19 },
                        end: { offset: 32, line: 3, col: 0 }
                      },
                      children: [
                        {
                          kind: 'whitespace',
                          loc: {
                            start: { offset: 31, line: 2, col: 19 },
                            end: { offset: 32, line: 3, col: 0 }
                          },
                          value: '\n'
                        }
                      ],
                      value: undefined
                    }
                  ],
                  value: '{T}est'
                }
              ],
              value: [ 'title', '{T}est' ]
            }
          ],
          value: { title: '{T}est' }
        },
        {
          kind: 'rbrace',
          loc: {
            start: { offset: 32, line: 3, col: 1 },
            end: { offset: 33, line: 3, col: 2 }
          },
          value: '}'
        }
      ],
      value: { type: 'book', label: 'label', properties: { title: '{T}est' } }
    }
  ],
  value: [ { type: 'book', label: 'label', properties: { title: '{T}est' } } ]
}

@retorquere
Copy link
Contributor Author

What is being mixed? I don't understand? This is the AST produced by the new idea parser?

@larsgw
Copy link
Member

larsgw commented Nov 7, 2019

This is the AST produced by the new idea parser?

Yes.

What is being mixed?

I'm using a tokenizer (moo) which splits up the text into parts like lbrace and at and text based on where it's at in the file. Then, the rules are defined based on those tokens instead of individual characters, which helped a lot with the performance on abstracts for example. However, the AST has both rules (as branches) and tokens (as leaves) with no real distinction except their name and their position in the tree.

@retorquere
Copy link
Contributor Author

I see. But as far as I can tell, the tokens should be easy enough to filter out, and that should leave a fairly clean nested AST, which I could then inspect and transform.

Can I play with this? I am curious what {Bausch and Lomb} and {{Bausch and Lomb}} would return. From what I see above I suspect I'd get something where I can see the difference between these two ands.

How would I add test cases to the idea parser? First thing is I'd be curious to see if my existing tests parse at all.

Error recovery is separate in my parser BTW. If it can be built into the idea parser it will almost certainly be faster, but if not, I could just keep my existing one; the error recovery works by chunking the input into individual entries/strings/comments, then parsing these individually with the astrocite parser, then reassembling the results (a.o. by replacing references to @strings with the AST of those @strings.

@larsgw
Copy link
Member

larsgw commented Nov 7, 2019

Can I play with this? I am curious what {Bausch and Lomb} and {{Bausch and Lomb}} would return. From what I see above I suspect I'd get something where I can see the difference between these two ands.

I'll push the changes to ast.

How would I add test cases to the idea parser? First thing is I'd be curious to see if my existing tests parse at all.

In principle just by adding files to the test/files/ directory. I updated the test suite (in the ast branch) so it works for a single parser and numerous files instead of many parsers and a few files. You can run npm test to run the parser on every file in test/files/. You can also run

node test/ast.js test/files/single.bib

to get a single file's AST output. Note that those can be pretty long, longer than my terminals scrollback anyway. For the sake of brevity, the updated test suite only prints success on success.

@retorquere
Copy link
Contributor Author

if I do

node test/ast.js test/files/syntax.bib 

I get

[
  {
    "type": "book",
    "label": "sweig42",
    "properties": {
      "author": "Stefan Swe{\\i}g and Xavier D\\'ecoret",
      "title": " The {impossible} ℡—book ",
      "publisher": " D\\\"ead Poₑeet Society",
      "year": 1942,
      "month": "03"
    }
  }
]

which isn't what I expected. Should this have been the AST?

@larsgw
Copy link
Member

larsgw commented Nov 7, 2019

Did you run npm run babel to update lib/ first?

@retorquere
Copy link
Contributor Author

Right, now it gives me the AST.

@retorquere
Copy link
Contributor Author

It parses most of my test suite files, with these exceptions:

../bibtex/tests/better-bibtex/export/Really Big whopping library.bib
FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory

../bibtex/tests/better-bibtex/import/Async import, large library #720.bib
Error: invalid syntax at line 64197 col 1:

  @inproceedings{Mills2012a,
  ^

../bibtex/tests/better-bibtex/import/Endnote should parse.bib
SyntaxError: expected "comma", got "label" at line 3 col 11:

  	  author =
            ^ (Main->Entry)

../bibtex/tests/better-bibtex/import/Import Jabref fileDirectory, unexpected reference type #1058.bib
SyntaxError: expected "comma", got "label" at line 33 col 23:

  @Comment{jabref-meta: databaseType:bibtex;}
                        ^ (Main->Entry)

../bibtex/tests/better-bibtex/import/Jabref groups import does not work #717.3.8.bib
SyntaxError: expected "comma", got "label" at line 36 col 23:

  @Comment{jabref-meta: databaseType:bibtex;}
                        ^ (Main->Entry)

../bibtex/tests/better-bibtex/import/Maintain the JabRef group and subgroup structure when importing a BibTeX db #97.bib
Error: invalid syntax at line 9242 col 52:

  results for Z~S/CUZO and 7.nO/Cu20 heterojunctions.},
                                                     ^
../bibtex/tests/better-bibtex/import/Some bibtex entries quietly discarded on import from bib file #873.bib
SyntaxError: expected "lbrace", got "label" at line 1954 col 10:

  @Comment Len
           ^ (Main->Entry)

cleanup of the AST will be a bit of work, I'll take a look in the weekend.

@larsgw
Copy link
Member

larsgw commented Nov 7, 2019

I'll look at the test results this weekend as well.

@retorquere
Copy link
Contributor Author

One other thing that the chunker adds is optional async BTW. It's not really "background" async, but it will yield to the event loop after every chunk which allows other tasks to interleave with parsing.

@retorquere
Copy link
Contributor Author

retorquere commented Nov 8, 2019

Oh and wrt verbatim fields, mendeley gets this wrong for eg file fields so my parser has an option to choose whether file fields are verbatim or not.

At one time, endnote also exported items without citation keys. There's a ton of real-life crap in my test suite - just because it parses doesn't necessarily mean the meaning is extracted properly.

@retorquere
Copy link
Contributor Author

BTW, I've put together a quicky test runner based on benchmark.js and the numbers shift a little; some better, some worse: https://gist.github.com/retorquere/79fb0ad7062a85a1d83e4b004d40985e

@larsgw
Copy link
Member

larsgw commented Nov 10, 2019

Oh and wrt verbatim fields, mendeley gets this wrong for eg file fields so my parser has an option to choose whether file fields are verbatim or not.

Good idea, I'll make them configurable when I implement it.

BTW, I've put together a quicky test runner based on benchmark.js and the numbers shift a little; some better, some worse: https://gist.github.com/retorquere/79fb0ad7062a85a1d83e4b004d40985e

Cool! I'll add the figures (and/or the test suite) to the repo.

@retorquere
Copy link
Contributor Author

Another thing (just added to @retorquere/bibtex-parser): only "engl-ish" (english and some variants like usenglish, american, etc) should be sentence cased on import.

@retorquere
Copy link
Contributor Author

The BBT parser has been updated -- {\emph same} wasn't recognized properly. I didn't think anyone would ever use this, but it's in my test suite. It behaves differently from {\it same} BTW. {\it same} italicizes same, {\emph same} italicizes just the s.

On two of my test files, at least astrocite runs out of memory, where my parser will parse them correctly (if slowly, they're 8.2Mb and 11Mb respectively).

@retorquere
Copy link
Contributor Author

retorquere commented Jan 3, 2020

Does the citation-js parser handle verbatim fields (like url and file) and verbatim commands (\url, \href, probably others)?

A few things recently fixed in the BBT parser that citation-js may not yet be aware of:

  1. $\frac n 2 + 5$ is valid, and equivalent to $\frac{n}{2} + 5$
  2. < and > mean different things depending on whether you're parsing in math mode or text mode.

BBT has its own AST parser now, which is based on a version of astrocite grammar but has seen substantial (and incompatible) changes since.

It still seems strange to me to label parsers "complete" merely because they don't crash. Name parsing, verbatim fields, title-sentence casing, command-argument handling are all crucial parts of parsing bibtex. I'd wager that none of the "complete" parsers will parse {Bausch and Lomb} vs {{Bausch and Lomb}} correctly, or handle $\frac n 2 + 5$ properly.

@retorquere
Copy link
Contributor Author

retorquere commented Mar 9, 2020

Nice on the updated tests! BBT 3.1.20 fixes all non-gimmick tests and some gimmick tests.

What do you think the state of idea-reworked is now? Given how fast it is I may want to build on it, but I'd need to be able to pass my own test suite.

@larsgw
Copy link
Member

larsgw commented Mar 9, 2020

The main part missing from idea-reworked right now is the actual mapping to CSL or other output formats too. That includes field information as well, such as url and verbatim fields and automatic recognition of list fields. And for that, I need some distinction between natbib and biblatex, as they have minor differences in syntax. Note: I am aware that a lot of this is minor edge cases (apart from the field information).

I have been working on mappings over at the aptly named bibtex-mappings, I don't remember if I linked it before. The repository contains some data text-mined from documentation (the biblatex docs are especially usable for this) to be combined with hand-crafted mappings.

#3 is still pretty up-to-date, I have been mainly focused on fixing the test suites and README, and a workaround for the command concatenation gimmick. I'm trying to fully get back into it and sift through the issues and comments soon.

@retorquere
Copy link
Contributor Author

I understand why you'd want mapping to other objects, but I just want the parsed object (pretty much what _intoFixtureOutput delivers), and I'll take it from there, as I'm targeting specifically conversion to Zotero objects.

The command concatenation gimmick would be pretty difficult to address in my parser, but to me that wouldn't be any kind of priority. It's interesting to see that your parser can deal with it successfully, but it's not something I expect to see in the wild.

#3 still has a long list of stuff I absolutely need in the todo list, so I'd have to wait on that. I'm subscribed to the issue, but I won't be notified of edits, just new comments.

@larsgw
Copy link
Member

larsgw commented Sep 10, 2020

Do you happen to have some documentation for the extended name format? I am working on name parsing now and I did find 3.6 Data Annotations in the BibLaTeX manual but that's slightly different from how the feature fixture you added works.

@retorquere
Copy link
Contributor Author

I don't have docs handy, no, and maybe I misunderstood it when I built it. What difference do you see?

@larsgw
Copy link
Member

larsgw commented Sep 10, 2020

Apparently, what you have works but I have not found it in the manual yet. I did find this, on page 82 in http://mirrors.ctan.org/macros/latex/contrib/biblatex/doc/biblatex.pdf:

@MISC{ann1,
    AUTHOR = {Last1, First1 and Last2, First2 and Last3, First3},
    AUTHOR+an = {1:family=student;2=corresponding}
}

But the name-parts are not overwritten by the annotation.

@larsgw
Copy link
Member

larsgw commented Sep 10, 2020

There's an example of what you implemented here: https://github.com/plk/biblatex/blob/dev/doc/latex/biblatex/examples/93-nameparts.tex

@retorquere
Copy link
Contributor Author

But the name-parts are not overwritten by the annotation.

Looking at the docs, I don't think they're meant to overwrite name-parts? They add annotations to the specific name-parts, and those annotations can be used in specialized styles; I've only seen it used in annotated bibliographies myself.

@larsgw
Copy link
Member

larsgw commented Sep 11, 2020

I updated the feature fixtures to include all the name parts instead of just the last name, and on one I encountered unexpected \u0004 characters in BBT's output. They seem to come from https://github.com/retorquere/bibtex-parser/blob/f41af75fd9350507279b42078d07de1187699455/index.ts#L63-L67, when is that used for specifically? Should it still be there in the output?

@larsgw
Copy link
Member

larsgw commented Sep 12, 2020

Looking at the docs, I don't think they're meant to overwrite name-parts? They add annotations to the specific name-parts, and those annotations can be used in specialized styles; I've only seen it used in annotated bibliographies myself.

I think you're right, still a bit confused about the annotation in the example though. Why would someone annotate specifically the family part of the name with "student"?

@retorquere
Copy link
Contributor Author

I can't say with certainty, but this looks to me like a synthetic sample meant to show what's possible with annotations, more than an actual sample from an actual annotated bibliography.

@retorquere
Copy link
Contributor Author

retorquere commented Sep 12, 2020

Those 0004 chars should not be in the output, I'll look into that.

@larsgw
Copy link
Member

larsgw commented Sep 12, 2020

If it helps, I saw it when there were braces in explicit name part values in the extended name format:

@article{test,
  author = {family=Duchamp, given=Philippe, given-i={Ph}}
}

@retorquere
Copy link
Contributor Author

Thanks, that is fixed in the latest release.

@retorquere
Copy link
Contributor Author

I'm also tinkering with chevrotain to remove a pass from my parser.

@larsgw
Copy link
Member

larsgw commented Sep 24, 2020

Cool! I think I might have heard of chevrotain before but I do not recognize the website... the uppercase function names seem familiar though.

@retorquere
Copy link
Contributor Author

@retorquere
Copy link
Contributor Author

I've tried chevrotain, but if your test results are anything to go by, your parser is 2-3 times faster than my lexer alone. I can't replicate your results because npm install fails for me, but clearly I should be looking to use your parser for speed. What's the current state of things? I see that moo is only used "for now", do you intend to remove that dependency?

@larsgw
Copy link
Member

larsgw commented Nov 9, 2020

I see that moo is only used "for now", do you intend to remove that dependency?

I see I have not updated those READMEs in a while, mainly focusing on the automated test suite. I do not like the current tokenizer as it encodes a lot of state, but I believe it works well in terms of speed (you would have to test that though, it still feels weird to me that that code is actually faster). I used my own tokenizer before and I was thinking of making something similar to replace moo, but right now I am focusing on making a stable release of @citation-js/plugin-bibtex with the new parser.

@retorquere
Copy link
Contributor Author

my lexer takes 5s or long.bib. I was really surprised by that. This lexer has so much state that it's close to being a parser -- I'm considering just doing everything there, but not if your parser is already 3 takes faster.

@retorquere
Copy link
Contributor Author

What's the downside of using moo?

@retorquere
Copy link
Contributor Author

At the current state of things I'd be better off either helping build out citationjs, or even just using citationjs as a lexer.

How do you feel about typescript?

@larsgw
Copy link
Member

larsgw commented Nov 9, 2020

What's the downside of using moo?

moo isn't really a problem.

How do you feel about typescript?

I would like to try it out at some point but I also want the citationjs parser to be part of @citation-js/plugin-bibtex and I was not planning on adding TypeScript infrastructure to that. I guess the parser itself could be a separate package though.

@retorquere
Copy link
Contributor Author

I wouldn't mind cooperating on a parser, especially seeing how fast yours is -- if that's of any interest to you of course. I don't want to rely on a parser I'm not personally involved in though -- my users are often on tight deadlines, and I'd prefer to be able to roll out fixes quickly when necessary. I wouldn't want to incur functional loss against the BBT parser though, and I've grown really fond of typescript, it has prevented so many problems at compile time over the years for me.

@larsgw
Copy link
Member

larsgw commented Nov 10, 2020

I wouldn't mind cooperating on a parser, especially seeing how fast yours is -- if that's of any interest to you of course. I don't want to rely on a parser I'm not personally involved in though -- my users are often on tight deadlines, and I'd prefer to be able to roll out fixes quickly when necessary. I wouldn't want to incur functional loss against the BBT parser though, and I've grown really fond of typescript, it has prevented so many problems at compile time over the years for me.

It does sound interesting but right now I feel like the parsers are incompatible in that sense (and our goals maybe as well). If it is okay with you, I will get back to you about this later.

What's the downside of using moo?

Some more thoughts:

Right now my parser lexes everything with moo. This works with different states to group tokens when they have no individual meaning, like alphanumeric characters in text fields. This grouping improved performance a lot in tests, especially with fields like abstract. However, with this the syntax is encoded in both the lexer states and the actual grammar, as you mentioned. As an alternative, maybe a lexer could be built where you can specify the "state" in the grammar itself. The lexer does not lex everything at once anyway, it has an iterator. This might improve performance:

  • no double passes over syntax (lexer & grammar)
  • more specific tokens

But it might also not: somehow, moo is pretty fast, maybe by using one RegExp per state, and I do not know yet if a custom lexer can replicate that with multiple RegExps (as states would be eliminated and tokens matched based on pattern names).

I guess this would be the structure of a more normal parser?

especially seeing how fast yours is

I am really curious now why mine is faster. Maybe the lack of back-tracking? I keep being afraid I messed up the benchmarks though.

@retorquere
Copy link
Contributor Author

retorquere commented Nov 11, 2020

It does sound interesting but right now I feel like the parsers are incompatible in that sense (and our goals maybe as well). If it is okay with you, I will get back to you about this later.

Sure. I'm going to tinker with it to see if it's a better base than my lexer in the interim.

Right now my parser lexes everything with moo. This works with different states to group tokens when they have no individual meaning, like alphanumeric characters in text fields.

That's what my lexer does too.

This grouping improved performance a lot in tests, especially with fields like abstract.

What is special about abstract?

However, with this the syntax is encoded in both the lexer states and the actual grammar, as you mentioned. As an alternative, maybe a lexer could be built where you can specify the "state" in the grammar itself.

In my chevrotain attempt, the same, and I think by necessity, unless you use kludges for things like name fields (of the kind my main parser has right now).

The lexer does not lex everything at once anyway, it has an iterator. This might improve performance:

I don't know how that could make a difference? Are iterators especially efficient at generating tokenized text?

* no double passes over syntax (lexer & grammar)

Yeah but if I understand correctly, citationjs in the tests of this repo already does a double pass, and it is about 2-3 times faster than my single-pass lexer.

But it might also not: somehow, moo is pretty fast, maybe by using one RegExp per state

That could well be it. I test multiple regexes per state. My lexer did already pick out commands though.

and I do not know yet if a custom lexer can replicate that with multiple RegExps (as states would be eliminated and tokens matched based on pattern names).

I'm not sure what this says, sorry.

especially seeing how fast yours is

I am really curious now why mine is faster. Maybe the lack of back-tracking? I keep being afraid I messed up the benchmarks though.

My lexer doesn't back-track. So yours seems very structurally faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants