Skip to content

Commit

Permalink
more documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
lfoppiano committed Nov 27, 2024
1 parent 6840da6 commit 8484a95
Show file tree
Hide file tree
Showing 9 changed files with 952 additions and 6 deletions.
48 changes: 42 additions & 6 deletions doc/Grobid-specialized-processes.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,19 +25,55 @@ Following, an updated view of the cascade architecture:

At the moment, the flavored processes are available as follows:

| Identifier | Flavored models | Description | Advantages and Limitations |
|-----------------------|---------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `article/light` | `segmentation`, `header` | Simple process that extracts only title, authors, publication date and doi from the header, and put everything else in the body | Simple model that can work with any document and bring the advantage of pdfalto processing which solves many issue with text ordering and column recognition. Limitation are that all noise not being part of the article, such as references, page numbers, headnotes, and footnotes are also included in the body. |
| `article/light-ref` | `segmentation`, `header` | Simple process that extracts only title, authors, publication date and doi from the header, the references, and put everything else in the body | Variation of the `article/light` that includes the recognision of references. More versatile than `article/light` in the realm of variation of scientific articles, such as corrections, erratums, letters which may contain references. |
| Name | Identifier | Flavored models | Description | Advantages and Limitations |
|-----------------------------------------------|-----------------------|---------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Article lightweight structure | `article/light` | `segmentation`, `header` | Simple process that extracts only title, authors, publication date and doi from the header, and put everything else in the body | Simple model that can work with any document and bring the advantage of pdfalto processing which solves many issue with text ordering and column recognition. Limitation are that all noise not being part of the article, such as references, page numbers, headnotes, and footnotes are also included in the body. |
| Article lightweight structure with references | `article/light-ref` | `segmentation`, `header` | Simple process that extracts only title, authors, publication date and doi from the header, the references, and put everything else in the body | Variation of the `article/light` that includes the recognision of references. More versatile than `article/light` in the realm of variation of scientific articles, such as corrections, erratums, letters which may contain references. |

## Benchmarking

The evaluation of the flavors is performed in the same way as the standard processing for scientific articles.
However, the evaluation is performed on a reduced set of fields:
The evaluation of the flavors is performed in the same way as the standard processing for scientific articles:

- **BidLSTM_ChainCRF_FEATURES** as sequence labeling for the header model

- **BidLSTM_ChainCRF_FEATURES** as sequence labeling for the reference-segmenter model

- **BidLSTM-CRF-FEATURES** as sequence labeling for the citation model

- **BidLSTM_CRF_FEATURES** as sequence labeling for the affiliation-address model

- **CRF Wapiti** as sequence labelling engine for all other models.

Header extractions are consolidated by default with [biblio-glutton](https://github.com/kermitt2/biblio-glutton) service (the results with CrossRef REST API as consolidation service should be similar but much slower).

The evaluation, which is usually create grobid files suffixing `fulltext.tei.xml`, will suffix also the flavor, for example `article/light` will be suffixed as `article_light.tei.xml`.
In this way is possible to run evaluation for multiple flavor without loosing the Grobid processed files.

The evaluation is performed on a reduced set of fields:

| Flavor | Header fields | Fulltext fields | Citation fields |
|---------------------|--------------------------------------|-----------------|----------------------------------|
| `article/light` | `title`, `first author`, `authors` | N/A | N/A |
| `article/light-ref` | `title`, `first author`, `authors` | N/A | Same as the standard processing* |

(*) for this flavor the citation model is included to avoid regressions, as the citation parsing is performed using the standard citation model

The benchmarks results are listed here with links to the full reports.

### Article lightweight structure

| Corpus | Header (avg. micro F1 Ratcliff/Obershelp@0.95) | Full report |
|-----------------|------------------------------------------------|----------------------------------------------------------------------------------|
| Bioxiv | 89.4 | [benchmaking-bioxiv.md](benchmarks/flavors/article_light/benchmaking-bioxiv.md) |
| PMC_sample_1943 | 95.71 | [benchmaking-pmc.md](benchmarks/flavors/article_light/benchmaking-pmc.md) |
| PLOS_1000 | 99.37 | [benchmaking-plos.md](benchmarks/flavors/article_light/benchmaking-plos.md) |
| eLife_984 | 88.73 | [benchmaking-elife.md](benchmarks/flavors/article_light/benchmaking-elife.md) |

### Article lightweight structure with references

| Corpus | Header (avg. micro F1 Ratcliff/Obershelp@0.95) | Citations (Instance-level f-score (RatcliffObershelp)) | Full report |
|-----------------|------------------------------------------------|--------------------------------------------------------|-------------------------------------------------------------------------------------|
| Bioxiv | 89.79 | 56.31 | [benchmaking-bioxiv.md](benchmarks/flavors/article_light_ref/benchmaking-bioxiv.md) |
| PMC_sample_1943 | 95.74 | 58.78 | [benchmaking-pmc.md](benchmarks/flavors/article_light_ref/benchmaking-pmc.md) |
| PLOS_1000 | 99.52 | 48.04 | [benchmaking-plos.md](benchmarks/flavors/article_light_ref/benchmaking-plos.md) |
| eLife_984 | 91.35 | 76.14 | [benchmaking-elife.md](benchmarks/flavors/article_light_ref/benchmaking-elife.md) |
72 changes: 72 additions & 0 deletions doc/benchmarks/flavors/article_light/benchmaking-bioxiv.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
## Header metadata

Evaluation on 1996 random PDF files out of 1998 PDF (ratio 1.0).

#### Strict Matching (exact matches)

**Field-level results**

| label | precision | recall | f1 | support |
|-----------------------------|-----------|-----------|-----------|---------|
| authors | 82.92 | 81.5 | 82.2 | 1995 |
| first_author | 96.33 | 94.78 | 95.55 | 1993 |
| title | 78.16 | 73.7 | 75.86 | 1996 |
| | | | | |
| **all fields (micro avg.)** | **85.91** | **83.32** | **84.59** | 5984 |
| all fields (macro avg.) | 85.8 | 83.33 | 84.54 | 5984 |

#### Soft Matching (ignoring punctuation, case and space characters mismatches)

**Field-level results**

| label | precision | recall | f1 | support |
|-----------------------------|-----------|-----------|----------|---------|
| authors | 83.53 | 82.11 | 82.81 | 1995 |
| first_author | 96.63 | 95.08 | 95.85 | 1993 |
| title | 80.66 | 76.05 | 78.29 | 1996 |
| | | | | |
| **all fields (micro avg.)** | **87.03** | **84.41** | **85.7** | 5984 |
| all fields (macro avg.) | 86.94 | 84.41 | 85.65 | 5984 |

#### Levenshtein Matching (Minimum Levenshtein distance at 0.8)

**Field-level results**

| label | precision | recall | f1 | support |
|-----------------------------|-----------|-----------|-----------|---------|
| authors | 91.59 | 90.03 | 90.8 | 1995 |
| first_author | 96.84 | 95.28 | 96.05 | 1993 |
| title | 92.03 | 86.77 | 89.32 | 1996 |
| | | | | |
| **all fields (micro avg.)** | **93.5** | **90.69** | **92.08** | 5984 |
| all fields (macro avg.) | 93.48 | 90.69 | 92.06 | 5984 |

#### Ratcliff/Obershelp Matching (Minimum Ratcliff/Obershelp similarity at 0.95)

**Field-level results**

| label | precision | recall | f1 | support |
|-----------------------------|-----------|-----------|----------|---------|
| authors | 87.51 | 86.02 | 86.75 | 1995 |
| first_author | 96.33 | 94.78 | 95.55 | 1993 |
| title | 88.42 | 83.37 | 85.82 | 1996 |
| | | | | |
| **all fields (micro avg.)** | **90.78** | **88.05** | **89.4** | 5984 |
| all fields (macro avg.) | 90.75 | 88.05 | 89.37 | 5984 |

#### Instance-level results

```
Total expected instances: 1996
Total correct instances: 1278 (strict)
Total correct instances: 1312 (soft)
Total correct instances: 1613 (Levenshtein)
Total correct instances: 1496 (ObservedRatcliffObershelp)
Instance-level recall: 64.03 (strict)
Instance-level recall: 65.73 (soft)
Instance-level recall: 80.81 (Levenshtein)
Instance-level recall: 74.95 (RatcliffObershelp)
```

Evaluation metrics produced in 15.364 seconds
72 changes: 72 additions & 0 deletions doc/benchmarks/flavors/article_light/benchmaking-elife.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
## Header metadata

Evaluation on 957 random PDF files out of 982 PDF (ratio 1.0).

#### Strict Matching (exact matches)

**Field-level results**

| label | precision | recall | f1 | support |
|-----------------------------|-----------|-----------|-----------|---------|
| authors | 78.74 | 78.16 | 78.45 | 957 |
| first_author | 92 | 91.42 | 91.71 | 956 |
| title | 89.92 | 87.67 | 88.78 | 957 |
| | | | | |
| **all fields (micro avg.)** | **86.87** | **85.75** | **86.31** | 2870 |
| all fields (macro avg.) | 86.89 | 85.75 | 86.31 | 2870 |

#### Soft Matching (ignoring punctuation, case and space characters mismatches)

**Field-level results**

| label | precision | recall | f1 | support |
|-----------------------------|-----------|-----------|-----------|---------|
| authors | 79.05 | 78.47 | 78.76 | 957 |
| first_author | 92 | 91.42 | 91.71 | 956 |
| title | 97 | 94.57 | 95.77 | 957 |
| | | | | |
| **all fields (micro avg.)** | **89.3** | **88.15** | **88.73** | 2870 |
| all fields (macro avg.) | 89.35 | 88.15 | 88.75 | 2870 |

#### Levenshtein Matching (Minimum Levenshtein distance at 0.8)

**Field-level results**

| label | precision | recall | f1 | support |
|-----------------------------|-----------|-----------|-----------|---------|
| authors | 90.53 | 89.86 | 90.19 | 957 |
| first_author | 92.32 | 91.74 | 92.03 | 956 |
| title | 98.5 | 96.03 | 97.25 | 957 |
| | | | | |
| **all fields (micro avg.)** | **93.75** | **92.54** | **93.14** | 2870 |
| all fields (macro avg.) | 93.78 | 92.54 | 93.16 | 2870 |

#### Ratcliff/Obershelp Matching (Minimum Ratcliff/Obershelp similarity at 0.95)

**Field-level results**

| label | precision | recall | f1 | support |
|-----------------------------|-----------|-----------|-----------|---------|
| authors | 84.32 | 83.7 | 84.01 | 957 |
| first_author | 92 | 91.42 | 91.71 | 956 |
| title | 98.5 | 96.03 | 97.25 | 957 |
| | | | | |
| **all fields (micro avg.)** | **91.56** | **90.38** | **90.97** | 2870 |
| all fields (macro avg.) | 91.61 | 90.38 | 90.99 | 2870 |

#### Instance-level results

```
Total expected instances: 957
Total correct instances: 678 (strict)
Total correct instances: 729 (soft)
Total correct instances: 811 (Levenshtein)
Total correct instances: 773 (ObservedRatcliffObershelp)
Instance-level recall: 70.85 (strict)
Instance-level recall: 76.18 (soft)
Instance-level recall: 84.74 (Levenshtein)
Instance-level recall: 80.77 (RatcliffObershelp)
```

Evaluation metrics produced in 13.732 seconds
72 changes: 72 additions & 0 deletions doc/benchmarks/flavors/article_light/benchmaking-plos.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
## Header metadata

Evaluation on 1000 random PDF files out of 998 PDF (ratio 1.0).

#### Strict Matching (exact matches)

**Field-level results**

| label | precision | recall | f1 | support |
|-----------------------------|-----------|-----------|-----------|---------|
| authors | 98.97 | 99.28 | 99.12 | 969 |
| first_author | 99.28 | 99.59 | 99.43 | 969 |
| title | 95.79 | 95.5 | 95.64 | 1000 |
| | | | | |
| **all fields (micro avg.)** | **97.99** | **98.09** | **98.04** | 2938 |
| all fields (macro avg.) | 98.01 | 98.12 | 98.07 | 2938 |

#### Soft Matching (ignoring punctuation, case and space characters mismatches)

**Field-level results**

| label | precision | recall | f1 | support |
|-----------------------------|-----------|-----------|-----------|---------|
| authors | 98.97 | 99.28 | 99.12 | 969 |
| first_author | 99.28 | 99.59 | 99.43 | 969 |
| title | 99.3 | 99 | 99.15 | 1000 |
| | | | | |
| **all fields (micro avg.)** | **99.18** | **99.29** | **99.23** | 2938 |
| all fields (macro avg.) | 99.18 | 99.29 | 99.24 | 2938 |

#### Levenshtein Matching (Minimum Levenshtein distance at 0.8)

**Field-level results**

| label | precision | recall | f1 | support |
|-----------------------------|-----------|-----------|-----------|---------|
| authors | 99.28 | 99.59 | 99.43 | 969 |
| first_author | 99.38 | 99.69 | 99.54 | 969 |
| title | 99.7 | 99.4 | 99.55 | 1000 |
| | | | | |
| **all fields (micro avg.)** | **99.46** | **99.56** | **99.51** | 2938 |
| all fields (macro avg.) | 99.45 | 99.56 | 99.51 | 2938 |

#### Ratcliff/Obershelp Matching (Minimum Ratcliff/Obershelp similarity at 0.95)

**Field-level results**

| label | precision | recall | f1 | support |
|-----------------------------|-----------|-----------|-----------|---------|
| authors | 99.18 | 99.48 | 99.33 | 969 |
| first_author | 99.28 | 99.59 | 99.43 | 969 |
| title | 99.5 | 99.2 | 99.35 | 1000 |
| | | | | |
| **all fields (micro avg.)** | **99.32** | **99.42** | **99.37** | 2938 |
| all fields (macro avg.) | 99.32 | 99.42 | 99.37 | 2938 |

#### Instance-level results

```
Total expected instances: 1000
Total correct instances: 950 (strict)
Total correct instances: 985 (soft)
Total correct instances: 989 (Levenshtein)
Total correct instances: 988 (ObservedRatcliffObershelp)
Instance-level recall: 95 (strict)
Instance-level recall: 98.5 (soft)
Instance-level recall: 98.9 (Levenshtein)
Instance-level recall: 98.8 (RatcliffObershelp)
```

Evaluation metrics produced in 12.571 seconds
Loading

0 comments on commit 8484a95

Please sign in to comment.