Merge branch 'feature/segmentation-light' into feature/segmentation-l…

…ight_das_training
kermitt2 · Dec 25, 2024 · 41784a2 · 41784a2
2 parents 797f4d5 + 57fbe4d
commit 41784a2
Show file tree

Hide file tree

Showing 25 changed files with 1,493 additions and 1,447 deletions.
diff --git a/doc/Grobid-specialized-processes.md b/doc/Grobid-specialized-processes.md
@@ -25,10 +25,52 @@ Following, an updated view of the cascade architecture:
 
 At the moment, the flavored processes are available as follows:
 
-| Name                                          | Identifier          | Flavored models          | Description                                                                                                                                     | Advantages and Limitations                                                                                                                                                                                                                                                                                            |
-|-----------------------------------------------|---------------------|--------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| Article lightweight structure                 | `article/light`     | `segmentation`, `header` | Simple process that extracts only title, authors, publication date and doi from the header, and put everything else in the body                 | Simple model that can work with any document and bring the advantage of pdfalto processing which solves many issue with text ordering and column recognition. Limitation are that all noise not being part of the article, such as references, page numbers, head notes, and footnotes are also included in the body. |
-| Article lightweight structure with references | `article/light-ref` | `segmentation`, `header` | Simple process that extracts only title, authors, publication date and doi from the header, the references, and put everything else in the body | Variation of the `article/light` that includes the recognition of references. More versatile than `article/light` in the realm of variation of scientific articles, such as corrections, erratums, letters which may contain references.                                                                              |
+| Name                                          | Identifier          | Flavored models          | Description                                                                                                                                                                                                                              | Advantages                                                                                                                                                                                                                                   | Limitations                                                                                                                                                |
+|-----------------------------------------------|---------------------|--------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Article lightweight structure                 | `article/light`     | `segmentation`, `header` | Simple process that extracts only title, authors, publication date and doi from the header, and put everything else as a fulltext body paragraphs (first paragraph is the header non-classified text, second paragraph is the body text) | Simple model that can work with any document and bring the advantage of pdfalto processing which solves many issue with text ordering and column recognition.                                                                                | All noise not being part of the article, such as references, page numbers, head notes, and footnotes are also included in the body. No fulltext processing |
+| Article lightweight structure with references | `article/light-ref` | `segmentation`, `header` | Simple process that extracts only title, authors, publication date and doi from the header, the references, and put everything else as a fulltext body paragraph   (first paragraph is the header non-classified text, second paragraph is the body text)                                                                       | Variation of the `article/light` that includes the recognition of references. More versatile than `article/light` in the realm of variation of scientific articles, such as corrections, erratums, letters which may contain references.     | Ditto. No fulltext process.                                                                                                                                |
+
+### Article lightweight structure (`article/light`), and with references (`article/light-ref`)
+
+In large harvesting pipelines for scientific content, is common to have documents that are related to scientific articles, such as corrections, letters to editors, editorials, etc that are not well processed by a the standard grobid processing because the structure is different. 
+Most common difference is the absence of abstract and references. Some may just have a title and body. 
+Those kind of document still represent a relevant part of the scientific communication.  
+
+![corrigendum-1.png](img/corrigendum-1.png)
+
+_Article lightweight_ specialised processes are a simplified version of the standard processing for scientific articles, which extract only a subset of information.
+The simplistic approach pose no limitation to what type of document can be processed in case of needs of a simple process. 
+
+The header information are reduced to title, authors, publication date and DOI: 
+The body is not formatted as a structured text, but as a single paragraph.
+
+```xml
+<body>
+    <div
+        xmlns="http://www.tei-c.org/ns/1.0">
+        <p>ONCOLOGY REPORTS 49: 10, 2023 CORRIGENDUM</p>
+        <p>Oncol Rep 42: 509-520, 2019; DOI: 10.3892/or.2019.7175 Subsequently to the publication of the above article, and a Corrigendum that has already been published with the intention of showing corrected versions of Figs. 1 and 8 (DOI: 10.3892/or.2022.8348; published online on June 14, 2022), the authors have belatedly realized that the revisions made to Fig. 8 necessitated changes that should have been introduced into Fig. 9, although these were not attended to in the first corrigendum. Essentially, Fig. 8 was revised as the cell apoptosis and cell proliferation assays therein were poorly presented, which made the interpretation of the data difficult; Fig. 9 showed the fractions of apoptotic cells in the SKM-1 and THP-1 cell lines with lncENST00000444102 overexpression as this pertained to Fig. 8. A revised version of Fig. 9, presenting the analysis of the data shown in the revised version of Fig. 8, is shown opposite. In addition to the revision of Fig. 9, the sentence starting on p. 517, left-hand column, line 12 ["The flow cytometric apoptosis assay revealed that lncENST00000444102 overexpression promoted tumor cells to undergo apoptosis compared to control cells (P&lt;0.001, Fig. 9)"] should be replaced with the following text, to reflect the change in the level of statistical significance: 'The flow cytometric apoptosis assay revealed that lncENST00000444102 overexpression promoted tumor cells to undergo apoptosis compared to control cells (P&lt;0.01, Fig. 9)". Note that the revisions made to Figs. 8 and 9 in this paper have not had a major impact on the reported results, and do not affect the overall conclusions reported in the study. All the authors agree to the publication of this corrigendum. The authors are grateful to the Editor of Oncology Reports for allowing them the opportunity to publish this additional Corrigendum; furthermore, they apologize for any inconvenience caused to the readership of the Journal. Figure 9. The fraction of apoptotic cells in the SKM-1 and THP-1 cell lines with lncENST00000444102 overexpression. Data are presented as the means ± standard deviation. ** P&lt;0.01 vs. the controls. This work is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) License.</p>
+    </div>
+</body>
+```
+
+## Training the specialised flavor models  
+
+The training data for the flavors modes `article/light` and `article/light-ref` are following the same structure as the standard models. In other words the annotated training data for, e.g., the lightweight segmentation model with references, for articles, are following the guidelines as the standard grobid segmentation model. 
+The Grobid parser select automatically the right subset of labels to include. 
+However, this can be implemented at discretion of the user, so for example a flavor `sdo/ietf` for parsing standards documents for IETF, can be following their specific guidelines.
+
+For training the specialised models the same procedure as for the standard models is used, but the flavor is indicated in the training command, e.g. to train the segmentation model for the flavor `article/light`: 
+
+```shell
+./gradlew train_segmentation_article_light
+```
+
+or the header model for the flavor `article/light-ref`: 
+
+```shell
+./gradlew train_header_article_light_ref
+```
 
 ## Benchmarking
 
@@ -62,21 +104,21 @@ The benchmarks results are listed here with links to the full reports.
 
 ### Article lightweight structure
 
-| Corpus          | Header avg. f1* | Full report                                                                     | 
-|-----------------|-----------------|---------------------------------------------------------------------------------|
-| Bioxiv          | 89.4            | [benchmaking-bioxiv.md](benchmarks/flavors/article_light/benchmaking-bioxiv.md) |
-| PMC_sample_1943 | 95.71           | [benchmaking-pmc.md](benchmarks/flavors/article_light/benchmaking-pmc.md)       |
-| PLOS_1000       | 99.37           | [benchmaking-plos.md](benchmarks/flavors/article_light/benchmaking-plos.md)     |
-| eLife_984       | 88.73           | [benchmaking-elife.md](benchmarks/flavors/article_light/benchmaking-elife.md)   |
+| Corpus          | Header avg. f1* | Full report                                                                       | 
+|-----------------|-----------------|-----------------------------------------------------------------------------------|
+| Bioxiv          | 89.4            | [benchmaking-biorxiv.md](benchmarks/flavors/article_light/benchmaking-biorxiv.md) |
+| PMC_sample_1943 | 95.71           | [benchmaking-pmc.md](benchmarks/flavors/article_light/benchmaking-pmc.md)         |
+| PLOS_1000       | 99.37           | [benchmaking-plos.md](benchmarks/flavors/article_light/benchmaking-plos.md)       |
+| eLife_984       | 88.73           | [benchmaking-elife.md](benchmarks/flavors/article_light/benchmaking-elife.md)     |
 
 ### Article lightweight structure with references
 
-| Corpus          | Header avg. f1* | Citations avg. f1+ | Full report                                                                         | 
-|-----------------|-----------------|--------------------|-------------------------------------------------------------------------------------|
-| Bioxiv          | 89.79           | 56.31              | [benchmaking-bioxiv.md](benchmarks/flavors/article_light_ref/benchmaking-bioxiv.md) |
-| PMC_sample_1943 | 95.74           | 58.78              | [benchmaking-pmc.md](benchmarks/flavors/article_light_ref/benchmaking-pmc.md)       |
-| PLOS_1000       | 99.52           | 48.04              | [benchmaking-plos.md](benchmarks/flavors/article_light_ref/benchmaking-plos.md)     |
-| eLife_984       | 91.35           | 76.14              | [benchmaking-elife.md](benchmarks/flavors/article_light_ref/benchmaking-elife.md)   |
+| Corpus          | Header avg. f1* | Citations avg. f1+ | Full report                                                                           | 
+|-----------------|-----------------|--------------------|---------------------------------------------------------------------------------------|
+| Bioxiv          | 89.79           | 56.31              | [benchmaking-biorxiv.md](benchmarks/flavors/article_light_ref/benchmaking-biorxiv.md) |
+| PMC_sample_1943 | 95.74           | 58.78              | [benchmaking-pmc.md](benchmarks/flavors/article_light_ref/benchmaking-pmc.md)         |
+| PLOS_1000       | 99.52           | 48.04              | [benchmaking-plos.md](benchmarks/flavors/article_light_ref/benchmaking-plos.md)       |
+| eLife_984       | 91.35           | 76.14              | [benchmaking-elife.md](benchmarks/flavors/article_light_ref/benchmaking-elife.md)     |
 
 (*) avg. micro F1 Ratcliff/Obershelp@0.95
 

diff --git a/doc/benchmarks/flavors/article_light/benchmaking-biorxiv.md b/doc/benchmarks/flavors/article_light/benchmaking-biorxiv.md
@@ -0,0 +1,80 @@
+
+## Header metadata 
+
+Evaluation on 1996 random PDF files out of 1998 PDF (ratio 1.0).
+
+#### Strict Matching (exact matches)
+
+**Field-level results**
+
+| label            |  precision |   recall  |     f1     | support |
+|---               |---         |---        |---         |---      |
+| authors | 82.99 | 81.45 | 82.22 | 1995 |
+| first_author | 96.32 | 94.63 | 95.47 | 1993 |
+| title | 78.19 | 73.65 | 75.85 | 1996 |
+|                  |            |           |            |         |
+| **all fields (micro avg.)** | **85.94** | **83.24** | **84.57** | 5984 |
+| all fields (macro avg.) | 85.84 | 83.24 | 84.51 | 5984 |
+
+
+
+#### Soft Matching (ignoring punctuation, case and space characters mismatches)
+
+**Field-level results**
+
+| label            |  precision |   recall  |     f1     | support |
+|---               |---         |---        |---         |---      |
+| authors | 83.55 | 82.01 | 82.77 | 1995 |
+| first_author | 96.63 | 94.93 | 95.77 | 1993 |
+| title | 80.64 | 75.95 | 78.22 | 1996 |
+|                  |            |           |            |         |
+| **all fields (micro avg.)** | **87.03** | **84.29** | **85.64** | 5984 |
+| all fields (macro avg.) | 86.94 | 84.3 | 85.59 | 5984 |
+
+
+
+#### Levenshtein Matching (Minimum Levenshtein distance at 0.8)
+
+**Field-level results**
+
+| label            |  precision |   recall  |     f1     | support |
+|---               |---         |---        |---         |---      |
+| authors | 91.57 | 89.87 | 90.72 | 1995 |
+| first_author | 96.78 | 95.08 | 95.93 | 1993 |
+| title | 92.13 | 86.77 | 89.37 | 1996 |
+|                  |            |           |            |         |
+| **all fields (micro avg.)** | **93.51** | **90.57** | **92.02** | 5984 |
+| all fields (macro avg.) | 93.49 | 90.58 | 92 | 5984 |
+
+
+
+#### Ratcliff/Obershelp Matching (Minimum Ratcliff/Obershelp similarity at 0.95)
+
+**Field-level results**
+
+| label            |  precision |   recall  |     f1     | support |
+|---               |---         |---        |---         |---      |
+| authors | 87.59 | 85.96 | 86.77 | 1995 |
+| first_author | 96.32 | 94.63 | 95.47 | 1993 |
+| title | 88.35 | 83.22 | 85.71 | 1996 |
+|                  |            |           |            |         |
+| **all fields (micro avg.)** | **90.79** | **87.93** | **89.34** | 5984 |
+| all fields (macro avg.) | 90.75 | 87.94 | 89.32 | 5984 |
+
+
+#### Instance-level results
+
+```
+Total expected instances: 	1996
+Total correct instances: 	1280 (strict) 
+Total correct instances: 	1313 (soft) 
+Total correct instances: 	1615 (Levenshtein) 
+Total correct instances: 	1497 (ObservedRatcliffObershelp) 
+
+Instance-level recall:	64.13	(strict) 
+Instance-level recall:	65.78	(soft) 
+Instance-level recall:	80.91	(Levenshtein) 
+Instance-level recall:	75	(RatcliffObershelp) 
+```
+
+Evaluation metrics produced in 13.666 seconds
diff --git a/doc/benchmarks/flavors/article_light/benchmaking-bioxiv.md b/doc/benchmarks/flavors/article_light/benchmaking-bioxiv.md