Add data documentation

umcu · Jul 4, 2024 · 4b3bb43 · 4b3bb43
1 parent 2c3bc95
commit 4b3bb43
Show file tree

Hide file tree

Showing 3 changed files with 27 additions and 0 deletions.
diff --git a/data/README.md b/data/README.md
@@ -0,0 +1 @@
+This folder contains some open data files. See https://clinlp.readthedocs.io/en/latest/data.html for more information.
diff --git a/docs/source/data.md b/docs/source/data.md
@@ -0,0 +1,25 @@
+# Data
+
+The `clinlp` repository contains some open data files with real (or semi-real) examples. Some of these are used by `clinlp` (for example in the tests), but they are also available for others to use.
+
+The files are located at: https://github.com/umcu/clinlp/tree/main/data
+
+## `tokenizer_cases.json`
+
+Some cases for testing tokenizers, collected during development of clinlp, often based on real examples.
+
+## `sentencizer_cases.json`
+
+Some cases for testing sentencizers, collected during development of clinlp, often based on real examples.
+
+## `qualifier_cases.json`
+
+Some cases for testing qualifier detectors, collected during development of clinlp, often based on real examples. Each doc contains exactly one entity, which makes it easier for our regression tests to mark skips.
+
+You can load this file to an `InfoExtractionDataset` for further evaluation using: 
+
+```python
+from clinlp.data import InfoExtractionDataset
+
+dataset = InfoExtractionDataset.from_json("data/qualifier_cases.json")
+```
diff --git a/docs/source/index.md b/docs/source/index.md
@@ -13,6 +13,7 @@ Introduction <introduction>
 Installation <installation>
 Getting started <getting_started>
 Roadmap <roadmap>
+Data <data>
 Citing <citing>
 ```