Skip to content

Commit

Permalink
Add data documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
vmenger committed Jul 4, 2024
1 parent 2c3bc95 commit 4b3bb43
Show file tree
Hide file tree
Showing 3 changed files with 27 additions and 0 deletions.
1 change: 1 addition & 0 deletions data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This folder contains some open data files. See https://clinlp.readthedocs.io/en/latest/data.html for more information.
25 changes: 25 additions & 0 deletions docs/source/data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Data

The `clinlp` repository contains some open data files with real (or semi-real) examples. Some of these are used by `clinlp` (for example in the tests), but they are also available for others to use.

The files are located at: https://github.com/umcu/clinlp/tree/main/data

## `tokenizer_cases.json`

Some cases for testing tokenizers, collected during development of clinlp, often based on real examples.

## `sentencizer_cases.json`

Some cases for testing sentencizers, collected during development of clinlp, often based on real examples.

## `qualifier_cases.json`

Some cases for testing qualifier detectors, collected during development of clinlp, often based on real examples. Each doc contains exactly one entity, which makes it easier for our regression tests to mark skips.

You can load this file to an `InfoExtractionDataset` for further evaluation using:

```python
from clinlp.data import InfoExtractionDataset

dataset = InfoExtractionDataset.from_json("data/qualifier_cases.json")
```
1 change: 1 addition & 0 deletions docs/source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ Introduction <introduction>
Installation <installation>
Getting started <getting_started>
Roadmap <roadmap>
Data <data>
Citing <citing>
```

Expand Down

0 comments on commit 4b3bb43

Please sign in to comment.