diff --git a/data/README.md b/data/README.md new file mode 100644 index 0000000..59a0f01 --- /dev/null +++ b/data/README.md @@ -0,0 +1 @@ +This folder contains some open data files. See https://clinlp.readthedocs.io/en/latest/data.html for more information. \ No newline at end of file diff --git a/docs/source/data.md b/docs/source/data.md new file mode 100644 index 0000000..d6d0d40 --- /dev/null +++ b/docs/source/data.md @@ -0,0 +1,25 @@ +# Data + +The `clinlp` repository contains some open data files with real (or semi-real) examples. Some of these are used by `clinlp` (for example in the tests), but they are also available for others to use. + +The files are located at: https://github.com/umcu/clinlp/tree/main/data + +## `tokenizer_cases.json` + +Some cases for testing tokenizers, collected during development of clinlp, often based on real examples. + +## `sentencizer_cases.json` + +Some cases for testing sentencizers, collected during development of clinlp, often based on real examples. + +## `qualifier_cases.json` + +Some cases for testing qualifier detectors, collected during development of clinlp, often based on real examples. Each doc contains exactly one entity, which makes it easier for our regression tests to mark skips. + +You can load this file to an `InfoExtractionDataset` for further evaluation using: + +```python +from clinlp.data import InfoExtractionDataset + +dataset = InfoExtractionDataset.from_json("data/qualifier_cases.json") +``` \ No newline at end of file diff --git a/docs/source/index.md b/docs/source/index.md index 71fdbf5..28aa8bc 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -13,6 +13,7 @@ Introduction Installation Getting started Roadmap +Data Citing ```