This project provides a dataset for fine-tuning language models (LLMs) in Wolof. It uses a Python script to create a JSONLines (.jsonl
) file from a list of words, retrieving detailed information from a Wolof-French dictionary API.
- Input File: The script reads a list of words from a text file (
mots.txt
), with one word per line. - API Requests: For each word, the script calls an API to fetch JSON data containing definitions, examples, and etymology.
- Data Transformation: The retrieved data is transformed into a structured format and saved in a JSONLines file (
dataset.jsonl
).
The data is sourced from the "Corpus Oraux du LLACAN," which provides comprehensive Wolof-French dictionary data. You can access the API for this dictionary at:
- API Endpoint: https://corporan.huma-num.fr
- Automated Data Extraction: Fetches data from the Wolof-French dictionary API.
- Structured Output: Converts data into JSONLines format for easy use in model training.
- Rate Limiting: Includes a delay between API calls to manage rate limits.
-
Install Dependencies: Ensure you have Python 3.x and install the required library:
pip install requests
-
Prepare Input File: Create a
mots.txt
file with one word per line. -
Run the Script: Execute the Python script to generate the dataset:
python script.py
This project is licensed under the MIT License.