Wolof Dataset for Open LLM Fine-Tuning

This project provides a dataset for fine-tuning language models (LLMs) in Wolof. It uses a Python script to create a JSONLines (.jsonl) file from a list of words, retrieving detailed information from a Wolof-French dictionary API.

How It Works

Input File: The script reads a list of words from a text file (mots.txt), with one word per line.
API Requests: For each word, the script calls an API to fetch JSON data containing definitions, examples, and etymology.
Data Transformation: The retrieved data is transformed into a structured format and saved in a JSONLines file (dataset.jsonl).

Data Source

The data is sourced from the "Corpus Oraux du LLACAN," which provides comprehensive Wolof-French dictionary data. You can access the API for this dictionary at:

API Endpoint: https://corporan.huma-num.fr

Features

Automated Data Extraction: Fetches data from the Wolof-French dictionary API.
Structured Output: Converts data into JSONLines format for easy use in model training.
Rate Limiting: Includes a delay between API calls to manage rate limits.

Setup

Install Dependencies: Ensure you have Python 3.x and install the required library:
```
pip install requests
```
Prepare Input File: Create a mots.txt file with one word per line.
Run the Script: Execute the Python script to generate the dataset:
```
python script.py
```

License

This project is licensed under the MIT License.

Author

Mamadou Diagne

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wolof Dataset for Open LLM Fine-Tuning

How It Works

Data Source

Features

Setup

License

Author

About

Releases

Sponsor this project

Packages

Contributors 2

Languages

License

dofbi/jolof

Folders and files

Latest commit

History

Repository files navigation

Wolof Dataset for Open LLM Fine-Tuning

How It Works

Data Source

Features

Setup

License

Author

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Contributors 2

Languages

Packages