Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Wikipedia Persian Dataset #3629

Merged
merged 1 commit into from
Aug 3, 2023
Merged

Conversation

pourmand1376
Copy link
Contributor

@pourmand1376 pourmand1376 commented Aug 2, 2023

Currently, the Open-assistant model doesn't support Farsi. This is a text-only dataset to learn Farsi (Persian).

One of my friends fine-tuned LLaMa on this dataset and It could understand Farsi grammar and word usage very well. If the Open-assistant team wants to add support to Farsi, this should be the first step.

I have transformed the dataset into the standard that has been mentioned here and uploaded it to my huggingface account.

@olliestanley olliestanley merged commit 65f5c2b into LAION-AI:main Aug 3, 2023
olliestanley added a commit that referenced this pull request Aug 3, 2023
The level of importance of this data is less than Wikipedia. So, I think
[this pull
request](#3629) should be
merged first.

I have uploaded the data to
[huggingface](https://huggingface.co/datasets/pourmand1376/isna-news)
according to Open-assistant's standard. So, it shouldn't need any
processing.

---------

Co-authored-by: Oliver Stanley <olivergestanley@gmail.com>
@somerandomguyontheweb
Copy link

Hi @pourmand1376, sorry for a slighly off-topic question: could you please share any details on how your friend managed to fine-tune LLaMA on text-only dataset, without instructions? I'm interested in doing the same thing with Belarusian Wikipedia, but so far I've only seen tutorials on how to instruct-tune LLaMA, and Wikipedia articles as such don't contain clearly delimited prompts and responses. Could you please briefly describe the approach?

Thanks in advance for any comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants