Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add two popular datasets for character level LM #254

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

entron
Copy link

@entron entron commented Apr 25, 2023

Added data preparation and example trainning config scripts for two popular datasets: text8 and enwik8.

@entron
Copy link
Author

entron commented Apr 25, 2023

Also tried out feeding the outputs of each layer to itself mutiple times in the 2nd commit.
For the shakespeare_char dataset, this actually gives better val at 1.4543 with only 1 layer and 1.8M parameters.
For bigger datasets such as text8, this also gives better results when the
number of parameters are the same. Haven't tested on GPT-2 yet.
The 2nd commit may be not so relavant though.

@karpathy
Copy link
Owner

That's nice, but prefer we keep n_layer_update separate

@entron
Copy link
Author

entron commented Apr 26, 2023

I have removed the 2nd commit.

@Andrei-Aksionov
Copy link

Maybe it's about time to have a separate .py file with the shared logic?
Because all prepare.py files for shakespeare and these two new datasets basically do the same thing.
I understand that it's sometimes better to have some code duplication for the sake of simplicity and easiness of understanding, but this is not the case (in my opinion).

I am opened to hear why I am wrong (again 😄 ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants