You can check all the steps in the notebook or you can run the notebook on kaggle if you want to test it out
This tokenizer is trained on more than 3.2 million rows of data.
- To get more info about the data I used, check out dataset page on 🤗Datasets