Skip to content

Commit

Permalink
Adding dataset documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
Rajashekar Chintalapati committed Aug 19, 2024
1 parent db83195 commit 43901fc
Show file tree
Hide file tree
Showing 3 changed files with 10 additions and 2 deletions.
10 changes: 9 additions & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,20 @@ instate: predict the state of residence from last name


Using the Indian electoral rolls data (2017), we provide a Python package that takes the last name of a person and gives its distribution across states.
This package can also predict the most spoken language in the state based on the last name.
This package can also predict the spoken language of the person based on the last name.

Potential Use Cases
---------------------
India has 22 official languages. To serve such a diverse language base is a challenge for businesses and surveyors. To the extent that businesses have access to the last name (and no other information) and in the absence of other data that allows us to model a person's spoken language, the distribution of last names across states is the best we have.

Dataset
---------
Refer `lastname_langs_india.csv.tar.gz <https://github.com/appeler/instate/blob/main/instate/data/lastname_langs_india.csv.tar.gz>`__ for the dataset, that will be used to predict/lookup the spoken language based on the last name.

Refer `lastname_langs_india_top3.csv.tar.gz <https://github.com/appeler/instate/blob/main/instate/data/lastname_langs_india_top3.csv.tar.gz>`__ for the dataset, that will be used to predict the top-3 spoken languages based on the last name. A LSTM model has been trained on this dataset to predict the top-3 spoken languages.

Refer `notebooks <https://github.com/appeler/instate/tree/main/instate/notebooks>`__ for the notebooks that were used to prepare above datasets and train the models.

Web UI
--------------
Streamlit App.: https://appeler-instate-streamlitstreamlit-app-e39m4c.streamlit.app/
Expand Down
Binary file added instate/data/lastname_langs_india_top3.csv.tar.gz
Binary file not shown.
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ def run_tests(self):

setup(
name="instate",
version="0.1.6",
version="0.1.7",
description="Instate: predict the state of residence from last name",
long_description=readme,
long_description_content_type = "text/x-rst",
Expand Down

0 comments on commit 43901fc

Please sign in to comment.