Skip to content

Latest commit

 

History

History
40 lines (29 loc) · 3.1 KB

README.md

File metadata and controls

40 lines (29 loc) · 3.1 KB

Name2Gender

Using character sequences in first names to predict gender. This is a quick exploration into the interesting problem; see my Medium post where I elaborate on why it is interesting https://medium.com/@ellisbrown/name2gender-introduction-626d89378fb0.

I have implemented a Naïve-Bayes approach and an Char-RNN approach, which are contained in their respective subdirectories.

Table of Contents

Naïve-Bayes /naive_bayes

In this approach, I defined features of first names (last two letters, count of vowels, etc.) to use to learn the genders. I explain this in more detail here in my blog post and in the /naive_bayes subdirectory.

Char-RNN /rnn

In this second approach, I feed characters in a name one by one through a character level recurrent neural network built in PyTorch in the hopes of learning the latent space of all character sequences that denote gender without having to define them a priori. I explain this in more detail here in my blog post in the /rnn subdirectory.

Dataset /data

I have aggregated multiple smaller datasets representing various cultures into a large dataset (~135k instances) of gender-labeled first names. See data/dataset.ipynb for further information on how I pulled it together. Note: I did not spend a ton of time going through and pruning this dataset, so it is probably not amazing or particularly clean (I would greatly appreciate any PR’s if anyone cares or has the time!).

Acknowledgement

Below are a bunch of links I found useful: