myw2v

A word2vec implementation with Cuda & Python

What?

A simple word2vec implementation written in Python using Cuda. Requires a Cuda-capable GPU (e.g. NVidia GPU).

Features implemented:

negative sampling
skip-gram
basic data loading and vocabulary handling
for negative sampling, biasing the word frequency counts with a given exponent, e.g. 0.75 (see paper)
given a maximum context window size C, the actual context window size is random R with R ~ Unif(1, C)
basic gradient descent with a linearly decreasing learning rate

Currently all data is loaded into GPU memory first, after which the model is trained. Afterwards the first weight matrix is written to disk in plain text format (called "word2vec format" in gensim).

How's the code?

conda env create -f env_demo.yaml
conda activate myw2v-demo
pytest -s

Check out the unit tests (and the rest of the code) to see how things work and what's going on.

How to demo it/check it out?

conda env create -f env_demo.yaml
conda activate myw2v-demo
python mywv2/demo.py [-d <data_dir>]

The -d <data_dir> parameter is optional. The default is ./demo_data.

Requirements: 2 GB of disk space; CUDA-capable GPU (e.g. NVidia) with, probably, 5 GB or more of GPU RAM; for gensim model, maybe 8 GB of main RAM to read data into memory first (TODO: optimise this part maybe).

The demo code will first download & process a partial Wikipedia dump and then train a myw2v model on it. Notice that this will require a few gigabytes of disk and will also take a while. Please see demo.py for details, or just run it and it will print the details and wait for a key press.

After training, the demo code uses gensim to do the standard "word analogy" task on the trained vectors, to gauge accuracy.

In addition, the demo code will also train another word2vec model on the same data, using gensim, which is a well-known and reliable word2vec implementation. Afterwards the same analogy task is done on the gensim vectors as well.

Demo outputs

Under the given data directory, the following will be created:

enwiki-20210801-pages-articles-multistream6.xml-p958046p1483661.bz2      # original Wikipedia dump, as downloaded
enwiki-20210801-pages-articles-multistream6.xml-p958046p1483661.json.gz  # with-gensim-processed Wikipedia dump
txt/                                                                     # Wikipedia data processed into plain text sentences

vectors                                                                  # output of myw2v, in plain text format
vectors_accuracy.json                                                    # gensim's word analogy task results for myw2v output (summary only)
vectors_accuracy_details.json                                            # gensim's word analogy task results for myw2v output (full details)
vectors_params.json                                                      # myw2v parameters used for training
vectors_stats.json                                                       # myw2v statistics about source data

vectors_gensim                                                           # output of gensim, in plain text format
vectors_gensim_accuracy.json                                             # gensim's word analogy task results for gensim output (summary only)
vectors_gensim_accuracy_details.json                                     # gensim's word analogy task results for gensim output (summary only)
vectors_gensim_params.json                                               # gensim parameters used for training
vectors_gensim_stats.json                                                # gensim statistics about source data

Results

Here are the approximate results of the demo code above, when evaluated on the standard word analogy test:

gensim: approx. 48-49%
myw2v: approx. 44-45% (~ 10% lower)

The training times of the demo (on my PC) were:

gensim: approx. 11.5 minutes
myw2v: approx. 6.5 minutes (~ 40% faster)

See also results of running another comparison of gensim vs myw2v.

Remarks and thoughts on the implementation

See blog post here.

TODO

Parameterise all the model parameters properly
"Streaming" data handling would be nice (as opposed to loading everything into memory all at once)
Perhaps try to optimise the code more

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
doc		doc
myw2v		myw2v
tests		tests
.gitignore		.gitignore
COPYING		COPYING
COPYING.LESSER		COPYING.LESSER
README.md		README.md
env_demo.yaml		env_demo.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

myw2v

What?

How's the code?

How to demo it/check it out?

Demo outputs

Results

Remarks and thoughts on the implementation

TODO

Licence

About

Licenses found

Releases

Packages

Languages

License

Licenses found

tsaastam/myw2v

Folders and files

Latest commit

History

Repository files navigation

myw2v

What?

How's the code?

How to demo it/check it out?

Demo outputs

Results

Remarks and thoughts on the implementation

TODO

Licence

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages