Skip to content

K-means clustering of texts (survey answers) using word-embeddings, finding optimal elbow-point, and averaging multiple-word expressions.

Notifications You must be signed in to change notification settings

elmurod1202/survey-clustering

Repository files navigation


K-menas Clustering of Survey Answers

Answers of a survey in Spanish are categorized using word-embeddings, and categorized using k-means clustering.
This project also includes dealing with multi-word expressions, by removing stopwords, and obtaining their vector-averages.
This is an example for Spanish language but it can easily be adapted for any other languages.
The number of clusters are obtained by optimal curvage finding algorithm(elbow method).

Table of Contents
  1. About The Project
  2. How it works
  3. How to use
  4. License
  5. Acknowledgements

About The Project

Spanish words scattered in 2D space

This project was created with a purpose to serve people who are searching for a solution to group/categorize words or even multi-word expressions by their meaning. There are so many tools and services to run statistics and/or create diagrams of given data, but they mostly work for numbers, when it comes to deal with words or some texts, those tools seem less useful since they do not include any way to visualize them in 2D/3D space based on their usage/meaning. This repository somewhat helps to perform following operations on texts:

  • Visualising texts:
    • Visualizing single words using word-embedding vectors of a language;
    • Visualizing multiple-word texts by obtaining average vecotrs of containing words (stopwords removed for better output quality);
  • Finding the optimal number of groups/clusters/categories to split words/texts based on their meaning, using Within Cluster Sum of Squares(WCSS) to find a level-oof using elbow method;
  • Grouping/Clustering texts using k-means clustering algorithm;
  • Visualizing grouped texts by different colors, using patplotlib.

(back to top)

Built With

Programming language used:

These are the major libraries used inside Python:

(back to top)

How it works

  • First of all, the code loads the list of words/texts from given file: input/answers.txt (it's called answers in this case, because it was an answers of a particular survey), and it obtains a vectors of those texts. An example diagram would look like this:
Spanish words scattered in 2D space
  • Then, the code obtains the optimal number of clusters for given texts to splin into, using an elbow-method. For our example it would look like this:
Elbow method example diagram
  • Lastly, the code categorizes the list of texts into groups by their meaning. The final result would look like this:
Clustered words in a scatter

(back to top)

Usage

To use this code you should have at least a small understanding of how to run a Python code, with Python installed machine. You should also install above-mentioned necessary framework/libraries into it. There are two ways you can run this code:

  1. Either clone the repo by running the commend below, and run the survey-clustering.py:

    git clone https://github.com/elmurod1202/survey-clustering.git
  2. Or just download only the survey-clustering.py (or survey-clustering-minimum.py if you want minimised working code without graphic visualisations) file and make some small changes like where to read the files from and where to store the results to. That's it.

IMPORTANT: This code uses a Spanish word embeddings vector file that is not inluded here due to its big size. Please download the file into the src/ folder from the link: Spanish word vectors (3.4 GB)

Adapting for other languages:

This code is ontended for Spanish, but it can be adapted to many other languages just by changing two files in the src/ folder:

  • src/embeddings-l-model.vec : Spanish word vectors file to a word-vector file of any language;
  • src/spanish-stopwords.txt : Spanish stopwords file replaced by any toher language stopwords.

(back to top)

License

Distributed under the GNU GENERAL PUBLIC LICENSE. See LICENSE.txt for more information.

(back to top)

Acknowledgments

Big shoutouts to Luis for bringing this problem to the table.

We are grateful for these resources and tutorials for making this repository possible:

(back to top)

About

K-means clustering of texts (survey answers) using word-embeddings, finding optimal elbow-point, and averaging multiple-word expressions.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages