eksi-cluster

ekşi sözlük is the most popular Turkish social media website, which is organized like a dictionary; users create 'titles' and others write 'entries' under it like a thread. However the title (topic) might be homonymous, for example there is a footballer Harun Tekin and a musician Harun Tekin. Currently, users of the website report such titles to the moderation who then seperate hundreds of entries manually. We present eksi-cluster as a tool for automatically clustering these titles. Clustering is not only useful for this purpose, feel free to play with this tool and maybe find new uses. This tool is for helping out ekşi sözlük moderators as a preprocessor for title seperation, it is not meant to replace them.

Usage

main.py takes in 4 arguments:

Title name such as mit or 'im westen nichts neues'. Make sure to wrap the title in quotes if there are multiple words in it.
Number of clusters. If you think there are 2 distinct notions in a title, you might still go for 3 to weed out unrelated stuff.
First relevant page. Acceptable range: 1, last page - 1
Last relevant page. Acceptable range: 2, last page

python main.py mit 3 1 18

A 'title name'.csv will be saved in data/ containing the Turkish entries, the translated entries and the cluster labels.

test.py is useful for validating the performance of this tool. You can use the already seperated titles to get an F1 score on that dataset. To do this change the name of the titles in the script and the number of pages to be evaluated. The script supports 2 titles.

python test.py

A test.csv will be saved in data/ containing the Turkish entries, the translated entries, the real labels and the cluster labels.

How it Works

eksi-cluster is simple.

Translate the entries from Turkish to English using Google Translate (unofficial API, not meant to be used commercially)
Convert from text to vectors using OpenAI embeddings (need to have an API key for this)
Cluster the vectors using k-means.

We tried directly embedding Turkish text and found out that this approach does not work well.

Testing

Method

50 entries are drawn two titles each that share the same name which are already manually seperated by Ekşisözlük admins. These entries are then merged into a single title which is clustered into two. Below are the results.

Results

Topic 1	Topic 2	F1 Score
harun tekin (football player)	harun tekin (musician)	0.92
koray avcı (football player)	koray avcı (musician)	0.93
dart	dart (programming language)	0.85
suskunlar (series)	suskunlar (book)	0.94
camel (music group)	camel (cigarette)	0.93
aydın	aydın (city)	0.88

As can be seen from the results, the tool is able to distinguish between specific titles with high accuracy, but if one of the titles is a more encompassing title such as dart, it may be insufficient due to the variety of information in that title.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
data		data
LICENSE		LICENSE
README.md		README.md
embedder.py		embedder.py
main.py		main.py
requirements.txt		requirements.txt
scrape.py		scrape.py
test.py		test.py
translate.py		translate.py
visualize.py		visualize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

eksi-cluster

Testing

About

Releases

Packages

Contributors 3

Languages

License

demegire/eksi-cluster

Folders and files

Latest commit

History

Repository files navigation

eksi-cluster

Testing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages