Skip to content

Latest commit

 

History

History
81 lines (72 loc) · 4.94 KB

README.md

File metadata and controls

81 lines (72 loc) · 4.94 KB

Threaded KMeans - Jython

Introduction

Jython is the Java implementation of Python. One advantage of Jython over CPython is

the lack of the Global Interpreter Lock (GIL) and thus can fully exploit multiprocessor systems. Threading in Jython is super easy, just follow the steps:

  1. Import the Thread module from threading
  2. sub-class Thread
  3. Override the run function from Thread
  4. Create an instance of the class
  5. Call start()

Congratuations, you are now multithreading in Jython! Admittedly sometimes there's more than that, but that is the basic gist. But seriously, threading in Jython is nice. I've had instances where I've cut run times of programs from 30+ minutes to under 2 mintues. (One specific example involved searching through millions of items and then logging attributes to a database.)

One drawback of the Jython is that most packages are made for CPython. In particular, there is no pandas, numpy, or scikit learn. However, in my experience working on very specialized problems, there is often a need to develop a large portion of your own algorithms from scratch. Usually this arises because of some special case or because general algorithms do not give the control needed.

Basics of the Program

When devising a multithreaded program, a very challenging portion of programming is deciding how to break the work up into threads. In other words, thinking of the best way to break the program into smaller pieces.

The part of KMeans that takes the most amount of time is evaluating all the points against new centroids and then assigning the points to the new cluster. One key observation is that evaluating one point against all the centroids has no bearing on evaluating a different point against all the centroids. I.e., evaluating points against centroids can be done independently.

The second challenging part of multithreading is deciding how to merge all the threads together. Our threaded kmeans will assign the cluster to which each point belongs. Once all the threads have finished, all we have to do is merge the clusters. Once we have the clusters merged together, we calculate new centroids, and then keep repeating until the centroids stop moving or reach the maximum number of iterations.

Code Highlights

We first start by importing the Thread module from threading. We will also use the square root function from math, so import that package as well. We also import random since we use random.sample to get starting random centroids.

The ComputeDistances class subclasses Thread. This class will be a thread, and will be our workhorse for the KMeans algorithm. One way to use thre Thread module is override the run function. When an instance of ComputeDistances calls the function start(), it starts the run() function. Notice that it is not start() that is overridden, but run. Inside the run function we define what work we want to do. I usually reserve this space for calls to other functions as opposed to writing everything inside run() to make the code easiers to follow. Each ComputeDistances will take a portion of the global name points. There's usually a point in which adding more threads doesn't make the program any faster, and I wish I had a fail-safe way of getting the magic number of threads needed to maximize the speed of programs. Trial and error seems to work best, as it will differ (often wildly) between systems. (I've used anywhere between 4 and 50 threads.) The number of threads in the program is not dependent on the number of clusters desired.

One thing to note is that each thread is accessing the global list points. This might cause some sychronization across the threads and may add unintended overhead. Another option would be to pass a copy of the points to each thread. But this has the drawback of taking more memory.

In thinking about the previous paragraph, version 2 creates a global list of points that is partitioned into sublists for each thread. This cuts the run time in half. So, the first version definitely has some synchronization across accessing a global variable. For version 2, The number of threads that makes the biggest difference in run time is 12. On another system the number of threads will more than likely be different. When adding more threads, the run time actually starts increasing. But, using less threads also increases run time.

Updated Metrics

The first version runs a bit slower than v2. Increasing threads to 40 for v2 gets the time to under .012 seconds, while increasing to 80 gets under .009s. Thus adding more threads after a certain point only speeds up the process by a small amount. We also run into problems with partitioning the data into more sets. Should probably add better error checking in the ComputeDistances class.