Implementation of the algorithm to calculate closest pair of points in a 2-dimensional plane in time O(n*log n) as described by Algorithm Design by Kleinberg & Tardos.
It is a divide and conquer implementation that relies on some geometrical properties. By contrast, the brute-force version (consisting in calculating the distance of each point to all others and take the minimum) is O(n^2).
Implementation is done in different programming languages and using different parallelization techniques:
Language | Sequential | Multi-thread | Multi-process | OpenMP |
---|---|---|---|---|
C | yes | yes | yes | yes |
Python | yes | no | yes | no |
Scala | yes | yes | no | no |
We'll start by implementing the brute-force version to use it as test oracle to prove the correctnes of the divide and conquer implementation.
Next, we implement the algorithm using the divide and conquer technique.
Unit testing is used to verify correctness of basic and edge cases and to guide development. Some of the test cases covered are:
- solution is in the left half
- solution is in the right half
- solution is a pair of points from different halves
- repeat points
- even number of points
- odd number of points
Using property-based testing, we can compare the divide and conquer version to the test oracle on thousands of randomly generated inputs.
Depending on the language, we use different frameworks to do this testing:
- Scala, Scalatest
- Python, hypothesis
- C, manual implementation
Depending on the language, we use multi-thread and/or multi-process approach to implement the parallel version of the algorithm.
And again, we can use property-based testing to compare sequential vs parallel versions.
In order to compare the implementations across languages, we generate a file with random points that is fed to each language to ensure that all compute the same solution.
This sections highlights some of the key elements to understand the implementation.
The input is split in two parts recursively, O(log n), and the work to do in each iteration is linear, O(n).
The way to achieve linear time on each iteration is to have the points sorted by coordinate y.
The points get sorted before starting the recursive algorithm and they will remain sorted during the recursion as long as the input is split in a specific way. Basically, we need to keep two arrays of points, each of them sorted by a different coordinate, x and y. The points sorted by coordinate y will need to keep a reference to their position in the array sorted by x.
Of course, the sorting algorithm must run in O(n*log n) like mergesort
The notation (variable names) used in the implementation is the one used in the book.
Although those names may not comply with the rules to name objects in each of the languages used, keeping the same notation across different implementations makes it easy to understand the code.
- P = array of points
- Px, Py = array of points sorted by coordinates x and y respectively
- Qx, Qy = left half part of P sorted by coordinates x and y respectively
- Rx, Ry = right half part of P sorted by coordinates x and y respectively
The parallel implementation is based on the Fork-Join model, that is well suited for divide and conquer algorithms.
Given that the algorithm to calculate the closest points is CPU-bound, we need real parallelism and not just concurrency. Therefore, Python's GIL is not fit for purpose and only a multi-process implementation can take advantage of multiple cores.
On the other hand, multi-process implementation in the JVM is not as straightforward as in Python or C and therefore we opted for the multi-thread approach.
Here's a visual represenation of the parallel algorithm in actio (either multi-thread or multi-process):
The C implementation also includes parallelization with OpenMP. One of the advantages of OpenMP is the ability to decorate a sequential application with pragmas to turn it into a parallel one. As a result, nlogn_sol.c
can be compiled to be run in parallel or sequentially, depending on whether the compilation flag -fopenmp
is added or not.
Property-based testing (PBT) is an excellent tool to catch edge cases and to get insights about the inner workings of an application.
When some of the tests failed, I realised that solutions given by the quadratic algorithm are not stable as in, the solution depends on the original order of the points. For instance, given two possible solutions, the one selected is the one corresponding to the pair of points that come first.
On the other hand, solutions given by the O(n*logn) algorithm are stable as they do not depend on the original order of the points.
As a consequence, we cannot simply compare the pair of points given by each algorithm. The only invariant is the distance betwen each pair of points.
Note: in case of multiple possible solutions, our implementation of the O(n*logn) algorithm selects always:
- the pair of points with the lowest value of coordinate x if solutions are in the left or right half
- the pair of points with the lowest value of coordinate y if solutions are in the strip around the line dividing left and right halves
Multi-process implementation in Scala is considerably more cumbersome to implement than in C or Python.
In C and Python, you can just invoke a function (passing its parameters alongside) in a new process in a similar way to invoking a new thread.
However, in Java/Scala that's not possible. On the contrary, you need to invoke a new java
or scala
commmand on a class and write the corresponding parameters as arguments to the command (that will get passed to the program as arguments of the corresponding main
method)
Despite the big gap in the scale of high/low level languages, Python and C can integrate very easily in some situations.
For instance, the binary file generated in C with the values of a sequence of random points represented as structs containing 2 integers, can be read easily in Python by using the struct
module
struct.iter_unpack('ii', data)
Another example is Python's ease to map shared memory between processes in a similar way to C's mmap
by making use of the object Value
After testing the C implementation in my Mac, I decided to give it a go in a VM running Ubuntu.
To my surprise, I started to get random failures in the multi-process version of the program. For a while I assumed that it was something to do with the functions used to share memory between processes.
But then I realised that it was a bug in my code as instead of using the function waitpid
, was using wait
! And this bug would manifest as a race condition.
In my Mac it did not happen because it sports 8 cores and all the processes would run in parallel, finishing in the same order as they were started. On the other hand, the VM had 2 cores, resulting in the processes interleaving and originating the race conditions.