Skip to content

murex/Kata-MedianListOfLists

Repository files navigation

Gradle Maven CMake Go Pytest Npm Check Markdown links Add contributors

Median of a List of Lists (with no concatenation)

Kata Image
"CPU Heat Sink" by Michael Dziedzic is licenced under Unsplash License

Description

This kata focuses on practicing TDD with algorithms.

TDD does not 'magically' generate non-trivial algorithms.

TDD can help in a different way though. You can use it to validate that your algorithm works, and then use the tests to refactor your algorithm.

Here is the process (roughly):

  1. Think about the algorithm that will solve your problem.
  2. Think of an order of tests that will allow you to build this algorithm incrementally.
  3. Use TDD to write a crude algorithm that solves your problem, and that can later be re-factored into our target algorithm. This usually involves laying down the main blocks we will need, while using very low performance algorithms as much as possible.
  4. With the tests in place, refactor the algorithm to what we want.

Part 1 - Solving the Problem

Problem Description

We have a cluster of machines that each holds a very large list of integers. We want to get the median of all these numbers, but we cannot bring all the numbers on a single machine.

The central question is: how to compute the median value of a list of lists without concatenating all the lists? (A Median value does not need to be an element of the list, and can be a float value even if the lists only contain integers).

Possible Solution

The definition of a median is that there are as many values that are greater as values that are smaller.

A simple way to find the median, is to search for a value that fits the definition by asking for the count of greater and smaller values to all lists, and then summing.

Here is a starting skeleton of the algorithm, that leaves place for incremental optimizations:

List<List<int>> values = ...

isMedian = function(n) {
    totalCountSmaller = sum(countSmaller(n, values[0]), countSmaller(n, values[1]), countSmaller(n, values[2]), ...)
    totalCountGreater = sum(countGreater(n, values[0]), countGreater(n, values[1]), countGreater(n, values[2]), ...)
    return totalCountSmaller == totalCountGreater
}

maxValue = max(max(values[0]), max(values[1]), max(values[2]), ...)
minValue = min(min(values[0]), min(values[1]), min(values[2]), ...)

return findByPredicateBetween(minValue, maxValue, isMedian)

At first, we can implement all the bricks (min, max, findByPredicateBetween, countSmaller, countGreater) in a very basic brute force way.

Part 2 - Code Optimization

The current solution is working, but it is slow! Your task now is to reach a faster solution! Therefore, do not attempt this step before completing part 1 (with a greedy solution).

For this, we will be using two techniques:

  1. Big(O) analysis of each function
  2. Code profiling: For this technique, you rely on tools to help you analyze the space and time complexity of your program. Refer to the below pages for more details:

Detailed Steps

Here are the steps we suggest you follow:

  1. Introduce a benchmark test
  2. Do a big(O) analysis. For each function, compute the Big(O) (comment the result at the top of the function)
  3. Run the profiler and compare the result to the results you got from the previous step.
  4. Based on the analysis you've done in previous steps, think of optimizations you can do
  5. Apply the optimizations
  6. Re-run the profiler and analyze the output. Was there any improvement in time?
  7. Plan further possible optimization (if any)
  8. Repeat steps 4-7

Possible Optimizations

  • Use a dichotomous search
  • Sort the lists to speed up max and min
  • Use a binary search for countSmaller and countGreater
  • Compute countSmaller and countGreater in a single call (remember the lists are supposed to be on a different machine)
  • Parallelize the computations happening on the different lists (BONUS: how to abstract this to keep the unit tests deterministic and single threaded?)

Getting Started

Session Quick Retrospective

You can fill it from here

Useful Links

For this Kata

General

Session Information

Style & Duration

Topic

  • TDD on algorithms

Focus Points

  • Acceptance tests
  • Greedy or not?
  • ⚠ Premature optimization

Source Files

License

Kata-MedianListOfLists and the accompanying materials are made available under the terms of the MIT License which accompanies this distribution, and is available at the Open Source site

Acknowledgements

See ACKNOWLEDGEMENTS.md for more information.

Contributors

Damien
Damien Menanteau
Ahmad
Ahmad Atwi
Philippe
Philippe Bourgau
AntoineMx/
AntoineMx