MajorityElement.hh

/****************************************************************************
 * File: MajorityElement.hh
 * Author: Keith Schwarz (htiek@cs.stanford.edu)
 *
 * A one-pass, linear time algorithm that returns the majority element of a
 * range of data.  The majority element is an element that appears strictly
 * more than half the time.  For example, in the sequence
 *
 *                                  0 1 0 0 2 0 3
 *
 * The number 0 is a majority element, since it appears 4/7 times.  However,
 * in the sequence
 *
 *                                 0 1 0 0 2 0 3 3
 *
 * There is no majority element, since even though 0 occurs 4/8 times, this
 * isn't strictly greater than half the elements.
 *
 * The algorithm for finding the majority element is remarkably simple, but its
 * correctness is not immediately obvious.  The algorithm works as follows.  At
 * each step, we maintain our "guess" of what the majority element will be, and
 * also a counter.  We then scan across the array.  At each point, if the new
 * element matches our current guess we increment the counter, and otherwise
 * we decrement it.  If the counter is ever zero, then on the next element we
 * change the counter to 1 and pick the next element as our guess.  Finally, we
 * output the guessed element.  For example, here is the algorithm running on
 * the earlier input.  The topmost row shows the input, below that our guess,
 * and below that the counter:
 *
 *  INPUT    0 1 0 0 2 0 3
 *  GUESS   ? 0 ? 0 0 0 0 0
 *  COUNTER 0 1 0 1 2 1 2 1
 *
 * Since our guess at the end is zero, we output zero.  This algorithm is
 * due to Boyer and Moore and is described in their paper "MJRTY - A Fast
 * Majority Vote Algorithm."
 *
 * There are many ways to think about why this algorithm works.  One good
 * intuition is to think of the algorithm as breaking the input down into lots
 * of stretches of consecutive copies of particular values.  Incrementing the
 * counter then corresponds to marking that multiple copies of the same value
 * were found, while decrementing it corresponds to some other sequences of
 * values "canceling out" the accumulation of values of a particular type.
 *
 * A formal proof of correctness of this algorithm (based on the proof in
 * Boyer and Moore's paper) relies on a key lemma.  In this section, we'll
 * let C be the number that is currently a candidate for the majority element,
 * K be its count after some number of steps, and N be the number of total
 * elements.
 *
 * Lemma 1: For any i, 1 <= i <= N, after i steps of the algorithm, the
 * elements in the range [1, i] can rearranged into two groups A and B such
 * that A is K copies of C, and B is a collection of elements with at most
 * i / 2 copies of any one element.
 *
 * Let's hold off of the proof of this lemma for now, and show that if it
 * holds and there is a majority element, the algorithm must be correct.  Using
 * the above lemma, note that when the algorithm terminates, there must be some
 * element C that was chosen with some count K.  Assume for the sake of
 * contradiction that C is not the majority element; then there is some other
 * element C' that must be the majority element.  Consequently, there are at
 * least n / 2 elements of the range equal to C'.  Let's consider where they
 * are.  By the above lemma, all the elements of the input can be broken up
 * into groups A and B, where everything in group A has value C and at most
 * |B| / 2 elements of |B| have value C'.  Since |A| = K and |A| + |B| = N,
 * this means that there are at most (N - K) / 2 copies of C', contradicting
 * the fact that C' is the actual majority element.  We have reached a
 * contradiction, and so C must be the majority element at the end of the
 * algorithm's run.
 *
 * We can now prove the claim of the lemma by induction on i.  As a base case,
 * if i = 1, then K = 1 and C is the first element of the range.  Then we can
 * let A be the singleton element and B be the empty set, which trivially obeys
 * the criteria of the lemma.  For the inductive step, assume that for some i
 * the claim holds and consider the execution of the algorithm on step i + 1.
 * Let A and B be the sets A and B from the ith step.  Then we consider three
 * possible cases:
 *
 * 1. On entry to this step, K = 0.  Then after this step finishes, K = 1
 *    and C is the newest element.  This means that on entry to this step,
 *    A was the empty set and B was some set where no element appeared more
 *    than i/2 times in B.  If we then let A' be the singleton set containing
 *    the new element and B' = B, then these sets satisfy the requirements of
 *    the lemma and the claim holds.
 * 2. On entry to this step, K > 0 and the new element matches the current
 *    majority element.  Then we can add this element to A to get a new set
 *    A' meeting the lemma's requirements, so the claim holds.
 * 3. On entry to this step, K > 0, and the new element does not match the
 *    current majority element.  This means that the new K is one minus the
 *    previous K, but the candidate majority element does not change.  If
 *    we then move one element from A into B, then place the new element into
 *    the set B, then the updated A and B will satisfy the lemma's claims.
 *    This is tedious but simple to check, so I'll leave it as an exercise
 *    to the reader. :-)
 *
 * In the case where there is no majority element, the element produced by the
 * algorithm will be arbitrary.  We can then check whether we have the majority
 * element by performing a linear scan over the input range and counting the
 * frequency of the element.
 */
#ifndef MajorityElement_Included
#define MajorityElement_Included

#include <functional> // For equal_to
#include <iterator>   // For iterator_traits

/**
 * Function: ForwardIterator MajorityElement(ForwardIterator begin, 
 *                                           ForwardIterator end);
 * Usage: int maj = MajorityElement(v.begin(), v.end());
 * -----------------------------------------------------------------------
 * Given a range of values [begin, end) where strictly more than half the
 * elements have the same value, returns the value of this most-common
 * element.  If no element is a majority element, end is returned as a
 * sentinel.
 */
template <typename ForwardIterator>
ForwardIterator MajorityElement(ForwardIterator begin, ForwardIterator end);

/**
 * Function: ForwardIterator MajorityElement(ForwardIterator begin, 
 *                                           ForwardIterator end,
 *                                           Comparator comp);
 * Usage: int maj = MajorityElement(v.begin(), v.end(), MyComparator);
 * -----------------------------------------------------------------------
 * Given a range of values [begin, end) where strictly more than half the
 * elements have the same value according to the binary comparator comp,
 * returns the value of this most-common element.  If no element is a majority
 * element, end is returned as a sentinel.
 */
template <typename ForwardIterator, typename Comparator>
ForwardIterator MajorityElement(ForwardIterator begin, ForwardIterator end,
                                Comparator comp);

/* * * * * Implementation Below This Point * * * * */

/* Main implementation uses the parameterized comparator. */
template <typename ForwardIterator, typename Comparator>
ForwardIterator MajorityElement(ForwardIterator begin, ForwardIterator end,
                                Comparator comp) {
  /* Initially, we have no guess and our count is zero.  However, to avoid
   * edge cases with the empty range, we initialize the candidate to end.
   */
  ForwardIterator candidate = end;
  size_t confidence = 0;

  /* Scan over the input using the Boyer-Moore update rules. */
  for (ForwardIterator itr = begin; itr != end; ++itr) {
    /* If we have no confidence in our previous guess, update it to this new
     * element.
     */
    if (confidence == 0) {
      candidate = itr;
      ++confidence;
    }
    /* Otherwise, increment or decrement the confidence based on whether the
     * next element matches.
     */
    else if (comp(*candidate, *itr)) 
      ++confidence;
    else 
      --confidence;
  }

  /* Finally, do one more pass to confirm that we have a majority element. */
  size_t numMatches = 0, totalElements = 0;
  for (ForwardIterator itr = begin; itr != end; ++itr) {
    /* Check whether this is a match and update appropriately. */
    if (comp(*candidate, *itr)) 
      ++numMatches;

    /* Either way, increment the total number of elements. */
    ++totalElements;
  }

  /* This is a majority element if it accounts for at least half the number
   * of elements.
   */
  return totalElements / 2 < numMatches ? candidate : end;
}

/* Non-comparator version implemented in terms of comparator version. */
template <typename ForwardIterator>
ForwardIterator MajorityElement(ForwardIterator begin, ForwardIterator end) {
  return MajorityElement(begin, end,
                         std::equal_to<typename std::iterator_traits<ForwardIterator>::value_type>());
}

#endif