choosing a random element with weighted probability? #123

joelmiller · 2019-07-30T08:15:21Z

Let's say I have a collection of many objects, each with a weight associated with it. I'd like to be able to choose one of these at random, with probability proportional to its weight.

Is this currently built in to sortedcontainers? Basically given an ordered list of (non-negative) numbers, I'd like to return a random index chosen with probability proportional to the corresponding number.

I have some ideas about how this could be done, but first I want to check if it's doable already before I put in the effort to see if my ideas actually work (and are faster than what I've already implemented using a different data structure).

grantjenks · 2019-07-30T15:16:10Z

I think what you're describing is part of the random module: https://docs.python.org/3/library/random.html#random.choices

joelmiller · 2019-07-30T23:49:12Z

I don't think that will work very well in my case. I have a large collection of objects, each with a weight associated with them. I am continually adding and removing things to this collection. The version you've linked to will recalculate the cumulative sum each time it is called, or I have to update the cumulative sum each time I add/remove things:

Internally, the relative weights are converted to cumulative weights before making selections, so supplying the cumulative weights saves work.

This is too slow for my purposes. I've got a better approach than that already, but I know an even better algorithm if I can guarantee that my weights are in decreasing order.

joelmiller · 2019-07-31T12:49:57Z

Hmm,

I was just trying to implement my "better algorithm" which was an adaptation of "reservoir sampling" to take advantage of the fact that we know the weights are ordered. There is an unexpected subtlety that is blocking my initial attempt. I think it can be fixed, but I'm going to have to set it aside for a bit before I can devote the time it'll need.

If I manage to work it out, I'll try to share it with you for sortedcontainers. I think it would be a valuable contribution.

joelmiller · 2019-08-01T06:21:07Z

Okay, so I've got some working code, and comparisons with numpy and random functions. The function itself is just 14 lines of code. Is this something that can be implemented directly into sortedcontainers?

I believe it does almost as well as random.choices does when I put in the effort to calculate the cumulative sum. It does much better than the numpy equivalent or random.choices without the cumulative sum. I haven't done a really careful comparison, just eyeballing how long a bunch of repeated choices take. I expect my algorithm (as well as random.choices with a cumulative sum) is logarithmic in the length of the list, while the other two are linear. I think I know how to prove it, but haven't worked through all the details yet.

My example does not need to know the cumulative sum to work, so it is clearly better in cases where we cannot keep track of the cumulative sum.

import random
import numpy as np
import matplotlib.pyplot as plt


def my_selection(ordered_list):

    r'''
    Input is 
        ordered_list, a list of positive weights in monotonically decreasing order.

    Returns 
         an index with probability proportional to its weight.

    Will fail and break if ordered_list is empty or not a list.

    Will return a value (but incorrectly) if ordered_list has negative values or isn't ordered

    This output similar to, but as far as I've been able to test faster than, 
    random.choices and np.random.choice where we put out just a single entry.
    It is comparable in speed (but I believe slower) to random.choices where a 
    cumulative sum is provided but in many contexts it may not be possible to do that.


    I believe I can prove that runtime of this algorithm is logarithmic in the length of
    the list.  

   
    This is based on the algorithm A-Chao described here:
    https://en.wikipedia.org/wiki/Reservoir_sampling#Weighted_random_sampling_using_Reservoir
    but with an additional step allowing us to do a long jump to a later element.

    The basic idea change from A-Chao is that I know that the list is ordered.
    So in order for the next r to be greater than the existing maximum (largest_r), the
    random number drawn between 0 and 1 has to be at least largest_r**W where 
    W is the weight of the element I just considered.  

    Then set q = largest_r**W, and p = 1-q.  The time till the next draw of
    a random number in (q,1) is geometric with parameter p.  So we can sample
    from a geometric distribution to find the next one that might have a larger
    r.  We go to that one, and then select a random numper in (q,1).  Then
    test to see if the new r is larger.  If so, we replace our previous candidate
    with this candidate.  If not, we leave the previous candidate in place

    Then we update the weight (because a smaller weight will lead to a smaller
    p, and then we do our next jump, which is probably longer.


    Note that as given in Wikipedia, the algorithm can return multiple entries by
    keeping a priority queue.  We should be able to do the same here.
    '''
    current_index = 0
    largest_r = 0
    q=0
    p=1
    while current_index < len(ordered_list):
        W = ordered_list[current_index]
        r = (q+p*random.random())**(1./W)
        if r> largest_r:
            largest_r = r
            returnval = current_index
                
        q = largest_r**W
        p = 1-q
        current_index += np.random.geometric(p)
    
    return returnval


iterations = 10000
L = []

for counter in range(iterations ):
    L.append(np.exp(2*random.random())-1)  #choosing some varied weights

L.sort(reverse=True)

L = np.array(L)

Lsum = sum(L)
cum_sum = np.cumsum(L)

probs = L/Lsum
random1_indices=[]
random2_indices = []
numpy_indices = [] 
my_indices = []

print('Now doing {} choices from the list L:\n'.format(iterations))
print('using random.choices (without a cumulative sum)')
for counter in range(iterations):
    random1_indices.append(random.choices(range(iterations), weights = L))
print('now using numpy.random.choice')
for ctr in range(iterations):
    numpy_indices.append(np.random.choice(range(iterations), p = probs))
print('using random.choices (with a cumulative sum)')
for counter in range(iterations):
    random2_indices.append(random.choices(range(iterations), cum_weights = cum_sum))
print('now using my_indices')
for ctr in range(iterations):
    my_indices.append(my_selection(L))
print('done with calculations\n\n')


random1_indices.sort()
random2_indices.sort()
numpy_indices.sort()
my_indices.sort()

print('the plots should overlie each other if they are all choosing from the same distribution.')
plt.plot(cum_sum/cum_sum[-1], label = 'Predicted cumulative density function')
plt.plot(np.array(random1_indices), np.arange(0,iterations)/iterations, label = 'random_indices without cumsum')
plt.plot(np.array(random2_indices), np.arange(0,iterations)/iterations, label = 'random_indices with cumsum')
plt.plot(np.array(numpy_indices), np.arange(0,iterations)/iterations, label = 'numpy_indices')
plt.plot(np.array(my_indices), np.arange(0,iterations)/iterations, label = 'my_indices')

plt.legend()
plt.xlabel('index in list L')
plt.ylabel('cumulative probability')
plt.show()

grantjenks · 2019-09-08T15:01:00Z

I don’t think this belongs in sortedcontainers. It’s an interesting use case but the topic is niche and result can be achieved with the existing api. There might be a place for it in sortedcollections or some other recipes-oriented package/post.

grantjenks · 2020-05-11T18:09:19Z

Reference #110

grantjenks · 2020-06-07T05:57:42Z

For reference, sorted collections is at https://github.com/grantjenks/python-sortedcollections

grantjenks closed this as completed Jun 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

choosing a random element with weighted probability? #123

choosing a random element with weighted probability? #123

joelmiller commented Jul 30, 2019 •

edited

Loading

grantjenks commented Jul 30, 2019

joelmiller commented Jul 30, 2019

joelmiller commented Jul 31, 2019

joelmiller commented Aug 1, 2019 •

edited

Loading

grantjenks commented Sep 8, 2019

grantjenks commented May 11, 2020

grantjenks commented Jun 7, 2020

choosing a random element with weighted probability? #123

choosing a random element with weighted probability? #123

Comments

joelmiller commented Jul 30, 2019 • edited Loading

grantjenks commented Jul 30, 2019

joelmiller commented Jul 30, 2019

joelmiller commented Jul 31, 2019

joelmiller commented Aug 1, 2019 • edited Loading

grantjenks commented Sep 8, 2019

grantjenks commented May 11, 2020

grantjenks commented Jun 7, 2020

joelmiller commented Jul 30, 2019 •

edited

Loading

joelmiller commented Aug 1, 2019 •

edited

Loading