Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

choosing a random element with weighted probability? #123

Closed
joelmiller opened this issue Jul 30, 2019 · 7 comments
Closed

choosing a random element with weighted probability? #123

joelmiller opened this issue Jul 30, 2019 · 7 comments

Comments

@joelmiller
Copy link

joelmiller commented Jul 30, 2019

Let's say I have a collection of many objects, each with a weight associated with it. I'd like to be able to choose one of these at random, with probability proportional to its weight.

Is this currently built in to sortedcontainers? Basically given an ordered list of (non-negative) numbers, I'd like to return a random index chosen with probability proportional to the corresponding number.

I have some ideas about how this could be done, but first I want to check if it's doable already before I put in the effort to see if my ideas actually work (and are faster than what I've already implemented using a different data structure).

@grantjenks
Copy link
Owner

I think what you're describing is part of the random module: https://docs.python.org/3/library/random.html#random.choices

@joelmiller
Copy link
Author

I don't think that will work very well in my case. I have a large collection of objects, each with a weight associated with them. I am continually adding and removing things to this collection. The version you've linked to will recalculate the cumulative sum each time it is called, or I have to update the cumulative sum each time I add/remove things:

Internally, the relative weights are converted to cumulative weights before making selections, so supplying the cumulative weights saves work.

This is too slow for my purposes. I've got a better approach than that already, but I know an even better algorithm if I can guarantee that my weights are in decreasing order.

@joelmiller
Copy link
Author

Hmm,

I was just trying to implement my "better algorithm" which was an adaptation of "reservoir sampling" to take advantage of the fact that we know the weights are ordered. There is an unexpected subtlety that is blocking my initial attempt. I think it can be fixed, but I'm going to have to set it aside for a bit before I can devote the time it'll need.

If I manage to work it out, I'll try to share it with you for sortedcontainers. I think it would be a valuable contribution.

@joelmiller
Copy link
Author

joelmiller commented Aug 1, 2019

Okay, so I've got some working code, and comparisons with numpy and random functions. The function itself is just 14 lines of code. Is this something that can be implemented directly into sortedcontainers?

I believe it does almost as well as random.choices does when I put in the effort to calculate the cumulative sum. It does much better than the numpy equivalent or random.choices without the cumulative sum. I haven't done a really careful comparison, just eyeballing how long a bunch of repeated choices take. I expect my algorithm (as well as random.choices with a cumulative sum) is logarithmic in the length of the list, while the other two are linear. I think I know how to prove it, but haven't worked through all the details yet.

My example does not need to know the cumulative sum to work, so it is clearly better in cases where we cannot keep track of the cumulative sum.

import random
import numpy as np
import matplotlib.pyplot as plt


def my_selection(ordered_list):

    r'''
    Input is 
        ordered_list, a list of positive weights in monotonically decreasing order.

    Returns 
         an index with probability proportional to its weight.

    Will fail and break if ordered_list is empty or not a list.

    Will return a value (but incorrectly) if ordered_list has negative values or isn't ordered

    This output similar to, but as far as I've been able to test faster than, 
    random.choices and np.random.choice where we put out just a single entry.
    It is comparable in speed (but I believe slower) to random.choices where a 
    cumulative sum is provided but in many contexts it may not be possible to do that.


    I believe I can prove that runtime of this algorithm is logarithmic in the length of
    the list.  

   
    This is based on the algorithm A-Chao described here:
    https://en.wikipedia.org/wiki/Reservoir_sampling#Weighted_random_sampling_using_Reservoir
    but with an additional step allowing us to do a long jump to a later element.

    The basic idea change from A-Chao is that I know that the list is ordered.
    So in order for the next r to be greater than the existing maximum (largest_r), the
    random number drawn between 0 and 1 has to be at least largest_r**W where 
    W is the weight of the element I just considered.  

    Then set q = largest_r**W, and p = 1-q.  The time till the next draw of
    a random number in (q,1) is geometric with parameter p.  So we can sample
    from a geometric distribution to find the next one that might have a larger
    r.  We go to that one, and then select a random numper in (q,1).  Then
    test to see if the new r is larger.  If so, we replace our previous candidate
    with this candidate.  If not, we leave the previous candidate in place

    Then we update the weight (because a smaller weight will lead to a smaller
    p, and then we do our next jump, which is probably longer.


    Note that as given in Wikipedia, the algorithm can return multiple entries by
    keeping a priority queue.  We should be able to do the same here.
    '''
    current_index = 0
    largest_r = 0
    q=0
    p=1
    while current_index < len(ordered_list):
        W = ordered_list[current_index]
        r = (q+p*random.random())**(1./W)
        if r> largest_r:
            largest_r = r
            returnval = current_index
                
        q = largest_r**W
        p = 1-q
        current_index += np.random.geometric(p)
    
    return returnval


iterations = 10000
L = []

for counter in range(iterations ):
    L.append(np.exp(2*random.random())-1)  #choosing some varied weights

L.sort(reverse=True)

L = np.array(L)

Lsum = sum(L)
cum_sum = np.cumsum(L)

probs = L/Lsum
random1_indices=[]
random2_indices = []
numpy_indices = [] 
my_indices = []

print('Now doing {} choices from the list L:\n'.format(iterations))
print('using random.choices (without a cumulative sum)')
for counter in range(iterations):
    random1_indices.append(random.choices(range(iterations), weights = L))
print('now using numpy.random.choice')
for ctr in range(iterations):
    numpy_indices.append(np.random.choice(range(iterations), p = probs))
print('using random.choices (with a cumulative sum)')
for counter in range(iterations):
    random2_indices.append(random.choices(range(iterations), cum_weights = cum_sum))
print('now using my_indices')
for ctr in range(iterations):
    my_indices.append(my_selection(L))
print('done with calculations\n\n')


random1_indices.sort()
random2_indices.sort()
numpy_indices.sort()
my_indices.sort()

print('the plots should overlie each other if they are all choosing from the same distribution.')
plt.plot(cum_sum/cum_sum[-1], label = 'Predicted cumulative density function')
plt.plot(np.array(random1_indices), np.arange(0,iterations)/iterations, label = 'random_indices without cumsum')
plt.plot(np.array(random2_indices), np.arange(0,iterations)/iterations, label = 'random_indices with cumsum')
plt.plot(np.array(numpy_indices), np.arange(0,iterations)/iterations, label = 'numpy_indices')
plt.plot(np.array(my_indices), np.arange(0,iterations)/iterations, label = 'my_indices')

plt.legend()
plt.xlabel('index in list L')
plt.ylabel('cumulative probability')
plt.show()

@grantjenks
Copy link
Owner

I don’t think this belongs in sortedcontainers. It’s an interesting use case but the topic is niche and result can be achieved with the existing api. There might be a place for it in sortedcollections or some other recipes-oriented package/post.

@grantjenks
Copy link
Owner

Reference #110

@grantjenks
Copy link
Owner

For reference, sorted collections is at https://github.com/grantjenks/python-sortedcollections

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants