Weighted random shuffling

1. Weighted random shuffling

For some application I needed a random shuffle \(r\) of a list of numbers \(a\). The shuffle, however, should be weighted, such that, given a list of weights \(w\), the element \(a_i\) with weight \(w_i\) should be the first element of the random shuffle \(r\) with probability \(w_i/\sum_k w_k\). I searched for a python implementation of such an algorithm, but I couldn't find one. Also, numpy doesnt' provide it, and numpy.random.shuffle only offers shuffling with uniform weights. Thus, this code.

2. Weighted shuffling in pure python

Weighted random shuffling is the same as weighted random sampling from a list \(a\) without replacement. In steps:

  1. Choose with probability \(w_{i}/\sum_{i} w_{i}\) element \(a_{i}\) from \(a\).
  2. Add this element to a list \(r\).
  3. Remove element \(a_{i}\) from \(a\) and \(w_{i}\) from \(w\).
  4. Stop if \(a\) is empty, else,
  5. Return to step 1 with the updated lists \(a\) \(w\).

However, I found the code below for weighted random sampling on this page.

from random import random

def weighted_choice(weights):
    rnd = random() * sum(weights)
    for i, w in enumerate(weights):
        rnd -= w
        if rnd < 0:
            return i

Thus, weighted_choice() gives an index to the chosen element of an array. With this index I can fill a random shuffle \(r\) with elements of \(a\).

As removing elements from lists is a bit expensive, I chose to set the weight \(w_{i}\) of the chosen element to zero. This ensures that the element \(i\) cannot be selected the next time.

def weighted_shuffle(a,w):
    w = list(w) # make a copy of w
    if len(a) != len(w):
        print("weighted_shuffle: Lenghts of lists don't match.")
        return

    r = [0]*len(a) # contains the random shuffle
    for i in range(len(a)):
        j = weighted_choice(w)
        r[i]=a[j]
        w[j] = 0
    return r

3. Fast shuffling

Once I had the above code, I went to Gerlach van der Heide. He managed to speed it up by several orders of magnitude.

from random import random
from bisect import bisect_right
import numpy as np

def weighted_shuffle(a,w):
    r = np.empty_like(a)
    cumWeights = np.cumsum(w)
    for i in range(len(a)):
         rnd = random() * cumWeights[-1]
         j = bisect_right(cumWeights,rnd)
         #j = np.searchsorted(cumWeights, rnd, side='right')
         r[i]=a[j]
         cumWeights[j:] -= w[j]
    return r

a = np.arange(1,1000)
w = np.arange(1,1000)
r = weighted_shuffle(a,w)

Here is the result

print(r[:2])

Some testing shows, for a reason unknown to me, that np.searchsorted is slower than bisect_right.

4. A wrong idea

While searching the web I came across the one-liner below. It is elegant, but wrong. The weights according to which the elements of \(a\) are chosen are not in proportion to \(w\). Some basic probability theory shows why this is the case. It comes down to proving that \[P\{w_1 U_1 \leq w_2 U_2\} \neq \frac{w_1}{w_1+w_1},\] where \(w_i\) are the weights and \(U_i\) uniform random deviates on \([0,1]\).

def wrong(a,w):
    aa = range(len(a))
    aa.sort(key=lambda i: w[i]*random()*beta[i])
    return [a[i] for i in aa]