-
Notifications
You must be signed in to change notification settings - Fork 0
/
RandomSample.hh
137 lines (126 loc) · 6.62 KB
/
RandomSample.hh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
/****************************************************************************
* File: RandomSample.hh
* Author: Keith Schwarz (htiek@cs.stanford.edu)
*
* Implementation of a random sampling algorithm on streams. The algorithm
* takes as input a stream of values (represented by a range of input
* iterators), as well as an output range, then fills the output range with
* elements sampled randomly from the input stream with uniform probability.
* The algorithm need not know in advance the number of elements in the
* stream.
*
* Internally, the algorithm works as follows. Suppose that we are to pick
* k elements at random. Initially, the algorithm guesses that it will pick
* the first k elements from the stream. From that point forward, upon
* seeing a new element, the algorithm chooses a random number in the range
* [0, n], where n is the number of elements encountered so far, including
* this new one. If the value is in the range [0, k), then the element at
* that position in the output is overwritten with the newly-sampled value.
* Otherwise, the element is ignored.
*
* We can show that this chooses each element with uniform probability by
* induction on the stream length. This induction is valid because it does
* not consider the length of the stream during its decision-making, and so
* the behavior of the algorithm on a stream of n + 1 elements is the same
* as the behavior of the algorithm on a stream of n elements, plus one
* extra iteration of the loop.
*
* As a base case, if the length of the stream is k or less, then the
* algorithm will pick each element, and so each element is correctly chosen
* with uniform probability (since every element must be picked).
*
* For the inductive step, assume that for a stream of n elements, each
* element is correctly picked with probability k / n and consider the
* next iteration of the algorithm. The probability that the algorithm
* will choose to store the next value that comes from the stream is the
* probability that a number in the range [0, n + 1] is in the range
* [0, k). This has probability k / (n + 1). Moreover, consider the
* probability that any element from the first n elements of the stream
* is chosen. By the inductive hypothesis, each element has probability
* k / n of being chosen. The probability that an element chosen this way
* survives after this iteration is then the probability that the randomly-
* chosen number is not equal to the slot in which that element is stored,
* and since all slots are uniform this is n / (n + 1), since only one
* choice can evict this element. Thus the total probability that the
* element is chosen is (k / n) * (n / (n + 1)) = k / (n + 1), and we
* have that each element in the series has probability k / (n + 1) of
* being chosen, which is a uniform distribution.
*
* The implementation provided here by default uses rand to generate random
* numbers, but a custom random generator may be used instead.
*/
#ifndef RandomSample_Included
#define RandomSample_Included
#include <cstdlib> // For rand
/**
* Function: RandomSample(InputIterator inBegin, InputIterator inEnd,
* RandomIterator outBegin, RandomIterator outEnd);
* -------------------------------------------------------------------------
* Populates the output range [outBegin, outEnd) with a uniform random
* sample of the elements in the range [inBegin, inEnd). Internally, this
* function uses rand to generate random numbers. If the input range does
* not contain enough elements, then only some of the values will be filled
* in and the algorithm will return an iterator to the last element written.
* If at least outEnd - outBegin elements were written, the return value
* is outEnd.
*/
template <typename InputIterator, typename RandomIterator>
RandomIterator RandomSample(InputIterator inBegin, InputIterator inEnd,
RandomIterator outBegin, RandomIterator outEnd);
/**
* Function: RandomSample(InputIterator inBegin, InputIterator inEnd,
* RandomIterator outBegin, RandomIterator outEnd,
* RandomGenerator rng);
* -------------------------------------------------------------------------
* Populates the output range [outBegin, outEnd) with a uniform random
* sample of the elements in the range [inBegin, inEnd). Internally, this
* function uses the generator rng to generate random numbers. If the input
* range does not contain enough elements, then only some of the values will
* be filled in and the algorithm will return an iterator to the last element
* written. If at least outEnd - outBegin elements were written, the return
* value is outEnd.
*/
template <typename InputIterator, typename RandomIterator, typename RandomGenerator>
RandomIterator RandomSample(InputIterator inBegin, InputIterator inEnd,
RandomIterator outBegin, RandomIterator outEnd,
RandomGenerator rng);
/* * * * * Implementation Below This Point * * * * */
/* Main implementation uses the parameterized generator. */
template <typename InputIterator, typename RandomIterator, typename RandomGenerator>
RandomIterator RandomSample(InputIterator inBegin, InputIterator inEnd,
RandomIterator outBegin, RandomIterator outEnd,
RandomGenerator rng) {
/* Try reading in outEnd - outBegin elements, aborting early if they can't
* be read.
*/
RandomIterator itr = outBegin;
for (; itr != outEnd && inBegin != inEnd; ++itr, ++inBegin)
*itr = *inBegin;
/* If we ran out of elements early, report that. We can detect this by
* checking whether our advancing iterator hit the end of the output
* range.
*/
if (itr != outEnd)
return itr;
/* For simplicity, cache the number of elements in the output range. */
const size_t numOutputSlots = outEnd - outBegin;
/* Now apply the main algorithm by reading elements and deciding whether to
* randomly evict an element or to skip it.
*/
for (size_t count = numOutputSlots; inBegin != inEnd; ++inBegin, ++count) {
size_t index = rng() % (count + 1);
if (index < numOutputSlots)
outBegin[index] = *inBegin;
}
/* Report that we read everything in by handing back the end of the output
* range.
*/
return outEnd;
}
/* Non-generator version just passes in rand to the generator version. */
template <typename InputIterator, typename RandomIterator>
RandomIterator RandomSample(InputIterator inBegin, InputIterator inEnd,
RandomIterator outBegin, RandomIterator outEnd) {
return RandomSample(inBegin, inEnd, outBegin, outEnd, std::rand);
}
#endif