Pseudo-random number generators

Author

Jeet Sukumaran

Key concept: Pseudo-random number generators use finite internal state and seeded algorithms to produce reproducible sequences that statistically approximate random behavior for practical and scientific use.

A computer algorithm cannot generate a truly random result. Computer algorithms are deterministic. However, a computer algorithm can generate a sequence of numbers based on a series of deterministic calculations that behave like a sequence of random numbers would under statistical analysis. These algorithms are known as pseudo-random number generators (PRNGs). Pseudo-random number generators are deterministic, and, being deterministic, their results — the sequence of numbers they generate — are completely determined by the algorithm and the initial seed.

A computer program (or algorithm) is a deterministic machine. For some given input, it produces a particular output, based on its programming. It cannot produce a truly random value in the sense of being fundamentally unpredictable from its inputs, as, for any particular input, an algorithm will produce the same output. For many applications, however, what is needed are not values that are absolutely random, but rather a series of numbers that behave as though they are random with respect to each other, at least up to some scale — values that can be generated one after another that do not exhibit detectable periodicities, patterns, ordering, or structure in the sequence, at least for the scale or size relevant to the application. (For applications where true physical unpredictability is strictly required, computers rely on hardware incorporating some physical source of randomness, such as radioactive decay or other entropy-harvesting mechanisms.)

A pseudo-random number generator is a computer algorithm that can produce a series of values that behave like a random sequence with no detectable periodicity or pattern, statistically or otherwise, for a very long run of numbers. (The number of values a particular PRNG can generate before repeating is known as its period. All PRNGs have finite periods, i.e. will eventually begin repeating the sequence, but good PRNGs, and the ones we will be using, have HUGE periods (\(2^{128}\), for example). All PRNGs have finite periods because they must operate with a finite internal state represented with a finite number of bits. The number of bits used to represent the internal state that the PRNG tracks sets the upper bound on the number of distinct states the generator can occupy. This also sets an upper bound on the length of the sequence before repetition must occur. At some point, the update step will produce a state value that has been used before in the sequence, resulting in the sequence repeating from that point. A 32-bit state PRNG might repeat after at most \(2^{32}\) steps, while cryptographic applications (and many modern computing platforms, including Julia) use 128-bit or larger states, resulting in periods on the order of \(2^{128}\) or larger.)

The sequence of numbers generated by a PRNG behaves as though it were random in most practical statistical senses, including measures such as empirical entropy or goodness-of-fit to a uniform probability distribution, except for one crucial feature: the sequence is completely determined by the algorithm and the seed. For most non-cryptographic applications, this determinism is not a limitation. What matters is that there is no detectable ordering, patterns, autocorrelations, cross-correlations, or other structure in the numbers at the scale relevant to the application, and that the empirical frequency distribution of values approximates that of a uniform probability distribution. This a well-designed PRNG can deliver, provided usage does not exceed its intended operational limits. (While the entire sequence of numbers generated by a PRNG is determined by the algorithm and initial seed, predicting future outputs without knowledge of the internal state or transition function may be computationally infeasible in the case of cryptographically secure generators, though it is often feasible for simpler non-cryptographic PRNGs.)

Given an initial starting or “seed” value, a PRNG algorithm computes and returns a value, and then updates an internal state that is used to generate the next value, thus forming a chain of state updates and outputs. In more complex algorithms, a separate internal state consisting of multiple values is updated from the previous state, and a separate function is used to calculate the output from this internal state. It is this internal state that is passed to the next step rather than the output value itself. In this way, with each state update deterministically producing the next, a sequence of numbers is generated from the initial seed. The entire sequence is therefore completely determined by the seed value. Any particular seed value given to the algorithm, e.g., 108 or 99, will always return the exact same sequence of numbers. While not truly random in a physical sense, the sequence behaves as random in that there are no detectable autocorrelations, patterns, or short-period repetitions at the scale for which the generator is designed.

This determinism of the number sequence from a pseudo-random number generator is not a practical limitation in most scientific computing contexts, and, in fact, has important roles in software engineering as well as research. It can be critically useful in software development: the key to solving a bug is to be able to reliably replicate it, and the reproducibility of a pseudo-random number sequence allows for debugging of programs that rely on randomness, including scientific software (e.g. Bayesian Markov chain Monte Carlo inference programs). It can also be very useful in scientific research: often it is sufficient to record the random seed used for a particular suite of simulations or stochastic heuristic-based inference for replicability purposes, instead of storing the giga-, tera-, or peta-bytes of data generated downstream from the random number sequence itself.