## Improving on QBasic's Random Number Generator

Pixelmusement produces videos about MS-DOS games and software. Each video ends with a short, randomly-selected listing of financial backers. In ADG Filler #57, Kris revealed the selection process, and it absolutely fits the channel’s core theme: a QBasic program. His program relies on QBasic’s built-in pseudo random number generator (PRNG). Even accounting for the platform’s limitations, the PRNG is much poorer quality than it could be. Let’s discuss these weaknesses and figure out how to make the selection more fair.

Kris’s program seeds the PRNG with the system clock (`RANDOMIZE TIMER`, a QBasic idiom), populates an array with the backers represented as integers (indices), continuously shuffles the list until the user presses a key, then finally prints out a random selection from the array. Here’s a simplified version of the program (note: QBasic comments start with apostrophe `'`):

``````CONST ntickets = 203  ' input parameter
CONST nresults = 12

RANDOMIZE TIMER

DIM tickets(0 TO ntickets - 1) AS LONG
FOR i = 0 TO ntickets - 1
tickets(i) = i
NEXT

CLS
PRINT "Press any key to stop shuffling..."
DO
i = INT(RND * ntickets)
j = INT(RND * ntickets)
SWAP tickets(i), tickets(j)
LOOP WHILE INKEY\$ = ""

FOR i = 0 to nresults - 1
PRINT tickets(i)
NEXT
``````

This should be readable even if you don’t know QBasic. Note: In the real program, backers at higher tiers get multiple tickets in order to weight the results. This is accounted for in the final loop such that nobody appears more than once. It’s mostly irrelevant to the discussion here, so I’ve omitted it.

The final result is ultimately a function of just three inputs:

1. The system clock (`TIMER`)
2. The total number of tickets
3. The number of loop iterations until a key press

The second item has the nice property that by becoming a backer you influence the result.

### QBasic RND

QBasic’s PRNG is this 24-bit Linear Congruential Generator (LCG):

``````uint32_t
rnd24(uint32_t *s)
{
*s = (*s*0xfd43fd + 0xc39ec3) & 0xffffff;
return *s;
}
``````

The result is the entire 24-bit state. `RND` divides this by 2^24 and returns it as a single precision float so that the caller receives a value between 0 and 1 (exclusive).

Needless to say, this is a very poor PRNG. The LCG constants are reasonable, but the choice to limit the state to 24 bits is strange. According to the QBasic 16-bit assembly (note: the LCG constants listed here are wrong), the implementation is a full 32-bit multiply using 16-bit limbs, and it allocates and writes a full 32 bits when storing the state. As expected for the 8086, there was nothing gained by using only the lower 24 bits.

To illustrate how poor it is, here’s a randogram for this PRNG, which shows obvious structure. (This is a small slice of a 4096x4096 randogram where each of the 2^23 24-bit samples is plotted as two 12-bit coordinates.) Admittedly this far overtaxes the PRNG. With a 24-bit state, it’s only good for 4,096 (2^12) outputs, after which it no longer follows the birthday paradox: No outputs are repeated even though we should start seeing some. However, as I’ll soon show, this doesn’t actually matter.

Instead of discarding the high 8 bits — the highest quality output bits — QBasic’s designers should have discarded the low 8 bits for the output, turning it into a truncated 32-bit LCG:

``````uint32_t
rnd32(uint32_t *s)
{
*s = *s*0xfd43fd + 0xc39ec3;
return *s >> 8;
}
``````

This LCG would have the same performance, but significantly better quality. Here’s the randogram for this PRNG, and it is also heavily overtaxed (more than 65,536, 2^16 outputs). ### QBasic RANDOMIZE

That’s not the end of our troubles. The `RANDOMIZE` statement accepts a double precision (i.e. 64-bit) seed. The high 16 bits of its IEEE 754 binary representation are XORed with the next highest 16 bits. The high 16 bits of the PRNG state is set to this result. The lowest 8 bits are preserved.

To make this clearer, here’s a C implementation, verified against QBasic 7.1:

``````uint32_t s;

void
randomize(double seed)
{
uint64_t x;
memcpy(&x ,&seed, 8);
s = (x>>24 ^ x>>40) & 0xffff00 | (s & 0xff);
}
``````

In other words, `RANDOMIZE` only sets the PRNG to one of 65,536 possible states.

As the final piece, here’s how `RND` is implemented, also verified against QBasic 7.1:

``````float
rnd(float arg)
{
if (arg < 0) {
memcpy(&s, &arg, 4);
s = (s & 0xffffff) + (s >> 24);
}
if (arg != 0.0f) {
s = (s*0xfd43fd + 0xc39ec3) & 0xffffff;
}
return s / (float)0x1000000;
}
``````

### System clock seed

The `TIMER` function returns the single precision number of seconds since midnight with ~55ms precision (i.e. the 18.2Hz timer interrupt counter). This is strictly time of day, and the current date is not part of the result, unlike, say, the unix epoch.

This means there are only 1,572,480 distinct values returned by `TIMER`. That’s small even before considering that these map onto only 65,536 possible seeds with `RANDOMIZE` — all of which are fortunately realizable via `TIMER`.

Of the three inputs to random selection, this first one is looking pretty bad.

### Loop iterations

Kris’s idea of continuously mixing the array until he presses a key makes up for much of the QBasic PRNG weaknesses. He lets it run for over 200,000 array swaps — traversing over 2% of the PRNG’s period — and the array itself acts like an extended PRNG state, supplementing the 24-bit `RND` state.

Since iterations fly by quickly, the exact number of iterations becomes another source of entropy. The results will be quite different if it runs 214,600 iterations versus 273,500 iterations.

Possible improvement: Only exit the loop when a certain key is pressed. If any other key is pressed then that input and the `TIMER` are mixed into the PRNG state. Mashing the keyboard during the loop introduces more entropy.

### Replacing the PRNG

Since the built-in PRNG is so poor, we could improve the situation by implementing a new one in QBasic itself. The challenge is that QBasic has no unsigned integers, not even unsigned integer operators (i.e. Java and JavaScript’s `>>>`), and signed overflow is a run-time error. We can’t even re-implement QBasic’s own LCG without doing long multiplication in software, since the intermediate result overflows its 32-bit `LONG`.

Popular choices in these constraints are Park–Miller generator (as we saw in Bash) or a lagged Fibonacci generator (as used by Emacs, which was for a long time constrained to 29-bit integers).

However, I have a better idea: a PRNG based on RC4. Specifically, my own design called Sponge4, a sponge construction built atop RC4. In short: Mixing in more input is just a matter of running the key schedule again. Implementing this PRNG requires just two simple operations: modular addition over 2^8, and array swap. QBasic has a `SWAP` statement, so it’s a natural fit!

Sponge4 (RC4) has much higher quality output than the 24-bit LCG, and I can mix in more sources of entropy. With its 1,700-bit state, it can absorb quite a bit of entropy without loss.

#### Learning QBasic

Until this past weekend, I had not touched QBasic for about 23 years and had to learn it essentially from scratch. Though within a couple of hours I probably already understood it better than I ever had. That’s in large part because I’m far more experienced, but also probably because QBasic tutorials are universally awful. Not surprisingly they’re written for beginners, but they also seem to be all written by beginners, too. I soon got the impression that QBasic community has usually been another case of the blind leading the blind.

There’s little direct information for experienced programmers, and even the official documentation tends to be thin in important places. I wanted documentation that started with the core language semantics:

• The basic types are INTEGER (int16), LONG (int32), SINGLE (float32), DOUBLE (float64), and two flavors of STRING, fixed-width and variable-width. Late versions also had incomplete support for a 64-bit, 10,000x fixed-point CURRENCY type.

• Variables are SINGLE by default and do not need to be declared ahead of time. Arrays have 11 elements by default.

• Variables, constants, and functions may have a suffix if their type is not SINGLE: INTEGER `%`, LONG `&`, SINGLE `!`, DOUBLE `#`, STRING `\$`, and CURRENCY `@`. For functions, this is the return type.

• Each variable type has its own namespace, i.e. `i%` is distinct from `i&`. Arrays are also their own namespace, i.e. `i%` is distinct from `i%(0)` is distinct from `i&(0)`.

• Variables may be declared explicitly with `DIM`. Declaring a variable with `DIM` allows the suffix to be omitted. It also locks that name out of the other type namespaces, i.e. `DIM i AS LONG` makes any use of `i%` invalid in that scope. Though arrays and scalars can still have the same name even with `DIM` declarations.

• Numeric operations with mixed types implicitly promote like C.

• Functions and subroutines have a single, common namespace regardless of function suffix. As a result, the suffix can (usually) be omitted at function call sites. Built-in functions are special in this case.

• Despite initial appearances, QBasic is statically-typed.

• The default is pass-by-reference. Use `BYVAL` to pass by value.

• In array declarations, the parameter is not the size but the largest index. Multidimensional arrays are supported. Arrays need not be indexed starting at zero (e.g. `(x TO y)`), though this is the default.

• Strings are not arrays, but their own special thing with special accessor statements and functions.

• Scopes are module, subroutine, and function. “Global” variables must be declared with `SHARED`.

• Users can define custom structures with `TYPE`. Functions cannot return user-defined types and instead rely on pass-by-reference.

• A crude kind of dynamic allocation is supported with `REDIM` to resize `\$DYNAMIC` arrays at run-time. `ERASE` frees allocations.

These are the semantics I wanted to know getting started. Throw in some illustrative examples, and then it’s a tutorial for experienced developers. (Future article perhaps?) Anyway, that’s enough to follow along below.

#### Implementing Sponge4

Like RC4, I need a 256-element byte array, and two 1-byte indices, `i` and `j`. Sponge4 also keeps a third 1-byte counter, `k`, to count input.

``````TYPE sponge4
i AS INTEGER
j AS INTEGER
k AS INTEGER
s(0 TO 255) AS INTEGER
END TYPE
``````

QBasic doesn’t have a “byte” type. A fixed-size 256-byte string would normally be a good match here, but since they’re not arrays, strings are not compatible with `SWAP` and are not indexed efficiently. So instead I accept some wasted space and use 16-bit integers for everything.

There are four “methods” for this structure. Three are subroutines since they don’t return a value, but mutate the sponge. The last, `squeeze`, returns the next byte as an INTEGER (`%`).

``````DECLARE SUB init (r AS sponge4)
DECLARE SUB absorb (r AS sponge4, b AS INTEGER)
DECLARE SUB absorbstop (r AS sponge4)
DECLARE FUNCTION squeeze% (r AS sponge4)
``````

Initialization follows RC4:

``````SUB init (r AS sponge4)
r.i = 0
r.j = 0
r.k = 0
FOR i% = 0 TO 255
r.s(i%) = i%
NEXT
END SUB
``````

Absorbing a byte means running the RC4 key schedule one step. Absorbing a “stop” symbol, for separating inputs, transforms the state in a way that absorbing a byte cannot.

``````SUB absorb (r AS sponge4, b AS INTEGER)
r.j = (r.j + r.s(r.i) + b) MOD 256
SWAP r.s(r.i), r.s(r.j)
r.i = (r.i + 1) MOD 256
r.k = (r.k + 1) MOD 256
END SUB

SUB absorbstop (r AS sponge4)
r.j = (r.j + 1) MOD 256
END SUB
``````

Squeezing a byte may involve mixing the state first, then it runs the RC4 generator normally.

``````FUNCTION squeeze% (r AS sponge4)
IF r.k > 0 THEN
absorbstop r
DO WHILE r.k > 0
absorb r, r.k
LOOP
END IF

r.j = (r.j + r.i) MOD 256
r.i = (r.i + 1) MOD 256
SWAP r.s(r.i), r.s(r.j)
squeeze% = r.s((r.s(r.i) + r.s(r.j)) MOD 256)
END FUNCTION
``````

That’s the entire generator in QBasic! A couple more helper functions will be useful, though. One absorbs entire strings, and the second emits 24-bit results.

``````SUB absorbstr (r AS sponge4, s AS STRING)
FOR i% = 1 TO LEN(s)
absorb r, ASC(MID\$(s, i%))
NEXT
END SUB

FUNCTION squeeze24& (r AS sponge4)
b0& = squeeze%(r)
b1& = squeeze%(r)
b2& = squeeze%(r)
squeeze24& = b2& * &H10000 + b1& * &H100 + b0&
END FUNCTION
``````

QBasic doesn’t have bit-shift operations, so we must make due with multiplication. The `&H` is hexadecimal notation.

#### Putting the sponge to use

One of the problems with the original program is that only the time of day was a seed. Even were it mixed better, if we run the program at exactly the same instant on two different days, we get the same seed. The `DATE\$` function returns the current date, which we can absorb into the sponge to make the whole date part of the input.

``````DIM sponge AS sponge4
init sponge
absorbstr sponge, DATE\$
absorbstr sponge, MKS\$(TIMER)
absorbstr sponge, MKI\$(ntickets)
``````

I follow this up with the timer. It’s converted to a string with `MKS\$`, which returns the little-endian, single precision binary representation as a 4-byte string. `MKI\$` does the same for INTEGER, as a 2-byte string.

One of the problems with the original program was bias: Multiplying `RND` by a constant, then truncating the result to an integer is not uniform in most cases. Some numbers are selected slightly more often than others because 2^24 inputs cannot map uniformly onto, say, 10 outputs. With all the shuffling in the original it probably doesn’t make a practical difference, but I’d like to avoid it.

In my program I account for it by generating another number if it happens to fall into that extra “tail” part of the input distribution (very unlikely for small `ntickets`). The `squeezen` function uniformly generates a number in 0 to N (exclusive).

``````FUNCTION squeezen% (r AS sponge4, n AS INTEGER)
DO
x& = squeeze24&(r) - &H1000000 MOD n
LOOP WHILE x& < 0
squeezen% = x& MOD n
END FUNCTION
``````

Finally a Fisher–Yates shuffle, then print the first N elements:

``````FOR i% = ntickets - 1 TO 1 STEP -1
j% = squeezen%(sponge, i% + 1)
SWAP tickets(i%), tickets(j%)
NEXT

FOR i% = 1 TO nresults
PRINT tickets(i%)
NEXT
``````

Though if you really love Kris’s loop idea:

``````PRINT "Press Esc to finish, any other key for entropy..."
DO
c& = c& + 1
LOCATE 2, 1
PRINT "cycles ="; c&; "; keys ="; k%

FOR i% = ntickets - 1 TO 1 STEP -1
j% = squeezen%(sponge, i% + 1)
SWAP tickets(i%), tickets(j%)
NEXT

k\$ = INKEY\$
IF k\$ = CHR\$(27) THEN
EXIT DO
ELSEIF k\$ <> "" THEN
k% = k% + 1
absorbstr sponge, k\$
END IF
absorbstr sponge, MKS\$(TIMER)
LOOP
``````

If you want to try it out for yourself in, say, DOSBox, here’s the full source: `sponge4.bas`

Have a comment on this article? Start a discussion in my public inbox by sending an email to ~skeeto/public-inbox@lists.sr.ht , or see existing discussions.