Tuesday, January 3, 2023

The MOS 6502 is (mostly) Turing-complete without registers

It is known that the x86 MOV instruction is Turing-complete (PDF) all by itself, and is even a compiler target. More usefully, x86 can be made Turing-complete without the overt use of any registers.

These tricks work primarily because the ISA allows memory-to-memory operations, i.e., altering a memory location without explicitly moving data through a program-visible register, a historical holdover from its roots in the Intel 8086 and its ancestors. (Let's not even talk about its Turing-complete faults.) Other pre-RISC CPUs of that era also have memory-to-memory addressing, including the MOS 6502, which despite its simplicity being inspiration for the RISC ARM architecture is not itself RISC. It should be no surprise you can make the 6502 do this trick too even with its more constrained instruction set, and we can do it with just four instructions, not counting rts to return to the operating system.

I say the 6502 is "mostly" Turing-complete without using registers because there are two conditions on the 6502 needed to make this true. The first is that self-modifying code is impossible to avoid; the memory-to-memory instructions we will be using don't support indirect addressing, so to maintain the Turing machine's data pointer requires altering code inline. Dolan's paper above considers this cheating but this was so common on CPUs of the time, including the 8086, that I think this requirement is acceptable. The "uses no registers" example above does it too.

The other condition is how a register is defined. Most texts consider the 8-bit accumulator (A) and two 8-bit index registers (X and Y) of more limited function to be the 6502's canonical registers, but the 6502 also maintains an 8-bit stack pointer (S), a 16-bit program counter (PC) and eight bits of processor status (P, or as individual flags NV-BDIZC). On the x86 examples above, they don't count the program counter either and they make jumps (including, in the second case, to an indirect memory vector), so we're only doing the same. We will be making subroutine calls in this demonstration for maintaining the Turing machine's data pointer, which is definitely using the stack pointer, but this is simply expediency: our forthcoming example also works fine if you alternatively patch every place the data pointer is manipulated instead of encapsulating it in a set of global subroutines, and the stack pointer will be the same value at the conclusion regardless.

However, most 6502 instructions, including memory-to-memory ones, implicitly and mandatorily change flags in P (x86 MOV sets no flags). Additionally, the 8086 has an ALU as wide as its in-segment addresses (16 bits), as do its 32- and 64-bit descendants, so it can compute addresses with single instructions. This property is necessary for the "uses no registers" example, which calculates indirect jumps that way to "read" memory. The 6502's ALU is 8-bit and only supports 8-bit math but uses 16-bit addressing, and since its memory-to-memory instructions all unavoidably change the zero and negative flags (Z and N) anyway, we might as well use them to overcome the deficit. If you don't consider N or Z to be a true register — I've chosen not to because P is more impaired than even S, N and Z can't be set explicitly, and the majority of whatever ad-hoc 6502 calling conventions there are don't require saving them — then the argument holds. To make this a little less cheating-adjacent, we'll agree not to use any other processor flag including any memory-to-memory instructions that also modify carry (C), and it turns out we can get away with testing just Z anyway.

I should also parenthetically note that this concept generates absolutely horrid code and the paucity of 6502 registers means you would be much more effective by simply saving and restoring them. That said, if you're into this sort of Turing tarpit, you've already implicitly defenestrated efficiency; here we're merely throwing it through a couple windows more.

We'll demonstrate Turing completeness by showing how Corrado Böhm's six-instruction 𝒫′′ (P'') language, proven Turing-complete, can be written in this constrained set of 6502 instructions. Conveniently and more entertainingly for modern audiences, those six P'' instructions directly translate to Brainf*ck operators, hereafter BF (pardon the censoring, but this is after all a family-friendly blog), so we'll use those operators as shorthand.

Our example will run on the Commodore 64 or any emulator, and emits "HELLO" and a space to the top of the screen. The Commodore 64 has a memory-mapped screen that by default starts at location 1024, so we don't need to implement BF's I/O to print anything; we'll just treat screen memory as the Turing tape. (If you're adapting this to another 6502-based computer, you'll need to make analogous changes below.) The basic 6502 instructions we'll use are inc and dec. Among other addressing modes these two instructions increment and decrement a memory location by 1 respectively. Since these implicitly set N and Z, we'll allow the use of bne (branch if Z flag is false, i.e., last operation was non-zero) to handle 16-bit addresses and to implement testing zero. As 6502 branches have limited displacements, a jmp may be required, yielding our four instructions. Again, while we'll use jsr and rts to make this demonstration tractable, if we were willing to simultaneously patch the Turing data register everywhere it occurs we can still make the example work with just those four. That exercise is left to the exceptionally masochistic reader.

To begin, let's implement BF + and -, which are P'' r and r' and increase or decrease the memory location currently pointed to by 1 respectively. These are the only subroutines we will use. As we rely on self-modifying code we shall make the further (rational) assertion that no part of the program is running in zero page, i.e., below location 256, where special addressing rules and differing instruction lengths apply.

plus:
        inc $0400
        rts
minus:
        dec $0400
        rts

These necessarily have the data pointer encapsulated within them, which is assembled to ee0004 and ce0004 respectively. That makes increment jsr plus and decrement jsr minus.

BF >, which is P'' R and increments the data pointer, thus is

        inc plus+1
; skip the next instruction if Z flag false
; it is three bytes long
; include the size of the bne instruction
        bne *+5
        inc plus+2
        inc minus+1
        bne *+5
        inc minus+2

which increases the low byte of both instructions' operands, and if it overflows to zero, increases their high byte.

Neither inc nor dec set carry, nor did we agree to use it, so that makes BF < (P'' L, decrement the data pointer) slightly more complex. The N flag is appropriately defined as the high bit being set after an operation, but this would occur for any unsigned value from 128 to 255, meaning an initial dec with a result in that range would set this flag. What we want to do is to decrement the high byte of the data pointer only if the low byte started at zero. We can do by incrementing and decrementing first; to wit:

        inc plus+1
        dec plus+1
        bne *+5
        dec plus+2
        dec plus+1
        inc minus+1
        dec minus+1
        bne *+5
        dec minus+2
        dec minus+1

Finally BF's [ and ], which are P'' ( and ). These implement a jump forward to the matching ] if the data pointer is already zero, and a jump backwards to the matching [ if it isn't. Note that the range for bne's single signed byte offset may well be exceeded by even a modestly complex loop, possibly requiring a trampoline jmp. In the general case that potential requirement makes [ into

; test value under data pointer
        jsr plus
        jsr minus
        bne *+5
        jmp loop_end_label
loop_top_label:

where the label name of course differs per loop.

Like we did for </L, we do an increment and decrement as a test for zero because we can't assume that the Z flag is already correctly set for the location the data pointer is currently pointing to. We might be able to reason about that in a more advanced compiler, such as omitting the test if the last instruction was + or -, but I'm lazy and this suffices for pedagogical purposes. Unfortunately we have to assume the same for ], so:

        jsr plus
        jsr minus
        bne *+5
        jmp loop_end_label
        jmp loop_top_label
loop_end_label:

You can see why I've used subroutines here because we only have two instructions to patch versus keeping track of a whole bunch and patching them all every single time, but as I said, you can do it that way and I look forward to someone trying to.

That suffices for the operators; now for the runtime environment. We can clear a memory location without the use of A, X or Y with eight asls or shifts left (or lsr shift rights), but this also sets C, and since we agreed not to use that flag we're gonna do it the hard way. To make our program re-entrant we need to ensure the data pointer is set to 1024 on each run (change this for your system if you're not running it on a C64). That means we must decrement the low byte of the data pointer to zero and set the high byte to four no matter what those actually started at. Our entry code thus is

        dec plus+1
        bne *-3
        dec minus+1
        bne *-3
        dec plus+2
        bne *-3
        dec minus+2
        bne *-3
        inc plus+2
        inc plus+2
        inc plus+2
        inc plus+2
        inc minus+2
        inc minus+2
        inc minus+2
        inc minus+2

On top of all that, P'' assumes the Turing tape is clean, i.e., all symbols are c0. In BF terms, that means the entire memory range defaults to zero, but screen memory on the C64 defaults to byte value 32 (the space character) plus anything else printed by the Kernal or BASIC. We could simply clear the six locations from 1024 to 1029 the same way to run our demonstration program, but this doesn't suffice in the general case, so in a diversion from spec we're going to make our job harder by requiring the program itself to clear the memory it uses. The program we'll be running is, and is valid BF,

[-]>[-]>[-]>[-]>[-]>[-]
<<<<<
++++++++>
+++++>
++++++++++++>
++++++++++++>
+++++++++++++++>
++++++++++++++++++++++++++++++++

This sets memory locations 1024 through 1029 to values 8, 5, 12, 12, 15 and 32 respectively (recall that on the C64, screen codes are neither PETSCII nor ASCII). Again, change these values if you are not running this on a C64 or other Commodore 8-bit.

We now have enough of a specification and a test program to write a minimal compiler (in Perl, because I'm one of those people). I've put the compiler, the test program and test output up as a Github gist. Let me say it again: this is an absolutely potty BF-to-6502 compiler, and it is actually non-standard by requiring the programs it compiles assume previously dirtied memory which orthodox BF does not. The innovation here is that the code it emits does not use any of the 6502's canonical registers, with naturally the exception of PC and the N and Z flags (and for this situation S, which exits the routine unchanged).

You will need a 6502 assembler to compile the output, for which I invariably recommend xa65, which yours truly maintains. Assuming you are using it as well, invoke like so:

perl -s no-reg-6502-bf.pl < test.bf > test.s
xa test.s

I've provided the output in the gist so you can look at the generated assembly if you don't want to actually run the commands. If you choose to do so, though, you'll get a file named a.o65, assembled to start at 49152 (pass -sa=address to the Perl compiler to use a different location). Load it into your real C64 or emulator of choice:

Thus endeth the demonstration.

No comments:

Post a Comment

Comments are subject to moderation. Be nice.