Saturday, April 27, 2024

Virtualizing the 6502 on a 6502 with 6o6 (and The Incredible KIMplement goes 1.0)

Okay, promises, promises. Here's the first of my bucket list projects I'm completing which I've intermittently worked on for literally two decades. Now that I've finally shaken out more bugs, tuned it up and cleaned it off, it's time to let people play with the source code.
This is the official 1.0 release of the Incredible KIMplement, an emulator of the one kilobyte, 1MHz MOS/Commodore KIM-1 6502-based single board computer. It provides access to the KIM's built-in TTY support (even through your computer's real serial port) and has expanded RAM with 16K of addressing space, all on an unexpanded stock Commodore 64.

It's almost burying the lede to announce that, though, because the real meat in this entry is how the Commodore 64 manages to emulate a very different 6502-based system. That piece is "6o6," for "6502-on-6502," and is a full virtualized software NMOS 6502 CPU that runs on a 6502 CPU — which I've open-sourced too. It has full control of guest code execution, including trapping undocumented and jam opcodes, and completely abstracts all memory access, making it possible to remap addresses, intercept illegal reads or writes, or even run entirely from virtual memory. On top of that, it's complete enough to not only pass a full functional test but also virtualize itself virtualizing itself:

These GIF screencasts are real-time with no tricks. Here a Commodore 64 and Apple IIe are both running a guest "hello world" payload within 6o6 (stage 1), which is nearly instantaneous, then 6o6 running the payload as a payload within another instance of 6o6 (stage 2), which is a little slower, then 6o6 running 6o6 running 6o6 running the payload (stage 3), which is glacial. But all of it works!

I think all nerds at a certain age want to design "the ultimate operating system." I made a lot of attempts at this as a kid, mostly concentrating on windowing and menus, and none of them progressed much beyond the level of primitive tech demos. I have a few of these disks still around to smile at.

When I had my first taste of multiuser systems, it hadn't occurred to me before that people might (unwittingly or otherwise) try to run crap code. After all, I'd grown up in an era where computers generally ran one program that had control of the entire machine, so operating systems — to naïve little me, at least — were more of an exercise in interface design. The program was expected to take over everything, so if the program misbehaved, you simply reset the computer. But people expect multiuser systems not to reboot randomly when they're in the middle of something. An effective system would either have to deal with the fallout, or actively protect you and others, from code that might (unwittingly or otherwise) corrupt other address spaces, run deleterious instructions or hog all the resources.

Naturally, the trick is how a little 6502 with less than 4,000 transistors can guard against all that. Although memory management with the 6502 was improved with bank-switching hardware — the C64 could have never juggled its own 84K of addressing space without the I/O port on its 6510 — few historical NMOS 6502 systems could move around the special zero page or the processor stack in memory (as it happens, the Commodore 128 is one of those systems but little software takes advantage of it), and none could remap addresses such that program code could be moved to and run from an arbitrary location without fixups. It also could not comprehensively prohibit programs from accessing certain memory locations, and if any of the number of undocumented jam or "KIL" opcodes were executed, then the processor would lock up completely. (These opcodes are unintended consequences of how the original NMOS 6502 decodes instructions, and CMOS versions of the 6502 make them into no-ops or repurpose them for other operations.)

Some other issues could also be conceivably mitigated with hardware. A process determined to monopolize the system by setting the interrupt flag (to disable interrupts) can still be halted externally with a source of periodic non-maskable interrupts (NMIs), which is how some multitasking 6502 kernels implement preemptive task switching, assuming that source exists. For mitigating dangerous instructions, in-circuit emulators did exist — the Eastern House Software "Trap65" comes to mind, which turned bogus opcodes into trappable BRKs — but they were very expensive and at the time weren't generally sophisticated enough to directly manipulate live bus lines in a complex fashion. And, of course, you'd still need some sort of supervisor/hypervisor kernel to control all this.

So what if we just do it all in software? Even a simple-minded interpreter isn't completely nuts: the 6502 has few registers, so keeping track of the processor state is easy, with a manageable 56 instructions and only a handful of addressing modes. We can abstract memory access in any way we want, and we can throw controlled exceptions for bad instructions or protection faults instead of crashing or hanging.

The "virtualization" part comes in by 6o6 also using the host's ALU to do processor-internal operations on behalf of the guest. Again, this is facilitated by how simple the 6502's internal state is. If we're asked to add a value to the accumulator, we load the guest register and processor flags on the host 6502, run the same instruction the guest would have, record the result and processor flags and clean up. (This is how we avoid a guest with, say, the interrupt flag and decimal flag set from corrupting our own state.) Here's an example from the source code for the ADC # immediate (opcode $69) instruction.

-IMMPTR = hpadc69+1
opadc69 ARGIMM
        GETAP
hpadc69 adc #$00
        PUTAP

This uses xa preprocessor macros to fetch an immediate argument from the instruction stream and store it as the operand to the immediate instruction at hpadc69, then loads the accumulator and processor state from the guest into the host CPU, runs the instruction, stores the accumulator and processor state and clears the I and D flags. (Do note this is self-modifying in case you wanted to put it in ROM.) Not only does this give us the answer without having to implement the logic ourselves, but we also get all the CPU flags precisely handled "for free," and stuff like decimal mode (i.e., BCD) arithmetic just works as a natural consequence. Similarly, if we're asked to shift it, or subtract, or increment or decrement the X or Y registers, we do the same thing.

Even if we're just loading or operating on a value from memory, or transferring registers, we still do this because we still have to set the negative and zero flags. As an example, here's an indirect indexed LDA (),Y (opcode $b1), where we take the result from memory and then load it as an immediate. Notice we don't bother loading the guest accumulator since we'll immediately clobber it.

-IMMPTR = ipldab1+1
opldab1 ARGINYF
        GETP
ipldab1 lda #$00
        PUTAP

Regardless of how we get there, the approach yields a result exactly the same as a 6502 because a 6502 computed it, and we get the results faster than doing the math and flags manually.

That, in a nutshell, was my first version of 6o6 in 2002. As a test case to emulate the simplest 6502 system I could think of, I added code to support the LEDs and TTY, and that became the first version of the Incredible KIMplement.

In our little virtual world, the 6o6 VM is not the prime mover, just the engine. Since the original intent was to be part of a future operating system, I believed strongly that any 6o6-based environment should be highly modular, and this diagram reflects that. The virtual machine (in red) is almost completely hardware-agnostic other than assuming it runs on a real 6502. As such, in order to run an arbitrary payload, i.e., the guest, there are two other parts (in green) required to specify a complete system.

The first part is the harness, which is 6o6's interface to guest memory and the rest of the managed hardware. The harness has a standardized jump table that serves as its binary interface. It implements loads and stores for a given address, including instruction fetches, and maintains the hardware stack and stack pointer. The VM assumes nothing — not even the page size or that memory is even paged. (By the way, while the VM calls the harness to get and store the 8-bit stack pointer register using X, it also calls the harness for all pushes and pops because it deliberately doesn't touch the stack itself. That means the VM doesn't have an actual dependency on the location or size of the stack, so regardless of what you say is the stack pointer value, nothing says you couldn't implement a bigger stack internally.)

Converting a virtual to a physical address could be as simple as addition or bit shifts, but if you wanted to implement something more complex like a paged virtual memory scheme, you'd also do it here: instead of generating page faults for something else to handle, the harness itself would do the paging in and out as part of a load or store. The harness can also raise protection exceptions. Here's an example store from the included demonstrations (we'll talk about it a little later too):

        * = HARNESS

        jmp mpeek
        jmp spush
        jmp spull
        jmp stsx
        jmp stxs

        ; these drivers should not change dptr

        ; poke .a into location indicated by dptr
        ; okay to clobber x and y
mpoke   tax
        lda dptr
        sta hhold0
        lda dptr+1
        cmp #>KERNEL
        bcs mpokehi
        ; $0000 and $0100 go to EMURAM and EMURAM+0x100
mpokelo adc #>EMURAM
        bcc mpokec      ; carry still clear
        ; accesses to kernel and up go to the payload
        ; add offset keeping in mind carry is set
mpokehi adc #(PAYLOAD >> 8)-(KERNEL >> 8)-1
        bcc mpokehj
        ; fault if the result wraps
mpokefa lda #R_MFAULT
        jmp BAILOUT
mpokehj ; fault if the result hits I/O
#ifdef C64
        ; VIC
        cmp #$d0
#else
        ; Apple II
        cmp #$c0
#endif
        bcs mpokefa
        ; OK to store
mpokec  sta hhold1
        txa
        ldy #0
        sta (hhold0),y
        rts

I included the jump table so you can see the functions the harness is expected to provide. dptr is the virtual address pointer provided by 6o6 for the harness to dereference and hhold0/1 are zero page work areas the harness can use for this task. EMURAM indicates where the bottom of guest memory is in physical memory, PAYLOAD indicates where the payload is in physical memory, and KERNEL is the address of the kernel, which in this case may or may not be a physical address (more later). Here, accesses to virtual zero page and the stack go to the same place offset by EMURAM and accesses to the kernel address and up are offset by the computation at mpokehi. If the resulting physical address wrapped or hits I/O, then this routine will raise an exception, though it could just as easily ignore it. Otherwise, having computed the physical address, it performs the store and returns to the VM. You should note that the way this scheme is constructed, memory above the stack but below the starting address of the payload may be aliased; also, the full 64K addressing space is not available beyond a certain point. Neither is an intrinsic limitation of 6o6, just how this particular harness was written for pedagogic purposes.

Now, suppose you do have an MMU or some other hardware assist for managing memory. The beauty of it is, you just write it into the harness and make the harness the driver for your hardware; 6o6's memory model is so completely abstracted that it won't notice the difference. I'll have an example of that later on.

The second part is the kernel, which is what actually kicks off the VM. The kernel and its associated headers define for the VM where the harness is located and what zero page locations it can use for registers, which are shared with the kernel so that the kernel can initialze and manipulate them. The kernel next calls the VM to run an instruction from the payload (or a group of instructions; more in a moment). The VM executes some portion of the instruction stream and returns a status code to the kernel, which may include an exception from either the harness or 6o6 itself that the kernel needs to act on, after which the kernel can inspect or modify the guest processor state and do any needed emulation tasks. For example, it might notice the PC is at a service routine it handles natively, so it does that task, adjusts the registers with any return value, and (I provide a utility routine for this) pulls the return address off the stack at the end. All necessary work having been done, it goes back to call the VM again for the next payload instruction or instruction group, and so on and so forth, looping until termination.

Between calls to the VM the guest CPU is "frozen," making it possible to completely capture the guest state for later reconstitution, or context-switch to a new task or environment by just swapping in a new set of registers. The kernel also determines when externally triggered events occur because the VM doesn't fire virtual IRQs or NMIs (or, for that matter, resets); the kernel decides when such events should take place and accordingly loads the stack (there's a routine for this too) and sets the guest program counter (PC) and any other needed registers. Although the VM supports the BRK instruction, traditionally treated as a "software" interrupt, the VM will set up the stack for you but instead of setting the PC to the new location returns an exception. Your kernel can handle it in the conventional way by setting the PC to the IRQ vector, or perhaps to some other vector, or perhaps handle the situation natively, or even just ignore it.

The harness in KIMplement provides a virtual standard 6502 stack in the usual location and a KIM-4 expansion device with read/write memory between $0000 and $17ff, ROM from $1800 to $1fff, RAM from $2000 to $3fff, unmapped unwriteable space from $4000 to $fff7, and mirroring $1ff8-$1fff to $fff8-$ffff for vectors. The host C64 stores the low 16K at $4000 through $7fff and the rest is synthesized by the harness, so in most cases computing the effective physical address is just bit twiddling and adds. Meanwhile, the KIMplement kernel handles emulating the RRIOTs (picking up the writes from the harness), drawing the LEDs, servicing the TTY, inserting NMIs for stops and the Single-Step Switch, and also trapping portions of the KIM-1 ROM monitor. After a VM run the KIMplement kernel examines 6o6's PC to see if we want to intercept the current routine, which is how the TTY and certain other features are implemented.

There are two other performance enhancements in 6o6. The clever among you will have already recognized the 6o6-harness edge as a potentially massive bottleneck, since even the simplest instructions will involve at least one fetch and things like indirect addressing could require many. To eliminate the overhead of a subroutine call (at least 12 cycles) on every byte, as part of KIMplement 0.2 I allowed memory loads in 6o6 to be partially inlined with preprocessor macros. These optional macros provide inlineable load routines both for any arbitrary virtual address and an optimized alternative for zero page which are linked right into 6o6. Doing so yielded substantial speedup, but at the cost of bloating the VM, so I chose to keep stores as subroutine calls since they happen less frequently and are usually more complicated than fetches. In this very latest iteration, I also cleaned up an inefficiency in how the inline macro accesses the program counter, which improves instruction fetches further.

The other means is an optional and primitive form of instruction fusion I call "extra helpings," which I added to 6o6 for KIMplement 0.3. For instructions that don't touch memory, instruction fetches notwithstanding, there's no need to return to the kernel right away because there would likely be no operation the kernel needs to observe yet. This includes immediates, accumulator oriented instructions, most implied instructions (anything that wouldn't call into the harness or access the processor stack), and branches if they are not taken. For example, something like clc:adc #32:tax:inx:cpx #50 is otherwise handled all within the processor, so there's no reason from the kernel's perspective that these sorts of instructions couldn't all be taken together as a group. However, if an instruction does any loads or stores, or the PC changes to anything but the next instruction, or there's an exception (duh), then the effect is potentially observable by the kernel and the VM stops trying to cram in more. This doesn't make the VM any faster — in fact, if we have the "extra helpings" gated on where the PC is in a page, as KIMplement does to handle exact ROM traps, it's slightly slower — but it also means that the kernel doesn't have to run uselessly after every instruction where nothing could have changed, so instead it makes everything else faster. The approach is profitable, as we'll show in a moment, but it does interfere with applications that need precise control of the program counter and so there are options to progressively or completely disable it.

That's a good segue to talk about 6o6's built-in validation and testing, because we can directly test any optimizations we make and see if that lowers execution time or instruction count. As part of cleaning up the code I pulled down Klaus Dormann's functional test suite, using the binary he provides so that there's nothing up my sleeve. This is a widely accepted stress test for 6502 systems for over a decade. The provided binary assumes nothing about the hardware and signals an exit by entering an infinite loop. If the loop isn't at the point marked as successful completion, then we conclude a failure has occurred.

So that I can test changes easily and quickly, the test suite uses Ian Piumarta's venerable lib6502 CPU emulator and runs directly from the shell. To detect the testsuite's "signal" loops I modified David Schmenk's patch as part of his PLASMA project to implement single-stepping, and ran Klaus' suite directly against lib6502 to start with (after all, since 6o6 relies on the host CPU's ALU, it's only as accurate as the ALU that's present). Interestingly, lib6502 failed due to an edge case in decimal mode which I also had to patch. I don't claim my patch is optimal or even correct, but it passes now.

Next we mix in 6o6. The binary Klaus provides is a full 64K, so there's no room in the basic 6502 address space for both 6o6 and the test suite at the same time. I solved this with another patch to lib6502 to create a single bankswitched minimal system for $7000-$efff (32K), putting the first 32K of Klaus' binary in the first bank and the second half in the other. The test harness will bank in the proper segment depending on which address is being referenced. The first stage of testing exercises this harness with a payload that stores to each bank and ensures the values are correct. We test this three different ways, one with no extra helpings and no inline fetch macros, then one with inline fetch macros but still no extra helpings, and then both extra helpings and inline fetch macros. If that passes, we move onto the same configurations tested against Klaus' suite. I had to add a synthetic "X" bit to the status register and adjust the code for BRK slightly, but once done, it passed in all three forms.

The lib6502 patch I added also reports the number of opcodes executed. (At some point I'll do cycle counts, but even just an instruction count is already useful information.) Against the test suite, the naked lib6502 without 6o6 present executes the test suite in 30,646,178 instructions, while the three configurations of 6o6 from least to most optimized execute it in 2,188,322,914, 1,713,350,225 and 1,602,516,769 instructions respectively. Those are significant savings: at its fastest 6o6 is executing 36.5% fewer instructions than the least optimized version and an average of 52.3 instructions for every guest instruction. Considering those figures necessarily include the harness and the kernel as well as 6o6, I think that's a darn good number. Notice that while inline fetch macros make the biggest delta, the improvement with extra helpings is certainly no slouch, and the instructions per instruction should not be interpreted as a speed multiplier since different instructions have different cycle counts.

I've included four examples with 6o6, not counting KIMplement, its original application. Three of the four will run on either an unexpanded Commodore 64 or on an Apple IIe with 64K (tested using DOS 3.3), and I'll discuss the fourth after that. (If someone wants to submit an Atari 8-bit port, open a pull request.) You've already met one, so let's meet them all.

The first, and simplest, is a "hello world." It runs a program which prints "hello world" first natively on the CPU, then through 6o6. xa handles the character set, so the Makefiles set the right options to make the string correct on either the Apple II or the C64. Here's the payload in nearly its entirety:

        * = KERNEL
        ; NOT payload!

        ldx #0
:       lda string,x
        beq :+
        jsr CHROUT
        inx
        bne :-
:       rts
        
        ; character set handled by xa
string  .asc "hello world", CR, $00

This is a very normal-looking blob of code (on the Apple II we intentionally don't end with a jmp $03d0, though that wouldn't be necessary for ProDOS, of course — I'll explain why in a second). You'll notice this calls a character-out routine, which maps to either $ffd2 on the Commodore or $fded on the Apple; when the kernel sees the call, it just passes it through to the ROM routine. The kernel is also small, and looks like this:

        * = KERNEL

cold    lda #>KERNEL
        sta pc
        lda #<KERNEL
        sta pc+1
        lda #0
        sta preg
        lda #$ff
        sta sptr

lup     jsr VMU

        ; if we get a stack underflow, treat as clean termination status
        cmp #R_STACKUNDER
        bne chekok
dun     rts             ; propagates up

        ; otherwise we don't handle any return status other than OK
chekok  cmp #R_OK
        beq chekpc
        ; err out, wait for a rescue
bail    sta $d020
        inc $d020
        jmp bail

        ; check for a call to $ffd2 and redirect to Kernal call
chekpc  lda pc+1
        cmp #>CHROUT
        bne lup
        lda pc
        cmp #<CHROUT
        bne lup

kffd2   lda areg
        jsr CHROUT      ; propagates up
        jsr DORTS
        cmp #R_STACKUNDER
        beq dun
        jmp lup

Our cold start sets the start address and zaps both the processor status register and stack pointer, and then starts calling the VM. As the PC ticks along, if the kernel discovers that it points to CHROUT, it grabs the guest accumulator, calls that routine on behalf of the payload, pulls the return address off the stack and returns to the loop.

You'll have noticed that both the kernel and the payload have the same starting address of KERNEL — and that's because this kernel can also be its own payload! That's how it virtualized itself virtualizing itself in those animated GIFs back at the beginning, using the exact same harness and kernel as this example. In fact, I've symlinked them to make it clear they're exactly the same. Only the main program that kicks all that off is different because it has to shift things forward in memory. Here's how that looks (both the C64 and the Apple IIe use the same memory map):

The bright red "caps" on the right of each bar is the ultimate payload. Each little band has its own zero page and stack, and the payload for the previous stage overlaps with the kernel for the next stage. Because 6o6 currently uses self-modifying code for expediency, the main program (which I've puckishly termed the "hypervisor") just copies everything up in memory, since each stage needs its own copy of the VM. As the harness I showed you earlier computes the physical address with an add, the same harness works at each level to propel the virtual address to the right place and everything is seen in the same location by each preceding stage.

Consequentially, it should also be noted that by stage 3 the majority of memory usage here is given over to three entire copies of the VM, which with the inlined fetch macros weighs in at over ten kilobytes a pop. While I could probably cut out a few more executed instructions by even more aggressively inlining everything, I think I'm currently at or near the point of diminishing returns with respect to memory usage. Twenty years occasionally refining your own code will do that.

Just like the movie, the deeper you get, the more levels you have to bubble up through. At stage 3, a call to CHROUT gets turned into a call to CHROUT at stage 2, which gets turned into a call to CHROUT at stage 1, which finally gets turned into a call to the native routine. Each is the same code and each handles the call the same way which percolates up to the lowest stage where the native call actually occurs.

Of course, when the payload completes, we'll need a "kick" to exit.

The "kick" is the RTS instruction — a BRK could also serve, and may even be more appropriately explosive, but I wanted the payload to be "more or less normal" code. You may have noticed we check for stack underflow in the kernel, which is a condition the harness can throw an exception for if it wants. (In KIMplement we don't; the stack just wraps, as it would on a real KIM-1.) Since we just "start" into the payload/kernel, there is no return address on the stack, so when the terminal payload wherever it is hits the RTS we have nothing to pull from. This gets reported up to the kernel that's supervising it which then does an RTS itself. At the deeper stages this too gets propagated back up and up until it reaches the native kernel, which then returns to the "hypervisor."

After all three stages are run, the hypervisor cleans up so you can run it again — on the Apple II, CALL 2051; on the C64, just RUN it — and then terminates to BASIC. I note parenthetically that this program extends up into the resident DOS range on the Apple II above $9000, so you should reboot after you're done playing around.

The third demo is a tiny task-switching kernel that jumps back and forth between two independent tasks. Each task has its own zero page and stack, along with a tiny address range for its code to live in. Each task is encapsulated and completely unaware of the other task or, indeed, of the VM actually running it.

The two tasks are one displaying the letters of the alphabet, and one displaying numbers (in reverse video to make them visually distinct). Each time you press a key, the task switches. Even though each task uses the same location in zero page to track its state, their zero pages are independent, so they pick up where they last left off. If you hold down a key, then as the keypresses go through, the kernel switches back and forth between the two of them. (In fact, on the C64 version, the "reverse video" will bleed because it's possible for the numbers task to print the RVS ON sequence but get immediately swapped out after for the other task. This doesn't happen on the Apple II version where inverse video is a separate set of printable screen codes.)

All that's necessary to context-switch is to keep track of the current task and have storage areas for the processor state of each one (i.e., A, X, Y, P, S and PC). The harness looks at the task "on CPU" and selects the proper physical addresses for the zero page, stack and executable code accordingly; the kernel stores and loads the other state when instructed to swap and marks the other task as "on processor." This demo runs until you stop it with RUN-STOP/RESTORE or Control-Break.

For the last demonstration, which runs only on the Commodore 64 (though it could be adapted for 128K systems like the 128 or Apple IIe), we'll provide a full 64K addressing space that uses none of the system's own RAM — it's all external memory. Since the C64 has only 64K of RAM of its own, this will require some hardware.

The geoRAM is a paged RAM device distinct from the Commodore-official RAM Expansion Unit (REU), which is DMA-oriented. REU memory is not accessible directly, only by operations that read from, overwrite or exchange it with main memory using the custom MOS 8726 REC. Commodore 64 GEOS is very memory-hungry and GEOS supports the REU to reduce dependence on the disk drive, but the proprietary REC chip was expensive and suffered from supply shortages, so Berkeley Softworks came up with a cheaper alternative of its own. (Ah, Shattuck Avenue, says this Berkeley Master's graduate.)

Similar to schemes like early LIM EMS, the geoRAM maps its memory into a "window" page — in this case a 256-byte page in the I/O range at $de00 — with its control registers at $dffe and $dfff to change which page is accessed. This is convenient since that block of I/O space doesn't take away any RAM. Pulling up the correct page is merely a matter of setting the registers to the one you want and the window page is instantaneously read-write, so you don't need to flush any caches to bank in another one. Though the geoRAM complicates this slightly by using 16K "banks" (64K would have been more convenient), overall the geoRAM was a very straightforward and inexpensive scheme that can be implemented with off-the-shelf components, and compatible modern clones now have capacities up to 4MB. VICE can easily emulate a geoRAM.

For this demo I wanted to emulate a simple terminal-based system with a full 64K. As it happens, for a period of time the RC2014 Z80 kit computer had a third-party 6502 processor module available, for which a ROM was provided that has both a simple monitor and Lee Davison's EhBASIC (rip). Like before I wanted to use a pre-built ROM so that you can see there's "nothing up my sleeve," and fortunately there's a set on Github. We'll use the 6551 version, though we could use any of them (mutatis mutandis) since we're going to trap the terminal routines anyway.

Here's the harness store routine. As I mentioned, the geoRAM deals in 16K instead of 64K banks. We're going to just occupy the first 64K of what's there but we'll still need to handle the bank computation for anything between 16K and 64K.

        ; poke .a into location indicated by dptr
        ; okay to clobber x and y
mpoke   ldx dptr+1
        cpx #>ROMSTART
        bcc :+
        ; no writes to emulated ROM, cheaters
        rts
        ; page unchanged? if so, skip all this nonsense
:       cpx curpage
        beq :++
        ; map the high byte maps onto the geoRAM page registers
        stx curpage     ; cache it
        cpx #64
        bcs :+
        ; below $4000, direct mapping
        stx $dffe
        ldx #0
        stx $dfff
        ldx dptr
        sta $de00,x
        rts
        ; $4000 and up indirect mapping, convert to block and bank
:       tay
        txa
        and #63
        sta $dffe
        txa
        and #192
        clc             ; carry is still set
        rol
        rol
        rol
        sta $dfff
        tya
:       ldx dptr
        sta $de00,x
        rts

ROMSTART is the location of ROM in guest memory, which for the preassembled ROM image is $c100. Writes to it are simply ignored. For writes below 16K, we have a fast path; for everything else, there's MasterCard a quick banking adjust with masks and shifts. This isn't a slow calculation but it isn't necessary if we're already on the same page (particularly true for instruction fetches), so we always cache the current page the geoRAM is on.

The reason I wanted something with both a monitor and BASIC is so that you can mess around with the emulated machine's memory and any errors just bail out to the monitor. The main program and the kernel are one and the same in this example, so once the main program has checked that there's a geoRAM present and it seems to be working (by using its own harness, also a good self-test), it will copy the ROM image into geoRAM and begin execution. The relevant instruction loop looks like this:

lup     jsr VMU

        ; check status
        cmp #R_BRK
        beq dobrk
        cmp #R_BADINS
        beq doill
        cmp #R_UDI
        beq doill
        cmp #R_MFAULT   ; nb - current version can't generate this
        beq doill
        ; other conditions ignored

BRK instructions go back to the monitor, as they do with most monitors.

        ; brk handler
dobrk   ; stack already set up by VM
        lda PAYLOAD+ROMSIZE-2
        sta pc
        lda PAYLOAD+ROMSIZE-1
        sta pc+1
        jmp lup

But so that we can use the same machinery in the monitor, we "forge" a BRK for other failures like an illegal instruction or the special user-defined instruction trap (also notionally an illegal instruction).

        ; illegal failure handler
        ; effectively turn the faulting instruction into a brk
doill   ; do what VM opbrk would do, but wind the pc back one instruction
        ; since the bad opcode was already fetched
        lda pc
        clc
        adc #1          ; not 2
        sta hold0
        lda pc+1
        adc #0
        sta hold1       
        ; put high byte on first so it comes off last
        jsr SPUSH
        lda hold0
        jsr SPUSH
        lda preg
        ora #%00110000
        jsr SPUSH
        lda preg
        ora #%00010000  ; set bit 4 for B-flag, leave IRQs alone
        sta preg
        jmp dobrk

The other main part of the kernel/main program is handling a simple emulated terminal. Since the RC2014 assumes a PC-type terminal, we intercept its serial vectors (for the other UARTs these values may need to be changed), translate to and from PETSCII and maintain a little cursor, twiddling the guest registers and flags according to the results. At any time you can do a "three finger salute" and reset the emulated system with CTRL-SHIFT-Commodore, leaving memory intact.

        ; check for emulated ROM routines
        ; these come from disassembling the reset routine at $ff0f
        ; use the targets rather than ff03, ff06, ff09, ff0c so that
        ; the vectors can be redirected to user code if desired
        ;
        ; f931 = init (no-op)
        ; f941 = input_char (wait) ($ff03 vectored at $03d0)
        ; f94c = input_char (no wait, carry flag) ($ff06 vectored at $03d2)
        ; f959 = output_char ($ff09 vectored at $03d4)
        ; fa05 = print string at a, x ($ff0c)

        lda pc+1
        cmp #$f9
        beq lowchek
        cmp #$fa
        bne lup
        ; $fa05 is the only routine in $faxx we patch
        lda pc
        cmp #$05
        bne lup
        jmp epstrax
        ; check $f9xx routines
lowchek lda pc
        cmp #$31
        beq euinit
        cmp #$41
        beq euinput
        cmp #$4c
        beq euscan
        cmp #$59
        beq euout
        jmp lup

Let's pop it into VICE.

The screenshot here shows starting it up and entering EhBASIC (from the monitor, g c100 for cold start or g c103 for warm). This particular version has been minimally patched by the RC2014-6502 author to implement a SYS command to return to the monitor. On a cold start, EhBASIC will ask for the memory size, or if you give it none, count it up itself. The unaccelerated C64 using geoRAM takes about a minute to discover that it has 32768 bytes free (it's actually more but this ROM was built with a hard cap at $8000, so you can put things at $8000 through $c0ff) because it has to swap the window back and forth between the instructions it fetches and the RAM locations it reads, or you can just type 32768 to skip all that. EhBASIC annoyingly does not accept lower-case commands or keywords and you must type everything in UPPERCASE. We run a simple little BASIC program to show it works, then intentionally try to foul the machine by storing and executing a jam instruction. The monitor is duly and swiftly invoked with the PC correctly pointing to the offender.
And here it is on my real Commodore 128DCR with my real geoRAM cartridge installed. Notice that floating point math works just fine, too, and that the bad instruction is once again immediately intercepted in a controlled fashion. Other than the 256 byte window peeping through, none of what we're looking at here actually executes from the 6502's own address space. Even on this basic 512K geoRAM unit you could have eight entire 6502 system tasks, all separate and independent, with only a minimum of state for each one needing to be maintained on the actual processor.

What are some future improvements to 6o6? Naturally I'd like it to be able to run from ROM, though this would require some refactoring and would possibly make it slower, so this would only ever be an option. Also, although I consider making something like this emulate a 65816 on an NMOS 6502 to be thoroughly out of scope, it might be possible to have it emulate CMOS instructions on the NMOS system, just as this will (mostly) act like an NMOS CPU when run on a CMOS system. Note that since the ALU is used, an NMOS 6502 emulating a CMOS 65C02 will still have flags set the way the NMOS system would do it and vice versa, and the addressing is currently written the way an NMOS CPU would do it, like indirect jumps to $xxff pulling the vector from $xxff and $xx00. Additionally, despite the performance improvements I've implemented over the years, depending on how the inline memory macros are done there are peephole optimization opportunities that could potentially be accomplished with a "post-preprocessor" pass before actual assembly. That would make the tooling more complex, so I'd have to see how much benefit that realizes in the general case.

Other than using 6o6 to power a custom operating system, though, another obvious use is running downloaded code without messing up what you're currently doing. My next application might be as part of a Gopher client where you can dynamically run what you download, like a networked file system. That could have possibilities!

If you're designing your own ultimate 6502 system, of course, 6o6 is not (primarily) meant for you — you can just implement the hardware you want to implement the features you need, and because you're in control of the design you'd be able to do it faster. But if you're trying to do this on an NMOS CPU where you have fewer guardrails, or to do it with a minimum of added silicon, then here's another option that's designed to be flexible and adaptable.

Meanwhile, the updates for KIMplement 1.0 are just minor bug fixes and cleaning it up for public display. I've also included Tiny PILOT courtesy of Dave Hassler, which originally appeared in MICRO magazine written by Nicholas Vrtis in 1979 with patches by Bob Applegate (rip) and Dave himself. It runs splendidly and is a nice small implementation of an interesting language that ought to have a second life in scripting. Dave also ported ELIZA to it from the 1980 Atari PILOT implementation by Carol Shaw and Harry Stewart, which serves as a historically noteworthy demonstration.

The Incredible KIMplement is available from its homepage, along with sourcecode on Github. 6o6 is also available on Github and all four of the examples shown here. Both the KIMplement and 6o6 are under the Floodgap Free Software License.

2 comments:

  1. Have you thought about/tried doing JIT?

    ReplyDelete
    Replies
    1. Yes, I did consider it and it might be an approach I look at again, possibly storing an executable sequence to memory after it runs it in the VM and all addresses are defined. It's certainly an attractive option theoretically. However, other than the obvious first-run slowdown, of course, there are a couple additional concerns to solve: it would be hard to figure out when this is profitable (there are no structured loops, so no loop headers to hint us, and no symbols to tell us we've entered a different code unit), the VM already has quite a large footprint and we'd add additional memory overhead storing sequences we may never run again, it may make the harness less flexible in some corner cases, self-modifying code would invalidate any previous trace (not a rare phenomenon), and after all that it's not clear that the code we generate would be substantially quicker than what the VM does now -- except if we do an additional optimization pass that may or may not pay off. We could require code to insert meta-hints or generate a new executable format that does, but that would be some additional complexity of its own. A JIT or AOT idea would probably pay off most in a system with free memory to burn and/or a second CPU generating the traces. Not helpful as such to the C64 or Apple II, but maybe to new hobbyist machines?

      Delete

Comments are subject to moderation. Be nice.