Old Vintage Computing Research: Instruction fusion and a real serial port for your virtual KIM-1: The Incredible KIMplement 0.3

Everyone should have the retrocomputing experience of a 1976 1MHz MOS 6502 single-board computer with 1K of memory, six hex digit LEDs and a keypad. One of the earliest such systems and one of the least expensive, you program the KIM-1 in 6502 assembly language right on the keypad in hexadecimal and it's amazing what you could do with a system that little. You could even hook up a cassette deck and an external terminal and have a full system for just a few hundred dollars; MOS Technology (and later Commodore) consequently sold a ton of them. We first experienced the KIM-1 in high school and having grown up with Commodore 64s and 128s it was like meeting their long-lost little brother. We spent the whole weekend typing in hex opcodes and learned how to bang on the hardware and make it do surprising things in a space that small. That's the very unit in the picture, still in my possession, and over four and a half decades old it still works.

While I now personally own four KIMs (an MOS Revision A, plus a Commodore Revision D, a Commodore Revision G and this Commodore Revision F, my first), it's naturally more convenient to develop on an emulator and then test on the real thing. The KIM is such an easy system to understand that there are other KIM-1 emulators like it, but this one is mine. And the Incredible KIMplement runs on a Commodore 64, so anyone can run it on just about anything that can emulate a Commodore 64, or even a real Commodore 64.

However, I also think the KIMplement is a darn handy emulator if I do say so my darn self. It naturally supports the keypad and hex LEDs, and loads and saves memory like every other basic KIM-1 emulator, but it also supports a virtual teletype on the Commodore 64's console (the MAME driver still can't do that) and implements a true KIM-4 expander with 16K of total addressing space.

But new in this version, and the only emulator that currently supports it so far, is you can now redirect the virtual KIM's TTY to the Commodore 64 user port as a real physical serial connection: a physical serial port for your virtual KIM-1. The picture shows a real Commodore 128D running the emulator, connected to minicom on my Linux workstation over USB serial from the 128's userport at 300 baud. From the Linux machine's perspective it's practically indistinguishable from my real Revision F unit, and on Commodore 64 emulators that support it (like VICE), you can tunnel the emulated 64's user port over a TCP socket to give your virtual KIM an Internet connection — we'll demonstrate that below. Plus, this means your virtual KIM can now call out to the real world as well as in! (What can you do with that? Stay tuned for a future entry!)

Still, emulating a 1MHz 6502 CPU on a computer with a 1.02273MHz 6510 (or, worse, 0.985250MHz if you're PAL) at full speed is impossible. Besides replacing some portions of the ROM with native traps to bridge the performance gulf, and a lot of behavioural tricks to make it seem full speed, this version also finally adds partial support for the RIOT I/O chips' real-time interval timers (implemented with the Commodore 64 Kernal's Timer A interrupt) which helps programs that use it for delays or have loops that can be rewritten. That doesn't mean there aren't opportunities for improvement in the emulator core, of course. While version 0.3 is still about 30 times slower clock-for-clock than a real KIM (modulo exact workload), it's 10-15% faster than the previous version 0.2b, and almost all of that improvement comes from a feature I like to call "extra helpings."

Recall from previous posts that the virtualizer in KIMplement is called "6o6," for 6502-on-6502. It achieves its speed by examining the emulated instruction stream and executing an equivalent series of instructions for their side effects. For example, to load the accumulator from an immediate value (such as LDA #1), 6o6 actually executes LDA #1 itself and saves the resulting value of the accumulator and status flags. Since the instruction was run on the physical silicon, there's no need to load the value into an emulated accumulator or figure out the values of the Negative or Zero flags because the 64's 6510 CPU has already done it for real, making it more like primitive virtualization than simply a brute-force CPU emulator. Arithmetic instructions like ADC and SBC therefore "just work": regardless of whether carry is set or decimal mode is on, by just loading the accumulator and status flags back and running the same instruction, the math is handled identically to how the emulated NMOS CPU would do it because a real NMOS CPU is doing it. A load from memory, say, LDX $D020, is turned into a virtual memory call to get the value of $D020 (in KIMplement this is just computing a real-memory effective address, since everything is in RAM), and then 6o6 runs an LDX-immediate with that value and saves X and the status. Stores to memory fetch the register and execute another virtual memory call to store it, and so forth. A handful of instructions are fully emulated (i.e., no direct execution) for functionality, such as BRK, RTS and RTI, or speed reasons, like the opcodes that directly manipulate flags (patching the shadow status register in place is faster than loading it, running the same instruction and saving it again). I've also made some space tradeoffs since in this version 6o6 is now almost 16K all by itself, making it KIMplement's largest single component.

As we just saw, the 6502 has multiple addressing modes, not all of which are available or even sensible for every instruction. MOS defined eight, but in broad strokes there are four main camps, namely instructions with an immediate 8-bit quantity (like register loads or ALU operations), implied instructions (where there is no argument and any source or target are implicit, usually the accumulator, but also true of flag setters and register-to-register instructions), relative branch instructions (where an 8-bit argument is used as a signed offset to the program counter if the branch is taken) and instructions specifying memory addresses (which can be absolute 16-bit addresses, indirect 16-bit addresses where the effective address is stored in the provided 16-bit address [JMP only], an eight-bit address in zero page [i.e., $00xx], an indexed absolute address, an indexed indirect address where the effective address comes from a zero page address plus X, or an indirect indexed address where the effective address comes from a zero page address and Y is added to the effective address).

In earlier versions of 6o6, the emulator and the virtualizer operated in lockstep. The emulator called the virtualizer, which ran exactly one instruction and returned to the emulator, and the emulator handled all the other tasks such as displaying the KIM's LEDs, looking at the program counter to see if a trap needs running, reading the TTY, translating keys to the keypad and so on before going back to the virtualizer to run another instruction. This design is very precise, but if you look at those general types of instructions above, only the instructions that actually reference memory care about the state of the rest of the emulator. Consider this contrived sequence of instructions:

    ldx #0
lup inx
    bne lup

Except for instruction fetches, this snippet of code, which is basically a delay loop increasing the X register until it overflows to zero, doesn't touch memory. There are no I/O port instructions on the 6502; everything is memory-mapped. Thus, assuming everything is in memory and all events are internal, there's no reason to run the rest of the emulator while this section of code is executing because this loop can't have changed or observed anything. In fact, we could even reason that it could simply execute as a "single instruction" from the emulator's perspective by the virtualizer going back for another "extra helping" of instructions after one it knows it can consolidate — theoretically the loop in its entirety.

By dispensing with that unnecessary work, while we haven't made the virtualizer more efficient, we have it made it possible for the virtualizer to make the emulator more efficient by doing less unnecessary work, and that's where that 10-15% improvement comes from. If you've delved into microprocessor design, this approach might seem conceptually familiar: what we're doing is a primitive in-order analogue of instruction fusion, just in software. From the emulator's perspective, the virtual CPU merely executed one "big instruction" that we managed to cobble together out of smaller ones we were able to reason could be safely executed as a group. That's also why the speed gain is dependent on instruction mix as lots of memory access will defeat the virtualizer's ability to reason about what it's trying to execute.

There are downsides to this approach because real world. While it's nice to assume that nothing happens within the system unless the CPU triggers it, there are of course all kinds of external events occurring, so there needs to be an escape hatch to allow the emulator to still do periodic work (or an instruction sequence like lda #0:beq *-2 will never return to the emulator: IRQs and even non-maskable interrupts [NMIs] are handled external to 6o6). We can't solve the halting problem in 1MHz and we don't want to make the set of eligible instructions Turing-complete, so besides instructions that touch memory or the stack, anything that could change the program counter (PC) other than to move to the next instruction is considered ineligible for extra helpings (that means JSR, JMP, RTS, RTI, BRK and a conditional branch that is taken, but a branch that is not taken is eligible, because we know the condition being tested). In practice the eligible instructions are the set of all immediate instructions, instructions that operate directly on the status word, implied accumulator-accumulator instructions, conditional branches not taken, NOP, and register-register instructions except TSX and TXS (mostly paranoia on my part).

KIMplement's internal requirements cause other complications we must account for. Since an entire sequence of multiple instructions could be taken as a group, this approach becomes incompatible with anything needing exact control over the PC or the rest of the processor state. The KIM-1 has a "single step switch" (SST) that triggers an NMI for all instructions in RAM, allowing you to step through code one instruction at a time no matter what that instruction is, so we'll need to provide a flag to 6o6 to disable extra helpings when it's on. Additionally, we need the precise location of the PC to determine if we've moved into a ROM trap: it's not merely sufficient to look at the PC when entering the emulator after a branch because we may also move into a trap by either crossing a page boundary from RAM to ROM or continuing execution in a ROM routine up to a trapped location. And we need to make sure any test we do in 6o6 doesn't erase the benefit of extra helpings by being too heavyweight in execution or bloating every eligible instruction in code size.

The test in the current version of KIMplement and 6o6 is thus very simple: look at the LSB of the emulated program counter and ensure it is less than a flag byte the emulator sets. If the SST is on or the high byte of the program counter is in ROM, the emulator sets the flag byte to zero, and since the compare is unsigned extra helpings become dynamically disabled. Otherwise, to deal with the situation of moving from RAM to ROM, the flag byte is set to $fc. The last instruction in an execution group must always be an ineligible opcode by definition and those instructions could be up to three bytes long. If a three-byte instruction is at $xxfc, then the next instruction is $xxff and no matter the length of that next instruction (one, two or three bytes), extra helpings will be off to execute it over the page boundary into ROM where the flag will go to zero anyway. While additional tests could be added, the most productive ones largely depend on snooping the next instruction which makes the logic too complicated. By contrast, while this LSB-less-than approach is conservative and may potentially ignore short eligible instruction sequences in those last couple bytes, they're probably not worth the effort trying to fish out.

After all that, the other improvements are almost small potatoes. A whole mess of bugs were fixed and a better, less flickery LED routine was written, meaning you can now play (Hunt the) Wumpus from the First Book of KIM with impunity. There are also some conveniences for programmers: the NMI vector and the KIM's processor flags shadow register at $00f1 are properly set for you on startup, and you can also use the $fb opcode (illegal on a real 6502, generates ISC) as a breakpoint to trigger an NMI on execution in place of any JSR. And there's several more games, too!

The emulator is available as either a .d64.gz you can expand and use as a disk image in an emulator such as VICE or write to a real 1541 floppy, or as .prg files you can download and run directly (the .prg archives of the demonstration programs are .sda [Self-Dissolving Archive] that when run with a disk in device 8 will write out their component files). Here are downloads and the full manual.

If you want to run KIMplement in a Commodore 64 emulator (Inception!), VICE, the Versatile Commodore Emulator, supports redirecting the emulated Commodore 64's user port to a network connection, letting your virtual KIM tunnel over the Internet. As I promised above, here's the steps, done on my Raptor Talos II running Fedora 37. First, configure VICE's RS-232 settings to redirect the user port (not a 6551 ACIA cartridge) to a socket. By default this will be 127.0.0.1:25232, which I've set all the virtual serial ports to below in this screenshot, and then check "Enable userport RS232 emulation." I've also hardcoded the baud rate even though that technically doesn't matter here.

Shut down VICE and open a terminal window. The command I use to listen on the socket is

stty raw ; nc -l -p 25232 ; stty cooked iutf8

or adjust as you see fit (add -brkint if you need to send CTRL-C, etc.). The socket should be open and listening before restarting the emulator. Once it is, start VICE, load and run the KIMplement, load your program, set the toggle for the TTY-user port (toggle #3) to 1, and press INST/DEL or SHIFT on the Commodore emulator. KIMplement will start TTY mode and data will start flowing to your open socket. Ta-daa! Remember to have CAPS LOCK down when talking to the virtual KIM because ASCII-1963 didn't have lowercase.

Here's FOCAL running on the virtual KIM running on the virtual Commodore 64 over a virtual serial port connection computing a fairly accurate fractional estimate of π:

The socket closes automatically when you quit VICE.

A parenthetical programming note about the Commodore 64's Kernal RS-232 routines. This program uses 300 baud not only because it's the most reliable speed for a KIM-1 but also works properly with the C64 Kernal, meaning no special serial code was needed. However, when you OPEN the RS-232 device, the Kernal lowers the top of memory (normally $a000) by 512 bytes to store its buffers. By default this happens to be right in the middle of where 6o6 was, so the Kernal routines immediately corrupted it. As soon as any programs were run, the emulator would crash — but only when I had the RS-232 channel opened. Fortunately it put bytes there that caused an immediate jam instruction, meaning it didn't take long to figure out why.

Instead, we want the 512 bytes to be within the space I've allocated for the BASIC menu and shell. This code ends at $2000, so we want $1e00 and $1f00 for buffers. To get that setup correctly requires a bit of a dance:

1 poke52,30:poke56,30:poke644,31:clr:open2,2,0,chr$(6)+chr$(0)
2 poke247,0:poke248,31:poke249,0:poke250,30:poke169,144

The first set of POKEs lower both of BASIC's top of memory pointers, plus the operating system's top of memory to be just above it. When the OPEN is executed, the Kernal then puts its buffers in our desired locations and sets up its pointers. However, there seems to be a Kernal bug that incorrectly sets one of them, so we immediately ensure those pointers are correct, and hint the Kernal's RS-232 routines that no start bit has been received yet so that the first character received is not mangled.

Version 0.3 should be the last closed-source version of the Incredible KIMplement and 6o6, and assuming I find no other critical bugs, I'm planning to open-source both the KIMplement and 6o6 with the "1.0" release which hopefully should be next up. More KIM-1 tricks to come as well, but for now, check out the emulator and read more about the MOS KIM-1.

Old Vintage Computing Research

Sunday, February 5, 2023

Instruction fusion and a real serial port for your virtual KIM-1: The Incredible KIMplement 0.3

No comments:

Post a Comment