Old Vintage Computing Research: After 41 years, my first assembly program on my first computer, the Tomy Tutor

We got it in 1983, I think, so it only took me about 41 years to get around to it. This Tomy Tutor isn't a replacement system I secondarily acquired, nor is it a Ship of Theseus Frankenstein rebuild. This is my actual first computer, in its original case, on its original components, with the Federated Group sticker still on the original box. And it still works.

Now, why so long? Well, for one thing, it was only supposed to be a training wheels computer because a full Commodore 64 system would have cost too much, but my folks wanted to see whether we'd take to a home computer and His High Holy Munificence Fred R. Rated was blowing these babies out for a song by then. The receipt has long since disappeared, though $99 sounds about right plus maybe around $40 or so for a joystick, cassette deck and some cartridges, compared to somewhere between $200 and $300 for the recently discounted 64 — which didn't include anything else. (It tells you something about our family finances at the time when a C64 was too expensive.) I immediately started writing my own BASIC programs on it in its perverse little BASIC dialect and when my folks indeed saved up and bought us a C64 system the next year (complete with 1702 monitor and 1541 disk drive), I refused to use it. In ~~retaliation~~ my best interests, my parents forcibly relocated the Tomy to storage and I went on to do even bigger things on the Commodore, making it, not the Tutor, the defining computer of my childhood. That's why there's still a Commodore 128DCR on my desk.

The other reason is that there was never really a simple way to do it. Even when I found out what CPU was actually inside (incredibly a 16-bit TMS 9995, an evolved version of the TMS 9900 in the Texas Instruments 99/4 and 99/4A), there was never a Tomy assembler, and other than its small amount of scratchpad RAM (256 bytes) the entirety of the Tutor's 16K of memory is tied up in the 9918ANL VDP video chip. That sort of architecture was typical for the family, but that also means that almost everything is stored in non-executable VDP RAM, so short of burning your own cartridge EPROMs there's no way to actually create and run a machine language program on the Tutor. The first flashcart for the Tutor didn't exist until around 2016 and it was still all ROM; furthermore, while the 99/4A could have its CPU-addressable RAM expanded (as well as the 99/8, its unreleased successor to which the Tomy Tutor is closely related), there wasn't ever a Tutor RAM expansion cartridge either until very recently. But now there are multiple homebrew options even for obscure home computers like this one, and at last I've got my own assembly language program finally running on it.

And it's all done with its own, better I/O routines (if I do say my own better self) as a basis for bigger projects. But first, a little tour of the Tutor itself, and then we'll dig in.

Texas Instruments' home computer series, including the famous TI 99/4A, was the logical consequence of TI's "one company, one computer architecture" policy. Indeed, the 1976 TMS 9900 CPU was basically their 16-bit 990 minicomputer architecture in an ungainly 64-pin DIP chip package and quite possibly (the only alternative is the General Instrument CP1600) the first single chip 16-bit microprocessor commercially available. It was fabbed in NMOS on a 4.5 micron process with about 8,000 transistors and initially topped out at a respectable 3MHz, though its pervasively microcoded architecture required sometimes large cycle counts per instruction. The 9918 video display processor in the original 99/4 evolved with a new bitmap mode to become the 9918A in the 99/4A, and, outlasting its originator, was one of the most common video chips of the early home computer era (Sord, MSX, CreatiVision/Dick Smith Wizzard, ColecoVision/Coleco ADAM, and many, many more). The original 99/4 was the winning design of three internal home system efforts in 1979, but was derided for its "calculator" keyboard, lack of lower case and a high MSRP; the upgraded 99/4A débuted in 1981 with an improved keyboard, better video, more expansion options and a lower price.

But TI was first and foremost in the chip business, and at the time was the largest semiconductor manufacturer on the planet. TI realized that the physical size of the CPU was harming its commercial viability — though TI's dubious management decisions were just as big a factor — and developed more conventional 40-pin versions, first as microcontrollers with on-board RAM and ROM, then with more typical 8-bit data buses. The most advanced of these was the TMS 9995 which had a few extra opcodes, a primitive pre-fetch facility, 256 bytes of on-chip RAM (we'll get to why this is notable when we discuss assembly language programming) and an internal decrementer for timing and event counting. It was noticeably faster than the 9900 and TI planned to implement it in its next generation of home computers, the low-end black and white 99/2 intended to compete against systems like the ZX-81 and Timex Sinclair 1000, and the high-end 99/8 with more memory, built-in peripherals and a larger keyboard.

Before that could happen, however, TI got deep in the weeds against their old nemesis Commodore under Jack Tramiel and ended up cancelling both the 99/2 and 99/8 in 1983 (exiting the home computer market completely in 1984), though not before there were spinoffs. It's not clear how Japanese toy manufacturer Tomy got involved but in 1982 Tomy adapted the in-development 99/8 architecture using the same 9995 CPU and 9918A VDP into their own home computer in Japan, manufactured by Matsushita (Panasonic) under contract. This computer was called the Tomy Pyuuta (ぴゅう太, also variously romanized as the Tomy Pyuta or Tomy Pyūta).

The Pyuuta wasn't, and wasn't intended as, a 99/8 clone. Unlike the 99/8's higher-end aspirations, the Pyuuta was targeted explicitly at younger children, using a friendly yet durable large plastic case and spill-resistant rubber Chiclet keys. For cartridges Tomy licensed some of Konami's arcade games like Frogger and Scramble and created a few of their own, and for peripherals they provided game controllers (included) and a cassette recorder (optional) for saving your work. Eventually they planned to release a modem, floppy disk, printer and speech synthesizer, all presumably using TI's reference designs except the printer which was a modified Astor MCP-40 plotter.

Although the basic BIOS was based on the TI's and the title screen in particular is very similar, Tomy prominently advertised it was a 16-bit system, yet focused more on games and graphics than programming. Like the unexpanded 99/4 and 99/4A, all of the included 16K RAM in the Pyuuta is dedicated to the VDP, for which Tomy created a built-in paint program and a highly constrained dialect of BASIC ("G-BASIC") to manipulate screen elements and sprites with katakana keywords. It ran using a 10.738635MHz 945/88 crystal divided by three for video (standard NTSC 315/88 3.579545MHz) and four for the CPU (945/352, 2.68465875MHz). The processor clock speed was slower on paper than the 3MHz 9900 in the 99/4A which came off a 12MHz crystal, but the Pyuuta was nevertheless faster because of the 9995's efficiencies and a critical architectural difference I'll discuss shortly.

The Pyuuta was a reasonable success in Japan and Tomy decided to export it to other markets by translating the OS and G-BASIC to English. However, British importer Adam Imports sent this first prototype back, finding G-BASIC too limited to be commercially viable. As an upgrade TI must have provided (unwittingly or otherwise) the code for TI Extended BASIC to Tomy to port, since Tomy BASIC has similar to nearly identical tokens, memory usage and syntax. This updated version was imported more or less directly by Adam Imports as the Grandstand Tutor, and its dual-BASIC system was released as an add-on device for the Pyuuta and then built-in as well to the next generation system Tomy themselves intended to sell in the United States. This was the Tomy Tutor in 1983.

Tomy USA hired Real People host Sarah Purcell as their spokeperson, who touted the computer as a system so easy to use that kids could teach themselves how to use it (it's true: I did!). She wasn't as high-profile as TI's Bill Cosby, but she was hardly unknown to the target demographic('s parents), and she hadn't committed any criminal offenses either. Unfortunately their otherwise promising marketing campaign was most notable for the most frequent use of the word "real" in a single pamphlet, as well as a five-day kick-the-tires-for-free deal which was about as successful as Apple's later "Test Drive a Macintosh" promotion. (NARRATOR: By which he means it wasn't.)

Like the Pyuuta, Tomy prominently touted the Tutor's 16-bit processor, but provided no way to directly access it. A couple years earlier the Tutor might have been a compelling system and one of the "real" kids on the box even wrote me a few years back to mention he rather enjoyed the games, but the video game crash was in full swing by then and Tomy's intentional toy aesthetic quickly became the kiss of death. No wonder Fred R. Rated was trying to get rid of them.

Tomy apparently lost so much money on the Tutor that they ended up producing very few peripherals for the system in either the United States or Japan. This picture shows the complete setup in the United States, namely a tape recorder with custom electronics, an Anglicized version of those disc-based joy controllers which made the Intellivision seem like a paragon of ergonomics, and a tough-as-nails single joystick sold as the Joy Stick (insert joke here). And that was it. The floppy disk drive, printer and speech synthesizer promised in both countries never appeared as other than a single picture in the Purcell pamphlet, the printer interface sold in Japan as part of the BASIC-1 add-on was never sold abroad, and most critically the "TI Adaptor" — nothing less than a Tomy rebadge of the TI Peripheral Expansion Box — that would have included additional memory and storage options was vapourware too.

The Tutor was also not an exact copy of the Pyuuta either, although the core silicon (the 9995 CPU, 9918A VDP video chip [9929A for PAL] and SN76489 DCSG sound chip) is the same, primarily differing in the BIOS ROM, the absence of a Japanese character set, the presence of Tomy BASIC, and slightly different memory banking logic. The systems are otherwise nearly totally compatible such that Japanese Pyuuta cartridges will generally run on American or PAL Tutors and vice versa, language support notwithstanding, with only one of the American cartridges — its sole explicitly educational title — being specific to the United States.

The Tutor, as with the Pyuuta before it, started with the TI 99/4A's title screen but with animation, scrolling the colour bars vertically. It felt like a friendly computer from the moment you turned it on and the larger 32x24 text cells actually reinforced that somewhat (plus making it much easier to read on our little Panasonic colour TV).

The Tutor menu, directly translated from the Japanese menu in the Pyuuta, was also inspired by the TI's menu, but instead uses a "pointer" rather than selecting items by number. It was likewise very easy for a child to grasp. On the Tutor, the GRAPHIC and BASIC modes are always available as part of its sizeable 48K of built-in ROM. The CARTRIDGE option only appears if a cartridge is detected, which we'll talk about in a moment.

Unlike the TI 99/4A which used serially addressed "GROMs" for BASIC and much other software (which on top of that can't contain native code and are necessarily written in an interpreted bytecode called GPL), all of the Tutor ROMs and cartridges are directly connected to the bus and therefore tremendously faster. Although Tomy BASIC is also based on GPL, Tomy's GPL dialect is a stripped-down variant specialized for this task, and the program text is directly accessible like any other data in ROM. All of this, plus the 9995's prefetch, are why the Tutor's (and Pyuuta's) slightly slower clocked CPU runs so much more swiftly in practice than the 99/4A's.

The Tutor's GRAPHIC mode is a simple built-in paint program that takes full advantage of the 9918A's 256x192 bitmap mode, offering two colours per line in each 8x8 cell — substantially better than systems like the C64 or ZX Spectrum with two colours per cell. A little rocket cursor moved with the arrow keys indicated the current location, and when you were at the desired cell, you could then edit it using the palette and the editing square on the lower right. GRAPHIC mode also supported four large 16x16 sprites — to hide the 9918's limitation of no more than four sprites per scan line — which could be crudely moved or animated by GBASIC programs (no hyphen in the English name).

GBASIC, however, was so limited — barely any string support, small program space, terse syntax and a couple severe bugs — that I spent most of my early elementary life in Tomy BASIC. I wrote some games and some simple utilities and saved them to tape, and I still have one of these tapes, though I carelessly overwrote most of its contents later. Yet despite its lineage as a descendant of TI Extended BASIC, Tomy BASIC intentionally supported less than its ancestor, likely to keep inquisitive kids like me in a memory-safe "sandbox." There were commands for sound, character graphics and some custom character shapes, but most of these features were poorly documented (if at all), and there wasn't any supported facility for directly accessing bitmapped graphics or sprites — let alone arbitrary reads and writes to VDP memory.

Still, the "sandbox" concept ended up being unsuccessful because a number of Tomy BASIC commands don't bounds-check properly, and I remember very clearly crashing it multiple times one day trying to figure out how the COLOR keyword worked. (Alas, I was too young at the time to realize the significance of what I'd done.) Such bugs even facilitated a clever hack to enable sprites, though this hack unsurprisingly has notable limitations, and there was no way to directly access VDP registers for other features like high resolution or 40-column mode. BASIC was as much as you could do on a stock Tutor and other than a small user group in the Los Angeles area I didn't know anyone else who had one. It wasn't until several years later that I got the Tutor back, and by then I was knee-deep in Commodore programming, including handcoding 6502 machine language opcodes in the Epyx FastLoad monitor. The Tutor had been fun but I could do more with the Commodore 64 and the 128 we got after that.

Emulation came late to the Tutor both due to its obscurity and a profound lack of hardware documentation. In 1998 yours truly wrote the first Tutor "simulator," which I christened Tutti, ironically for the Commodore 64 so that it could be run anywhere a C64 emulator was supported (back then I used C64S and later Frodo). It was designed to mimic the Tutor's look and feel using a character set I labouriously drew by eye, a custom keyboard driver, raster interrupts for the 9918A's split screen modes, simple tone audio, and colour approximations with the VIC-II's palette. It had a fully functional title screen and menu plus reasonably accurate looking but very primitive GRAPHIC, GBASIC and BASIC modes. For its behaviour I manually figured out how fast things ran and added delays and tweaks, and reverse-engineered the BASIC and GBASIC editors. Surprisingly, a portion of Tutti is actually part of the project we'll do today, so hang onto it in the back of your head.

It took five more years for the first true Tomy Tutor emulator, namely Ian Gledhill's 2003 TutorEm with functional 9995 and 9918A emulation; it was very slow, very buggy, incomplete and Windows-only, but it really did work and finally opened the floodgates. Later that year MESS added a driver written by Raphael Nabet in 0.70 that I helped beta-test and it is still part of modern MAME. While I have since updated TutorEm and made many fixes for my tape-enabled Tutti II emulator, we'll use MAME for debugging this entry because it is currently the only Tutor emulator that handles cartridge ROMs.

Tutti didn't emulate the CPU because I didn't know how its I/O worked and it would have been impossible to execute code in any performant fashion on the C64; even the relatively lightweight 6502-on-6502 emulator I maintain for the KIM-1 KIMplement emulator runs about 30 times slower or so than actual. I had done a little playing around with TMS9900 assembly on the 99/4A using the Editor/Assembler cartridge ("module") on a friend's machine, and I even had a basic 9900 programming book, so the 9995 wasn't really an alien architecture to me — which makes a good transition into talking about the CPU itself.

The original TI-990 minicomputers supported multiprogramming in a then-innovative fashion: most of their registers were actually stored in RAM. The only CPU-internal registers are the program counter (PC), a Workspace Pointer (WP) that indicated where in RAM the 16 registers (32 bytes) reside, and a Status (ST) register for flags. This meant that a context switch could be as simple as merely changing the PC, WP and ST registers to those of the new task. Though zero or direct pages on CPUs like the 6502 or 6809 is a related concept, the 990 WP was more versatile and indeed absolutely intrinsic to how the 990 operated. It has a generally orthogonal instruction set for the time (ceteris paribus), and aside from R0 not being valid as an index the registers can be used for any general purpose, though certain instructions are fixed to specific registers like R11 as a link register for subroutine calls or R12 as the address for bit-serial I/O over the Communications Register Unit bus. Byte operations exist but all word accesses are aligned to even addresses.

For the TI 99/4 in 1979 (and later the 99/4A), TI determined that designing a full 16-bit system around the 9900 would have required new chips for its exclusively 16-bit bus, making the effort too expensive for a home computer. TI solved this problem by devising two buses. The most directly connected 16-bit bus hosted the lowest level system ROM with the GPL interpreter plus 256 bytes of "scratchpad RAM" which could store eight complete sets of registers, composed of two 128-byte 8-bit static RAMs wired as low and high bytes (the 9900 is big-endian) which the CPU could access in parallel. SRAM was expensive, however, so the remainder of the machine's RAM was 16K of dynamic RAM given to the VDP, which has its own DRAM refresh circuitry. Unfortunately, although the VDP was on this 16-bit bus also, the VDP only supported byte accesses and ignores the lower half of the word, slowing DRAM reads further. Worse, everything else was behind the second "multiplexed" 8-bit bus, where a small circuit stalled out the CPU on reads until two 8-bit fetches could assemble the full 16 bits. While this meant less expensive 8-bit parts could be used, the cost reduction also cost a significant amount of performance.

The use of SRAM suggests that the 99/4 was originally intended to use a different chip that had RAM on-die, where refresh circuitry wouldn't have been needed, but cost and market considerations apparently prevailed. The intended CPU may have been something like the 1979 TMS9940 with an on-chip CRU, 2K of ROM, a decrementer and 128 bytes of SRAM, or the later TMS9985 with 8K of ROM and 256 bytes of scratchpad, though neither would have been ready in time for the 99/4. As mentioned earlier, after the disadvantages of the 9900's strict 16-bit data bus became more acute TI moved the multiplex circuitry on-chip and exposed only an 8-bit bus starting with the TMS9980 — but this also doubled access time to its external scratchpad RAM, condemning it to lower-performance applications like TI's Silent 700 teletypes. TI's first attempt was to turn the 9940 into the 9985 by adding the same multiplexer and bumping up the ROM and scratchpad RAM, which were both internal and thus avoided the bus problem. There was reportedly no market appetite for the 9985, so TI removed the ROM and reduced instruction latency further by using prefetch steps in the microcode which could be parallel with a preceding ALU operation. This is the 9995, released in 1981.

(A digression: how do you use the 9900 to implement a language like C? The simplest method is to just implement a stack, which is facilitated on the 990/9900 by its support for post-increment addressing. This is in fact the approach taken by the GCC port for the TMS9900, which treats the CPU more or less like a modern CPU with a link register [usually R11], defines an ABI for arguments and volatile/non-volatile registers, and reserves one of the registers as the stack pointer, in this case R10. R10 is a 16-bit register like all the others, so the stack can be as large as the addressing space, a significant improvement over C-hostile architectures like the 6502. Another way is to go "full Berkeley" and treat the WP as a means to implement register windows, a la SPARC: the WP can be moved to any word-aligned address, so a caller can move the WP down a few notches, set up its arguments, call the routine, capture the return value and set it back. However, the 9995 — and for that matter the stock 99/4(A) — doesn't have CPU-addressable RAM other than the scratchpad, so in the base configuration neither system would have much capacity for function calls no matter how they were implemented. The Tomy BIOS gets around this by simply moving the WP or individual registers around by hand, which is space efficient, but also makes some individual routines or subsections more difficult to use because there is no standard calling convention.)

For our purposes, although the 9995 has a few extra instructions, we can treat it in practical terms as a faster 9900. The main difference at the assembly level is where the scratchpad RAM lives: since it's external to the 9900, its location is wherever it gets decoded (e.g., in the $8000 range in the 99/4A), but in the 9995 the internal RAM always occupies $f000-$f0fb (for compatibility with the 9900 the last four bytes are seen at $fffc to $ffff and serve as the NMI vector). The 9995 also has an internal decrementer at $fffa but we won't need to deal with that right now for this particular project. The only other concern is that the prefetch in the 9995 will affect self-modifying code if it changes the very next instruction which our example doesn't do either. Otherwise, programming it is almost completely the same.

The Tomy BIOS obviously has support routines for displaying text and reading the keyboard, but we're not going to use them for several reasons: first, I'm not particularly conversant in them, second, we can probably do it faster and more flexibly ourselves, third, it's good education, and fourth, they kind of suck. For input, while we can't do anything about the Tutor's mushy Chiclets or its single SHIFT key, we certainly can improve upon the BIOS' terrible key rollover. Additionally, the Tutor's default character set is inconveniently organized for modern applications: wile you can apparently use the SCELL() (the Tomy equivalent of TI CALL HCHAR) command to store characters by their ASCII value directly into VDP screen memory, this is in fact an artifact of BASIC and not actually how the glyphs are laid out in VDP RAM. We would like to organize our character set to be exactly the same as true ASCII so that no translation is needed, as well as support the 9918A's 40 column text mode which the Tutor BIOS never did. To do all of these things, we'll devise our own library.

The homebrew hardware we'll use is all from TeamEurope (hi Klaus!), who made one of the earliest Tutor flash multicarts. This is his newest unit which is the only currently available CPU RAM expansion for the Tomy Tutor and Pyuuta, providing 16K of CPU-accessible RAM in two 8K ranges as well as multiple 32K ROMs accessible by DIP switch from a 512K flash ROM. (This cartridge was actually conceived of first by tanam, but this unit is an expansion of that design.) We'll explore this device more in a future entry. However, we don't need the RAM nor the extra ROM capacity today and the device additionally requires a passive I/O port adapter for those extra addressing lines, so we'll use one of his simpler items.

That simpler item is this one, his first. It has every USA and Japanese cartridge ROM except the very rare USA and Japanese Demonstration cartridges — with a little luck I'm hoping to rectify that soon. It also lacks the later "3-D" series (a misnomer, they weren't 3D with the possible exception of Rescue Copter) which require the extra addressing line for 32K ROMs and are provided on a separate multicart.

The flash ROM itself is a socketed off-the-shelf 512K Microchip Technology SST39SF040. These chips are end-of-life but they're still inexpensive and easy to find as DIPs or PLCCs, and by using Klaus' board I don't need to make one of my own. For this I started with another DIP 39SF040 that I got cheaply since we won't need to do too many insertion cycles on the socket to get this simple program working. There is free space in the default cartridge loadout for four more 16K ROM images and we'll use two of them.

Pretty much any programmer will work for this. Since my daily driver is a POWER9 Linux workstation, I use the open-source minipro and this older XGecu TL866-II+ (minipro has experimental support for the newer T48 but the TL866-II+ is well-tested with it; unfortunately you can't trust many of the eBay and Amazon sellers to get you the older model).

For the cross-assembler, we'll use the AS macroassembler, which is multi-architecture, cross-platform, open-source and has specific support for the 9995. It builds just fine on any modern OS, including Linux and macOS. The macroassembler will create an intermediate object which we then link with an included tool into the final executable.

The Tomy machines place their VDP ports at $e000 and $e002 in the 9995's regular addressing space, while the keyboard and joy controllers (which share keyboard lines) are on the CRU bus at $ec00 through $ec70 with each group of eight lines separated by 16. The "little" 8K and "medium" 16K cartridges both are mapped to $8000-$bfff, where the Tutor expects to see two $55 bytes at $8000. If these two $55 bytes are present, the CARTRIDGE option is enabled in the menu, which triggers a jump to $8002. (There are other ways to signal its presence, but this method is the simplest and the one used by the majority of official Tomy cartridges.) So we'll start off with this:

        padding off

        ; vdp ports on the tutor
vdpwd   equ 0e000h
vdpwr   equ 0e002h
        ; CRU address for reading the keyboard
keycru  equ 0ec00h

        org 08000h

        ; cartridge signature word
        word 05555h

The leading zeroes for these particular 16-bit values are a required quirk of AS. Since we're using all our own routines, we don't want any interference from the BIOS, so we'll turn off all interrupts by setting the interrupt mask to zero and load the WP with the lowest address of the 9995's built-in scratchpad RAM. (We'll have more to say about interrupts later.)

        limi 0
        lwpi 0f000h     ; don't even trust the Tomy OS here

The Tomy Tutor BIOS gives us literally nothing to work with anyway: before the cartridge is started, the registers are set to default values and the entire VDP RAM is cleared. That means there's no screen matrix nor a character set, and we'll have to write them to VDP RAM ourselves. (The expectation is that you'd call the BIOS' own utility routines to set those up, and that's indeed what regular Tomy cartridges do, but we're not going to do that here.) To make working with the VDP a bit more convenient, we'll construct a little utility subroutine.

vdpr    ; write to VDP registers
        ; MSB of r0: command nybble + value (8r = register,
        ; 4x = MSB VDP RAM for write, 0x = for read)
        ; LSB of r0: new register value (xx) or LSB of VDP RAM address
        ; the swapping around gives enough time for the VDP to operate,
        ; and we have no IRQs on, so nothing will interfere
        swpb r0
        movb r0, @vdpwr
        swpb r0
        movb r0, @vdpwr
        b *r11

Recall that the 9918A only has an eight-bit data bus, so we must communicate with it through byte-sized operations. This subroutine takes a single 16-bit argument in r0 that either encodes an absolute VDP RAM address for reading or writing, or encodes one of the eight VDP registers and the byte to store in it. (Because of the way these addresses are represented, i.e., either $4xxx to write or $0xxx to read, a "super 9918A" would need to implement some sort of bankswitching register to handle more than 16K. The only other supported RAM size for the 9918A is 4K.) The LSB goes out on the bus first and the 9995 is big-endian, so for each byte we swap them before sending it to the 9918A's control register, exiting back to the caller through r11 as our link register. As a happy convenience the swap operation takes just long enough for the 9918A to handle the bus transaction and be ready for the next. With that, we can set the following:

        ; register 0 turn off bitmap and external video
        li r0,08000h
        bl @vdpr
        ; register 1
        ; - 16K mode
        ; screen off
        ; no IRQs
        ; no 40 column text mode (except if we asked for it?)
        ; no multicolour
        ; no bit 5
        ; normal 8x8 sprites
        ; normal sized sprites
        li r0,08180h
        bl @vdpr
        ; register 2: put screen table at 0800h
        li r0,08202h
        bl @vdpr
        ; register 3: put colour table at 0c00h
        li r0,08330h
        bl @vdpr
        ; register 4: put character set at 0000h
        li r0,08400h
        bl @vdpr
        ; register 5: put sprite attributes at 0000h
        li r0,08500h
        bl @vdpr
        ; register 6: put sprite pattern table at 1000h
        li r0,08602h
        bl @vdpr
        ; register 7: white text on green background
        ; (the only colours available for 40-column)
        li r0,087f2h
        bl @vdpr

The locations for the screen table, colour table, character set, sprites and so forth are encoded as multiples of particular alignments. For our character set, we'll reorganize the Tutti one to match ASCII order (I told you we'd be coming back to that), add that to our binary and copy it in. We're only using positions 32-127, so there is plenty of space for expansion if we want to add graphics characters or an alternate font weight (for this purpose I just added a reverse/inverse set). Once we set the VDP memory address, we can just keep sending data to the data port as the VDP's address internally autoincrements with each write or read.

        ; load our font to >0000
        li r0,04000h
        bl @vdpr
        li r1,fontt
lup     movb *r1+,@vdpwd
        ci r1,fontt+00800h
        jne lup

We go on to clear our screen in the same way (by storing the appropriate number of space characters starting from the top left of screen memory), then set the colour matrix (if 32 columns), print our character set and display a welcome message in similar fashion.

Next, we want to accept keyboard input and echo it to the user. Despite the rubbery nature of the keycaps themselves and their non-standard layout, the keyboard matrix proper is actually pretty good quality: each key independently sets a particular bit in the matrix and some quick tests show there's little to no shorting or ghosting. Reading it is a simple matter of requesting every key bit in groups of eight from the CRU at their specific addresses. However, like any keyboard handler, we'll need to properly debounce the keys, and this is where the Tomy BIOS is particularly bad: if you type too quickly and multiple keys are down as you transition from one keys to the next, the keyscan routine will fail to make a match and the new key will be dropped. This makes the Tutor's already somewhat mushy keyboard even worse to type on, an absolutely needless situation since the Tomy keyboard has all the hardware requirements to implement N-key rollover and is only let down by its software. The solution is to track each individual key bit using the debounce matrix to filter out key bits we already know were previously down. This is made a bit easier by the fact there's only one modifier key to watch (i.e., SHIFT), but the principles are the same.

        ; scan keyboard
        ; needs 16 bytes of scratchpad RAM
keyzone equ 0f020h
keezc   clr r0                  ; clear debounce
        mov r0,@keyzone+8
        mov r0,@keyzone+10
        mov r0,@keyzone+12
        mov r0,@keyzone+14
keez    clr r9                  ; clear test
        li r2,keyzone
        li r12,keycru-16
keezl   ai r12,00010h
        clr r0
        stcr r0,8
        movb r0,*r2+
        socb r0,r9              ; bitwise or
        ci r2,keyzone+8
        jne keezl
        ci r9,0
        jeq keezc               ; clear debounce if nothing pressed

The "keyzone" block is our current matrix followed by the debounce matrix we'll use to filter it. This section can be entered either from keezc to blank the debounce matrix, falling through to keez to read it. Reading from the CRU requires placing the CRU address (a parallel addressing space) into R12 and asking for the needed number of bits. We fetch in groups of eight bits which are in eight locations stored 16 CRU bytes apart, keeping a running logical-OR (which the 9900/9995 atypically calls soc/socb "Set Ones Corresponding" for non-immediate arguments). If the value of the running logical-OR was zero, then no key was pressed, we branch back to clear the debounce, and go scan the matrix again.

The simplest case is where the current state of the matrix exactly equals the last time (modulo the state of the SHIFT key). This can be checked for by exclusive-ORing with the debounce matrix, masking off the SHIFT bit. We then logical-OR all the resulting bits together and if it's zero again, we go back to scanning — but leave the debounce matrix alone.

        mov @keyzone,r2
        mov @keyzone+2,r3
        mov @keyzone+4,r4
        mov @keyzone+6,r5
        xor @keyzone+8,r2       ; xor current bits with last set
        xor @keyzone+10,r3
        xor @keyzone+12,r4
        ; clear shift bit (prevent "eeking" characters when releasing)
        andi r5,0fbffh
        xor @keyzone+14,r5
        ; if exactly equal (i.e., all zeroes), go back
        soc r3,r2
        soc r4,r2
        soc r5,r2
        jeq keez        ; don't clear debounce

This is the block of code we'd use to set up key repeat, which as currently written this routine doesn't support yet (an exercise for the future). Otherwise, we need to filter the debounce to remove any keys that are now up, filter the new matrix to remove any keys already present in the debounce (which may give us another zero matrix again, but our keyscan table doesn't match an all zeroes matrix, so it's "fine"), and update the debounce matrix with the new bits that are down while clearing the SHIFT flag. This heavily uses the oddball szc instruction, which is an inverted logical-AND (though, like soc and ori, there is a regular immediate andi that is not inverted, a curious non-orthogonality in the instruction set). I won't show every store here but I'll give the overall flavour — there's probably a more efficient way to do it than I've done, but this is also pretty easy to follow conceptually:

        ; remove any bits in the debounce that aren't set currently
        mov @keyzone,r0
        inv r0
        szc r0,@keyzone+8       ; inverted and
        mov @keyzone+2,r0
        inv r0
        szc r0,@keyzone+10
[...]
        ; remove any bits in the new keyscan that were still set in debounce
        ; if we end up with a cleared keyscan, it doesn't matter since we
        ; won't be able to decode it anyway
        mov @keyzone,r0
        szc @keyzone+8,r0       ; inverted and
        mov r0,@keyzone
        mov @keyzone+2,r0
        szc @keyzone+10,r0
        mov r0,@keyzone+2
[...]
        ; update debounce, clearing shift
        ; add any new bits to debounce so they get masked off too
[...]
        mov @keyzone+4,r0
        soc @keyzone+12,r0
        mov r0,@keyzone+12
        mov @keyzone+6,r0
        soc @keyzone+14,r0
        andi r0,0fbffh
        mov r0,@keyzone+14

Now with a clean set of keybits, we need to match them against a table. I organized a table of four words representing the eight matrix bytes in ASCII order so once you've found a matching set, the index into the table is the result. This table is stored at label keytab and looks like this:

[...]
        ; symbols and numbers, 32-64
        ; SPACE
        word 00000h, 00000h, 00000h, 08000h
        ; !
        word 00100h, 00000h, 00000h, 00400h
        ; "
        word 00200h, 00000h, 00000h, 00400h
        ; #
        word 00001h, 00000h, 00000h, 00400h
        ; $
        word 00002h, 00000h, 00000h, 00400h
[...]

The Tutor does not have CONTROL or ALT keys, just SHIFT, nor does it have a backspace or delete. This lets us redefine our special keys (the cursor keys, MON and MOD) to generate indices in the control character range. Our table turns MOD into ^C (consistent with its use in Tomy BASIC as break), LEFT/UP/DOWN/RIGHT as ^H ^K ^M/CR ^L, RT (RETURN) as ^J/LF, and MOD as ^[/ESC. To round out other common ASCII points the default keyboard doesn't generate, ^I/TAB is encoded as SHIFT-SPACE, backtick as SHIFT-UP, tilde as SHIFT-DOWN and ^?/DEL as SHIFT-LEFT. The pipe and backslash characters remain represented by flat and degree/handaku, which have the same ASCII value. The only key our matrix table does not handle is LOCK, which would be word 00000h, 00000h, 00000h, 00200h. I'd probably implement this as a conventional CAPS LOCK defaulting to up but we'll exclude that from the logic for now. Anything not matched in the table gets a result of 0.

        ; decode key
        ; each table entry corresponds to CRUs >EC00-EC70
        ; use a custom table to generate a standard ASCII value
        clr r6
        li r1,keytab
dekodl  mov *r1+,r2
        mov *r1+,r3
        mov *r1+,r4
        mov *r1+,r5
        ci r2,0ffffh            ; no key here
        jeq dekodn
        ; if the key matrix is an exact match, should be all zeroes
        xor @keyzone,r2
        xor @keyzone+2,r3
        xor @keyzone+4,r4
        xor @keyzone+6,r5
        soc r3,r2
        soc r4,r2
        soc r5,r2
        jeq dekodo
dekodn  inc r6
        ci r6,128
        jne dekodl
        b @keez
dekodo  mov r6,r0               ; got a good key

Like the debounce comparator, this code XORs the current matrix value against the current table entry; if it gets all zeroes, we have a match. At the end the resulting character code is in R6 and R0. Parenthetically, the 9900 has a inct instruction that increments by two instead of just one with regular inc, useful for skipping words (you can also use an instruction like c *r1+,*r1+ to increment by four in one word).

Because we left the VDP memory pointer at the end of our "hello world" blurb, to print the character to the screen we could simply do swpb r0 to get it in the upper byte followed by movb r0,@vdpwd. This doesn't scroll at the end as there's no bounds-checking, and remember the Tutor doesn't have a backspace (control characters are simply printed as blanks anyway), but it's really really fast. However, we also want to display a cursor for we'll use our reversed space character, so we'll keep a rolling screen pointer in R7. We'll additionally have RT clear the screen and as a convenience use MON to bail out to the Tomy title.

        ; check for mon - implemented as escape
        ci r0,001bh
        jeq bye

        ; check for RT - implemented as line feed
        ci r0,000ah
        jne putc
        bl @clrscr
        b @cursor

        ; otherwise print character using a cursor
        ; overwrite previous cursor character with new character
putc    swpb r7
        movb r7, @vdpwr
        swpb r7
        movb r7, @vdpwr
        swpb r0
        ; MAME will actually allow a mov here but not the real machine
        movb r0, @vdpwd
        inc r7
        ; and print cursor
cursor  li r0,0a000h
        movb r0, @vdpwd
        b @keez

The simplest way to go back to the title screen is to call the Tutor's reset vector, but paradoxically the 9900's built-in rset instruction is not what we want for this. There are a handful of TI-990 holdovers called external instructions which were used for special context switching operations, such as lrex to jump into front panel code. However, on the 9900/9995, most of the instructions with the possible exception of idle do nothing useful and in some cases could be potentially harmful depending on what's listening on the bus.

Instead, we'll use the low memory vectors. 9900 vectors consist of a pair of WP and PC words, with interrupt vectors starting at $0000. When an interrupt is triggered, or a vector is branched to using the blwp instruction, the WP and PC are loaded in order from those words (instantly saving the previous code's registers, assuming there is no conflict) and the previous values of WP, PC and ST are placed in the new R13, R14 and R15 respectively. The rtwp instruction then reverses everything using those registers and thus returns to the prior execution context. Theoretically the TMS9900 can support up to 16 levels of interrupt, starting at $0000 with level 0 for resets through $003c for level 15, though the 99/4 and 99/4A just wire everything to interrupt level 1. In the 9900 memory map these vectors are followed by XOP vectors for up to 16 software-defined opcodes via the xop family of instructions.

However, the 9995 only implements seven distinct interrupt levels, two of which are actually software interrupts (and one of those doesn't even work properly according to the manual errata). The highest level is level 0, connected to the reset pin, followed by the mid interrupt used for software opcodes, then NMIs, and then four numbered interrupts consisting of an external interrupt (1) on the INT1 pin, a not-reliably-functional arithmetic overflow interrupt (2), the on-chip decrementer (3) and another external interrupt (4) on INT2. These have their own vectors except for the MID interrupt and interrupt level 2 which share the same vector, and other than the NMI vector at $fffc all the rest come from low memory as well.

In this particular regard, the Tutor is no different from the 99/4A: a blwp 0 will jump into the reset vector at $0000, just as if you'd powered the machine on and allowing you to go back to the menu. While the Tutor also uses the same reset vector values for levels 1 and 2, level 3 (the decrementer) runs normally to service its regular tasks and the level 4 (external INT2 triggered) interrupt is used for triggering on tape reads. On the other hand, the Tutor uses the entire XOP vector range as part of a jump table, so it isn't possible to use any XOP instructions on the Tutor with the standard ROMs (the 99/4A at least has a couple useful values there). Anyway, all this is to say that a simple blwp 0 will be sufficient.

That's pretty much it, and we're ready to assemble our first draft. I'll get to the code in a minute, but let's test it in MAME with mame tutor -skip_gameinfo -cart tello.rom. Our character set and welcome message appear beautifully though typing is a bit ... messy.

The problem isn't our code, it's MAME's default settings. You'll find the same mojibake occurs while typing in regular BASIC as well. I mentioned that the keyboard matrix is shared with the joy controller lines (though that's actually useful because it allows you to read some keys from GBASIC which wouldn't ordinarily permit this), and because MAME defines some keys for the controller, you can't type normally with the default keyboard settings. In my case, I have a Hyperkin Trooper 2 USB joystick I use for Tutor games because it has two buttons for SL and SR, so I removed the key equivalents for the joy controllers and set it to exclusively use the joystick. Now we can type normally.

And as hoped for, typing now flows beautifully. There is one more issue we need to solve, though:

I mentioned we would like this to also support the 9918A's 40 column mode, something the Tutor BIOS doesn't but would be very useful for productivity applications (the Tutor has a lot of great games but it's time it achieved its potential, darn it). With a little tweaking we can turn on the 40 column bit in VDP register 1 and adjust our message and screen layout so everything ends up in the right place. However, the 9918A can't display a 320-pixel-wide screen, so instead it displays a 240-pixel-wide screen using only the leftmost six columns of each character cell. The cells are still eight bits wide in memory; the rightmost two are simply not displayed. This sort of works for some of the characters — lowercase in particular, which makes me wonder if this was a consideration during the Tutor's development — but clearly doesn't for others.

Since the font started as a Commodore 64 character set, after all, we'll go back into Ultrafont+ and start shrinking them down, leaving a bit of gutter space.

I think we're ready to try it on the real thing!

I used minipro to dump the current contents of the multicart ROM and wrote up a little Perl script to do an inplace overwrite on the resulting file with our new binaries. We generate two ROM files called tello.rom (32 column) and tello40.rom (40 column), which we splice in at locations 0x70000 and 0x74000 and then burn it to flash.

% make burn
perl splice /home/censored/tutor.bin 0x70000 tello.rom 0x2000
successfully replaced 8192 bytes at offset 458752 in /home/censored/tutor.bin with tello.rom
perl splice /home/censored/tutor.bin 0x74000 tello40.rom 0x2000
successfully replaced 8192 bytes at offset 475136 in /home/censored/tutor.bin with tello40.rom
minipro -p SST39SF040 -z
Found TL866II+ 04.2.131 (0x283)
Pin test passed.
minipro -p SST39SF040 -w /home/censored/tutor.bin
Found TL866II+ 04.2.131 (0x283)
Chip ID OK: 0xBFB7
Erasing... 0.40Sec OK
Writing Code...  30.12Sec  OK
Reading Code...  4.30Sec  OK
Verification OK

The multicart DIP settings for the 32-column version are (1=on) 00011 if you use those addresses. The first time I tried, the hello message appeared but typing generated no output. This was because I had a mov r0,@vdpwd instead of movb r0,@vdpwd; MAME will accept either instruction but not a real Tutor. With that corrected, we're in business!

For the 40-column version, set the DIP switches to 00010.

These are real composite video captures from my real Tutor. Hurray! We did it!

Now, what things could you do with better keyboard support, true ASCII and fast character display? Well, obviously this whole proof of concept is the start of doing something more practical with the Tutor. I'll answer that question in a couple months once all the parts arrive. The first order of business will be installing a PLCC adapter in the multicart so I don't have to pull the flash chip out repeatedly for testing.

Let's briefly finish our Tutor story. There are in fact two other members of the Tutor/Pyuuta family, both domestic to Japan: the Pyuuta Jr., a game console that came out between the Pyuuta and Tutor that implemented GRAPHIC mode (but no GBASIC) and could play most cartridges — but was entirely in English with no katakana support at all — and this, the last and final Tutor, the Pyuuta Mark II (variously Pyuuta mk II and Mark II). The mk II had proper keycaps with a rearranged layout and used the American Tutor BIOS instead of the Pyuuta's, with Tomy BASIC (also the same as the American Tutor's version) available as an add-on cartridge. Notably, neither system supported a composite monitor, only RF TV output, though they're pretty easy to comp-mod. Both systems have slight hardware differences from the Pyuuta and Tutor but both will run all the same cartridge games, and they also don't require special hardware for 32K cartridges. Unfortunately, their English-only nature probably didn't endear them to their home markets and the mk II can't load Pyuuta tapes either (only US Tutor ones). Likely as a result, both sold poorly, and Tomy exited the home computer market as well in 1984.

As for the 9995, in the end it was only ever implemented in three systems: the Tutor/Pyuuta family, a PEB upgrade called the Myarc Geneve 9640 which was basically a new TI-compatible computer on a card, and the Powertran Cortex, a home and business computer first built at TI in the United Kingdom that never got released due to internal squabbles. Instead, its plans were published in the Electronics Today International magazine and a company called Powertran Cybernetics sold kits and fully assembled machines. The Cortex ran at a full 3MHz, had 64K of chip RAM (with a memory mapper supporting up to 1MB) and used a 16K PAL equivalent 9928/29 VDP for graphics, though the more advanced Yamaha/Maplin V9938 could also be substituted with up to 128K of VRAM. Floppy, serial and DMA were all supported along with a built-in BASIC and multiple operating system options, even a small v6 UNIX port called LSX. Although popular with enthusiasts, it was an obscure system then and now, and relatively few examples remain in operation.

While the 9995 was a much more tractable chip than its ancestors, the reliance of the 9900 series on RAM was what eventually stunted its technological evolution. In the days when it was taped out, CPU die space was expensive, so shifting register space onto cheaper RAM which often ran at a similar speed was a logical alternative. Indeed, tricks like 6502/6800 zero page are the same basic idea, using a special expanse of memory with faster access and special addressing as if it were CPU registers itself. As CPUs became substantially faster than memory, however, this architectural quirk became more of a liability and contemporary 16-bit CPUs like the Motorola 68000 and the Intel 8086 and 80286 eclipsed it. A chip like the 6502 only got away with it for as long as it did because it was incredibly cheap and incredibly common and even today still sells in quantity, neither advantage being one the 9900 or 9995 ever possessed. Today, modern CPUs have comparatively massive register files and caches as proof that the 9900 idea was a dead end. After the 99000 family, an upgraded 9900 with segmented memory only used in TI's last range of minicomputers, TI abandoned further development of the architecture in 1983 for the TMS320 DSP series and the exceptionally swift TMS32010, a much more popular (and, especially for a DSP, more conventional) processor.

The source code for our demonstration project along with a Makefile, the character set binaries and the keyscan table are available on Github under a 3-clause BSD license.

Old Vintage Computing Research

Monday, March 18, 2024

After 41 years, my first assembly program on my first computer, the Tomy Tutor

No comments:

Post a Comment