Saturday, August 12, 2023

Cracking DesignWare's The Grammar Examiner on the C64

It's been awhile since I've stripped the copy protection off a Commodore 64 software package. This weekend I had a reason to.
I should point out a couple things as preamble. First, my parents insisted I would not rot my brain with games (much), so we had a lot of educational titles for our C64, and second, my wife is a high-school English teacher. I kept a number of packages from back then and one of them was a secondarily acquired copy of DesignWare's The Grammar Examiner from 1984, something like a mashup between a board game and Strunk and White's Elements of Style, where you get to edit a fictional newspaper and fix all the typos and bad punctuation in your quest to become editor-in-chief.

I rather liked it back in the day. Don't judge. My wife — who used Commodores as a girl in the Australian school system but not this particular title — enjoyed it even more than I thought she would, enough so that she occupied VICE on the Talos II playing it all afternoon and prevented me from writing this.

The Grammar Examiner plays well enough on my real Commodore 128DCR, though it's a very slow loader, and I only have an original disk which I'd like to preserve. (My original original copy disappeared a while back, though I've had this particular one at least a couple decades.) A quick sector-by-sector D64 image using a ZoomFloppy yielded a number of apparently intentionally bad sectors typical of early 1980s copy protection, but even with the error information the program's loader just plain hung up in VICE trying to boot the copy. Yes, a nibbled raw copy of the GCR would work and I imagine people have made one of this title, but we'd also like to speed up the process instead of burdening the emulator further (and it would be nicer on the real system too).

So in this post we'll explore the loader routine, decrypt and extract it, figure out how the copy protection is implemented and work around it, and then pull out the payload it reads for a faster start. While we're at it, let's look briefly at the program itself, an interesting example of Forth programming "in the large" on 1980's home computers.

* * *

Back in the day, educational software was generally neglected by piracy. We got our warez through two distinct sources, namely our teenaged babysitter whom we got in trouble when we convinced him to let us stay up late (ending both his employment and our connection), and the son of a friend of the family who also had an elder sister I had a fairly hopeless crush on. There's just something about permed hair and braces. Anyway, almost all of those were games.

As a result, educational titles didn't have the evolutionary pressure games did to contain clandestine copies; they only had to stop the casual home copier who wouldn't have the latest tools like Fast Hack'Em or Copy II Plus or some such. They also sold by and large to generally law-abiding consumers (i.e., schools, parents), and not in numbers sufficient to justify licensing one of the higher-end copy protection schemes — I never met an education program on the Commodore 64 that used Rapidlok or V-MAX!, for example. That may also have been why or because educational software makers were some of the earliest to leave the home computer market for PCs and compatibles, whereas games continued to flourish on the C64 due to its massive installed base as late as the early 1990s.

None of that means there weren't cracks, of course, but there were certainly fewer. Kids like me whose parental-sponsored software libraries (unwillingly) skewed more towards edutainment had to do our own workarounds, most of which by now are hopefully past the statute of limitations.

That brings us to today's victim.

DesignWare was founded by Jim Schuyler in 1980 on a $3000 credit card, a PhD in computer science from Northwestern and an Apple II personally given to him off the assembly line by Steve Jobs (who was admonished by another exec to never do that again). Schuyler's background was in computer-aided instruction and developed a language called LINGO, not to be confused with Macromedia Director's scripting language, to model the interaction between a user and the machine. Around the same time was another larger system innovating in that very same space using similar ideas, which was of course the TUTOR language underlying UIUC CERL's PLATO. Schulyer reworked LINGO into MULTITUTOR, which could run TUTOR lessons on their own local CDC 6400 mainframe but wasn't limited to PLATO terminals. While at the University of the Pacific in San Francisco, MULTITUTOR's ideas and architecture in turn came to underlie Schuyler's CDS in 1977, the CAI Design System (later the Courseware Design System), with its own bespoke TUTOR-like language that could run converted TUTOR, PILOT and other CAI systems' lessons.

Early microcomputers were starting to appear, notably the Apple II and the Atari 8-bit family, and after meeting with their leadership Schuyler felt these more inexpensive home computers had greater potential for democratizing education than the client-server models he first worked on. Initially DesignWare did contract and commissioned work for other companies, most notably Reader's Digest Software (Trickster Coyote comes to mind) and later in 1982 Spinnaker Software, where their first big hits FaceMaker and Story Machine were published. In 1983 they started selling educational and edutainment software under their own name, one of the few submarkets in the industry back then not severely impacted by the video game crash. These first-generation first-party titles included notables such as Crypto Cube and especially Spellicopter, which became one of their biggest sellers. The above ad was from Creative Computing in October 1983.

By 1984 DesignWare was developing for the Apple II family, the Atari 8-bit family, the IBM PC, the IBM PCjr and the Commodore 64, meeting Schuyler's goal of educational opportunities on all kinds of home systems. In fact, this catalogue from October 1984 shows the Commodore 64 eventually got ports of all their first-party software; the only computers that didn't run the full line were the IBM PCjr that DesignWare didn't port one title to, and ironically the Atari 8-bit family, which despite being one of their launch platforms eventually became too small a market to maintain development for. Unfortunately cash flow problems caused DesignWare to be acquired twice, first in 1984 by Management Software Associates along with other purchases such as accounting giant Peachtree, then spun off 38 days later to absorb and revive some of MSA's other failing holdings like the hapless EduWare until DesignWare was sold off to Britannica Software in 1986.

In this guise DesignWare continued producing and publishing titles like Designasaurus under the DesignWare name as a sublabel of Britannica, expanding to the Amiga, until around 1989 when the brand was dissolved; the subsequent Spellicopter II sequel was developed and published under Britannica in 1991. After a name change to Compton's, Britannica eventually merged with SoftKey in 1995 and through SoftKey's later disastrous history the DesignWare IP now rests as part of Houghton Mifflin Harcourt Learning Technology — notably Harcourt was one of DesignWare's original contract publishers — along with other forgotten properties like Spinnaker and Springboard Software's back catalogues.

Early DesignWare titles came in typical boxes. We had Trap-A-Zoid from that era, but I've never enjoyed geometry, and I can't say I'm completely sad that one got lost. Later on DesignWare switched to these slicker spiral-bound books where the diskette was in a Tyvek pouch in the back. This wasn't my original copy of The Grammar Examiner, which I think got lost in the garage somewhere, but rather a replacement that I've still had for a good couple decades or more. DesignWare's 1984 MSRP for this title was $44.95 [2023 around $132]; this package was marked down to $29.95, though I don't remember exactly when that was.
DesignWare programs were also distinguished by a common visual style: lots of high contrast black-and-white (the lowest common denominator for their supported systems), the ever-present DesignWare logo, and sort of a scratchy hand-drawn look to the graphics which actually was somewhat endearing.
Less endearing was the loader. I never played these titles on an Apple II or Atari, but the Commodore 64 version took ages to even get started (the manual apologetically says it "is a large, complex program that will take a few minutes to load" — about two minutes and 15 seconds on a stop watch to the title screen) and my trusty Epyx FastLoad didn't make any difference. As it turns out, there are two stages to the loader and unfortunately not only does the program's on-disk layout make typical fastloading impossible, but the initial primary loader is also slower than it should be due to a technical oversight. More to come on that.

The game would frequently pause to load more data from the disk. When it did, a little coloured floppy disk icon would flash in the lower corner, sounding a little two-tone chirp ("ke-LOP!") each time. This sound has deeply ingrained itself into my consciousness.

Here's a brief tour of The Grammar Examiner before we break it apart.

The main menu. DesignWare was very forward-thinking, including a lot of content on one disk side and allowing you to create your own lessons. I never did that as a kid but I always thought it was cool. For kids like me who were socially apathetic and/or inept, you could play with Mel, who had an adjustable IQ (though what this really did was just set the percent chance he had of randomly picking the right answer — even at an "IQ" of 150 he would still make some amazingly bad gaffes).
Sadly, I don't believe DesignWare ever issued any game disks of their own. If they had, we might have bought them. Disk swapping would have probably made it more maddening than it should have been but at least the program supported multiple disk drives.
Four game boards were available on disk (plus your own) but I most enjoyed the default N. Y. Times layout. Every board had seven types of squares, six of them being start (does nothing), chance, a quick grammar question, edit and revise a paragraph, randomly jump somewhere on the board, and return to start. You rolled a simulated die with RETURN and then moved your piece on the board (I'm the camera, Mel is the ball behind me). While you could turn loops as part of your moves, you could not immediately go back the way you travelled.

The chance, quick question and editing squares either added to your accumulated pay or (if you were unlucky or incompetent) docked it. All players start with $20,000, which seemed like a nice nest egg for a cub reporter in 1984 (in 2023 about $58,500). If you clear $23,000, then the seventh type of game square will appear at the top right, allowing you to bribe the publisher to let you become Editor-in-Chief if you have the highest sum when someone gets there. You then win the game until the early 2020s when local news reporting suffers a major financial downturn. The random squares were particularly important for getting to the finish line because of the ever-present threat of the return-to-start squares and Mel had a distinct programmed tropism for them.

In our collective expert opinion, the chance squares exist largely as a hazard and we seemed to almost always lose money. The only one who made bank overall from these was Mel, which at least kept him in the game at the lower IQ levels. On the other hand, the artwork was "drawn" on screen instead of being stored as bitmaps, a nice animated touch, and was very well executed.
Paragraph editing could get you a cool grand if you made no mistakes (that's more than I've typically made writing entire articles, starting with when I wrote for COMPUTE!'s Gazette in the 1990s!), and started docking your pay if you made more than two. This was my favourite part of the game due to the score boost and the amusing nature of the prose, with more than a few tense moments watching the evaluator go through it word by word to check you.

The editing screen did not let you insert words, and the idea was not to entirely rewrite the article. Instead, you corrected grammar, spelling and punctuation by retyping individual words in place, keeping the overall format, tone and sentential structure the same. My wife and I quibbled over this in a couple places since style can be sometimes a subjective choice, but the game, and probably appropriately for an age range where ambiguity can be pedagogically unhelpful, has but one single view of truth and tolerates no variations. The manual says the style format came from the Chicago Manual of Style and Warriner's English Grammar and Composition, choosing Warriner's where they differed due to its focus on high school instead of college. Hint if you're unable to decide: some of the sentences from the paragraphs appear in the manual as examples.

We did find the in-universe explanation for the pop grammar quiz questions pretty suspect, but hey: if you already know the answer, and you're just willing to pay $400 to see if we do, that seems pretty fair to us.

I enjoyed playing it as a kid, and I still do as an adult, but even as a larval dweeb I was still more fascinated by how it worked under the hood. I certainly picked apart my share of loaders and educational software back then; I've kept one particular title in my collection as a fond memory not because it's particularly good (it's a so-so reading comprehension program) but because I ripped off its graphics library to use in my own stuff. The DesignWare titles, on the other hand, were a bit more opaque.

The disk directory shows two files, DWARF and BOOT2, which (with 664 blocks free, the amount on a freshly formatted 1541 floppy disk) seem to take up absolutely no space themselves. Even the filenames were bizarre and baffling. DWARF is just a BASIC program that loads and executes BOOT2 with a SYS 49152. It then sits there and loads for a long period before the screen blanks and the cheerful chirp starts regularly playing.

Now, we did have a backup copy of this back in the day and any nibbler should be able to copy it (we used Fast Hack'Em), so copy protection wasn't really the issue for 9-year-old me. I wanted to understand what it was doing and how it loaded from a disk that appeared to be completely empty. The only clue was a footnote in the manual:

MicroMotion FORTH-79 was an implementation of (wait for it) FORTH-79 initially for the Apple II. 9-year-old me had a very rudimentary awareness of Forth from the issue of CTW ENTER it appeared in, so I knew it was another programming language, but back then I didn't otherwise get the significance. Forth doesn't just provide an operating environment for programs; it is the operating environment, generally dealing in fixed-size numbered screens and blocks instead of variable-length named files.

As such, it performed direct disk access strictly by track and sector instead of using the Commodore Kernal's load routine, something most fast loaders wouldn't accelerate, or indeed using any filesystem at all. In fact, the only "files" on the disk are the BASIC bootstrap and loader, and these appear to take "no space" by actually occupying sectors normally reserved for the disk directory on track 18. After all, there are only two files, so most of the directory track isn't needed to store the directory. (There are actually three files, but one is a deleted file called MONITOR$8000 which has only one valid sector and the rest was apparently lost. The name and the ghost file's remnant first sector suggest a simple machine language monitor that was used as a debugger. Another orphaned sector contains a different BASIC loader that loads both along with old file entries for BOOT ALL, MACROS and FORTH.BRK, but none of this is connected to anything.) This allowed almost the entirety of the disk to be allocated to Forth code and data.

MicroMotion's mention in the manual was in fact a contractual requirement. Forth was selected as DesignWare's development language because it was probably the closest thing back then to a common language feature set available for the disparate home systems of the day. Since Forth has a standard memory layout for its vocabulary and object code, much of it could be reused directly on other 6502-based systems, adding custom words for platform-specific features like high-resolution graphics, sound, and the track-and-sector routine; on non-6502-based platforms, the Forth screens could simply be recompiled.

MicroMotion's implementation of blocks was also useful for paging because it kept track of what was already loaded. When the program requested a certain block range be brought into memory, the Forth runtime would only load what was needed, reducing disk access. This paging process is when the disk icon would light up with the cheerful chirp. You could even mark blocks as to be updated on disk, almost like a manually-driven virtual memory system, though I don't think any of the DesignWare titles used that feature and this disk is write-protected (no write notch).

Let's figure out how the loader works by ripping the disk to a D64 image. d64copy shows multiple errors on track 31. Notice that Commodore GCR disks have variable sector density between tracks, with the larger-circumference outer tracks holding more. On a standard 1541 formatted floppy there are 35 tracks in total yielding 683 total blocks of 256 bytes each. Although tracks up to 40 are possible and used by some custom formats, this disk doesn't seem to use them.

 1: *********************                  
 2: *********************                  
 3: *********************                  
 4: *********************                  
 5: *********************                  
 6: *********************                  
 7: *********************                  
 8: *********************                  
 9: *********************                  
10: *********************                  
11: *********************                  
12: *********************                  
13: *********************                  
14: *********************                  
15: *********************                  
16: *********************                  
17: *********************                  
18: *******************                    
19: *******************                    
20: *******************                    
21: *******************                    
22: *******************                    
23: *******************                    
24: *******************                    
25: ******************                     
26: ******************                     
27: ******************                     
28: ******************                     
29: ******************                     
30: ******************                     
31: --*-----*----*---        87%   601/683[Warning] read error: 1f/0e: 5
23,read error,00,00 (bad checksum)
31: --***---**---*?*-        88%   606/683[Warning] read error: 1f/0a: 5
23,read error,00,00 (bad checksum)
31: **********?***?*-        89%   614/683[Warning] read error: 1f/10: 3
20,read error,00,00 (no header)
31: **********?***?*?        90%   615/683[Warning] giving up...
31: **********?***?*?                      
32: *****************                      
33: *****************                      
34: *****************                      
35: *****************       100%   683/683
680 blocks copied.

Three of the sectors err out. I'll provide a spoiler now: the errors d64copy thinks it duplicated aren't actually what's going on with that track. Hang tight.

We'll now start disassembling BOOT2, the machine-language portion of the bootstrap and the first-stage loader, with the VICE monitor. It's not actually zero blocks, of course, but it's small, loading from $c000 to $c14f. Execution starts at $c000.

.C:c000  20 28 C0    JSR $C028
.C:c003  20 41 C0    JSR $C041
.C:c006  20 59 C0    JSR $C059
.C:c009  A9 05       LDA #$05
.C:c00b  20 C3 FF    JSR $FFC3
.C:c00e  A9 0F       LDA #$0F
.C:c010  20 C3 FF    JSR $FFC3
.C:c013  4C 00 08    JMP $0800

In broad strokes, this seems simple enough: it calls a couple subroutines, then closes channels 5 and 15 (implying those subroutines it calls opened them for reading and commands), and jumps to what we presume is the main program at $0800. (If you look at the MicroMotion memory map, the Forth dictionary starts at that location.) Let's start with the first call to $c028.

.C:c028  A0 0E       LDY #$0E
.C:c02a  B9 29 C0    LDA $C029,Y
.C:c02d  59 28 C0    EOR $C028,Y
.C:c030  99 29 C0    STA $C029,Y
.C:c033  C8          INY
.C:c034  C0 B7       CPY #$B7
.C:c036  D0 22       BNE $C05A
.C:c038  D2          JAM

This code doesn't make any sense, which is immediately suspicious: the loop isn't closed and is terminated with one of the NMOS 6502 halt-and-catch-fire instructions. That's because this is the first of two places in which it decrypts itself by exclusive-ORing bytes, here with the immediately preceding byte beginning with location $c037 (the offset for the BNE instruction in the loop). In this case it turns an apparent immediate forward branch into nonsense code to an actual loop back to $c02a ($22 EORed with $d0 is $f2) to continue the decryption. We'll drop a breakpoint in VICE to hit immediately after the loop to see what code is revealed.

(C:$c053) break c038
BREAK: 2  C:$c038  (Stop on exec)
(C:$c053) x
#2 (Stop on  exec c038)  136/$088,  37/$25
.C:c038  20 E7 FF    JSR $FFE7      - A:88 X:00 Y:B7 SP:f4 ..-...ZC   11154817

The jam "HCF" instruction at $c038 is now a call ($f2 EORed with $d2 is $20) to Kernal CLALL to close all open files. Let's continue disassembly with the first section of code decrypted. I've annotated the calls if you don't have the Kernal calls in your head like I do after all these years.

; close all
.C:c038  20 E7 FF    JSR $FFE7
; open 15,dev,15,""
.C:c03b  A9 00       LDA #$00
.C:c03d  20 BD FF    JSR $FFBD
.C:c040  A9 0F       LDA #$0F
.C:c042  A6 BA       LDX $BA
.C:c044  A8          TAY
.C:c045  20 BA FF    JSR $FFBA
.C:c048  20 C0 FF    JSR $FFC0
; open 3,dev,3,"#"
.C:c04b  A9 23       LDA #$23
.C:c04d  85 FB       STA $FB
.C:c04f  A9 01       LDA #$01
.C:c051  A2 FB       LDX #$FB
.C:c053  A0 00       LDY #$00
.C:c055  20 BD FF    JSR $FFBD
.C:c058  A9 03       LDA #$03
.C:c05a  A6 BA       LDX $BA
.C:c05c  A8          TAY
.C:c05d  20 BA FF    JSR $FFBA
.C:c060  20 C0 FF    JSR $FFC0
; copy routine at $c07a to $9000
.C:c063  A9 00       LDA #$00
.C:c065  85 FB       STA $FB
.C:c067  A9 90       LDA #$90
.C:c069  85 FC       STA $FC
.C:c06b  A0 66       LDY #$66
.C:c06d  B9 7A C0    LDA $C07A,Y
.C:c070  91 FB       STA ($FB),Y
.C:c072  88          DEY
.C:c073  C0 FF       CPY #$FF
.C:c075  D0 F6       BNE $C06D
; jmp $9000
.C:c077  6C FB 00    JMP ($00FB)

The new code opens the command channel to the disk drive (as channel 15) and then a buffer for loading direct sectors from disk (as channel 3). It then copies a routine from $c07a to $9000 and jumps to it. We drop another breakpoint in VICE and pick up from $9000, where it immediately calls a subroutine at $903f.

; $9000 (copied from $c07a)
.C:9000  20 3F 90    JSR $903F

; $903f
; clear channels
.C:903f  20 CC FF    JSR $FFCC
; send disk drive command: read sector 31 03
.C:9042  A2 0F       LDX #$0F
.C:9044  20 C9 FF    JSR $FFC9
.C:9047  A0 00       LDY #$00
.C:9049  B9 57 90    LDA $9057,Y
.C:904c  F0 06       BEQ $9054
.C:904e  20 D2 FF    JSR $FFD2
.C:9051  C8          INY
.C:9052  D0 F5       BNE $9049
.C:9054  4C CC FF    JMP $FFCC
(C:$9068) m 9057
>C:9057  55 31 3a 33  20 30 20 33  31 20 30 33  0d 00 88 98   U1:3 0 31 03....

This sets up a block read command to channel 3 from track 31, sector 3. Note that this wasn't one of the sectors d64copy complained was bad. It returns to $9003 with the terminal call to $ffcc and then does this strange thing:

; select channel 3 for input
.C:9003  A2 03       LDX #$03
.C:9005  20 C6 FF    JSR $FFC6
.C:9008  A0 00       LDY #$00
.C:900a  84 FB       STY $FB
; get a byte from the disk drive over the serial bus
.C:900c  20 A5 FF    JSR $FFA5
.C:900f  99 28 C0    STA $C028,Y
.C:9012  99 28 C0    STA $C028,Y
.C:9015  45 FB       EOR $FB
.C:9017  85 FB       STA $FB
.C:9019  C8          INY
; do it 256 times
.C:901a  D0 F0       BNE $900C
; ???
.C:901c  A5 FB       LDA $FB
.C:901e  C9 2E       CMP #$2E
.C:9020  F0 07       BEQ $9029
.C:9022  A9 59       LDA #$59
.C:9024  8D 0F 90    STA $900F
.C:9027  D0 D7       BNE $9000 ; branch always, since #$59 is non-zero

This is the second decryption routine; we can see an exclusive OR occuring, but not obviously to the memory it's storing in. There are also two stores (STA) to the exact same location. Why is that?

Again, this is self-modifying code. If we hand-execute through this routine, after loading a full 256 bytes from that sector, it uses the exclusive-ORed value in $fb as a kind of check. If the value doesn't match, it changes the first STA to EOR and runs the entire routine from the beginning, asking for that sector again.

This would seem even crazier: now it's apparently exclusive-ORing the same bytes it got before with the same bytes it's getting now. Indeed, when I let it continue execution at this point in VICE, it will read that sector over and over and over and over in an infinite loop until I stopped it. It clearly works fine with the original disk, so what's in that sector?

The first time I pulled up the sector in the Epyx FastLoad disk editor, I got this.

It's a little odd. There was nothing in the high nybble at all. In terms of machine code, these bytes would be gibberish and couldn't possibly be executed directly. More to the point, I couldn't see how exclusive-ORing it would generate anything sensible since the high nybble would never get populated. I puzzled over that and went to bed.

In the morning I pulled it up again to ponder further. It had changed. On an original, write-protected disk. This wasn't the sector I looked at last night:

I reloaded it multiple times. The sector has a split personality: it randomly flip-flops between one or the other. In the other morph, you can see there's nothing but a high nybble, and now XORing those values would make more sense. In fact, if you hand-merge the two using the disassembled loader, you get valid machine code.

This is a critical portion of the routine. If you look way back at the beginning, after this first call we've been deobfuscating there's a second one to $c041 and then to $c059. Those calls hit gobbledygook by default, but the composite sector we load here overwrites that region. The loader thus loads a new piece of itself as its first task.

How do we get two views of one sector? We're only seeing part of the picture here by just looking at the data. The sector appears to be a normal CBM DOS sector, which has a header portion followed by the 256 data bytes of the sector we saw in the FastLoad editor, so it must be something about the header for that sector that's responsible. Let's get Maverick out (I own a physical copy of v5 on disk), the last word in Commodore disk copiers and copy-protection analysis tools, and look at the headers for track 31.

This is the Maverick GCR editor. GCR stands for Group Code Recording, the particular encoding method (at least Commodore's form of it) used by native Commodore floppy drives, and is the raw on-disk format. Using this tool we can see how the track is physically encoded (if you don't have Maverick, here's a BASIC program that will do the same general thing, just much more slowly). On the left is the raw disk GCR, and on the right is the interpreted bytes. The Maverick editor interleaves each sector header with its data, so the first two lines are sector 0, the next two are sector 1, and so on. We'll only need the interpreted bytes for our analysis as this disk isn't using a custom encoding.

In the right column we can see each header on alternating lines. The header (which is described incorrectly in the 1571 user guide; see something like Inside Commodore DOS, the essential text on Commodore disk drives, for much more) starts with a sync mark of 40 1-bits, which Maverick doesn't show, then a byte $08, then a checksum (the exclusive-OR of the two ID bytes, track and sector numbers), then the sector number, the track number, the ID bytes set when the disk was formatted, and a padding gap. For track 31 sector 3, we seem to have a normal-looking header; in particular, the track and sector bytes in the header are $03 $1f, for 3 and 31 respectively.

The line I have highlighted on screen is sector 3's data. It has its own sync mark, then starts with a byte $07 and the 256 bytes of data. Usually this is the forward pointer to the next track and sector and 254 bytes after, but here the entirety is used. Compare the screenshot with what we got from the FastLoad disk editor; we're seeing the second view I got of the sector. No matter how many times I reloaded the disk from Maverick, all I saw for sector 3 was this.

The explanation comes when we try to go through the rest of the sectors.

Track 31 on a 1541 disk should have 16 sectors total and on this screen I have the header highlighted for sector 15. So far so good. Notice the header shows $0f $1f for sector 15, track 31. But when we go two rows down to where the header for sector 16 should be, the header reads ... $03 $1f again. The data immediately below it is what we got from the first view of the sector. This disk has two sectors that call themselves "sector 3."

How could this possibly work? Recall that the Commodore 1541 and 1571 disk drives are largely "blind." More specifically, they generally only see what's under the read/write head, so they only know which track and sector they're on by what's in the header (the 1541-II doesn't even have an optical track zero sensor). It is non-deterministic which sector, i.e., the real sector 3 or the fake one that's actually a disguised sector 16, will come under the head first. As the drive's software assumes that the track and sector coordinate in the header is unique for every sector on the disk, it doesn't bother looking again once it sees something that matches, as under normal circumstances such a doppelgänger sector couldn't exist. The disk drive doesn't cache these reads, so the next time we ask, it will faithfully again retrieve the first "sector 3" it sees, which may or may not be the other one.

(Parenthetically, this means "logical" sectors don't need to be in physical order as long as the drive can see them and they are unique. You could even have a disk where the physical sectors are logically numbered differently, e.g., sector headers being reordered to reflect an ideal interleave. We'll have more to say about interleave later, but such disks and programs to generate them exist, such as Datamost KwikLoad.)

Maverick isn't fooled by this because it reads and interprets the raw GCR of the entire track at once under the assumption the headers could be untrustworthy (and they are). But a regular sector-by-sector read will indeed, as we saw in the FastLoad editor, randomly get one or the other. This routine must therefore be very flexible to compensate: whatever view of the sector it sees first is copied ("double STA") to that block of memory, but this is incomplete and the check byte will be wrong, so now the next time it goes through it will read again, exclusive-OR and then store. It has to get both views one after the other, which may take several tries, so it makes no assumptions otherwise. Once the check byte matches, it knows it succeeded in getting both sectors.

What happened with our disk copy in the emulator, then? Refer back to the three errors we saw when we imaged it. The drive got confused by the double sector, which explains the spurious "bad checksum" (error 23) for some of the other sectors, but also correctly noted that it couldn't access sector 16 at all (error 20) because nothing is tagged as sector 16. For sector 3, however, the only thing written to disk is one or the other of whichever "sector 3" the disk copy routine hit first; the other form of that sector will never be referenced because the disk copy routine already "got" that sector. On the resulting D64 image the stored sector 3 will thus be invariant, so the routine will run forever in the emulator because it will only ever receive the same data. (A G64 "nibbled" image would reproduce this accurately because it deals in raw GCR, but this is slower despite being a more complete storage format.)

After the load is this mildly naughty little section:

; clear channels, close 3, close all
.C:9029  20 CC FF    JSR $FFCC
.C:902c  A9 03       LDA #$03
.C:902e  20 C3 FF    JSR $FFC3
.C:9031  20 E7 FF    JSR $FFE7
; store a $4c at $900f, making it jmp $c028
.C:9034  A9 4C       LDA #$4C
.C:9036  2C 0F 90    BIT $900F
.C:9039  8D 0F 90    STA $900F
.C:903c  10 D1       BPL $900F
.C:903e  02          JAM

It clears the channels and closes everything, and then does an almost insulting little trick by storing a $4c (JMP) into the STA $C028,Y opcode, making it JMP $C028. The BIT instruction here will always clear the N flag, so the BPL branch also becomes a "branch always" and jumps into the new code at $c028 we would have just loaded.

I took the two sectors and merged them manually with a quick Perl script to get the correct bytes, and then disassembled them in dxa.

; open 15,8,15,"i0"
lc028   lda #$03
        ldy #$c0
        ldx #$3e
        jsr $ffbd
        lda #$0f
        ldx #$08
        ldy #$0f
        jsr $ffba
        jsr $ffc0
        rts
c03e: "i0" $0d

This just calls a "soft init" in the disk drive and returns ... all the way back to $c003, which (look back) is JSR $C041. That's also in our newly loaded code.

; open 5,8,5,"#"
lc041   lda #$02
        ldx #$57
        ldy #$c0
        jsr $ffbd
        lda #$05
        ldx #$08
        ldy #$05
        jsr $ffba
        jsr $ffc0
        rts
c057: "#" $0d

This opens up a new sector buffer as channel 5, and returns back to $c006, which is JSR $C059. No, I don't understand why these don't come one after the other either except as another attempt at obfuscation.

Finally, this routine is the actual meat. I've annotated the calls it makes below and we'll look briefly at those smaller subroutines.

lc059   lda #$08
        sta $fc
        lda #$00
        sta $fb
        jsr lc098       ; send load sector command
        jsr lc0e3       ; read sector into $0800
        lda $082b       ; get number of sectors to load and ...
        sec
        sbc #$08        ; ... subtract eight
        sta $c018
        inc $fc         ; next page
        inc $c017       ; next sector
lc075   jsr lc098       ; send load sector command
        jsr lc0e3       ; read sector into $0900 and so on
        inc $fc         ; next page
        inc $c017       ; next sector
        dec $c018       
        beq lc097       ; return if loaded all sectors
lc085   lda $c017
        cmp #$15        
        bne lc075
lc08c   inc $c016       ; next track. no more than 21 sectors to read per track through 18
        lda #$00
        sta $c017
        jmp lc075
lc097   rts

This routine is loading sectors one after the other into memory starting with location $0800. We already know this is our target execution address, so it's very likely this is the payload we want to extract.

The routine at $c098 sends another block read command, but this time to the new channel 5 we opened. That command is stored in memory at $c019, as below:

C:$c17e) m c000
>C:c000  20 28 c0 20  41 c0 20 59  c0 a9 05 20  c3 ff a9 0f    (. A. Y... ....
>C:c010  20 c3 ff 4c  00 08 01 00  00 55 31 3a  30 35 2c 30    ..L.....U1:05,0
>C:c020  30 2c 00 00  2c 00 00 0d  a0 0e b9 29  c0 59 28 c0   0,..,......).Y(.

Since the U1 command expects an ASCII track and sector number, this routine will convert it, and then send the string. This is a family of several related routines. The load from track 31 sector 3 was only 256 bytes, so it only covers some of the code (you'll see the break below where the execution path switches back to what was already loaded as part of BOOT2).

; turn track-sector at $c016 $c017 into ASCII for the U1 command
lc098   lda $c016
        jsr lc126       ; converts
        stx $c022
        sta $c023
        lda $c017
        jsr lc126
        asl $c025
        ora $c026
        jsr lc0b7
        jsr lc0c2
        rts
lc0b7   lda #$0f        ; sets up command string pointer
        ldx #$19
        stx $fd
        ldx #$c0
        stx $fe
        rts
; send command to disk drive
lc0c2   pha
        ldx #$0f
        jsr $ffc9
        bcc $c0cd
lc0ca   jsr $c13c       ; appears to be a debugging stub if an error
        ldy #$00
        pla
        tax
lc0d1   lda ($fd),y
        jsr $ffd2
        bcc lc0db
lc0d8   jsr $c13c       ; same
lc0db   iny
        dex
        bne lc0d1
lc0df   jsr $ffcc
        rts

[...]
; do conversion to ASCII numbers
lc126   ldx #$00
;; bring back BOOT2
.C:c128  C9 0A       CMP #$0A
.C:c12a  90 07       BCC $C133
.C:c12c  E8          INX
.C:c12d  38          SEC
.C:c12e  E9 0A       SBC #$0A
.C:c130  4C 28 C1    JMP $C128
.C:c133  48          PHA
.C:c134  8A          TXA
.C:c135  09 30       ORA #$30
.C:c137  AA          TAX
.C:c138  68          PLA
.C:c139  09 30       ORA #$30
.C:c13b  60          RTS

; debugging stub (calls 6502 software interrupt)
.C:c13c  00          BRK
.C:c13d  60          RTS
.C:c13e  00          BRK

From the routine at $c059 we know that it is incrementing $c017 for the sector, and when that exceeds 21 sectors, it increments $c016 (because tracks 1-17 contain 21 sectors). The default values for those locations are (in memory order) 1 and 0, meaning it's loading from track 1 sector 0, the very first sector on the disk. The load routine is very simple.

; read sector
lc0e3   ldx #$05
        jsr $ffc6    ; set input channel
        bcc $c0ed
lc0ea   jsr $c13c    ; bomb on error
        ldy #$00
lc0ef   jsr $ffcf    ; read byte
        bcc lc0f7
lc0f4   jsr $c13c    ; bomb on error
lc0f7   sta ($fb),y
        iny
        bne lc0ef    ; loop 256 times
lc0fc   jsr $ffcc    ; clear channels
        rts

That first sector looks like this:

00000000  ea 4c d3 08 ea 4c db 08  31 2e 31 32 43 4f 4d 4d  |.L...L..1.12COMM|
00000010  4f 44 4f 52 45 20 36 34  73 6c 08 00 90 ab 80 00  |ODORE 64sl......|
00000020  ff 01 00 01 06 00 88 a3  88 6c 88 6c c4 41 2c 2a  |.........l.l.A,*|
00000030  46 1d 26 1d ea 6b 68 26  b4 1c 09 1d f5 1b 17 6c  |F.&..kh&.......l|
00000040  fc 1f c7 1c 8c d1 96 1c  0b ac 9a ce 39 29 8f 5b  |............9).[|
00000050  db 1c d6 29 9e 0f 28 09  cb 0d c0 11 37 12 0c 00  |...)..(.....7...|
00000060  87 09 18 03 cb 0d 28 09  b4 0d 2c 09 87 09 18 03  |......(...,.....|
00000070  b4 0d 87 09 18 08 cb 0d  87 09 8c 16 4a 10 b4 0d  |............J...|
00000080  8a 16 68 28 79 0d f5 1d  be 0c ac 0d b4 0d be 0c  |..h(y...........|
00000090  a2 0d b4 0d 33 27 5f 24  9e 0f c7 1c 7b 1c 25 17  |....3'_$....{.%.|
000000a0  b4 0d 70 1c 1b 17 b4 0d  87 09 19 00 87 09 00 00  |..p.............|
000000b0  2a 13 87 09 00 00 87 09  00 d4 81 0f ac 0f e0 0d  |*...............|
000000c0  76 12 f0 ff b4 1c 8d 0a  7d 13 84 43 4f 4c c4 00  |v.......}..COL..|
000000d0  00 d3 08 a9 08 a2 54 a0  3b d0 06 a9 08 a2 98 a0  |......T.;.......|
000000e0  0d d8 86 88 85 89 20 5b  17 a9 1b 8d 11 d0 a9 c8  |...... [........|
000000f0  8d 16 d0 a9 17 8d 18 d0  a9 00 8d 20 d0 8d 21 d0  |........... ..!.|

When loaded into memory, the byte $6c at offset $2b will be in location $082b. That value minus eight (note the subtraction) yields a count of $64, or 100, additional sectors to be loaded, for a total payload size of 25856 bytes. (Note that since the total number of sectors is not fixed but rather specified by the payload, I suspect that this or a similar loader was used for other DesignWare titles where the second-stage loader length varied.)

101 sectors is less than 5 tracks' worth if we start at track 1, as the first 17 tracks all have 21 sectors, which makes the routine straightforward. What it isn't is efficient, for two reasons. First, it has an effective sector interleave of one, while the default interleave on a 1541 is ten; i.e., it hopscotches ten sectors between those it loads, expecting at the drive spindle's typical rotational speed that the distant sector will be most likely already under the drive head ready to go on the next read. (This value is stored in 1541 RAM at location $0069 [SECINC], but an exhaustive disassembly of the 1541 ROM shows the only place the interleave differs is on the directory track 18, where the interleave is 3.) The optimal interleave of a 1541 disk is sometimes debated and will vary depending on the loader and other factors, but it is certainly not one, as the spindle will likely have rotated the disk away from the immediately following sector by the time the disk drive is ready to read it.

But — as we'll see — a far bigger contributor to load time is that this routine loads bytes from each sector one by one. Since that's an IEC bus transaction for each byte, it generates several times more IEC bus traffic than a regular file load using "bulk transfer" mode (i.e., the IEC "data" command byte but with a special reserved channel of 0) that sends the entire contents of a file to the computer in one huge transaction.

Finally, this sector loader routine terminates and we end up back at $c009, which closes channels 5 and 15, and jumps to $0800.

While this is a terrible way with a real 1541 to store the payload, it makes extracting it from the D64 (now that we know where it is) stupidly easy. It starts at track 1 sector 0, which is the very beginning of the D64, and everything is stored sector after sector and track after track in ascending numerical order, so we just take the first 25856 bytes of the image. That's the entirety of the executable it's loading and where we find the second-stage loader.

As we scroll through it in a hex editor, we start seeing text words.

000001e0  7f 09 a3 16 06 82 52 b0  dc 09 a3 16 08 83 54 49  |......R.......TI|
000001f0  c2 e5 09 a3 16 0a 85 57  49 44 54 c8 ed 09 a3 16  |.......WIDT.....|
00000200  0c 86 4d 45 4d 54 4f d0  f6 09 a3 16 0e 85 46 45  |..MEMTO.......FE|
00000210  4e 43 c5 01 0a a3 16 10  82 44 d0 0d 0a a3 16 12  |NC.......D......|
00000220  88 56 4f 43 2d 4c 49 4e  cb 18 0a a3 16 14 86 27  |.VOC-LIN.......'|
00000230  2d 46 49 4e c4 20 0a a3  16 16 85 2d 46 49 4e c4  |-FIN. .....-FIN.|
00000240  2e 0a c5 16 16 85 27 3f  4b 45 d9 3a 0a a3 16 18  |......'?KE.:....|
00000250  84 3f 4b 45 d9 45 0a c5  16 18 8a 27 3f 54 45 52  |.?KE.E.....'?TER|
00000260  4d 49 4e 41 cc 50 0a a3  16 1a 89 3f 54 45 52 4d  |MINA.P.....?TERM|
00000270  49 4e 41 cc 5a 0a c5 16  1a 86 27 41 42 4f 52 d4  |INA.Z.....'ABOR.|
00000280  6a 0a a3 16 1c 85 41 42  4f 52 d4 79 0a c5 16 1c  |j.....ABOR.y....|
00000290  86 27 42 4c 4f 43 cb 85  0a a3 16 1e 85 42 4c 4f  |.'BLOC.......BLO|
000002a0  43 cb 90 0a c5 16 1e 83  27 43 d2 9c 0a a3 16 20  |C.......'C..... |
000002b0  82 43 d2 a7 0a c5 16 20  84 27 43 56 c8 b0 0a a3  |.C..... .'CV....|
000002c0  16 22 83 43 56 c8 b8 0a  c5 16 22 85 27 45 4d 49  |.".CV.....".'EMI|
000002d0  d4 c2 0a a3 16 24 84 45  4d 49 d4 cb 0a c5 16 24  |.....$.EMI.....$|
000002e0  86 27 45 52 52 4f d2 d6  0a a3 16 26 85 45 52 52  |.'ERRO.....&.ERR|
000002f0  4f d2 e0 0a c5 16 26 87  27 45 58 50 45 43 d4 ec  |O.....&.'EXPEC..|
00000300  0a a3 16 28 86 45 58 50  45 43 d4 f7 0a c5 16 28  |...(.EXPEC.....(|
00000310  85 27 48 4f 4d c5 04 0b  a3 16 2a 84 48 4f 4d c5  |.'HOM.....*.HOM.|
00000320  10 0b c5 16 2a 8a 27 49  4e 54 45 52 50 52 45 d4  |....*.'INTERPRE.|
00000330  1b 0b a3 16 2c 89 49 4e  54 45 52 50 52 45 d4 25  |....,.INTERPRE.%|
00000340  0b c5 16 2c 84 27 4b 45  d9 35 0b a3 16 2e 83 4b  |...,.'KE.5.....K|
00000350  45 d9 44 0b c5 16 2e 85  27 4c 4f 41 c4 4e 0b a3  |E.D.....'LOA.N..|
00000360  16 30 84 4c 4f 41 c4 57  0b c5 16 30 88 27 4d 45  |.0.LOA.W...0.'ME|
00000370  53 53 41 47 c5 62 0b a3  16 32 87 4d 45 53 53 41  |SSAG.b...2.MESSA|
00000380  47 c5 6c 0b c5 16 32 87  27 4e 55 4d 42 45 d2 7a  |G.l...2.'NUMBE.z|
00000390  0b a3 16 34 86 4e 55 4d  42 45 d2 87 0b c5 16 34  |...4.NUMBE.....4|

Forth programmers will recognize some of them and the general format. Each word begins with a length byte with its high bit set and ends with the high bit of its last character also set, each acting as a delimiter (the name field). Each word also has a backlink to the previous word (the link field), a pointer to what should be executed (code field), and any trailing data (parameter field). This builds a linked list of words in memory.

Executable Forth words are typically threaded calls, where the pointer in the code field points to a Forth runtime routine to step through each call in the parameter field (MicroMotion calls this routine DOCOL), but can also be raw 6502 machine language (a so-called "code" word), where the execution pointer points into the parameter field itself and runs it directly. The very first word in this program, not shown here, is named COLD and is a code word. In most Forth implementations of the period, COLD and WARM would serve as the routines that initialize the Forth runtime, though as there is no need to warm-start the runtime here, it doesn't seem to be part of this dictionary.

To prepare a Forth dictionary for use as a commercial product, the MicroMotion manual required you to specify what word would run on startup (using TURNKEY, which other Forths also implemented) and what word would run for errors (ONERR), and to disable the formation of new words with DESTRUCT. The dictionary was then saved and would run all the words previously compiled but not be able to form new ones. It's difficult to tell exactly how many words are in the program total, especially since some are likely paged in from other portions of the disk, and some sectors may well be orphans containing old data that's never used. At least for this initial loader, however, we're probably talking three figures at least, which would have been a sizeable number of symbols for an 8-bit computer.

It also can't be determined whether this program was built directly on the Commodore 64, or on some other system (like the Apple II) with Commodore-specific code added later. However, we do know that it's MicroMotion-generated as implementation-specific words like COPY-BLOCKS and COPY-DISK are present. There are also obviously Commodore-specific words like CALL-KERNAL, plus code words for major Kernal routines:

0000e420  91 23 42 4c 4f 43 cb 97  6c a2 15 0b 00 85 43 48  |.#BLOC..l.....CH|
0000e430  4b 49 ce a4 6c a2 15 c6  ff 86 43 48 4b 4f 55 d4  |KI..l.....CHKOU.|
0000e440  b1 6c a2 15 c9 ff 85 43  48 52 49 ce bd 6c a2 15  |.l.....CHRI..l..|
0000e450  cf ff 86 43 48 52 4f 55  d4 ca 6c a2 15 d2 ff 85  |...CHROU..l.....|
0000e460  43 4c 4f 53 c5 d6 6c a2  15 c3 ff 86 53 45 54 4c  |CLOS..l.....SETL|
0000e470  46 d3 e3 6c a2 15 ba ff  86 53 45 54 4e 41 cd ef  |F..l.....SETNA..|
0000e480  6c a2 15 bd ff 84 4f 50  45 ce fc 6c a2 15 c0 ff  |l.....OPE..l....|
0000e490  86 43 4c 52 43 48 ce 09  6d a2 15 cc ff 85 43 4d  |.CLRCH..m.....CM|

These call a routine living at $15a2 that bridges the Forth-raw ML impedance mismatch with the address of the Kernal routine in question in the parameter field. Other words handle the sound output, including the very amusing SHUTUP.

In other sectors we can see the game data and grammar tasks, though unlike most Commodore programs, the human-facing text portions are rendered in true ASCII and not Commodore-native PETSCII. That's certainly the strongest implication the C64 version may not have been (entirely) developed on the C64 itself. There is also an on-disk copyright message, which seems to have been placed entirely for miscreant grown-up children rooting through the code like me, and is 7-bit ASCII. It doesn't seem to be referenced anywhere in the program.

0002a940  57 41 52 4e 49 4e 47 3a  20 54 48 49 53 20 50 52  |WARNING: THIS PR|
0002a950  4f 47 52 41 4d 20 41 4e  44 20 49 54 53 20 44 41  |OGRAM AND ITS DA|
0002a960  54 41 20 43 4f 4e 54 41  49 4e 20 50 52 4f 50 52  |TA CONTAIN PROPR|
0002a970  49 45 54 41 52 59 20 20  20 20 20 20 20 20 20 20  |IETARY          |
0002a980  49 4e 46 4f 52 4d 41 54  49 4f 4e 20 41 4e 44 20  |INFORMATION AND |
0002a990  54 52 41 44 45 20 53 45  43 52 45 54 53 20 4f 46  |TRADE SECRETS OF|
0002a9a0  20 44 45 53 49 47 4e 57  41 52 45 2c 49 4e 43 20  | DESIGNWARE,INC |
0002a9b0  41 4e 44 20 01 01 01 01  01 01 01 01 01 01 01 01  |AND ............|
0002a9c0  41 52 45 20 4e 4f 54 20  54 4f 20 42 45 20 43 4f  |ARE NOT TO BE CO|
0002a9d0  50 49 45 44 20 46 4f 52  20 41 4e 59 20 50 55 52  |PIED FOR ANY PUR|
0002a9e0  50 4f 53 45 20 57 49 54  48 4f 55 54 20 57 52 49  |POSE WITHOUT WRI|
0002a9f0  54 54 45 4e 20 01 01 01  01 01 01 01 01 01 01 01  |TTEN ...........|
0002aa00  50 45 52 4d 49 53 53 49  4f 4e 2e 20 74 48 45 59  |PERMISSION. tHEY|
0002aa10  20 41 52 45 20 50 52 4f  54 45 43 54 45 44 20 55  | ARE PROTECTED U|
0002aa20  4e 44 45 52 20 53 54 41  54 45 20 41 4e 44 20 46  |NDER STATE AND F|
0002aa30  45 44 45 52 41 4c 20 4c  41 57 2e 20 20 20 20 20  |EDERAL LAW.     |
0002aa40  01 01 01 01 01 01 01 01  01 01 01 01 01 01 01 01  |................|

So noted. With the payload thus extracted, an easy first cut is to compress it down with pucrunch with an execution address of $0800 (if the first 25856 bytes of the D64 are saved as gramload.prg and we add starting address bytes of 00 08 at the beginning, then pucrunch -c64 +f -ffast -x2048 gramload.prg gramload will do the job). This yields a binary which is about 30% smaller, LOADs and RUNs like a BASIC program, and decompresses in a few seconds. For the emulator, this is quite simple to work with: I put the crunched Grammar Examiner loader in my prg folder that VICE treats as a very fast virtual disk, and then run it with the sector dumped D64 in drive 8 to play the game. This is also great for something like a 1541-Ultimate where the loader can be DMAed straight into memory.

For a real C64/128 with just a regular disk drive, though, we'll want to get this back on the original disk; it would be clumsy to juggle floppies. Fortunately, since the crunched binary is 30% smaller, it will easily fit in the sectors the original payload was loaded from even with the overhead of adding next sector links to each sector, and load even faster due to the reduced size. As such, the Kernal effectively becomes our first-stage loader instead of using DWARF. After loading and running it, the program self-decompresses to memory, and then jumps right into the second-stage loader.

This simple Perl script will take our compressed payload on standard input and emit sectors back to back with the proper linkages on standard output.

#!/usr/bin/perl

use bytes;
read(STDIN, $buf, 65536);

# make it an even multiple of 254 bytes for ease
$buf .= "\0" x (254 - (length($buf) % 254)) if (length($buf) % 254);
print STDERR "splicing @{[ length($buf) ]} bytes/@{[ length($buf)/254 ]} sectors\n";

$t = 1;
$s = 0;
$l = length($buf) - 254;

for($i=0; $i<length($buf); $i+=254) {
        $s++;
        if ($s == 21) { $t++; $s = 0; }
        if ($i == $l) {
                print STDERR "last sector\n";
                print STDOUT pack("H*", "00ff");
        } else {
                print STDOUT chr($t).chr($s);
        }
        print STDOUT substr($buf, $i, 254);
}

print STDERR "finished at $t $s\n";

The result is a lake of sectors 18432 bytes long. We take the D64 sector dump of the disk and add on the remaining 156416 bytes starting at offset 18432 for a total length of 174848 bytes, the nominal size. Finally, we'll jump into the D64's directory track with a hex editor and add a link to our new file at track 1 sector 0. I "deleted" the original booter files, but kept them for posterity along with the other ghosts. Since we're talking about DWARFs (I also have a cloudy memory that one of our DesignWare disks had a reference to FIDDLE), a Firesign Theatre reference seems appropriate.

I wrote this out to a 5.25" floppy and did a few tests on the 128DCR. Because the loading time from the start of the second-stage loader (when the screen goes black) to the title screen is all raw track-and-sector access and therefore constant, any time savings will be observed in bringing up the second-stage loader. These times were done by me with a stopwatch. They necessarily include the time spent in memory decompression since that's technically part of the load (approximately a fixed 5-second penalty). I tested with and without Epyx FastLoad, which is my fastload cartridge of choice due to its ubiquity and compatibility.

1:24 original disk (regardless of fastloader)
0:52 1 interleave no FastLoad
0:23 1 interleave FastLoad

This is already a big savings even on a stock 64 with no fastloader at all, but we already know we're using a non-standard interleave, so can we do better?

The answer is, "it's complicated." Below is a hacky little Perl script that emits out entire tracks with an adjustable interleave, defaulting to 10.

#!/usr/bin/perl -s

use bytes;
read(STDIN, $buf, 65536);

# interleave
$interleave ||= 10;

# make it an even multiple of 254 bytes for ease
$buf .= "\0" x (254 - (length($buf) % 254)) if (length($buf) % 254);
# ass-U-me all tracks are 21 sectors long
$scount = length($buf)/254;
$tcount = int(($scount/21)+0.999999);
print STDERR "splicing @{[ length($buf) ]} bytes/$scount sectors in $tcount tracks interleave $interleave\n";

# paranoia if we get longer
$maxtrax = 5;

$foff = 0;
$end = length($buf) - 254;
$ss = 0;
$ns = 0;
for($t=1;$t<=$tcount;$t++) {
        @tbam = ();
        $nt = $t;

        # handle last track differently than the others
        if ($t < $maxtrax) {
                for ($s=0; $s<21; $s++) {
                        # predict next sector. this is 'close enough'
                        # to the 1541's algorithm
                        if ($s == 20) {
                                # out of sectors, step track
                                # 1541 algorithm keeps the same sector #
                                $nt++;
                        } else {
                                $ns = ($ss + $interleave) % 21;
                                while (length($tbam[$ns])) {
                                        $ns++;
                                        $ns = $ns % 21;
                                }
                        }
#                       print STDERR "$nt $ss -> $ns\n";
                        if ($foff >= length($buf)) {
                                $tbam[$ss] = chr(0) x 256;
                        } elsif ($foff >= $end) {
                                $tbam[$ss] = chr(0) . chr(255) 
                                        .substr($buf, $foff, 254);
                        } else {
                                $tbam[$ss] = chr($nt)
                                        .chr($ns)
                                        .substr($buf, $foff, 254);
                        }
                        $ss = $ns;
                        $foff += 254;
                }
                # fill in empty sector holes
                for ($s=0; $s<21; $s++) {
                        $tbam[$s] = chr(0) x 256 if (!length($tbam[$s]));
                }
                $track = join('', @tbam);
                die("assert: track $t is wrong size @{[ length($track) ]}\n")
                        if (length($track) != 21*256);
                print STDOUT $track;
                @tbam = ();
        } else {
                # for the remainder, emit sequentially
                while($foff < length($buf)) {
                        print STDOUT chr($t)
                                .chr($ss++)
                                .substr($buf, $foff, 254);
                        $foff += 254;
                }
        }
}

The separate handling for track 5 was paranoia in case my calculations were off and it ended up being longer than four tracks, but fortunately this isn't the case (I left the code in for future expansion). For veracity I checked its results against a real floppy with files written with the 1541's own ROM routines and the sector interleave output at the default value of 10 seemed the same. We'll then fill in the remainder with the original sector dump to get a proper D64 and write that to a floppy.

Does the interleave make a difference? Here's the complete table including our three prior entries:

1:24 original disk (regardless of fastloader)
0:52 12 interleave no FastLoad
0:52 10 interleave no FastLoad
0:52 6 interleave no FastLoad
0:52 1 interleave no FastLoad
0:25 6 interleave FastLoad
0:23 1 interleave FastLoad
0:17 12 interleave FastLoad
0:15 10 interleave FastLoad

First off, even with a stock Kernal/1541 and no fastloader, executing a LOAD instead of a byte-by-byte sector-by-sector read is way faster. If we remove the 5-second fixed decompressor penalty and multiply that 47 second time by 25856/18432 to compensate for the fewer sectors loaded, we get 66 seconds loading the same amount of payload, which is still 18 seconds and 21% faster. 52 seconds is thus already a big improvement, and even more so when a fastloader — any fastloader — is involved.

But we expected that. What was unexpected was that interleave didn't seem to make any difference at all to the stock Kernal/1541 ROM loader. It should be noted that we're doing this on track 1 and going inward instead of track 17 and going outward as you'd optimally do, and this is a 1571 instead of a 1541, but the 1571's compatibility is generally excellent and functions indistinguishably from a 1541, and you'd think that there would be some load time difference between interleaves if it really were a factor. My conclusion here is that the time spent waiting for the next sector is much less in comparison to the time spent actually transferring data, and if that's true, then it probably made virtually no difference to the original disk's loader after all. I didn't test every single value but it seems very unlikely it would change with anything else.

Where interleave does make a difference is when Epyx FastLoad is active. There are interleaves that are clearly worse and clearly better, and again I didn't test every single value, but FastLoad seems calibrated to achieve best performance on the default interleave of 10. And that makes sense, really, because it had to show improvement on existing software that likely would have been written to disk with that interleave. Notably, on an interleave of 10 and removing the five-second decompressor penalty, FastLoad reads in the same second-stage loader in 10 seconds that the stock Kernal does in 47 seconds, very close to the 5x speedup Epyx's marketing material claimed.

This is already pretty fast. You could maybe shave a few more seconds by using a different loader (WarpSpeed or Super Snapshot should easily beat it), or, since we're already working with track 1, you could make it into a 128 mode autobooter. On the other hand, as the loading time gets smaller, the fixed decompression time (and we're already using pucrunch's fast mode on purpose) starts to dominate, and a 128 mode autobooter would incur the additional penalty of switching to 64 mode. Since I like Scott Nelson's black beauty, I'm perfectly happy with 15 seconds to start. From the 135 seconds total time to the title screen on the original, we're now down to 65 on actual hardware, over twice as fast. And on an emulator it flies.

The last frontier is to ensure the entire title is playable, because the road to crack heaven is littered with hal-fassed attempts that didn't account for the game abruptly doing a protection check later on. Fortunately, my wife insisted on playing it all the way through and beating poor old Mel.

Don't look at the screen if you don't want any spoilers.

In the end, this article proves The Grammar Examiner was far more educational for far longer than I (or, probably, its designers) ever thought possible. And hey: look at the quality of the prose in this blog. Even my wife couldn't find any errors, and she's had a lot of practice.

2 comments:

  1. This looks so nasty, and I mean in a good way. Stepping through this brought back too many memories of last night hacks and tons of forgotten 6502 assembly.

    The most insidious protection I saw was code that loaded directly into the 1541 RAM and decrypted sectors from there. The code I ran into changed the interleave and I think the rotational speed to get its data loaded.

    I didn't have much in the way of direct hacking tools for the C64, except for one cheater cartridge. I think it was called Icepik. It had a button on the cartridge that you pressed after your game completely loaded (after passing all of its copy protection steps) and was at its title screen or some other idle loop. It would save a full snapshot of the C64 RAM space to another disk and create a small BASIC loader with a SYS statement to jump back into the assembly. It didn't work on all games but worked enough for me to keep it around. It sounds like it could have worked on your app. Of course, that would have defeated the purpose of learning all the neat hacks this code offered to you.

    ReplyDelete
    Replies
    1. Yup, I've got an Isepic here myself. It would probably have worked just fine at the title screen since it didn't load any special code into drive memory, but as you say, the crack was more fun (and elegant :-).

      Delete

Comments are subject to moderation. Be nice.