Saturday, November 12, 2022

Meet your new two-factor authenticator: your Commodore 64

Multi-factor authentication is ripe for disruption. SMS 2FA is inherently defective. Phone authenticators get stolen. Security tokens get lost.

But just try misplacing a Commodore SX-64. And any thief who tries to grab it and run gets a free hernia truss from the prison infirmary:

Plus, I've got a colour for every key!
And it actually works:
The terminal window is showing a generated time-based one-time password for a full key, and the emulated 64 is showing the correct key, at the correct time, which was known and tested to be valid. Yes, you really can use your Commodore 64 for multi-factor authentication to generate TOTP codes!

Keys can be entered manually in hexadecimal or loaded as binary files from disk (you specify the file, offset and length), and it can either use the real-time clock in CMD FD and HD devices or with devices implementing a compatible T-RA command or you can just manually enter the time for demonstration.

Some of you are asking already if this idea is totally nuts or just mostly. But consider: the C64 has a very small attack surface and it can be made completely airgapped. Keys can be entered manually, or stored as binary files which you have to know the file, offset and length to correctly use (unless you make the entire file the key). Heck, you have to even know what disk (or cassette tape?) it's on. Plus, anything fun is always a satisfactory justification!

Additionally, this project poses some special technological challenges on a constrained system with an 8-bit ALU that doesn't have hardware multiply or divide, and this blog is all about conquering technological challenges on constrained hardware. Here's what we have to write to make this work, direct from RFC 6238:

  • A SHA-1 hasher
  • An HMAC generator using that hasher (TOTP is HMAC-SHA-1 a la RFC 2104, which is still considered secure despite SHA-1's issues as a hash algorithm by itself because HMAC is substantially less affected by the risk of collision in the underlying hash)
  • Conversion of a real-time clock source into Unix time, adjusting for local timezone
  • Extraction of a target value from the resulting HMAC, especially its last six digits (basically computing mod 1000000)
And we're gonna do all that on an MOS 6510 running at just over 1MHz! (Or, on PAL systems, even less!)

First, before we talk about it, try it out yourself in VICE or on a real Commodore 64 (or 128 in 64 mode) to convince yourself it works. A note about keys: keys should be provided in hexadecimal or as binary data, not base 40. If you have a base 40 key, you can convert it to hex with this short Perl program (pass your base 40 key on standard input; some test vectors are in the __DATA__ section, but don't forget that last blank line if you want to use them):


@digots = qw(A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 2 3 4 5 6 7);
%digits = map { $_, $i++ } @digots;
while(<>) {
        exit if (!length || length($_) & 1);

        if (length($_) & 2) {
                # 2 base 32 characters equals 2.5 base 16 characters,
                # which we null-pad with A's. this works enough for totp.
                $_ .= "AA";

        $_ = uc($_);
        # 4 base 32 characters equal 5 base 16 characters.
        foreach $group (unpack("(A4)*", $_)) {
                $sum = 0;
                foreach $char (split('', $group)) {
                        exit if (!defined($digits{$char}));
                        $sum <<= 5;
                        $sum |= $digits{$char};
                printf("%05x", $sum & 0xfffff);
        print "\n";


Continuing on with your key:

  • Optionally, create a key on a disk (the key must be binary data, not ASCII hexadecimal: a key of 0102 should consist of a file containing bytes 0x01 and 0x02). You can pad it with data because you can provide the program with an offset and length of the actual key.

    Parenthetically, I suppose you could put the key on a cassette tape and read it from a Datasette. I haven't tried this but I have tried very hard to stay out of areas the Kernal tape routines use, so hopefully it should work. You would need to modify the program to not check for the file's existence and query the error channel though.

  • Optionally, if you're using VICE, you can emulate a CMD FD or HD to provide the real-time clock. VICE does not provide this ROM by default. Once you have the appropriate ROM (I emulate an FD-2000), before loading TOTP-C64 create a blank disk image for it and make sure the image is loaded into the emulated drive, then reset the drive once. If you don't do this little dance when you fire up VICE, the RTC may not respond correctly. VICE syncs the clock to the host system time for you.
  • Download it from the Releases tab of the Github project. It's a single PRG you LOAD and RUN like a BASIC program — in fact, the menu actually is a BASIC program — so do that (e.g., LOAD"TOTP",8 and RUN, substituting filename and device). There will be a brief pause while it decompresses and unpacks its components.
  • If you created an on-disk key, press L to load it, insert your disk or disk image if not already done, then specify the filename, device, filetype (anything but a RELative file), offset (can be zero) and length (if zero, loads the whole rest of the file or 64 bytes of it, whichever is less). If you didn't, press E to enter one, and type the hexadecimal key (lowercase, no spaces, maximum 128 characters). This key will not be saved to disk. You can just enter a couple random bytes if all you want to do is play with the program.
  • After the key is loaded, you can press F1 if you messed up and want to redo, or press C to get time data from a CMD or compatible device that implements T-RA, or E to enter the time manually.
    • If you pressed C, it will ask you to enter the device to load the time from, which doesn't have to be the device you loaded your key from (if any). The menu will try to query it and display the response. If the response is garbage and you're trying to emulate a CMD drive in VICE, make sure an image is mounted and reset the drive once more, and try again (but if you get a syntax error from the device, then it doesn't support this feature and you should try something else). Once a valid response is received, enter your time zone relative to UTC, first hours (possibly negative), and then minutes at the prompt after. For example, Pacific standard time (PST) is -0800, so enter -8 hours and 0 minutes; Australian eastern daylight time (AEDT) is +1100, so enter 11 hours and 0 minutes; Australian central standard time (ACST) is +930, so enter 9 hours and then 30 minutes.
    • If you pressed E, it will ask you for your timezone first (enter it the same way as in the part above), then the current month, day, and year (only 2000 and later are supported; 2 digit years will have 2000 added), then the current time as hours (in 24 hour notation: midnight = 0, 8am = 8, noon = 12, 8pm = 20), minutes and seconds. Enter the time about 10 or 15 seconds ahead. The program will then pause for a key. When the clock matches the time you've entered, press any key other than F1, or press F1 if you did it wrong to re-enter the time. You could just enter a random time here if all you want to do is play with the program.
  • The TOTP display will appear with the current 6 digit code in the centre of the screen and a progress bar below it. This bar fills with a different colour every 30 second interval, after which a new code is generated. The display continues until you press F1 to stop. This will halt the display and run the program from the beginning starting with acquiring a key.
The TOTP display depends on the Time-of-Day clock in CIA #1. This is different from the Timer A jiffy clock accessible in BASIC as TI and TI$; the CIA TOD clock is a real clock that keeps time in BCD and uses either the 60Hz or 50Hz frequency from your wall mains. We use this clock because it is unaffected by interrupts. If the progress bar fails to advance, your CIA #1 is broken, and/or the 9V AC line from your power supply is defective, and/or the 2.7V Zener diode and/or glue chip (on breadbin 64s, a 74LS08 or similar, on 128s, a 74LS14) connected to the TOD pin of the CIAs have blown. If everything else with your system functions normally, then it's likely the latter two components since relatively few programs use the TOD clock but do use other CIA features. On a breadbox 64, check U27 and CR1. On a 128 or 128D(CR), check U16 and CR8; on the 128D(CR), CR8 is inconveniently under the power supply at the edge of the logic board, so you'll have to unscrew it from its standoff and the rear and side of the case to move it away. (I mention this because during testing it turned out my "daily driver" 128DCR has a faulty TOD clock. Fixing this will be a future blog entry. Fortunately both VICE and the SX-64 worked fine.)

The source code is also on Github (released under the BSD 3-clause license). There are two main parts to TOTP-C64, the BASIC section that provides the menu to read and accept user settings (which provided the prompts you interacted with), and then the main assembly language portion which actually does the heavy computations and maintains the code display. This is divided into three modules, the SHA-1 hasher, the time computer, and the actual main loop that calls those modules for the time and computes the HMAC, displays the digits and maintains the clock until you stop it.

The code display is very simple, just a bar of reversed spaces with six big sprites for the digits and a padlock sprite for decoration. The 16x32 sprites are created by vertically stretching the number character glyphs from the built-in character ROM to double height (yielding an 8x16 digit) and then having the VIC-II's hardware double-size each digit sprite again in both dimensions. The colour that fills the progress bar is computed off the least-significant byte of the clock counter value. Those are all pretty straightforward.

The rest of it is where the meat is. When you're writing code for a constrained platform that requires heavyweight computation, what's at least as important as using good algorithms is finding good shortcuts. You simply want to do the minimum amount of work possible. Sometimes this is done by merely remembering old work steps; sometimes this requires putting steps out of order; sometimes this is done by changing an algorithm entirely. Particularly for the last two a compiler will only get you so far, which is why I wrote the critical sections entirely in 6502 assembly language.

The first part of the code that I wrote was the SHA-1 hash implementation, since without that (or without being able to do it quickly) there was no point in writing anything else. Perhaps unsurprisingly, there's some prior art here: a cursory Google search turned up several 6502-based cryptographic hash projects, such as this Commodore 64 SHA-256 implementation, also in 100% assembly, another one in assembly, an actual Commodore 64 SHA-256 file checksum also presumably in assembly, and an Atari SHA-1 file checksum written in C, which strikes me as light treason. I should say officially for the record that I didn't look at the source code for any of these projects and didn't use any of their work in writing this one.

Fortunately, SHA-1 only requires adds, subtracts, bitwise logical operations and bitwise rotations and shifts. The two big tricks here are 1) everything is 32-bit, and 2) a single error in the implementation, even just one bad bit, can completely munge the hash because of the (desirable) avalanche effect. You don't want to be able to work the hash backwards because that means it wouldn't be opaque — but also that means it's going to be harder to debug it, too.

To help with this I wrote a Perl model emulating the eight bit operations a 6502 would have to do to compute SHA-1 and later an HMAC, and pervasively instrumented it so I could see each computational step. The code wasn't a 1:1 correspondence with what I expected to write in the assembler but it was enough to make the translation to assembly language mostly direct.

You can see the trace of the five SHA-1 h variables (h0 through h4) on both the 64's screen and in the terminal session. Each step through is printed so I could identify almost the exact spot the assembly version diverged from the Perl model. Conveniently, displaying a 160-bit hash on the 64's screen is 40 hexadecimal digits, exactly the width of a screen line.

There's no great magic to a 32-bit (or 16, 24, 48, 64, etc.) operation on an 8-bit CPU; you just do it one byte at a time. I wrote up preprocessor macros for this which generate the boilerplate code for me in the xa65 cross assembler. For adds and subtracts, the byte-by-byte math needs to be sequential from the LSB to the MSB so that carry is propagated through. In particular you need to load and store each byte in order, so these tend to be the slowest such operations in this module. This is also true for shifts and rotates since the moved-out bit also ends up as carry, but the 6502 has some memory-memory forms of these instructions which avoid an explicit load/store (another reason the 6502 isn't truly RISC, or at least not in the way RISC is currently conceived of), which can speed a few operations up and don't require a trip through the accumulator.

However, for bitwise logicals like AND, OR, XOR, etc., each byte of the 32-bit result can be handled independently, and there is no carry. This means that an AND followed by an XOR, for example, can be part of the same chain: we load from memory, do the AND, do the XOR, and write it back, and do this four times total for each byte. This saves loading and storing between each logical operation and we use this a lot in the main loop (here's an example).

There are also some additional shortcuts to reduce CPU overhead further:

  • The first part of SHA-1 requires extending the sixteen 32-bit words (512 bits) to be hashed into eighty 32-bit words by exclusive-ORing four indexes with constant offsets into the original 16 words. We know where everything is in memory, so we can simply have the assembler precalculate the necessary effective addresses and maintain individual pointers in zero page for them. All the code has to do is load those words, exclusive-OR them all together (and since this is a string of bitwise logical operations, we only have to load, chain-XOR, rotate and store each byte once for four words, using the Y register as the index within each 32-bit word), and then advance the pointers by four each time through. Through some parsimonious programming we can keep the loop count in the X register the entire time.
  • A rotate or shift by 8 is by definition equivalent with rotating or shifting by an entire byte, so we can just load and store whole bytes offset by 1 instead of doing eight labourious single-bit moves. We exploit this for a left rotate of 5 by turning it into a left rotate of 8 (very fast) followed by a right rotate of 3 (slightly slower than left rotates because we have to recover the carry bit at the end, but we're only doing three).
  • On a 32-bit quantity, a left rotate of 30 can alternatively be a right rotate of 2.
  • Because we located h0 through h4 contiguously in memory, the full hash (from the view of an eight-bit CPU, at least) is automagically a 160-bit quantity because their bytes are sequential. We don't need to manually concatenate anything.

I certainly don't claim this to be the fastest you can do SHA-1 on the 6502, but I think it's pretty performant, considering.

The next necessary part was computing the time. The message for HMAC-SHA-1 is a big-endian counter value (being derived from HOTP, TOTP's count-based predecessor) computed from the current Unix time in seconds minus the Unix epoch (zero) divided by a time interval, which is usually 30 seconds. We'll need the remainder, too, because this tells us how long to wait initially before generating the second code. The RFC requires that any compliant algorithm handle at least the Y2038 problem, which most modern systems have dealt with by using a full 64-bit time_t. Already this is bad, because we're now in the position of dividing at least 32 bits and probably more by 30 (eventually I adopted a 64-bit time value to be consistent with the RFC; see below). Division by anything that's not a power of two on a system without hardware integer divide is a huge pain!

But wait, it gets worse. The Commodore 64 has no built-in time source, and no real-time clock presently available for it gives you Unix time: they all give you local time in a human-parseable format by seconds, minutes, hours, days, months and year. This is convenient for almost all of the things people would want a real-time clock for and incredibly inconvenient for us — effectively we're being forced to implement something like timegm() from scratch. Look at what Perl's Time::Local has to do to just compute days past the Unix epoch from a date: fiddle with the month and year, multiply the fiddled year count by 365 to get the number of days, account for leap years (a divide by 100 and a divide by 400!) and then add in the month — multiplied by 306, adding 5 and dividing by 10 — and the current date. To get from that count of days to seconds requires multiplying by 86,400. At that point "merely" multiplying by 3600 for hours to seconds and 60 for minutes to seconds (and adding it all together) seems trivial by comparison. Oh, and don't forget that when we're done we'll have to divide that whole honking number by 30 to get the current counter value, and we need to do all this within a second or so in real-time or the clock will be behind!

The CMD time-string template, depending on exact implementation, is more or less DOW. MM/DD/YY HH:MM:SS XM using a 12-hour time (e.g., SAT. 11/12/22 07:57:40 PM). There are variants to give us packed BCD and byte quantities but this PETSCII string form is apparently the most widely supported. The CMD drive (at least as emulated in VICE) actually generates a three-digit year for 2000 and later (11/12/122), though that doesn't seem to be part of the spec, so we handle both ways.

Immediately some potential shortcuts become apparent. Subtracting 48 away from the PETSCII values of the digits, we're effectively getting the date from the CMD device in decimal (technically unpacked BCD). In fact, we can exploit this by trivially converting the manually entered time and date into unpacked BCD as well and using the same code to process it. That means dividing by 10 and 100 is just a matter of extracting the right digits; in particular, since we expand the year out to a full four digits, dividing the year by 100 just means taking the top two of them. (Combining a two-byte decimal number into a single byte quantity is fast: multiply the tens byte by 10, i.e., shift it twice to multiply by 4, add the original value to make it times 5, shift again to make it times 10, and add the ones. We have a macro for this, naturally.) To divide by 400, we just divide the "divide by 100" quantity by 4, which is two bit shifts. As a practical matter this means we can't handle years past 9999, but this meets the requirements of the RFC, and I apologise to any far-future users of this program in advance. Finally, the code to fiddle with the month must only account for 12 months, which practically screams at us to use a lookup table (so is dividing that resulting figure by 10, so that's another lookup table, and so is int( ( ( $month * 306 ) + 5 ) / 10 ) )).

Two multiplys we can't avoid, but we can turn them into shifts and adds. To turn the year into a 16-bit quantity to multiply against it, we have to multiply the upper two digits by 100, which decomposes into the sum of multiplying them by 4, multiplying them by 32 and multiplying them by 64, which equal 100. We add that to the lower digits to get the year as a 16-bit short. That, in turn, must be multiplied by 365, which decomposes into the sum of times 256, times 64, times 32, times 8, times 4 and "times 1." We already have a fast path for multiplying by 256: that's a left shift of 8, so we just shift the bytes up by one byte and add the other shifts to it. We'll be summing all of this into the 64-bit time value, so we have macros for adding variously sized quantities. (The first way I did this as a check was adding to itself 256 times, and then adding to itself 109 times — 256 plus 109 equals 365. Similarly, I did the multiply by 100 as a 100-add loop as a verifier. I've left this code in for comparison but it's obviously much slower.)

After rebasing the total number of days by subtracting the equivalent figure for the Unix epoch (which we store as a constant), we now need to make it into seconds, which confronts us with computing the sum of days times 86,400, hours times 3,600, minutes times 60 and seconds, and then dividing that sum by 30 to generate the TOTP counter and noting the remainder. But things just got brighter in a hurry, because the improvement we can make here is massive and massively obvious: we can immediately divide all the constants by 30 up front, because they're themselves integer multiples of 30, and thus eliminate the division entirely. That means we multiply days by 2,880, a much more manageable number, hours times 120 (which can be a 24-item lookup table) and minutes by 2 (now trivially a single left shift). The remainder-30 can be computed easily off the seconds all by itself: if greater or equal than 30, take away 30 and add one to the total, and the residual is the remainder; otherwise, the seconds itself is the remainder (think for a second why this must be true). But wait, there's more! x2880 decomposes into the sum of x2048, x512, x256 and x64. Hey, there's a x256, so we can just do a byte shift! In addition, we know that the seconds divided by 30 (1 or 0) and minutes times two (0-118) summed are less than 256, so they will fit into a single byte. We thus start off the times-2880 with the x256 shifting the 64-bit quantity by one byte, but making the lowest byte the seconds plus minutes instead of just zero, doing the add and first part of the full multiply all together. We then add the sums of the other shifts and add hours from a times-120 lookup table.

That leaves adjusting for UTC from localtime (Unix time is always UTC), which is now straightforward: multiply the hours difference from UTC with the same times-120 table and the minutes by 2, and either add or subtract them from the total. We now have Unix time divided by 30 ready to serve as a counter, plus the remainder so that we can synchronise correctly on 30-second intervals. Whew!

Now for the HMAC and that mod 1000000. Our SHA-1 module proceeds in "chunks" of 512 bits/64 bytes; it expects whatever calls it to have the buffer set up correctly. An HMAC is computed by first computing the hash of the 64-byte key XOR-ed with a repeated inner padding "ipad" byte (0x36) concatenated with the big-endian 64-bit counter value, a grand total of 576 bits or 72 bytes. SHA-1 requires the payload to be terminated with a single one bit and then padded out with zero bits for a bit length congruent to −64 ≡ 448 (mod 512), followed by a 64-bit big-endian length of the entire message in bits. The 160-bit result of this first hash is then concatenated to the end of the 64-bit key XOR-ed this time with an outer padding "opad" byte (0x5c) for a total of 672 bits or 84 bytes, and hashed a final time for the result.

In practical terms, we thus run four chunks through the SHA-1 module, starting with resetting the hash variables to their initial state. The first one is the full 64-byte "ipad" (which we precompute). We don't reset the hash variables prior to the second chunk, which is the 64-bit counter concatenated with 0x80 (the terminating bit), followed by 53 nulls and 0x2 0x40 (576 bits). We run this chunk through, copy off the hash to use in the fourth chunk, and reset the hash variables for the new hash. The third chunk is the full 64-byte "opad" (which we also precompute); after processing it we again don't reset the hash variables and continue to the fourth chunk, which is the 160-bit/20-byte hash from the second chunk concatenated with 0x80, 41 nulls, and 0x2 0xa0 (672 bits). The hash value after the fourth chunk is the HMAC result.

The last step is to turn the HMAC into the 6-digit TOTP code, which is described in RFC 4226, the RFC for HOTP. A 32-bit slice is taken out of the 160-bit result based on the third byte of h4 ANDed with 15; this is the index to the first of the four bytes making up the 32-bit value. The value then has its most significant bit masked off, leaving a 31-bit value. We can do this easily from the contiguous hash variables in memory. The code is that 31-bit value mod 10digits; most implementations use 6, so we have to compute that value modulo 106, or 1000000.

The first way I did this was to figure out all of the decimal digits by labouriously dividing by powers of 10 (basically brute-force repeated subtraction) and then taking the lowest six. The maximum value of an unsigned 31-bit value is representable in 10 digits, so we start with the billions place and work our way down. As the exponent gets smaller we don't need to check all 32 bits, so smaller quantities get faster. The macros for this are somewhat tortured by the need for labels for each divide step.

Performance wasn't abysmal but while conceptually simple I didn't find my initial approach to be particularly elegant. A better algorithm turned up, however: leverage the 6502's decimal mode to turn the value into BCD, and take the lower six digits of that. Credit goes to Andrew Jacobs for writing this lookup-free version of Garth Wilson's original (a variation on the double-dabble or "add three" method), which I expanded from 16 bits into 32 and flipped around to keep everything big-endian. It works by clocking out bits and adding them (via carry) to a value that doubles each time through the loop, effectively the BCD version of double-dabble, completing after a constant 32 loops with just one add per iteration for every two digits. Much quicker! (The old repeated-subtraction version is left in for comparison. Note that interrupts have to be off to have decimal mode set on the Commodore 64 because the Kernal Timer A IRQ doesn't properly handle this flag.) We then take the lowest six digits (three bytes) of that and unpack them. I printed each digit and compared it to a real TOTP generator, and it matched!

At this point all that was left was the user interface, which is the fun part. I wrote the menu program first, which is straightforward BASIC, using my own tools to tokenize the source and link it in memory into a single runnable file with the assembled machine code. I also added the padlock sprite for decoration and wrote a little tool to turn that into .byt pseudo-ops or DATA statements (eventually I settled on the former), and wrote the code for the big stretch number glyphs. To ensure the clock was as accurate as possible and undisturbed by changes to interrupt status (remember we have to turn them off to convert the 31-bit hash extract to BCD), I used the CIA Time-of-Day clock as an interval timer, which needs adjustment for 50Hz or 60Hz power. The TOD clock itself tracks time in BCD, so to load the initial value we convert the remainder-30 we calculated way back when into BCD too (using the 8-bit version of Andrew's algorithm, since the remainder will never be greater than 29). The TOD clock then keeps time independently. As the seconds change we advance the little colourful progress bar, computed off the LSB of the counter. When the TOD clock's seconds count reaches 30 (i.e., 48, the decimal version of 30 in BCD), we reset it to zero, increment the clock counter, compute a new 6-digit code and start a new colour in the progress bar, looping repeatedly until we're stopped by the user. And that's how it all works.

Although I'm not accepting major refactors or tuneups or ports to other systems, I do intend to add more support for additional real-time clock devices and will consider pull requests related to that. (The one I'm hoping to do is for the RTC in the Ultimate-II+, but mine is still on backorder.) Otherwise, please only report verifiably incorrect behaviour; I consider this project largely feature complete. Feel free to fork it and port it to your favourite system. If you want to build the source yourself, you'll need:

  • the xa65 cross assembler, which is highly portable and runs on many systems
  • Perl 5.005 or better to run the tokenizer, linker and sprite builder scripts (included in tools/, you don't need to install them)
  • pucrunch to make a runnable compressed version of the resulting linked executable (reduces the size by about half). This is technically optional: if you don't have pucrunch, the build will err out at this point, leaving a totp.arc file in the current directory as the product from the linking step. That file is itself a runnable program and fully useable, just not compressed.
Adjust OBJ in the Makefile to your desired target if necessary (it's set to ../prgs/totp for my convenience with VICE) and then make to build all the pieces, link and crunch them. If you alter the assembly code, you may need to change the offsets in the BASIC portion or the menu will crash trying to initialize keys and/or run the TOTP display loop. These addresses are displayed as part of the build output and comments in the BASIC portion will tell you what needs to match up.

You can also use the time and SHA-1 modules in your own projects by building them as objects and linking them, or simply #includeing them in your assembly source. They have copious comments to explain their function and how to call them, and other than the pre-processor macros the meaty parts are encapsulated with xa's .( .) pseudo-ops to keep the labels from clashing with your own code's. Only "exported" labels are outwardly visible.

Now, if you'll excuse me, I need to log in and do a few things. Just gotta wait for the 1541 to finish loading so I can enter the 2FA code. See you in a bit.

No comments:

Post a Comment

Comments are subject to moderation. Be nice.