Saturday, June 11, 2022

prior-art-dept.: ProleText, encoding HTML before Markdown (and a modern reimplementation)

Steven P. Spackman allegedly once observed that "flat text is just never what you want." Which, I guess, is true: half the historical advances in computing have come from figuring out ways to tart up plain text, whether embedding control codes or out-of-band styling or in-band markup. However, with the exception of out-of-band styling (I always liked the Macintosh text file formats that kept the text in the data fork and the styling in a resource), you still needed to parse the file or at best you'd get blocks of text separated by gobbledygook. Enhanced text formats like Markdown were thus designed to make cognitive sense to human eyes without further parsing — but also encoding sufficient metadata to facilitate improved ways of rendering the document.

Markdown circa 2004 has displaced most of the others today, but it explicitly never claimed to be the first such human-readable format; indeed, AsciiDoc predates it by about two years, reStructuredText a year before that and MakeDoc about a year before that. For that matter, some of the concepts popularized in Markdown might not have existed at all were it not for earlier ancestors like 2002's Textile.

But a forgotten rich text language predates most of these, with the interesting property in that much of the markup is encoded using trailing whitespace, almost a fusion of in-band and out-of-band styling systems. If the whitespace is munged, it's largely just a text document (like those particular Mac files if you pass along only the data fork); but if it passes through intact, an intelligent converter can use attributes encoded in the whitespace to style it into HTML. That system is ProleText.

Brad Templeton was one of the early names in microcomputers, starting out as Personal Software's employee #1 while still a teenager. Peter R. Jennings, who wrote Microchess for the Commodore KIM-1, founded the company with Dan Fylstra to publish it. In 1979 they made a deal with outside company Software Arts to publish their new program for the Apple II called VisiCalc, which is now known as the first electronic spreadsheet package.

Templeton was experienced with the Commodore PET and one of his initial jobs was the VisiCalc port to that platform. Having done the job, Personal Software, which in 1982 changed its name to VisiCorp, subsequently assigned him to work on a graphics companion package for a new and secret upcoming machine (that we now know as the IBM PC) while he was an undergraduate at the University of Waterloo in Canada. He developed the software in C remotely over a Tymnet X.25 packet-switched link to an East Coast minicomputer that also happened to have Usenet newsgroup access via uucp and Arpanet. He was hooked, and successfully lobbied to get the University a Usenet feed with generous connectivity paid for by Digital Equipment Corporation. It was the first international link to Usenet outside of the United States.

Templeton left VisiCorp in 1983 and founded a new software company, but didn't forget the experience. His online activities started to take up substantial portions of his time, particularly his work as moderator for rec.humor.funny, and he began looking at ways to potentially earn from it without arousing the ire of NSFNet, who still administered the nascent Internet in those days and banned commercial use of the backbone.

In 1988 The Source was dying out partially because of the expense of its content contracts, but CompuServe was flourishing, not least because it was cheaper and had lower costs. Part of those lower costs came from outsourcing a large part of its news and content acquisition to a company called Comtex, which started in 1980 electronically publishing scientific research and expanded into financial and general mass-market news. Templeton contracted with them and others as well, converting their feeds to articles that could be consumed over Usenet but within a private newsgroup feed for which he planned to sell subscriptions. He reasoned, and NSFNet agreed, that academic institutions could get his feed over the Internet because they would use it for academic purposes; thus, as long as he got academic institutions up first, they could then propagate his premium newsgroups to commercial subscribers using uucp and other non-Internet means where NSFNet had no involvement (and he would give them and other such feeders a discount as incentive). Eventually he was able to cut out the middleman to contract with content providers like United Press International directly and added other syndicated content. The new ClariNet delivered its first set of articles to Stanford University in 1989, which in turn propagated throughout Silicon Valley, and Brad tells the rest of that story himself. (I met Brad at a Vintage Computer Festival one year, where, instead of merely infrared-beaming his contact to my Palm m505, he sent me a "You Have Been Hacked" card which also served as his contact. Very funny.)

ProleText (circa 1995-6 or thereabouts) came about as a means to have a way to visually enhance ClariNet's premium articles, but in a fashion that wouldn't look any different — or at least much different — to text consumers. Articles were authored in ProleText on the ClariNet side with the intent that a sufficiently advanced newsreader could do the processing on the client side or with a CGI script on a webserver (a translator written in C was provided). If the articles remained unprocessed by a text newsreader, however, the text was still legible. The encoded formatting could even survive cutting and pasting in a text editor to at least some degree. In fact, that's the case for the inline ProleText in this very article.

Whitespace formatting works by constraining most lines in the document to 60 characters and then using a sequence of spaces delimited by tabs to emit "tuples" (for example, [SPACE][SPACE][TAB][SPACE][SPACE][TAB][TAB] translates as a 2,2,0 tuple, which is understood as a header marker). The tuples tag the line with a single format type, or possibly a continuation from a preceding tuple. Special tags occupy their own lines and some act as containers. It is able to encode hyperlinks, headings, basic paragraphs and breaks, horizontal rules, images, lists of varying types and pre-formatted text. An in-band system of escapes using various metacharacters (#, * and _) allowed text decoration with boldface and italics, as well as inline links and images.

There is no great publicly available corpus of ProleText, so for didactic and personal gratification purposes I did two things. First, I reimplemented a ProleText to HTML translator in Perl using Brad's specification and probing his original C version of same, called inform (not to be confused with the interactive fiction package). My version corrects some edge cases with how it processed inline substitutions — see below — and I think it does a better job with more standards-compliant HTML (admittedly when this was written there was much less concern about it). On the security site it's also less likely to get suckered into emitting arbitrary characters in bad places by malicious input, and by being written a memory-safe(r) language, it is much less prone to general misbehaviour as well.

Second, because of the extreme paucity of ProleText in the field and the relative difficulty with handwriting it in a text editor, I also implemented an HTML to ProleText translator, similar to things like Turndown. I'm not aware of any such module for ProleText ever existing publicly (Brad himself says "[a]n HTML to Proletext translator is needed") nor for any other source format. You can feed the output of one to the other and see how they interact. We'll do some examples below.

The two tools are on Github as the "PeroleText" (heheh) toolset. They are tested with Perl 5.8 or higher, and do not depend on any external modules (the HTML-to-ProleText converter in particular includes its own miniature HTML parser, because I'm one of those people who will reinvent the wheel given the slightest opportunity). They are coded so that you can either chmod +x them and run them directly, or require them into a script, or rename them to a .pm and use them; the magic unless (caller) lets the script determine its mode of operation based on the context (see example symlinks). Both tools are stream-oriented on both input and output, so you can send files or pipe output of any length to them. If you want to use them in your own code, I'll explain briefly the class functions at the end, though there are only three and they are largely the same for both modules.

Let's start by seeing how a ProleText file is tagged. This example is the very same one Brad himself provides (the Clinton references are very 1990s), and is in the Github repo as example.txt.

% chmod +x pt2html
% ./pt2html -debug example.txt
220001> 
    80>                                 A big title
     1>                               With more for H1
    37> 
     8>   Point One
     3>                 This is the definition for Point One
     8>   Point Two
     3>                 This is an even bigger definition for point two
     2> 
     2>         WASHINGTON (AP) -- Greek Prime Minister Andreas Papandreou will
[...]

What pt2html in debug mode does is merely emit the contents of the file, but display any tuples it finds on individual lines. Blank lines can have tuples, too (a completely blank line is, reasonably, seen as a paragraph break), and in fact certain tuples only have meaning or function in a different way when they're attached to a blank line instead of one with text. The (2,2,0,0,0,1) tuple says this is a header for a ProleText document, and the ProleText version in use is 0.0.1 (which is the only known extant version). When a decoder sees this, it shifts into decoding mode and starts translating any tuples it finds on subsequent lines until it gets a trailer (2,3,0). A document might shift in and out of ProleText multiple times. Lines that are not within a header-trailer pair are considered unformatted and emitted as plain text; similarly, if the line has no tuple at all, it is also treated as unformatted plain text.

Tuples don't necessarily translate into specific HTML tags, even though many do, and some translate into multiples. For example, tuple 8,0 provides both a title and an <H1> heading. The first line is used as both the title and the first part of the heading, and then subsequent lines are incorporated in the heading. In HTML as generated by pt2html, it looks like this (hard wrapped for legibility; actual output is two lines):

% ./pt2html example.txt
<!DOCTYPE html><HTML><HEAD><TITLE>                              A bi
g title</TITLE></HEAD><BODY><H1>                            A big ti
tle
                              With more for H1</H1>

The tuple 1 on the next line indicates it is a continuation of tuple 8,0, so it becomes the second line of the block.

ProleText tuples are not true containers, even though some modal line tags can act like containers. Moreover, no single line can have multiple tuples attached to it (ProleText is nearly completely line-oriented). Instead, to facilitate things like inline links and boldface and other kinds of text decoration, ProleText provides a system of inline substitutions which operate within blocks. Continuing the debug output:

     2>         Clinton invited Papandreou to Washington last fall and the dates
     1> have now been #<www.clari.net/foo#> *arranged* #:, said White House Press
     1> Secretary Dee Dee Myers.
      > 
     9>                         news:clari.news.briefs
     1>                    The newsgroup clari.news.briefs

There are two ways to make a link, both demonstrated here. The second way using tuple 9 takes the first line as the URL and the second line as the text of the anchor (if a single line block, the URL is used for both the link and text). However, the first method used (in the middle of a tuple 2, a regular paragraph) uses hash characters and greater/less-than symbols to set off the URL, and a hash character followed by a colon as the terminator. This maps directly to <A HREF...> and </A>:

<P>        Clinton invited Papandreou to Washington last fall and th
e dates have now been <A HREF="http://www.clari.net/foo"> <STRONG>ar
ranged</STRONG> </A>, said White House Press
Secretary Dee Dee Myers.</P>
<P>

<A HREF="news:clari.news.briefs">                  The newsgroup cla
ri.news.briefs</A><

Also note the boldface with asterisks, same as many later formats (you can thus guess that italics use underscores). Even though these are not preformatted lines such as you would use <PRE> with, the spacing and line delimiters are passed through and become whitespace to HTML. As such the whitespace around the link text "arranged" is faithfully maintained, but neither Brad's C implementation nor my Perl implementation requires spacing to recognize inline sequences. Interestingly, boldface and italics are always automatically cancelled at the end of a line.

Inline sequences also permit inserting images (there's a conventional tuple for images too). Another unusual touch is that even though the link lacked the http:// portion, the spec specifically requires anything starting with www. should have the protocol added to it. This saves precious line width, though I note with some amusement that even though the spec strongly urges a max of 60 character lines, this example file doesn't adhere to that advice.

No system of generating ProleText from other types of documents seems to have publicly survived, so let's turn our attention to the second tool I wrote, html2pt. This turns arbitrary HTML, even something you might pipe to it from curl, into as good a reproduction in ProleText as it can automatically generate.

Let's compare it with the more famous Turndown, which is the same type of process for Markdown (alternatively, compare with the output from something like Html2Markdown). The test vector used below is provided in the Github repo as test.html. Turndown, using that HTML as input, generates this output:

Turndown Demo
=============

This demonstrates [turndown](https://github.com/mixmark-io/turndown) – an HTML to Markdown converter in JavaScript.

Usage
-----

    var turndownService = new TurndownService()
    console.log(
      turndownService.turndown('<h1>Hello world</h1>')
    )

* * *

It aims to be [CommonMark](http://commonmark.org/) compliant, and includes options to style the output. These options include:

*   headingStyle (setext or atx)
*   horizontalRule (\*, -, or \_)
*   bullet (\*, -, or +)
*   codeBlockStyle (indented or fenced)
*   fence (\` or ~)
*   emDelimiter (\_ or \*)
*   strongDelimiter (\*\* or \_\_)
*   linkStyle (inlined or referenced)
*   linkReferenceStyle (full, collapsed, or shortcut)

The ProleText equivalent, or at least the best match I can make the script generate, isn't too different superficially (the tuples are present in this output; try drag-selecting the text to see them):

% ./html2pt test.html
  	  				 

Turndown Demo  	


This demonstrates  
#<https://github.com/mixmark-io/turndown#>turndown#: – 
an HTML to Markdown converter in JavaScript. 

Usage   	

  	   	
var turndownService = new TurndownService()
console.log(
  turndownService.turndown('<h1>Hello world</h1>')
)
  	  				 

------------------------------------------------------------	 

It aims to be #<http://commonmark.org/#>CommonMark#:  
compliant, and includes options to style the output. These 
options include: 
   	 
    * headingStyle (setext or atx)   
    * horizontalRule (#*, -, or #_)   
    * bullet (#*, -, or +)   
    * codeBlockStyle (indented or fenced)   
    * fence (` or ~)   
    * emDelimiter (#_ or #*)   
    * strongDelimiter (#*#* or #_#_)   
    * linkStyle (inlined or referenced)   
    * linkReferenceStyle (full, collapsed, or shortcut)   

 
  

Unlike the example file, html2pt tries aggressively to keep everything to 60 characters as preferred in the specification. If we pipe that to ./pt2html -debug, we can see the tuples explicitly, or pipe it to ./pt2html to generate something very similar to the original HTML, including the preformatted plain text section in the middle and the unordered list. The literal asterisks and underscores in the list use different escapes than the backlashes in Markdown, though another part of the spec is that inline lists using asterisk bullets become list items, as you would expect. The same thing is true for ordered lists and dictionaries.

Its aggressiveness about line length extends to links. If a link is too long to inline in the text, html2pt will try to use tuple 9 and just eat the long line there rather than have a long line mess up text flow. The downside of doing it this way is that tuple 9 is a block of its own, so it tends to introduce an epenthetic paragraph break immediately after even in those situations where it can recover what the last tuple block was. If a line breaks in the middle of a section of boldface and/or italics, a new set of inline substitutions is automatically emitted on the next line to continue it as seamlessly as possible.

Relative links are emitted as she is spoke and thus will work fine if the document is rendered back to HTML, though jumps to fragment anchors in the text aren't currently possible (while tuple 4,2 lets you emit a series of autogenerated anchors, there's no way to generate an exact anchor the document specifies).

Let's feed it something a little bigger; I picked this old entry of mine because it has some preformatted blocks as well. Blogger puts a lot of crap into the page which doesn't translate through and causes a lot of blank lines, so html2pt has output filtering to try to cut down on the spew. Tuples are invisibly present in the text below.

% curl http://oldvcr.blogspot.com/2022/04/tonight-were-gonna-log-on-like-its-1979.html | ./html2pt
 				 
Old Vintage Computing Research: Tonight we're gonna log on       	
like it's 1979 (Telenet, Dialcom and The Source) 


#<http://oldvcr.blogspot.com/#> Old Vintage Computing  	
Research #: 


REWIND and PLAY  



Sunday, April 10, 2022   	

#:  

Tonight we're gonna log on like it's 1979 (Telenet,    	
Dialcom and The Source) 


https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_vWjs5J6UFSm28d70dufz1R59zD4SURtPn7em4NHdZv4JASS9mTvhFy1YXTBpendiyQ_0iPGSgo5uyTwH3SpVtoNV6--ySkAQBu2pomDfuWlC-KtdzLy2EWJ3Wdk3topIHlbXQnrGAgb5ZtQMgjjBrp-F5zCZxwAvnKzDVYtfietfEXtss4iHzGmV/s4080/IMG_20220403_210236.jpg         
[Image] 

Teletypes may have killed a lot of forests by emitting  
every line to hard copy instead of a screen, but there's 
something to be said for the permanence of paper, 
especially when people hang onto it for some reason. While 
getting duff units to build a functional 
/2022/02/refurb-weekend-texas-instruments-silent.html         
Silent 700 Model 765 ASR teletype 

Obviously pages with lots of text like this translate relatively well, but ProleText was designed for that use case specifically, so that shouldn't be too surprising. Notice the spurious #: end of an anchor, which came from a <a name="..."> it couldn't translate. After dithering over whether I should suppress them there, I decided to leave it alone since it makes a nice visual section header (pull requests welcome if you disagree). As you go through the document, you'll notice it jump in and out of preformatted blocks by emitting trailer and header tuples, which is absolutely acceptable behaviour per the spec.

While most of the documented tuples are implemented in pt2html (note that html2pt does not generate the full spectrum, for a variety of reasons), there are a few which I decided not to. Nevertheless, some are unusual features that have little peer in other systems, so they're worth mentioning. Perhaps the most unusual is the macro facility, where a document encoded with a future version of ProleText can "polyfill" an old decoder by providing substitution hints on tuples it may not support. These hints can even nest up to 10 levels. However, there's no future version of ProleText to demonstrate with it, so the facility presently has no functional use (if you pass an unknown tuple to pt2html, currently it will simply echo the line and flag the offending value). Related to this I've also not implemented the behaviour for handling undefined line tags: again, there's nothing to test it with, and coming up with some and causing html2pt to generate them frankly defeats the purpose of historical reconstruction.

A tuple I've outright refused to implement is tuple 4, or raw HTML. That's like it sounds: the HTML is emitted to the client. I think raw HTML escapes are asking for trouble — I don't like it in Markdown either — and so using tuple 4 will generate a warning and be treated like an ordinary paragraph. Although line tag 3,4 is also called RAW, this version wraps its blocks in <XMP> tags instead of emitting it straight. Theoretically this should escape any HTML but the tag is quite antiquated and may not be handled well in modern browsers, so I map them to <PRE> and substitute any problematic characters.

I've also not followed the spec exactly as written for #<...#> and #{...#}, which implement inline substitutions for links and images respectively. In the spec the first half of these pairs emits a partial HTML tag <TAG... and the second half closes the tag with >, which smells ripe for abuse by a cleverly malicious document that might insert metacharacters in the middle. Instead, an entire discrete set of one or the other must appear on one line and not have quotation marks or greater/less than signs to be substituted between them; if such hankypanky is detected, or the substitutions are unbalanced, they are emasculated by escaping the offending sequence. I haven't done the same for any unbalanced #: because multi-line links are allowed, and #: generates a full tag </A> anyway, which doesn't seem to hurt anything if there wasn't a link specified before it.

In any event, the HTML-to-ProleText converter should be considered a work in progress that worked acceptably well only for the corpus I ran through it during development. It also needs to have more HTML entities added to its converter. Something for a future lazy weekend.

The code and these examples are on Github. Since Brad merely copyrighted the existing C converter and didn't put it under any particular license, I've put these under the Perl Artistic License 2.0. If you want to play with ProleText in your own Perl script instead of just running them on the command line, you can either require or use them directly into your own code (no modification required). The ProleText-to-HTML converter object is called PeroleText and the HTML-to-ProleText converter object is called PeroleHTML. They are line-oriented; you feed lines of text to them until you're out of data. Both have just three methods, differing only in the arguments to the constructor:

 my $p = new PeroleText([$debug]);
 print $p->proline($string);
 print $p->done;
 
 my $p = new PeroleHTML([$debug,][$img]);
 print $p->proline($string);
 print $p->done;

The $debug argument (0=false, 1=true) indicates whether to run the converter in debug mode. For ProleText to HTML, this emits the tuples. For HTML to ProleText, this emits the view of the HTML parser. The second argument, only with PeroleHTML, says whether to always replace images with their ALT text (if true, <IMG> tags are used as is). Any unspecified argument to the constructor is treated as false.

Then, for each line of data, call the method proline (this is a specific pun I refuse to explain but Canadian Commodore users should get — hint: Jim Butterfield and Steve Punter), which will return a string you can print or store. This string may not include all of the elements you passed it, particularly in the HTML converter, so when you're out of data call the method done to return anything left over in the object's buffer. The object instance is not reusable when the conversion is complete; destroy it or let it go out of scope after calling done, and make a new instance if you need to do a new operation. Although there are other "secret" instance methods available for these objects, they should be considered neither public, stable nor supported. If you want a one-liner example, try this (in the repo directory):

% perl -I. -MPeroleHTML -e 'my $p=new PeroleHTML();while(<>){print $p->proline($_)} print $p->done;' < test.html

The remit and spectrum of supported HTML being as limited as it was in the mid-1990s, ProleText was no doubt a concept a bit too ahead of its time. Combine that with the lack of an ecosystem, no publicly available generators, a more difficult means of editing and its origination as part of a proprietary commercial service, and it's not too surprising it didn't catch on outside of ClariNet. (For that matter, it's not clear how much, if it all, it remained in use even at ClariNet after the company was sold in 1997 to Individual Inc.) Plus, the multiple and not quite overlapping alternatives for some constructions and an eclectic, non-orthogonal and incomplete (even for 1996, the last date on the C translator files) selection of tags make it questionably suited for the expanding Web despite attempts at future-proofing. Templeton, while asserting it "worked well for delivering ClariNet's news," himself acknowledged its flaws and proposed a later standard called Out of Band HTML — which ended up never being implemented at all. Any similarity to later text markup systems thus seems frankly coincidental and almost certainly a case of convergent evolution.

Nevertheless, ProleText remains an indisputable example of (forgotten) prior art, and demonstrates clearly that even in the mid-1990s there existed a system tolerant of manipulation that does nearly everything modern Markdown does, and did it even before many of Markdown's explicitly acknowledged influences. Its unique design choices have their own unusual consequences, but it seems to have achieved the purpose it was created for even if it survived no further.

No comments:

Post a Comment

Comments are subject to moderation. Be nice.