Sunday, 2013-04-28
Forth to HTML
I keep thinking about writing a literate programming tool for Forth. It seems like a good fit. Peter Knaggs suggests that the major benefits of literate programming (in order of importance) are1
- Flexible order of elaboration
- Automatic support for browsing
- Typeset documentation
Modern languages often allow you to feed your programs to the compiler in whatever order you choose, and have good support for factoring out any piece you want. Similarly, modern IDEs generally provide good support for browsing -- jumping to the definition of a function, popping up the documentation for a function while you are writing code so you can see what the arguments are supposed to be, even some automatic refactoring operations. And they almost always provide syntax highlighting to display code in a nicely readable format. So in many cases the only real benefit of literate programming these days is to provide typesetting for mathematics and figures in the comments of your program.
But Forth lacks many of these advantages. It requires you to define words before using them, dictating the order of elaboration. And Forth enthusiasts mostly seem to use development environments which are abysmally lacking in features compared to those for other languages.
So here is a little program which marks up
Forth source code with HTML syntax highlighting, with a stylesheet which displays it in pseudo-colorforth
style. It is somewhat gforth specific, as I haved used
evaluate-with
and include-file-with
(aliases
of gforth's names execute-parsing
and
execute-parsing-file
, which I don't like). I have copied
the marked-up code into this HTML document and added cross-references by
hand. Eventually I'd like to write a tool which does it
automatically.
We use what is basically a Forth interpreter loop, only instead of interpreting the code, we simply mark it up and output it. We wrap up this loop in two words allowing us to easily mark up a file or a (single-line) string.
: markup-line ( "..." -- )
begin parse-word dup while
2dup find-markup execute
repeat 2drop cr ;
: markup-input ( "..." -- ) begin markup-line refill 0= until ;
: markup-string ( c-addr u -- ) ['] markup-input evaluate-with ;
: markup ( "filename" -- )
parse-word r/o open-file throw
['] markup-input include-file-with ;
We have redefined parse-word
to echo the whitespace
before the word, and avoid eating any whitespace after the word.
: parse-word ( "word" -- c-addr u )
source >in @ /string dup >r
2dup skip-ws 0 split type-html
2dup scan-ws 0 split
2swap nip r> swap - >in +! ;
type-html
escapes ‘&’,
‘<’, and ‘>’. We start with a word which
takes an xt which decides whether a character should be replaced with an
encoded version. Then we use this to implement an HTML-encoding version
and one which also encodes double quotes (for use in HTML
attributes).
: type-encoded ( c-addr u xt -- )
>r 2dup begin dup while over c@
r@ execute if
2>r dup >r 2swap r> - type 2r> type
1 /string 2dup
else 1 /string then
repeat 2drop type r> drop ;
: entity? ( char -- false | c-addr u true )
dup [char] & = if drop s" &" true else
dup [char] < = if drop s" <" true else
dup [char] > = if drop s" >" true else
dup 9 = if drop s" " true else \ tabs are 4 chars
drop false then then then then ;
: type-html ( c-addr u -- ) ['] entity? type-encoded ;
: attr-encoded? ( char -- false | c-addr u true )
dup [char] " = if drop s" %22" true else entity? then ;
: type-attr ( c-addr u -- ) ['] attr-encoded? type-encoded ;
Then we have a few words which output HTML tags.
: quoted ( c-addr u -- ) [char] " dup >r emit type-html r> emit ;
: attr ( c-addr u -- )
[char] " dup >r emit type-attr r> emit ;
: span{ ( c-addr u -- ) ." <span class=" quoted ." >" ;
: named-span{ ( s.class s.name -- )
." <span id=" attr space ." class=" quoted ." >" ;
: }span ( -- ) ." </span>" ;
: .tagged ( s.text s.type -- ) span{ type-html }span ;
On top of the basic tag-generating words, we build others for each class of text—definitions, immediate words, literal values, and comments.
: .definition ( c-addr u -- )
s" d" 2over named-span{ type-html }span ;
: .immediate ( c-addr u -- ) s" i" .tagged ;
: .literal ( c-addr u -- ) s" l" .tagged ;
: comment-span{ ( -- ) s" t" span{ ;
We want to define our markup words in a separate wordlist to avoid overwriting things. But different Forth systems handle naming differently—some put the new word in the wordlist in effect when you begin the definition, some (including gforth) put it in the wordlist which is current when you end the definition. So we want to switch the definitions wordlist for the whole definition and then restore it. These helper definitions make that easier...
variable restore-wid false restore-wid !
: in-wordlist ( wid1 -- wid2 )
get-current true restore-wid ! swap set-current ;
: ?restore-current ( | wid -- )
restore-wid dup @ false rot ! if set-current then ;
: ; ( C: colon-sys | wid colon-sys -- )
postpone ; ?restore-current ; immediate
Then our markup definition and lookup words are simple. Note that
the lookup word returns type-html
if no special behavior is
found. Also, since we are not adding the markup
wordlist to the search order, we define a word to bind to another markup
definition.
wordlist constant markup-words
: markup: ( "name" -- wid colon-sys )
markup-words in-wordlist : ;
: find-markup ( c-addr u -- xt ) ( xt: ... c-addr u -- ... )
markup-words search-wordlist 0= if ['] type-html then ;
: [markup-for] ( "name" -- )
parse-word find-markup compile, ; immediate
Now we can define all the markup. Control flow words are immediate.
markup: ahead .immediate ;
markup: if .immediate ;
markup: else .immediate ;
markup: then .immediate ;
markup: begin .immediate ;
markup: again .immediate ;
markup: until .immediate ;
markup: repeat .immediate ;
markup: while .immediate ;
markup: do .immediate ;
markup: loop .immediate ;
markup: +loop .immediate ;
There are a few words like postpone
which parse the next
word and mark it as a literal.
markup: postpone .immediate parse-word .literal ;
markup: ['] [markup-for] postpone ;
markup: [char] [markup-for] postpone ;
Comments and string literals are similar, but with a different parsing (and of course different display classes).
: parse-line ( -- c-addr u ) source >in @ /string dup >in +! ;
markup: \
comment-span{ type-html parse-line type-html }span ;
markup: (
comment-span{ type-html
[char] ) dup parse type-html emit }span ;
markup: .(
type-html [char] ) parse .literal [char] ) emit ;
markup: s"
type-html [char] " parse .literal [char] " emit ;
markup: ." [markup-for] s" ;
Colon definitions use a state variable.
variable markup-state false markup-state !
markup: :
markup-state @ if type-html else
type-html parse-word .definition
s" c" span{ true markup-state !
then ;
markup: ;
}span type-html false markup-state ! ;
Defining words and normal (non-immediate) character and xt literals check this variable to see whether they should mark the next word as a literal.
markup: variable
type-html markup-state @ 0= if
parse-word .definition
then ;
markup: constant [markup-for] variable ;
markup: alias [markup-for] variable ;
markup: char
type-html markup-state @ 0= if
parse-word .literal
then ;
markup: ' [markup-for] char ;
And of course we want to highlight our markup defining words correctly.
markup: markup: [markup-for] : ;
markup: [markup-for] [markup-for] postpone ;
And...that's about it, except for the string-handling words we used
to define our new parse-word
.
\ split S at the start of S', removing N characters
: split ( s s' n -- s'' s.first )
over >r over min /string 2swap r> - ;
: scan-ws ( c-addr u -- c-addr' u' )
begin dup while over c@ bl > while
1 /string
repeat then ;
: skip-ws ( c-addr u -- c-addr' u' )
begin dup while over c@ bl <= while
1 /string
repeat then ;