The Bad Ballet of Regular Expressions: Pathological Programming in Thutu

For today’s installation of programming insanity, I decided to go with a relative of Thue, which is one of my favorite languages insane languages that I wrote about before. Thue is a language based on a rewriting system specified by a semi-Thue grammar. Todays language is called Thutu (pronounced tutu); it’s a string rewriting system like Thue, only it’s based on regular expressions instead of grammars, and it’s even got regular expression-based control flow mechanisms, making it a sort of hybrid language.

The scary thing about Thutu is that it’s not all that different from a language I’ve wanted to find some time to write myself – except that the one I want to write isn’t intended to be pathological. I’ve never stopped missing Teco for writing text processing programs; and since
my TECO programs tended to be roughly of the form: “Find something matching this pattern, and then take this action”, a regular-expression based language would make a lot of sense.

But anyway, today we’re looking at Thutu, which is a deliberately obscure version of this idea.

Thutu syntax is… interesting. It uses whitespace/indentation for program structure – except that a tab is always counted as 8 spaces. Not spaces up to the next 8th column, but 8 space. So space-tab is 9 spaces. And it’s deliberately ambiguous in how it uses regular expressions. But we’ll get to that in a moment. First let’s look at the regular expression syntax. It’s also perl-REs, so this should be familiar to most people, since Perl REs have become the de-facto standard for regular expression syntax. Regular expressions start with a “/”, and continue to the next non-escaped “/”. Most characters in a RE represent themselves, except for the regular expression’s syntax characters. To use a syntax character in a RE, you precede it with a backslash. The RE constructs are:

  • re?: match re 0 or 1 times (with 1 preferred).
  • re??: match re 0 or 1 times (with 0 preferred).
  • re+: match re 1 or more times, matching as many as possible while having the entire regular expression match successfully.
  • re+?: match re 1 or more times, matching as few as possible while having the entire regular expression match successfully.
  • re*: match re 0 or more times, matching as many as possible while having the entire regular expression match successfully.
  • re*?: match re 0 or more times, matching as few as possible while having the entire regular expression match successfully.
  • (re): grouping. The grouped regular expression can be referenced
    by position later, and constructs like “?” and “+” will be applied to the entire content of
    the parens, rather than just the single preceeding character.
  • re1|re2: match either re1 or re2.
  • number: match a copy of whatever was already matched by the numberth paren group.
  • [chars]: match any one of the characters inside the brackets.
  • [^chars]: match any character which is not one of the characters inside the brackets.
  • .: match any single character.
  • ^: match the beginning of the string.
  • $: match the end of the string.

There are two types of statements in Thutu: replacement statements, and control statements. Both replacement and control statements start with a sequence of regular expressions. If the last regular expression is followed by a command character, then it’s a control statement; otherwise, it’s a replacement statement, and the last regular expression actually specifies a replacement. When a statement executes, all of the regular expressions are matched in sequence against the input string. If all match, then the replacement or command is executed. If it’s a replacement, the replacement expression is used to replace the string matched by the last regular expression preceeding it.

A command statement starts a block of code. The block of code consists of the sequence of lines following the command statement that are indented more than the command statement. The command statement that starts a block is called the block marker line for the block, and is not considered part of the block: the block is the statements indented more than the block marker line. (Confused yet? Good! me too!)

Fortunately, there aren’t many commands: just “*”, “@”, “!”, “^”, “<“, “>”, and “.”; and of those, “<“, “>”, and “.” don’t actually start blocks.

  • “*”: loop including block marker: if the regular expressions preceeding the “*” all match, then start trying to match the statements inside the block. If any replacement is successful, or if a line with the “<” as a command reached, go back to the block marker line and do it again. The loop exits if either none of the body statements match, or if after an iteration, the regular expressions in the block marker line itself don’t match.
  • “@”: same as “*”, except that successful replacements and “<” go back to the first line of the block, not to the block control line. So “@” only matches the block control line’s expressions once.
  • “!”: same as “*”, except that the loop body is executed only if none of the
    regular expressions preceeding the “!” match. Like “*”, after a match, the block marker
    line is re-matched.
  • “^”: same as “@”, except that the loop body is executed only if none of the regular expressions preceeding the “^” match; like “@”, the block marker line’s regular expressions
    are not re-matched.
  • “<“: reloop – go back to either the block marker line, or the block’s first line, depending on which type of block it is.
  • “>”: unloop – go to the first statement after the end of the block.
  • “.”: noop – do nothing.

The basic flow and input-output cycle of Thutu is decidedly weird. A Thutu program is treated as if the entire program was enclosed in an “@” loop. The program is always started with an initial input string of “=1”. After an iteration is complete, it does the following:

  • If the string contains “=x”, then everything up to the “=x” is removed, and the character before “=x” are printed to the standard output.
  • If the string contains “=9”, then then the program exits.
  • If the string did not contain “=9”, then one line of input is read, escaped appropriately, and the resulting string concatenated with “=x” is inserted onto the front of the
    string.

And then, if no “=9” was found, it runs another iteration.

There’s not much in the way of example programs in Thutu, which is kind of a shame. But there are a couple of real beauties. As usual, we’ll start with a hello world, which is kind of boring, but it gets exciting pretty quickly after that.

/^=1$/Hello, world!=n=x=9/

So, we’ve got one handy little regular expression here. It looks for “=1”, the normal initial string, and replaces it with “Hello world”, followed by a newline, followed by “=x” and “=9”. So, the main program loop finishes its iteration, and looks for “=x”; prunes it out, and outputs whatever preceeded it – the “Hello world” string with the newline. Then it checks for “=9”, finds it, and terminates. Kaplooie, end of program. Simple!

/=t//
/ //
/=9./=9/
/^(.*)=x/<$1>/
/<0=+0>//
/>*/%>/
/=+([*%]?)>/=+0$1>/
/<=+([0-9])//1>/
/1%>/2>/
/2%>/3>/
/3%>/4>/
/4%>/5>/
/5%>/6>/
/6%>/7>/
/7%>/8>/
/8%>/9>/
/9%>/*0>/
/%(.)/$1%/
/9=+/8=+%/
/8=+/7=+%/
/7=+/6=+%/
/6=+/5=+%/
/5=+/4=+%/
/4=+/3=+%/
/3=+/2=+%/
/2=+/1=+%/
/1=+/0=+%/
/0=+([^%>*]*)(*?[0-9])>/=+$1>$2/
/=x/=1/=9/!
/$/=n=x/
*
/=1//
.

This beast is an adder. First, it strips out whitespace and other forms of noise. Then
it does the same sort of add algorithm that we saw in Thue; only in Thue, it was simple binary addition; Thutu is more ambitious and goes for decimal. But the basic principle is the same – use a marker character to keep track of where you are, and do it one step at a time, with hard-coded rules like “/7=+/6=+%/” for subtracting one from the left operand, and adding one “%” to the right.. So “3+2” turns into “2+%2”. then “1+%%2”, then “0+%%%2”.

Then one at a time, it flips the “%”s to the other side of the right operand, using “/%(.)/$1%/“; and finally gets rid of the “%”s using increment rules like “/3%>/4>/“. So for our example, it would proceed through “%%%2”, “%%2%”, “%2%%”,
“2%%%%”, “3%%”, “4%”, “5”.

Pretty nifty, huh?

And now, it’s nightmare time. The proof that it’s Turing complete: an Underload interpreter. Fortunately, it’s pretty nicely commented.

/=x//
# We don't want the line above running until the next iteration,
# so guard it with a loop. The same loop also checks for =9 and
# prevents anything further happening in that case.
/=9/!
# -- is used to mark the current location within the program.
/--/=1/!
/^/--/
# =x is needed to mark the end of the output. Also output a newline!
/=x/=1/!
/^/=n=x/
# If there isn't a stack, create one. %% is used as a stack marker.
/%%/=1/!
/$/%%/
# Turn temporary brackets {{ }} back to =( =)
/{{/=(/
/}}/=)/
# The next block can leave the temporary bracket reversion until later.
/=1/!
# '()': push onto stack.
/--=(/@
# Change inner brackets at the start of the program to {{ }} temporarily.
/--=(([^)]*?)=(([^)]*?)=)/--=($1{{$2}}/
# Move the set from the program to the stack.
/--(=([^)]*?))(.*?)%%/--$2%%$1/
# ':': duplicate the TOS.
/--=:/@
# Change inner brackets at the start of the stack to {{ }} temporarily.
/%%=(([^)]*?)=(([^)]*?)=)/%%=($1{{$2}}/
# Simultaneously remove the : and duplicate the TOS.
/--=:(.*?)%%(=([^)]*?))/--$1%%$2$2/
# 'a': enclose the TOS in brackets.
/--a/@
# Change inner brackets at the start of the stack to {{ }} temporarily.
/%%=(([^)]*?)=(([^)]*?)=)/%%=($1{{$2}}/
# Simultaneously remove the a and enclose the TOS.
/--a(.*?)%%(=([^)]*?))/--$1%%=($2=)/
# 'S': output the dequoted TOS, popping it.
/--S/@
# Change inner brackets at the start of the stack to {{ }} temporarily.
/%%=(([^)]*?)=(([^)]*?)=)/%%=($1{{$2}}/
# Move the TOS to the end of the output area, and remove the S.
/=n=x(.*?)--S(.*?)%%=(([^)]*?)=)/$3=n=x$1--$2%%/
# '!': pop the TOS.
/--=!/@
# Change inner brackets at the start of the stack to {{ }} temporarily.
/%%=(([^)]*?)=(([^)]*?)=)/%%=($1{{$2}}/
# Simultaneously remove the ! and the TOS.
/--=!(.*?)%%=([^)]*?)/--$1%%/
# '~': swap the top two stack elements.
/--=~/@
# Change inner brackets at the start of the stack to {{ }} temporarily.
/%%=(([^)]*?)=(([^)]*?)=)/%%=($1{{$2}}/
# Change inner brackets in the second stack element to {{ }} temporarily.
/%%=(([^)]*?)=)=(([^)]*?)=(([^)]*?)=)/%%=($1=)=($2{{$3}}/
# Simultaneously remove the ~ and swap TOS and SE2.
/--=~(.*?)%%(=([^)]*?))(=([^)]*?))/--$1%%$3$2/
# '*': concatenate the TOS after the second stack element.
/--=*/@
# Change inner brackets at the start of the stack to {{ }} temporarily.
/%%=(([^)]*?)=(([^)]*?)=)/%%=($1{{$2}}/
# Change inner brackets in the second stack element to {{ }} temporarily.
/%%=(([^)]*?)=)=(([^)]*?)=(([^)]*?)=)/%%=($1=)=($2{{$3}}/
# Simultaneously remove the ~ and concatenate SE2 and TOS.
/--=*(.*?)%%=(([^)]*?)=)=(([^)]*?)=)/--$1%%=($3$2=)/
# Loop back if there are any non-^ commands left to run.
/--[aS]/<
/--=[(:!~*]/<
# '^': move TOS to the program, dequoting it. We then need {{ }} reversion.
/--=^/@
# Change inner brackets at the start of the stack to {{ }} temporarily.
/%%=(([^)]*?)=(([^)]*?)=)/%%=($1{{$2}}/
# Simultaneously remove the ^ and move the TOS to the start of the program.
/--=^(.*?)%%=(([^)]*?)=)/--$2$1%%/
# Loop back if there are any commands left to run.
/--[aS]/<
/--=[(:!~^*]/<
# Loop back if dequoting's needed.
/{{/<
/}}/<
# Clear the stack and IP in preparation for the next program to run.
/=x./*
/=x./=x/
# Remove =1 for the first iteration. Don't loop back afterwards!
/=1/*
/=1//
.

0 thoughts on “The Bad Ballet of Regular Expressions: Pathological Programming in Thutu

  1. Xanthir

    Very interesting, actually. Sort of a combo of a regexp and a PDA, with looping control thrown in for Turing-completeness. Pathological, but relatively simply once you get used to it.
    The control statements don’t seem too difficult. * can be compared to a while, and @ is an if-while, in a way. ! and ^ are the negations of each, an until and an unless-while. Then are self-explanatory control structures, allowing arbitrary looping.
    It doesn’t appear to have any input capability. Everything’s hardcoded?

    Reply
  2. ais523

    As the creator of Thutu, I can say that it does have I/O. At the end of each loop of the main program, a line’s output (if necessary), and if the program doesn’t exit via =9 a line’s read from the input, escaped using =, and prepended to the input string followed by =x. So the null program’s almost cat, except that it strips all newlines from the input.

    Reply
  3. Xanthir, FCD

    Oh, that’s what was meant by that line in Mark’s post. Sorry, I was quite confused. Consider me corrected, then.
    I think the reason I like Thutu is because the concept of iterating through the program, constantly pattern-matching, is in the same vein as Fractran. That’s still my favorite pathological language.

    Reply
  4. Daniel Lyons

    Mark,
    You mention that you’d like to have a language that takes the form pattern -> action. That sounds just like AWK to me. The cover of the AWK book used to be a picture of a tree with different colored apples on it, with boxes around the red ones, and a picture of the tree with the red apples falling onto the ground. That’s a good analogy for the language itself, which consists of regular expressions mapped to blocks of code. It assumes that you’re dealing with standard input. I also consider it SQL for plain text files, since (by default) it is trying to break the input into records with fields.
    Hope you’ll check it out. Definitely a forgotten tool from the golden era of UNIX.

    Reply
  5. Mark C. Chu-Carroll

    Daniel:
    I know about awk. It’s no TECO :-).
    There are a lot of differences… awk is basically line-oriented, focused on a model of “read line/match/process”; whereas TECO was focused on text as character buffers. In TECO, you could treat a buffer as characters, or words, or lines, or paragraphs, or…. And you had the ability to push the cursor around the file in whatever direction you wanted based on previous actions.
    sed is actually closer to what I’d like than awk.

    Reply
  6. Daniel Lyons

    Mark,
    It’s true that it has a certain metaphor. Working around it is hard. You can, at least with GNU AWK, get it to treat the “record” as a bundle of lines and separate fields more intelligently. But you’re definitely not working with the input as just this big string.
    Sed is capable of some pretty cool tricks. I haven’t ever used TECO though I know about its history. You’re probably one of the only people left who’s used it, let alone misses it. OTOH, there is an open-source SNOBOL interpreter out there. It takes all kinds. 🙂
    Do let us know how your TECO-in-Haskell implementation starts coming along. 😉

    Reply
  7. Jonathan Vos Post

    TECO: I’ve used it, I miss it, I said so on a previous thread of your blog. With all due respect, Daniel Lyons is being some sort of reverse snob. I’ve also used and taught SNOBOL. No programming language is best for everything; each is best at something. Let a thousand flowers bloom.

    Reply
  8. Xanthir, FCD

    If by “something” you restrict yourself to “some kind of program”, then I will venture that Intercal is good for nothing.

    Reply
  9. davidp

    I had a useful early programming lesson in TECO.
    I wrote a script about 0.5k in size and:
    a) never fully debugged it
    b) went on 10 days holiday and couldn’t make head or tail of it on coming back.
    I should have learnt something about documenting designs
    I certainly learnt about write only code.

    Reply

Leave a Reply