{"id":337,"date":"2007-03-09T08:07:17","date_gmt":"2007-03-09T08:07:17","guid":{"rendered":"http:\/\/scientopia.org\/blogs\/goodmath\/2007\/03\/09\/the-bad-ballet-of-regular-expressions-pathological-programming-in-thutu\/"},"modified":"2007-03-09T08:07:17","modified_gmt":"2007-03-09T08:07:17","slug":"the-bad-ballet-of-regular-expressions-pathological-programming-in-thutu","status":"publish","type":"post","link":"http:\/\/www.goodmath.org\/blog\/2007\/03\/09\/the-bad-ballet-of-regular-expressions-pathological-programming-in-thutu\/","title":{"rendered":"The Bad Ballet of Regular Expressions: Pathological Programming in Thutu"},"content":{"rendered":"<p> For today&#8217;s installation of programming insanity, I decided to go with a relative of <a href=\"http:\/\/scientopia.org\/blogs\/goodmath\/2006\/08\/friday-pathological-programming-programming-through-grammars-in-thue\">Thue<\/a>, which is one of my favorite languages insane languages that I wrote about before. Thue is a language based on a rewriting system specified by a semi-Thue grammar. Todays language is called <a href=\"http:\/\/www.esolangs.org\/wiki\/Thutu\">Thutu<\/a> (pronounced tutu); it&#8217;s a string rewriting system like Thue, only it&#8217;s based on regular expressions instead of grammars, and it&#8217;s even got regular expression-based control flow mechanisms, making it a sort of hybrid language.<\/p>\n<p> The scary thing about Thutu is that it&#8217;s not all that different from a language I&#8217;ve wanted to find some time to write myself &#8211; except that the one I want to write isn&#8217;t intended to be pathological. I&#8217;ve never stopped missing Teco for writing text processing programs; and since<br \/>\nmy TECO programs tended to be roughly of the form: &#8220;Find something matching this pattern, and then take this action&#8221;, a regular-expression based language would make a lot of sense.<\/p>\n<p> But anyway, today we&#8217;re looking at Thutu, which is a deliberately obscure version of this idea.<\/p>\n<p><!--more--><\/p>\n<p> Thutu syntax is&#8230; interesting. It uses whitespace\/indentation for program structure &#8211; except that a tab is always counted as 8 spaces. <em>Not<\/em> spaces up to the next 8th column, but 8 space. So space-tab is 9 spaces. And it&#8217;s deliberately ambiguous in how it uses regular expressions. But we&#8217;ll get to that in a moment. First let&#8217;s look at the regular expression syntax. It&#8217;s also perl-REs, so this should be familiar to most people, since Perl REs have become the de-facto standard for regular expression syntax. Regular expressions start with a &#8220;\/&#8221;, and continue to the next non-escaped &#8220;\/&#8221;. Most characters in a RE represent themselves, except for the regular expression&#8217;s syntax characters. To use a syntax character in a RE, you precede it with a backslash. The RE constructs are:<\/p>\n<ul>\n<li> <code><em>re<\/em>?<\/code>: match <em>re<\/em> 0 or 1 times (with 1 preferred).<\/li>\n<li> <code><em>re<\/em>??<\/code>: match <em>re<\/em> 0 or 1 times (with 0 preferred).<\/li>\n<li> <code><em>re<\/em>+<\/code>: match <em>re<\/em> 1 or more times, matching as many as possible while having the entire regular expression match successfully.<\/li>\n<li> <code><em>re<\/em>+?<\/code>: match <em>re<\/em> 1 or more times, matching as <em>few<\/em> as possible while having the entire regular expression match successfully.<\/li>\n<li> <code><em>re<\/em>*<\/code>: match <em>re<\/em> 0 or more times, matching as many as possible while having the entire regular expression match successfully.<\/li>\n<li> <code><em>re<\/em>*?<\/code>: match <em>re<\/em> 0 or more times, matching as <em>few<\/em> as possible while having the entire regular expression match successfully.<\/li>\n<li> <code>(<em>re<\/em>)<\/code>: grouping. The grouped regular expression can be referenced<br \/>\nby position later, and constructs like &#8220;?&#8221; and &#8220;+&#8221; will be applied to the entire content of<br \/>\nthe parens, rather than just the single preceeding character.<\/li>\n<li> <code><em>re<sub>1<\/sub><\/em>|<em>re<sub>2<\/sub><\/em><\/code>: match either <em>re<sub>1<\/sub><\/em> or <em>re<\/sub>2<\/sub><\/em>. <\/li>\n<li> <code><em>number<\/em><\/code>: match a copy of whatever was already matched by the <em>number<\/em>th paren group.<\/li>\n<li> <code>[<em>chars<\/em>]<\/code>: match any one of the characters inside the brackets.<\/li>\n<li> <code>[^<em>chars<\/em>]<\/code>: match any character which is <em>not<\/em> one of the characters inside the brackets.<\/li>\n<li> <code>.<\/code>: match any single character.<\/li>\n<li> <code>^<\/code>: match the beginning of the string.<\/li>\n<li> <code>$<\/code>: match the end of the string.<\/li>\n<\/ul>\n<p> There are two types of statements in Thutu: replacement statements, and control statements. Both replacement and control statements start with a sequence of regular expressions. If the last regular expression is followed by a <em>command<\/em> character, then it&#8217;s a control statement; otherwise, it&#8217;s a replacement statement, and the <em>last<\/em> regular expression actually specifies a replacement. When a statement executes, all of the regular expressions are matched in sequence against the input string. If all match, then the replacement or command is executed. If it&#8217;s a replacement, the replacement expression is used to replace the string matched by the <em>last<\/em> regular expression preceeding it.<\/p>\n<p> A command statement starts a block of code. The block of code consists of the sequence of lines following the command statement that are indented <em>more<\/em> than the command statement. The command statement that starts a block is called the <em>block marker line<\/em> for the block, and is <em>not<\/em> considered part of the block: the block is the statements indented <em>more<\/em> than the block marker line. (Confused yet? Good! me too!)<\/p>\n<p> Fortunately, there aren&#8217;t many commands: just &#8220;*&#8221;, &#8220;@&#8221;, &#8220;!&#8221;, &#8220;^&#8221;, &#8220;&lt;&#8220;, &#8220;&gt;&#8221;,  and &#8220;.&#8221;; and of those, &#8220;&lt;&#8220;, &#8220;&gt;&#8221;, and &#8220;.&#8221; don&#8217;t actually start blocks.<\/p>\n<ul>\n<li> &#8220;*&#8221;: loop including block marker: if the regular expressions preceeding the &#8220;*&#8221; all match, then start trying to match the statements inside the block. If any replacement is successful, or if a line with the &#8220;&lt;&#8221; as a command reached, go back to the block marker line and do it again. The loop exits if either none of the body statements match, or if after an iteration, the regular expressions in the block marker line itself don&#8217;t match.<\/li>\n<li> &#8220;@&#8221;: same as &#8220;*&#8221;, except that successful replacements and &#8220;&lt;&#8221; go back to the first line of the block, <em>not<\/em> to the block control line. So &#8220;@&#8221; only matches the block control line&#8217;s expressions once.<\/li>\n<li> &#8220;!&#8221;: same as &#8220;*&#8221;, except that the loop body is executed only if <em>none<\/em> of the<br \/>\nregular expressions preceeding the &#8220;!&#8221; match. Like &#8220;*&#8221;, after a match, the block marker<br \/>\nline is re-matched.<\/li>\n<li> &#8220;^&#8221;: same as &#8220;@&#8221;, except that the loop body is executed only if <em>none<\/em> of the regular expressions preceeding the &#8220;^&#8221; match; like &#8220;@&#8221;, the block marker line&#8217;s regular expressions<br \/>\nare not re-matched.<\/li>\n<li> &#8220;&lt;&#8220;: reloop &#8211; go back to either the block marker line, or the block&#8217;s first line, depending on which type of block it is.<\/li>\n<li> &#8220;&gt;&#8221;: unloop &#8211; go to the first statement after the end of the block.<\/li>\n<li> &#8220;.&#8221;: noop &#8211; do nothing. <\/li>\n<\/ul>\n<p> The basic flow and input-output cycle of Thutu is decidedly weird.  A Thutu program is treated as if the entire program was enclosed in an &#8220;@&#8221; loop. The program is always started with an initial input string of &#8220;=1&#8221;. After an iteration is complete, it does the following:<\/p>\n<ul>\n<li> If the string contains &#8220;=x&#8221;, then everything up to the &#8220;=x&#8221; is removed, and the character before &#8220;=x&#8221; are printed to the standard output.<\/li>\n<li> If the string contains &#8220;=9&#8221;, then then the program exits.<\/li>\n<li> If the string did <em>not<\/em> contain &#8220;=9&#8221;, then one line of input is read, escaped appropriately, and the resulting string concatenated with &#8220;=x&#8221; is inserted onto the front of the<br \/>\nstring.<\/li>\n<\/ul>\n<p> And then, if no &#8220;=9&#8221; was found, it runs another iteration.<\/p>\n<p> There&#8217;s not much in the way of example programs in Thutu, which is kind of a shame. But there are a couple of real beauties. As usual, we&#8217;ll start with a hello world, which is kind of boring, but it gets exciting pretty quickly after that.<\/p>\n<pre>\n\/^=1$\/Hello, world!=n=x=9\/\n<\/pre>\n<p> So, we&#8217;ve got one handy little regular expression here. It looks for &#8220;=1&#8221;, the normal initial string, and replaces it with &#8220;Hello world&#8221;, followed by a newline, followed by &#8220;=x&#8221; and &#8220;=9&#8221;. So, the main program loop finishes its iteration, and looks for &#8220;=x&#8221;; prunes it out, and outputs whatever preceeded it &#8211; the &#8220;Hello world&#8221; string with the newline. Then it checks for &#8220;=9&#8221;, finds it, and terminates. Kaplooie, end of program. Simple!<\/p>\n<p><pre>\n\/=t\/\/\n\/ \/\/\n\/=9.\/=9\/\n\/^(.*)=x\/&lt;$1&gt;\/\n\/&lt;0=+0&gt;\/\/\n\/&gt;*\/%&gt;\/\n\/=+([*%]?)&gt;\/=+0$1&gt;\/\n\/&lt;=+([0-9])\/\/1&gt;\/\n\/1%&gt;\/2&gt;\/\n\/2%&gt;\/3&gt;\/\n\/3%&gt;\/4&gt;\/\n\/4%&gt;\/5&gt;\/\n\/5%&gt;\/6&gt;\/\n\/6%&gt;\/7&gt;\/\n\/7%&gt;\/8&gt;\/\n\/8%&gt;\/9&gt;\/\n\/9%&gt;\/*0&gt;\/\n\/%(.)\/$1%\/\n\/9=+\/8=+%\/\n\/8=+\/7=+%\/\n\/7=+\/6=+%\/\n\/6=+\/5=+%\/\n\/5=+\/4=+%\/\n\/4=+\/3=+%\/\n\/3=+\/2=+%\/\n\/2=+\/1=+%\/\n\/1=+\/0=+%\/\n\/0=+([^%&gt;*]*)(*?[0-9])&gt;\/=+$1&gt;$2\/\n\/=x\/=1\/=9\/!\n\/$\/=n=x\/\n*\n\/=1\/\/\n.\n<\/pre>\n<p> This beast is an adder. First, it strips out whitespace and other forms of noise. Then<br \/>\nit does the same sort of add algorithm that we saw in Thue; only in Thue, it was simple binary addition; Thutu is more ambitious and goes for decimal. But the basic principle is the same &#8211; use a marker character to keep track of where you are, and do it one step at a time, with hard-coded rules like &#8220;<code>\/7=+\/6=+%\/<\/code>&#8221; for subtracting one from the left operand, and adding one &#8220;%&#8221; to the right.. So &#8220;3+2&#8221; turns into &#8220;2+%2&#8221;.  then &#8220;1+%%2&#8221;, then &#8220;0+%%%2&#8221;.<\/p>\n<p> Then one at a time, it flips the &#8220;%&#8221;s to the other side of the right operand, using &#8220;<code>\/%(.)\/$1%\/<\/code>&#8220;; and finally gets rid of the &#8220;%&#8221;s using increment rules like &#8220;<code>\/3%&gt;\/4&gt;\/<\/code>&#8220;. So for our example, it would proceed through &#8220;%%%2&#8221;, &#8220;%%2%&#8221;, &#8220;%2%%&#8221;,<br \/>\n&#8220;2%%%%&#8221;, &#8220;3%%&#8221;, &#8220;4%&#8221;, &#8220;5&#8221;.<\/p>\n<p> Pretty nifty, huh?<\/p>\n<p> And now, it&#8217;s nightmare time. The proof that it&#8217;s Turing complete: an <a href=\"http:\/\/scientopia.org\/blogs\/goodmath\/2006\/12\/pathological-stack-hell-underload\">Underload<\/a> interpreter. Fortunately, it&#8217;s pretty nicely commented.<\/p>\n<pre>\n\/=x\/\/\n# We don't want the line above running until the next iteration,\n# so guard it with a loop. The same loop also checks for =9 and\n# prevents anything further happening in that case.\n\/=9\/!\n# -- is used to mark the current location within the program.\n\/--\/=1\/!\n\/^\/--\/\n# =x is needed to mark the end of the output. Also output a newline!\n\/=x\/=1\/!\n\/^\/=n=x\/\n# If there isn't a stack, create one. %% is used as a stack marker.\n\/%%\/=1\/!\n\/$\/%%\/\n# Turn temporary brackets {{ }} back to =( =)\n\/{{\/=(\/\n\/}}\/=)\/\n# The next block can leave the temporary bracket reversion until later.\n\/=1\/!\n# '()': push onto stack.\n\/--=(\/@\n# Change inner brackets at the start of the program to {{ }} temporarily.\n\/--=(([^)]*?)=(([^)]*?)=)\/--=($1{{$2}}\/\n# Move the set from the program to the stack.\n\/--(=([^)]*?))(.*?)%%\/--$2%%$1\/\n# ':': duplicate the TOS.\n\/--=:\/@\n# Change inner brackets at the start of the stack to {{ }} temporarily.\n\/%%=(([^)]*?)=(([^)]*?)=)\/%%=($1{{$2}}\/\n# Simultaneously remove the : and duplicate the TOS.\n\/--=:(.*?)%%(=([^)]*?))\/--$1%%$2$2\/\n# 'a': enclose the TOS in brackets.\n\/--a\/@\n# Change inner brackets at the start of the stack to {{ }} temporarily.\n\/%%=(([^)]*?)=(([^)]*?)=)\/%%=($1{{$2}}\/\n# Simultaneously remove the a and enclose the TOS.\n\/--a(.*?)%%(=([^)]*?))\/--$1%%=($2=)\/\n# 'S': output the dequoted TOS, popping it.\n\/--S\/@\n# Change inner brackets at the start of the stack to {{ }} temporarily.\n\/%%=(([^)]*?)=(([^)]*?)=)\/%%=($1{{$2}}\/\n# Move the TOS to the end of the output area, and remove the S.\n\/=n=x(.*?)--S(.*?)%%=(([^)]*?)=)\/$3=n=x$1--$2%%\/\n# '!': pop the TOS.\n\/--=!\/@\n# Change inner brackets at the start of the stack to {{ }} temporarily.\n\/%%=(([^)]*?)=(([^)]*?)=)\/%%=($1{{$2}}\/\n# Simultaneously remove the ! and the TOS.\n\/--=!(.*?)%%=([^)]*?)\/--$1%%\/\n# '~': swap the top two stack elements.\n\/--=~\/@\n# Change inner brackets at the start of the stack to {{ }} temporarily.\n\/%%=(([^)]*?)=(([^)]*?)=)\/%%=($1{{$2}}\/\n# Change inner brackets in the second stack element to {{ }} temporarily.\n\/%%=(([^)]*?)=)=(([^)]*?)=(([^)]*?)=)\/%%=($1=)=($2{{$3}}\/\n# Simultaneously remove the ~ and swap TOS and SE2.\n\/--=~(.*?)%%(=([^)]*?))(=([^)]*?))\/--$1%%$3$2\/\n# '*': concatenate the TOS after the second stack element.\n\/--=*\/@\n# Change inner brackets at the start of the stack to {{ }} temporarily.\n\/%%=(([^)]*?)=(([^)]*?)=)\/%%=($1{{$2}}\/\n# Change inner brackets in the second stack element to {{ }} temporarily.\n\/%%=(([^)]*?)=)=(([^)]*?)=(([^)]*?)=)\/%%=($1=)=($2{{$3}}\/\n# Simultaneously remove the ~ and concatenate SE2 and TOS.\n\/--=*(.*?)%%=(([^)]*?)=)=(([^)]*?)=)\/--$1%%=($3$2=)\/\n# Loop back if there are any non-^ commands left to run.\n\/--[aS]\/&lt;\n\/--=[(:!~*]\/&lt;\n# '^': move TOS to the program, dequoting it. We then need {{ }} reversion.\n\/--=^\/@\n# Change inner brackets at the start of the stack to {{ }} temporarily.\n\/%%=(([^)]*?)=(([^)]*?)=)\/%%=($1{{$2}}\/\n# Simultaneously remove the ^ and move the TOS to the start of the program.\n\/--=^(.*?)%%=(([^)]*?)=)\/--$2$1%%\/\n# Loop back if there are any commands left to run.\n\/--[aS]\/&lt;\n\/--=[(:!~^*]\/&lt;\n# Loop back if dequoting's needed.\n\/{{\/&lt;\n\/}}\/&lt;\n# Clear the stack and IP in preparation for the next program to run.\n\/=x.\/*\n\/=x.\/=x\/\n# Remove =1 for the first iteration. Don't loop back afterwards!\n\/=1\/*\n\/=1\/\/\n.\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>For today&#8217;s installation of programming insanity, I decided to go with a relative of Thue, which is one of my favorite languages insane languages that I wrote about before. Thue is a language based on a rewriting system specified by a semi-Thue grammar. Todays language is called Thutu (pronounced tutu); it&#8217;s a string rewriting system [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[92],"tags":[],"class_list":["post-337","post","type-post","status-publish","format-standard","hentry","category-pathological-programming"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p4lzZS-5r","jetpack_sharing_enabled":true,"jetpack_likes_enabled":true,"_links":{"self":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/posts\/337","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/comments?post=337"}],"version-history":[{"count":0,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/posts\/337\/revisions"}],"wp:attachment":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/media?parent=337"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/categories?post=337"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/tags?post=337"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}