The Problem with String Stored Regex

While regular expressions have limited usefulness, especially in larger programs, they're still very handy to have from time to time. It's usually difficult to write a lexer or tokenizer without one. Because of this several languages build them right into the language itself, rather than tacked on as a library. It allows the regular expressions to be stored literally in the code, treated as its own type, rather than inside a string. The problem with storing a regular expression inside a string is that it can easily make an already complex regular expression much more complex. This is because there are two levels of parsing going on.

Consider this regular expression where we match an alphanumeric word inside of quotes. I'm going to use slashes to delimit the regular expression itself.

/"\w+"/

Notice there is no escaping going on. The backslash is there is indicate a special sequence \w, which is equal to [a-zA-Z0-9_]. This will get parsed and compiled into some form in memory before it is run by a program. If the language doesn't directly support regular expressions then we usually can't put it in the code as is, since the language parser won't know how to deal with it. The solution is to store it inside of a string.

However, our regular expression contains quotes and these will need to be escaped when in a quote delimited string. But I no longer need slashes to delimit my regular expression.

"\"\w+\""

Did you notice the error yet? If not, stop and think about it for a minute. Our special sequence \w will not make it intact to the regular expression compiler. That backslash will escape the w during the string parsing step, leaving only the w. The string we typed will get parsed into a series of characters in memory, performing escapes along the way, and then that sequence will be handed to the regular expression compiler. So we have to fix it,

"\"\\w+\""

That's getting hard to understand, compared to the original. Now let's throw a curve-ball into this: let's match a backslash at the beginning of the word. The normal regular expression looks like this now,

/"\\\w+"/

We have to escape our backslash to make it a literal backslash, so it takes two of them. Now, when we want to do this in a string-stored regular expression we have to escape both of those backslashes again. It looks like this,

"\"\\\\\\w+\""

Now to match a single backslash we have to insert four backslashes! Quite unfortunately, Emacs Lisp doesn't directly support regular expressions even though the language has a lot of emphasis on text parsing, so a lot of Elisp code is riddled with this sort of thing. Elisp is especially difficult because sometimes, such as during prompts, you can enter a regular expression directly and can ignore the layer of string parsing. It's a very conscious effort to remember which situation you're in at different times.

Perl, Ruby, and JavaScript have regular expressions as part of the language and it makes a lot of sense for these languages; they tend to do a lot of text parsing. Python does it partially, with its r' syntax. Any string preceded with an r loses its escape rules, but it also means you can't match both single or double quotes without falling back to a normal string with escaping. Common Lisp may be able to do it with a reader macro, but I've never seen it done.

Remember those two levels of parsing when writing string stored regex. It helps avoid hair-pulling annoying mistakes.

Have a comment on this article? Start a discussion in my public inbox by sending an email to ~skeeto/public-inbox@lists.sr.ht [mailing list etiquette] , or see existing discussions.

null program

Chris Wellons

wellons@nullprogram.com (PGP)
~skeeto/public-inbox@lists.sr.ht (view)