Author: Patrik Nyblom <pan(at)erlang(dot)org>
Status: Accepted/R12B-3u Proposal is implemented in OTP release R12B-3,
 except for Unicode support according to EEP 10
Type: Standards Track
Created: 04-Jun-2008
Erlang-Version: R12B-5
Post-History: 01-Jan-1970

EEP 11: Built in regular expressions in Erlang

Abstract

This EEP suggests how to integrate an external regular expression library into the Erlang virtual machine.

Motivation

Regular expressions are widely used. Regardless of how other features of a language can be used to match or pick out parts of a string, many programmers prefer the regular expression syntax and expect regular expressions to be available in a modern language.

The Perl programming language has integrated regular expressions directly into the syntax and Perl programmers are often highly skilled at writing complicated regular expressions that parse e.g. text files, HTTP requests or simple user input. The Perl extensions to the common regular expressions are widely known and many modern programming languages support something similar.

Erlang currently has a minimalistic regular expression module (regexp module in STDLIB), which lacks features commonly available in other implementations. The current library is also painfully slow compared to the native C libraries utilized in other languages.

Erlang needs to interface with a modern regular expression library in a way that does not break the properties of the virtual machine.

Rationale

Preconditions

Writing a more efficient regular expression library purely in Erlang has been attempted, but so far no really efficient implementation has been proposed, and the work involved in creating one is deemed extensive.

On the other hand, several more or less successful attempts to integrate an external regular expressions library into the virtual machine have been presented. None of them, however, have addressed the issue of long running regular expressions stalling the schedulers.

A built in function in the Erlang VM needs to stop execution when it has run a certain amount of iterations, to avoid stalling a scheduler and thereby starving other processes in the system. When the Erlang process get's scheduled again, the built in function restarts and has provided some way of storing it's current state so that execution of the function can continue where it was once left.

The execution of a regular expression match is in many ways similar to the virtual machine's execution of ordinary beam code, but the available libraries are (for obvious reasons) not prepared to give up the execution temporarily to allow other processes to execute. A complicated regular expression on large amounts of subject data can take seconds or even minutes to execute. Stalling one of the schedulers in the VM for that amount of time is not an option in a real parallel system. As suggested interfaces to external libraries have never addressed this problem, none have been accepted and/or integrated in the main Erlang distribution.

Stack usage is another issue seldom addressed. The Erlang virtual machine may run a lot of scheduler threads, especially on processors with large amounts of cores. Multithreading applications need to be careful about stack usage, why recursive C routines are best avoided. The Erlang virtual machine avoids recursion in C code, why a linked in library should do the same. When it comes to realtime operating systems, the need to avoid recursion in the C code is even more obvious. The library used for Erlang regular expressions simply cannot be recursive on the C stack, at least not in a way where stack usage cannot be determined at compile time.

Multithreading versus interruptable execution

The problem of interrupting the execution of a regular expression (or other lengthy operations) when another Erlang process should be scheduled, has two obvious solutions:

  1. Count the number of iterations in a regular expression match, store the state after a certain amount of iterations (or a certain amount of time) and return control to the scheduler when execution time slot is exceeded.

  2. Let the operating system take care of the problem by executing the regular expression matches in separate kernel threads.

In the virtual machine's file driver, the second approach is used, introducing the concept of the asynchronous thread pool. The file I/O case is however special as the I/O operation in itself usually consumes far more time than the running time for the inter-thread communication and task switching involved when using asynchronous threads. Besides, there simply is no other solution at hand for I/O, so OS threads is the only solution at hand in that case.

If regular expressions were to be executed in separate threads, even very small and simple expressions would have to carry the extra burden of OS level task switching and communication.

Other lengthy operations in the virtual machine use the first approach of voluntary interruption and rescheduling. In the cases where external libraries are involved, like IP communication, the emulator provides ways to passively wait for events by supplying interfaces to I/O multiplexing (select/poll). This is the way to avoid blocking the schedulers in most drivers. Asynchronous threads are only utilized where there simply are no other options, like in file I/O (which cannot utilize I/O multiplexing).

Using the first solution when interfacing an external library in a driver or BIF, involves either finding a library where interruption and restart of execution is possible, or modifying an existing library to support this.

Even though modifying a library will make upgrading and patching of the library much harder, the benefits are significant. When executing e.g. regular expressions, the same thread that actually is executing the beam code will be utilized, why setup time and overhead in general is kept at a minimum. Of course execution time of the regexp itself will be slightly longer, as the implementation needs to keep track of the number of executed iterations and needs to be prepared to store the current state for later execution wakeup. The much smaller setup time is however expected to be dominating when it comes to smaller regular expressions (or rather expressions that involve a small number of loops). One also has to bear in mind that this solution imposes much less load on the operating system scheduler, which is a good thing for large and/or embedded systems.

For operating systems where no kernel threads are available, the first solution is the only acceptable. Separate threads for pure user space code execution will do more harm than good to the realtime properties of the Erlang system.

Selecting a suitable library to integrate

The library to integrate into the virtual machine should in an ideal situation fulfill the following wishes:

No available regular expression library currently provides a perfect match. The best available is the PCRE library, which has compile time options for not using the C stack, Perl (and Python) compatible regular expressions and also is written in a well structured way, making it suitable for integration, porting and implementing extensions needed in the Erlang case.

Other alternatives include rxlib (no longer maintained), the Tcl/Tk regular expression implementation, GNU regex, Jakarta and Onigurama, among others. Of those the Tcl/Tk implementation seems the most promising, especially as it for many situations is much faster than other implementations. The algorithms and code are however quite incomprehensible and the regular expression flavor not the most widespread.

After having had a good look at the alternatives, I came to the conclusion that PCRE was the best choice for the following reasons:

Although the subjective reasoning about code readability might seem somewhat out of place, the PCRE code base makes updates to the library easier to integrate, as relatively few and comprehensible alterations need to be done to the library to make it fit into the virtual machine. To be able to maintain the library is important and being able to understand the code is crucial.

The most appealing feature of the library is however the extensive support for Perl compatible regular expressions. PCRE is certainly one of the most powerful libraries around and Erlang programmers used to Perl's regular expressions will feel at home.

Programmers interface

In Perl, the regular expressions are integrated into the language itself. This could of course be done in Erlang too. However, Erlang already has syntax for matching structured data as well as binary ditto. Introducing new primitives for string matching with regular expressions seems out of place. Erlang is also not a language designed for processing textual data in the way Perl is, but a language that can handle complicated structured data. The bit-syntax however might one day benefit from regular expression extensions, but that is beyond the scope of this EEP.

A regular expression module interfacing with the library through built in functions is the usual way to do it in Erlang, and that's the way this EEP suggests. As the module name regexp is already taken, the abbreviation "re" for module name seems to be a good choice.

As a base implementation, I suggest a module with two basic functions: one for precompiling a regular expression into "bytecode" for the regular expression matching execution; and one for actually running the regexp matching. The function that runs the matching should take either a compiled regular expression, or the source of a regular expression as input (together with the subject string and the options for execution).

Around these two suggested functions one can implement functionality in Erlang to mimic the existing regular expression library or implement new functionality.

The current regexp module can, apart from matching, split a string according to a regular expression (functionality similar to the Perl built in function split) and do substitution of sub-strings based on regular expression matching (like the s/// expression in Perl or awk). With corresponding functions in the "re" module, the new module would provide all functionality of the old one.

The names of the functions should, as much as possible, be chosen so that mix up with the current regexp library functions is avoided, why I suggest "compile" and "run" and "replace" as names for regexp compilation, execution and substitution respectively. As no good synonym for the name "split" has emerged, that name is retained in the new module.

Here follows part of the suggested manual page:

Excerpt from a suggested manual page

DATA TYPES

iodata() = iolist() | binary()
iolist() = [char() | binary() | iolist()]
           % a binary is allowed as the tail of the list

mp() = Opaque datatype containing a compiled regular expression.

EXPORTS

compile(Regexp) -> {ok, MP} | {error, ErrSpec}

Types:

Regexp = iodata()

The same as compile(Regexp,[])

compile(Regexp,Options) -> {ok, MP} | {error, ErrSpec}

Types:

Regexp = iodata()
Options = [ Option ]
Option = anchored | caseless | dollar_endonly | dotall | extended |
         firstline | multiline | no_auto_capture | dupnames |
         ungreedy | {newline, NLSpec}
NLSpec = cr | crlf | lf | anycrlf
MP = mp()
ErrSpec = {ErrString, Position}
ErrString = string()
Position = int()

This function compiles a regular expression with the syntax described below into an internal format to be used later as a parameter to the run/2,3 functions.

Compiling the regular expression before matching is useful if the same expression is to be used in matching against multiple subjects during the program's lifetime. Compiling once and executing many times is far more efficient than compiling each time one wants to match.

The options have the following meanings:

run(Subject,RE) -> {match, Captured} | nomatch | {error, ErrSpec}

Types:

Subject = iodata()
RE = mp() | iodata()
Captured = [ CaptureData ]
CaptureData = {int(),int()} | string() | binary()
ErrSpec = {ErrString, Position}
ErrString = string()
Position = int()

The same as run(Subject,RE,[]).

run(Subject,RE) -> {match, Captured} | match | nomatch | {error, ErrSpec}

Types:

Subject = iodata()
RE = mp() | iodata()
Options = [ Option ]
Option = anchored | global | notbol | noteol | notempty | {offset, int()} |
         {newline, NLSpec} | {capture, ValueSpec} |
         {capture, ValueSpec, Type} | CompileOpt
Type = index | list | binary
ValueSpec = all | all_but_first | first | ValueList
ValueList = [ ValueID ]
ValueID = int() | string() | atom()
CompileOpt = see compile/2 above
NLSpec = cr | crlf | lf | anycrlf
Captured = [ CaptureData ] | [ [ CaptureData ] ... ]
CaptureData = {int(),int()} | string() | binary()
ErrSpec = {ErrString, Position}
ErrString = string()
Position = int()

Executes a regexp matching, returning match / {match, Captured} or nomatch. The regular expression can be given either as iodata() in which case it is automatically compiled (as by re:compile/2) and executed, or as a pre compiled mp() in which case it is executed against the subject directly.

When compilation is involved, the function may return compilation errors as when compiling separately ({error, {string(),int()}}); when only matching, no errors are returned.

If the regular expression is previously compiled, the option list can only contain the options anchored, global, notbol, noteol, notempty, {offset, int()}, {newline, NLSpec} and {capture, ValueSpec} / {capture, ValueSpec, Type}. Otherwise all options valid for the re:compile/2 function are allowed as well. Options allowed both for compilation and execution of a match, namely anchored and {newline, NLSpec}, will affect both the compilation and execution if present together with a non pre-compiled regular expression.

The {capture, ValueSpec} / {capture, ValueSpec, Type} defines what to return from the function upon successful matching. The capture tuple may contain both a value specification telling which of the captured substrings are to be returned, and a type specification, telling how captured substrings are to be returned (as index tuples, lists or binaries). The capture option makes the function quite flexible and powerful. The different options are described in detail below

If the capture options describe that no substring capturing at all is to be done ({capture, none}), the function will return the single atom match upon successful matching, otherwise the tuple {match, ValueList} is returned. Disabling capturing can be done either by specifying none or an empty list as ValueSpec.

A description of all the options relevant for execution follows:

The options solely affecting the compilation step are described in the re:compile/2 function.

replace(Subject, RE, Replacement) -> iodata() | {error, ErrSpec}

Types:

Subject = iodata()
RE = mp() | iodata()
Replacement = iodata()
ErrSpec = {ErrString, Position}
ErrString = string()
Position = int()

The same as replace(Subject, RE, Replacement,[]).

replace(Subject, RE, Replacement, Options) -> iodata() | binary() | list() | {error, ErrSpec}

Types:

Subject = iodata()
RE = mp() | iodata()
Replacement = iodata()
Options = [ Option ]
Option = anchored | global | notbol | noteol | notempty |
         {offset, int()} | {newline, NLSpec} |
         {return, ReturnType} | CompileOpt
ReturnType = iodata | list | binary
CompileOpt = see compile/2 above
NLSpec = cr | crlf | lf | anycrlf
ErrSpec = {ErrString, Position}
ErrString = string()
Position = int()

Replaces the matched part of the Subject string with the content of Replacement.

Options are given as to the re:run/3 function except that the capture option of re:run/3 is not allowed. Instead a {return, ReturnType} is present. The default return type is iodata , constructed in a way to minimize copying. The iodata result can be used directly in many I/O-operations. If a flat list() is desired, specify {return, list} and if a binary is preferred, specify {return, binary}.

The replacement string can contain the special character &, which inserts the whole matching expression in the result, and the special sequence \N (where N is an integer > 0), resulting in the subexpression number N will be inserted in the result. If no subexpression with that number is generated by the regular expression, nothing is inserted.

To insert an & or \ in the result, precede it with a \. Note that Erlang already gives a special meaning to \ in literal strings, why a single \ has to be written as "\\" and therefore a double \ as "\\\\". Example:

re:replace("abcd","c","[&]",[{return,list}]).

gives:

"ab[c]d"

while:

re:replace("abcd","c","[\\&]",[{return,list}]).

gives:

"ab[&]d"

The {error, ErrSpec} return value can only arise from compilation, i.e. when a non precompiled malformed RE is given.

split(Subject,RE) -> SplitList | {error, ErrSpec}

Types:

Subject = iodata()
RE = mp() | iodata()
SplitList = [ iodata() ]
ErrSpec = {ErrString, Position}
ErrString = string()
Position = int()

The same as split(Subject, RE, []).

split(Subject,RE,Options) -> SplitList | {error, ErrSpec}

Types:

Subject = iodata()
RE = mp() | iodata()
Options = [ Option ]
Option = anchored | global | notbol | noteol | notempty |
         {offset, int()} | {newline, NLSpec} | {return, ReturnType} |
         {parts, NumParts} | group | CompileOpt
NumParts = int() | infinity
ReturnType = iodata | list | binary
CompileOpt = see compile/2 above
NLSpec = cr | crlf | lf | anycrlf
SplitList = [ RetData ] | [ GroupedRetData ]
GroupedRetData = [ RetData ]
RetData = iodata() | binary() | list()
ErrSpec = {ErrString, Position}
ErrString = string()
Position = int()

This function splits the input into parts by finding tokens according to the regular expression supplied.

The splitting is done basically by running a global regexp match and dividing the initial string wherever a match occurs. The matching part of the string is removed from the output.

The result is given as a list of "strings", the preferred datatype given in the return option (default iodata).

If subexpressions are given in the regular expression, the matching subexpressions are returned in the resulting list as well. An example:

re:split("Erlang","[ln]",[{return,list}]).

will yield the result:

["Er","a","g"]

while:

re:split("Erlang","([ln])",[{return,list}]).

will yield:

["Er","l","a","n","g"]

The text matching the subexpression (marked by the parantheses in the regexp) is inserted in the result list where it was found. In effect this means that concatenating the result of a split where the whole regexp is a single subexpression (as in the example above) will always result in the original string.

As there is no matching subexpression for the last part in the example (the "g"), there is nothing inserted after that. To make the group of strings and the parts matching the subexpressions more obvious, one might use the group option, which groups together the part of the subject string with the parts matching the subexpressions when the string was split:

re:split("Erlang","([ln])",[{return,list},group]).

gives:

[["Er","l"],["a","n"],["g"]]

Here the regular expression matched first the "l", causing "Er" to be the first part in the result. When the regular expression matched, the (only) subexpression was bound to the "l", why the "l" is inserted in the group together with "Er". The next match is of the "n", making "a" the next part to be returned. As the subexpression is bound to the substring "n" in this case, the "n" is inserted into this group. The last group consists of the rest of the string, as no more matches are found.

All empty strings are per default removed from the end of the result list, the semantics beeing that we split the string in as many parts as possible until we reach the end of the string. In effect this means that all empty strings are stripped from the result list (or all empty groups if the group option is given). The parts option can be used to change this behaviour. Let's look at an example:

re:split("Erlang","[lg]",[{return,list}]).

The result will be::

["Er","an"]

as the matching of the "g" in the end effectively makes the matching reach the end of the string. If we however say we want more parts:

re:split("Erlang","[lg]",[{return,list},{parts,3}]).

We will get the last part as well, even though there is only an empty string after the last match (matching the "g"):

["Er","an",[]]

More than three parts are not possible with this indata, why:

re:split("Erlang","[lg]",[{return,list},{parts,4}]).

will give the same result. To specify that as many results as possible are to be returned, including any empty results at end, you can specify infinity as the number of parts to return. Specifying 0 as the number of parts gives the default behaviour of returning all parts except empty parts at the end.

If subexpressions are captured, empty subexpression matches at the end are also stripped from the result if {parts, N} is not specified. If you are familiar with Perl, the default behaviour corresponds exactly to the Perl default, the {parts, N} where N is a positive integer corresponds exactly to the Perl behaviour with a positive numerical third parameter and the {parts, infinity} behaviour corresponds to that when the Perl routine is given a negative integer as the third parameter.

Summary of options not previously described for the re:run/3 function:

Supported string representations

As can be viewed in the manual excerpt, I suggest allowing both the regular expressions and the subject strings to be provided as iodata(), which means either binaries, lists or a mix of binaries and deep lists. When Unicode is not involved, this basically means a implicit iolist_to_binary() when supplying data to the re module.

Further extensions

The following extensions are not yet implemented in the prototype, but should be included in a final release:

Of these, Unicode support is the far most important, and also the one that can not be implemented efficiently purely in Erlang code.

Prototype implementation

A prototype implementation using the PCRE library is present along with a reference manual page in the R12B-4 distribution. This implementation does not yet fully support Unicode, as EEP 10 is not accepted at the time of writing. The prototype implementation also lacks the "split" function, which was implemented after the R12B-4 release.

In terms of performance, fairly simple regular expressions matches are with this prototype up to 75 times faster than with the current regexp module. The bookkeeping to allow for interruptions of the regular expression execution costs between 1 and 2% of the performance when no out scheduling is needed. In worst cases a 5% performance loss can be noted compared to an untouched library, but then actual restarting is involved, so the numbers are not fully comparable.

Compiling PCRE to use the C stack for recursive calls and avoid restarting is expected to give the best results in terms of execution speed. The difference in benchmarks to the fully interruptable version is however only in the range of 1 to 3% when no restarting occurs and still no more than 6% when restarting actually occurs.

The conclusion is that the extra cost imposed on the PCRE library to allow an integration into the Erlang emulator without using asynchronous threads is in an absolute worst scenario no more than 6% compared to a theoretical maximum.

Copyright

This document has been placed in the public domain.