Author:
Raimo Niskanen <raimo(at)erlang(dot)org>
Status:
Final/27.0 Implemented in OTP release 27; the regular expression sigils (`~r` and `~R`) are not implemented
Type:
Standards Track
Created:
25-Sep-2023
Erlang-Version:
OTP-27.0

EEP 66: Sigils for String Literals #

Abstract #

This EEP proposes Sigils for string literals very much like Elixir Sigils. The chief reason is to facilitate other suggested language features, many of which exists in Elixir under the umbrella of Sigils, such as:

  • Binary Strings: unicode:unicode_binary()
  • Regular Expression Syntax
  • Choice of string delimiters
  • Verbatim Strings
  • String Interpolation Syntax, or Variable Interpolation

Rationale #

Many existing suggestions about features in the Abstract use a prefix before a normal erlang string such as:

u"For UTF-8 encoded binary strings"

or

bf"For UTF-8 encoded binary with interpolation formatting: ~foo()~"

This EEP suggests using the same or very similar syntax as in Elixir for Sigils on literal strings to avoid syntactical problems with simple prefixes, and to not make these sibling languages deviate too much without good reason:

~"For UTF-8 encoded binary strings"

Design Decisions #

In the following text double angle quotation marks are used to mark source code characters to improve clarity. For example: the dot character (full stop): «.».

Erlang Language Structure (Tokenizer and Parser) #

The Erlang programming language is built according to a traditional tokenizer+parser+compiler model.

The tokenizer a.k.a. scanner a.k.a. lexer scans the source code character sequence and converts it into a sequence of Tokens, like atom, variable, string, integer, reserved word, punctuation character or operator: atom, Variable, "string", 123, case, : and ++.

The parser takes a sequence of tokens and builds a parse tree, AST (Abstract Syntax Tree), according to the Erlang grammar. This AST is then what the compiler compiles into executable (virtual machine) code.

The Tokenizer #

The tokenizer is simple. It stems from the tool lex that try a set of regular expressions on the input and when one matches it becomes a token and is removed from the input. Rinse and repeat.

The tokenizer is no longer that simple, but it doesn’t keep much state and looks just a few fixed number of characters ahead in the input.

For example; from the start state, if the tokenizer sees a ' character, it switches state to scanning a quoted atom. While doing so it translates escape sequences such as \n (into ASCII 10) and when it sees a ' character it produces an atom token and goes back to the start state.

Problems with simple prefixes #

All of these simple prefixes have to become separate tokens in the tokenizer: «bf"» would constitute the start token for a binary string with interpolation syntax. So would «bf"""», «b"», «b"""», and so on.

The tokenizer would have to know of all combinations of prefix characters and emit distinct tokens for every combination.

Today, the character sequence «b», «f», «"» is scanned as a token for the atom bf followed by the string start token ". That combination fails in the parser so it is syntactically invalid today, which is what makes simple prefixes a possible language extension.

A simple prefix approach would have to scan a number of characters ahead to distinguish between an atom followed by string start vs. prefixed string start, and it would be a different number of characters depending on which atom characters that have been found so far. This is rather messy.

Furthermore, it is likely that we want the feature of choosing String Delimiters, especially for regular expressions such as:

re(^"+.*/.*$)

Among the desired delimiters are / and < >. The currently valid code «b<X» meaning atom b less than X, would instead have to be interpreted as prefixed string start b< with X being the first string content character.

For the / character we run into similar problems with for example «b/X», which would be a run-time error today, but if we also would want capital letter prefixes, then «B/X» is perfectly valid today but would become a string start.

There are more likely problems with simple string prefixes: «#bf{» is today the start of a record named bf, and is scanned as punctuation character #, atom bf and separator {, which the parser figures out to be a record start.

With simple prefix characters the tokenizer would have to be rewritten to recognize «#bf» as a new record token, a rewrite that might cause unexpected changes in record handling. For example, today, «# bf {» is also a valid record start, so to be compatible the tokenizer would have to allow white-space or even newlines within the new record token, between # and the atom characters, which would be really ugly…

For other reasons, namely that function call parenthesis are optional, Elixir has chosen to use the ~ character as the start of a string prefix which they call a “Sigil”.

Having a distinct start character for this feature simplifies tokenizing and parsing.

Sigil #

In a general sense, a Sigil, is a prefix to a variable that indicates its type, such as $I in Basic or Perl, where $ is the sigil and I is the variable.

Here we define a Sigil as a prefix (and maybe a suffix) to a string literal that indicates how it should be interpreted. The Sigil is a syntactic sugar that is transformed into some Erlang term, or expression.

A Sigil string literal consists of:

  1. The Sigil Prefix, ~ followed by a name that may be empty.
  2. The String Content within String Delimiters.
  3. The Sigil Suffix, a name character sequence that may be empty.

Sigil Transformation #

The sigil is transformed early by the tokenizer and the parser into some other term or expression. Later steps in the parsing and compilation finds out if the transformation result is valid.

Patterns and Expressions #

Where the transformed term is valid depends on what it was transformed into. For example, if the sigil is transformed into some other literal term, it would be valid in a pattern.

Should the sigil have become something containing a function call, then it is only valid in a general expression, not in a pattern.

String Concatenation #

Adjacent strings are concatenated by the parser so for example «"abc" "def"» is concatenated to "abcdef".

A Sigil looks like a string with a prefix (and maybe a suffix), but may be transformed into something other than a string, so it cannot be subject to string concatenation.

Therefore «~s"abc" "def"» should be illegal, and also all other sequences consisting of a Sigil of any type, and any other term, in any order.

Sigil Prefix #

The Sigil Prefix starts whith the Tilde character ~, followed by the Sigil Type which is a name composed of a sequence of characters that are allowed as the second or later characters in a variable or an atom. In short ISO Latin-1 letters, digits, _ and @. The Sigil Type may be empty.

The Sigil Type defines how the Sigil syntactic sugar shall be interpreted. The suggested Sigil Types are:

  • «»: the vanilla (default (empty name)) Sigil.

    Creates a literal Erlang unicode:unicode_binary(). It is a string represented as a UTF-8 encoded binary, equivalent to applying unicode:characters_to_binary/1 on the String Content. The String Delimiters and escape characters work as they already do for regular strings or triple-quoted strings.

    So «~"abc\d"» is equivalent to «<<"abc\d"/utf8>>», and «~'abc"d'» is equivalent to «<<"abc\"d"/utf8>>».

    Regular strings honour escape sequences but triple-quoted strings are verbatim, so «~"» is equivalent to «~b"» but «~"""» is equivalent to «~B"""», as described below.

    A simple way to create strings as UTF-8 binaries is supposedly the first and most desired missing string feature in Erlang. This sigil does just that.

  • b: unicode:unicode_binary()

    Creates a literal UTF-8 encoded binary, handling escape characters in the string content. Other features such as string interpolation will require another Sigil Type or using the Sigil Suffix.

    In Elixir this corresponds to the ~s sigil, a string.

  • B: unicode:unicode_binary(), verbatim.

    Creates a literal UTF-8 encoded binary, with verbatim string content. The content ends when the end delimiter is found. There is no way to escape the end delimiter.

    In Elixir this corresponds to the ~S sigil, a string.

  • s: string().

    Creates a literal Unicode codepoint list, handling escape characters in the string content. Other features such as string interpolation will require another Sigil Type or using the Sigil Suffix.

    In Elixir this corresponds to the ~c sigil, a charlist.

  • S: string(), verbatim.

    Creates a literal Unicode codepoint list, with verbatim string content. The content ends when the end delimiter is found. There is no way to escape the end delimiter.

    In Elixir this corresponds to the ~C sigil, a charlist.

  • r: regular expression.

    This EEP proposes to not implement regular expressions yet. It is still unclear how integration with the re module should be done, and if it is worth the effort compared to just using the S or the B Sigil Type.

    The best idea so far was that this sigil creates a literal term {re,RE::unicode:charlist(),Flags::[unicode:latin1_char()]} that is an uncompiled regular expression with compile flags, suitable for (yet to be implemented) functions in the re module. The RE element is the String Content, and the Flags element is the Sigil Suffix.

    See the Regular Expressions section about the reasoning behind this proposed term type.

    First the end delimiter is found and within the String Content, character escape sequences are handled according to the regular expression rules.

    The main advantage of a regular expression Sigil is to avoid the additional escaping of \ that regular erlang strings require.

    Looking for name\number in quotes such as: "foo\17"

    Today: re:run(Subject, "^\\s*\"[a-z]+\\\\\\d+\"", [caseless,unicode])

    Sigil: re:run(Subject, ~r/^\s*"[a-z]+\\\d+"/iu)

    Other advantages are possible tools and library integration features such as making the re module recognize this tuple format, and having the code loader pre-compile them.

Sigil Prefixes with other, unknown, Sigil Types should cause an error “illegal sigil prefix” in the tokenizer or the parser. Another possibility would be to pass them further in the compilation chain enabling parse transforms to act on them, but that feature can be added later, and in general one should avoid using parse transforms since they are often a source for hard to find problems.

These proposed Sigil Types are named according to the corresponding Erlang types. The Sigil Types in Elixir are named according to Elixir types. So, for example, a ~s Sigil Prefix in Erlang creates an Erlang string(), which is a list of Unicode codepoints, but in Elixir the ~s Sigil Prefix creates an Elixir String which is a UTF-8 encoded binary.

Consistency within the language is supposedly more important that between the languages, and that the string types are different between the languages is already a known quirk.

String Delimiters #

Immediately following the Sigil Prefix is the string start delimiter. A specific start delimiter character has a corresponding end delimiter character.

The allowed start-end delimiter character pairs are: () [] {} <>.

The following characters are start delimiters that have themselves as end delimiters: / | ' " ` #.

Triple-quote delimiters are also allowed, that is; a sequence of 3 or more double quote " characters as described in EEP 64.

For a given Sigil Type except the Vanilla Sigil, which String Delimiters that are used does not affect how the string content is interpreted, besides finding the end delimiter.

For a triple-quoted string, though, conceptually the end delimiter doesn’t occur in the string’s content, so interpreting the string content does not interfere with finding the end delimiter.

The proposed set of delimiters is the same as in Elixir, plus ` and #. They are the characters in ASCII that are normally used for bracketing or text quoting, and those that feel like full height vertikal lines, except: \ is too often used for character escaping, plus # which is too useful to not include since in many contexts (shell scripts, Perl regular expressions) it is a comment character that is easy to avoid in the String Content.

Even though Latin-1 is the character set that Erlang is defined in, it is still ASCII that is the common denominator for programming languages. Only western Europeean keyboards and code pages that have the possibility to produce Latin-1 characters above 127.

Latin-1 characters above 127 are allowed in variable names and unquoted atoms, but the programmer that uses such should be aware that the code will not read correctly for non-Latin-1 users. On the other hand it would be bad to lure a programmer into using e.g a quote character that happens to exist on a Latin-1 keyboard but will be something completely different for other programmers. Therefore characters like « » should not be used for a general syntactical element.

String Content #

Between the start and end String Delimiters, all characters are string content.

In a triple-quoted string all characters are verbatim, but stripping of indentation and leading and trailing newline is done as usual as described in EEP 64.

In a string with single character String Delimiters, normal Erlang escape sequences prefixed with \ are honoured, as usual for regular Erlang strings and quoted atoms

A specific Sigil Type can have it’s own character escaping rules, which may affect finding the end delimiter.

Sigil Suffix #

Immediately following the String Content comes the Sigil Suffix, which may be empty.

The Sigil Suffix consists, as the Sigil Type in the Sigil Prefix, of name characters.

The Sigil Suffix may indicate how to interpret the String Content, for a specific Sigil Type. For example; for the ~R Sigil Prefix (regular expression), the Sigil Suffix is interpreted as short form compile options such as «i» that makes the regular expression character case insensitive. For example «~R/^from: /i».

Things that may have to be performed by the tokenizer, such as how to handle escape character rules, should not be affected by the Sigil Suffix, since the tokenizer has already scanned the String Content when it sees the Sigil Suffix.

If a Sigil Type doesn’t allow a Sigil Suffix, an error “illegal sigil suffix” should be generated in the tokenizer or the parser.

Regular Expressions #

A regular expression sigil «~R"expression"flags» should be translated to something useful for tools/libraries. There are at least two ways; uncompiled regular expressions, or compiled regular expressions.

Uncompiled Regular Expression #

The value of a regular expression Sigil is chosen to be a tuple {re,RE,Flags}.

With this representation, the re module can be augmented with functions that accept this tuple format that bundles a regular expression with compile flags. These functions are re:compile/1,2, re:replace/3,4 re:run/2,3, and re:split/2,3. Translation of the Flags’ characters into re:compile_option()s should be done by these functions.

Example of calling a yet to be implemented re:run/3:

1> re:run("ABC123", ~r"abc\d+"i, [{capture,first,list}]).
{match,["ABC123"]}

Since the Sigil value represents an uncompiled regular expression, the user can choose when to compile it with re:compile/1,2, or use it directly in for example re:run/2,3.

It is possible to implement an optimization to make the compiler aware that when passing a regular expression Sigil, which is a literal, to functions like re:run/2,3, code can be emitted for the code loader (a now missing feature) to compile the regular expression at load time and instead pass the pre-compiled regular expression to re:run/2,3.

For this optimization to be safe, other compile options than the ones in the Sigil value cannot be allowed to affect for example re:run/3 that has options as the third argument. If re:run/3 would fail for any compile options (only allow run-time options), or if the options argument is a literal to be included in pre-compilation, then such an optimization is safe.

Compiled Regular Expression #

Another possibility would be that the value of a regular expression Sigil is a compiled regular expression; the re:mp() type.

Then it can be used as above, except as an argument to re:compile/1,2. Pre-compilation would be a hard requirement since the running Erlang code must see a compiled regular expression.

And we would still have to decide on another sigil type to be used in re:compile/1,2 that is syntactic sugar for an uncompiled regular expression. Without that a ~S sigil could be used but that won’t have the compile flags as suffix so those flags cannot be given in the same way for compiled vs. uncompiled regular expressions.

Therefore Uncompiled #

Since we in any case need a Sigil that is syntactic sugar for an uncompiled regular expression, and pre-compilation optimization is possible with that, this EEP suggests that a regular expression Sigil should represent an uncompiled regular expression with compile flags.

Comparison with Elixir #

There is no Vanilla Sigil (empty Sigil Type) in Elixir.

This EEP proposes to add the following String Delimiters to the set that Elixir has: # `.

The string and binary Sigil Types are named differently between the languages, to keep the names consistent within the language (Erlang): ~s in Elixir is ~b in Erlang, and ~c in Elixir is ~s in Erlang, so ~s means different things, because strings are different things.

When Elixir allows escape sequences in the String Content it also allows string interpolation. This EEP proposes to not implement string interpolation in the suggested Sigil Types.

When Elixir doesn’t allow escape sequences in the String Content, it still allows escaping the end delimiter. This EEP proposes that such strings should be truly verbatim whith no possibility to escape the end delimiter.

There are small differences in which escape sequences that are implemented in the languages; Elixir allows escaping of newlines, and has an escape sequence \a, that Erlang does not have.

There are also small differences in how newlines are handled between ~S heredocs in Elixir and triple-quoted strings in Erlang. See EEP 64.

Details about regular expression sigils, ~R, in particular their Sigil Suffixes remains to be decided in Erlang. Also, there still is a question about escaping the end delimiter or not.

It has not been decided how or even if string interpolation will be implemented in Erlang, but a Sigil Suffix or new Sigil Types would most probably be used.

Reference Implementation #

PR-7684 Implements the ~s, ~S, ~b, ~B and the ~ (vanilla) Sigil, according to this EEP.

The tokenizer produces a sigil_prefix token before the string literal, and a sigil_suffix token after. The parser merges and transforms them into the correct output term.

Another approach would be to produce (for example) a sigil_string token for the whole string and then handle that in the parser. It would require more state to be kept in the tokenizer between the parts of the sigil prefixed string, and therefore need more tokenizer rewriting.

Copyright #

This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.