This EEP proposes Sigils for string literals very much like Elixir Sigils. The chief reason is to facilitate other suggested language features, many of which exists in Elixir under the umbrella of Sigils, such as:
unicode:unicode_binary()
Many existing suggestions about features in the Abstract use a prefix before a normal erlang string such as:
u"For UTF-8 encoded binary strings"
or
bf"For UTF-8 encoded binary with interpolation formatting: ~foo()~"
This EEP suggests using the same or very similar syntax as in Elixir for Sigils on literal strings to avoid syntactical problems with simple prefixes, and to not make these sibling languages deviate too much without good reason:
~"For UTF-8 encoded binary strings"
In the following text double angle quotation marks are used to
mark source code characters to improve clarity.
For example: the dot character (full stop): «.
».
The Erlang programming language is built according to a traditional tokenizer+parser+compiler model.
The tokenizer a.k.a. scanner a.k.a. lexer scans the source code
character sequence and converts it into a sequence of Tokens,
like atom, variable, string, integer, reserved word,
punctuation character or operator:
atom
, Variable
, "string"
, 123
, case
, :
and ++
.
The parser takes a sequence of tokens and builds a parse tree, AST (Abstract Syntax Tree), according to the Erlang grammar. This AST is then what the compiler compiles into executable (virtual machine) code.
The tokenizer is simple. It stems from the tool lex that try a set of regular expressions on the input and when one matches it becomes a token and is removed from the input. Rinse and repeat.
The tokenizer is no longer that simple, but it doesn’t keep much state and looks just a few fixed number of characters ahead in the input.
For example; from the start state, if the tokenizer sees
a '
character, it switches state to scanning a quoted atom.
While doing so it translates escape sequences such as \n
(into ASCII 10) and when it sees a '
character it produces
an atom token and goes back to the start state.
All of these simple prefixes have to become separate tokens in the tokenizer:
«bf"
» would constitute the start token for a binary string
with interpolation syntax. So would «bf"""
», «b"
», «b"""
»,
and so on.
The tokenizer would have to know of all combinations of prefix characters and emit distinct tokens for every combination.
Today, the character sequence «b
», «f
», «"
» is scanned as a token
for the atom bf
followed by the string start token "
.
That combination fails in the parser so it is syntactically invalid today,
which is what makes simple prefixes a possible language extension.
A simple prefix approach would have to scan a number of characters ahead to distinguish between an atom followed by string start vs. prefixed string start, and it would be a different number of characters depending on which atom characters that have been found so far. This is rather messy.
Furthermore, it is likely that we want the feature of choosing String Delimiters, especially for regular expressions such as:
re(^"+.*/.*$)
Among the desired delimiters are /
and < >
. The currently
valid code «b<X
» meaning atom b
less than X
, would instead
have to be interpreted as prefixed string start b<
with X
being the first string content character.
For the /
character we run into similar problems with for example
«b/X
», which would be a run-time error today, but if we also would
want capital letter prefixes, then «B/X
» is perfectly valid today
but would become a string start.
There are more likely problems with simple string prefixes:
«#bf{
» is today the start of a record named bf
, and is
scanned as punctuation character #
, atom bf
and separator {
,
which the parser figures out to be a record start.
With simple prefix characters the tokenizer would have to be rewritten
to recognize «#bf
» as a new record token, a rewrite that might cause
unexpected changes in record handling. For example, today, «# bf {
»
is also a valid record start, so to be compatible the tokenizer
would have to allow white-space or even newlines within the new record
token, between #
and the atom characters, which would be really ugly…
For other reasons, namely that function call parenthesis are optional,
Elixir has chosen to use the ~
character as the start of
a string prefix which they call a “Sigil”.
Having a distinct start character for this feature simplifies tokenizing and parsing.
In a general sense, a Sigil, is a prefix to a variable
that indicates its type, such as $I
in Basic or Perl,
where $
is the sigil and I
is the variable.
Here we define a Sigil as a prefix (and maybe a suffix) to a string literal that indicates how it should be interpreted. The Sigil is a syntactic sugar that is transformed into some Erlang term, or expression.
A Sigil string literal consists of:
~
followed by a name that may be empty.The sigil is transformed early by the tokenizer and the parser into some other term or expression. Later steps in the parsing and compilation finds out if the transformation result is valid.
Where the transformed term is valid depends on what it was transformed into. For example, if the sigil is transformed into some other literal term, it would be valid in a pattern.
Should the sigil have become something containing a function call, then it is only valid in a general expression, not in a pattern.
Adjacent strings are concatenated by the parser so for example
«"abc" "def"
» is concatenated to "abcdef"
.
A Sigil looks like a string with a prefix (and maybe a suffix), but may be transformed into something other than a string, so it cannot be subject to string concatenation.
Therefore «~s"abc" "def"
» should be illegal, and also all other
sequences consisting of a Sigil of any type, and any other term,
in any order.
The Sigil Prefix starts whith the Tilde character ~
, followed
by the Sigil Type which is a name composed of a sequence of characters
that are allowed as the second or later characters in a variable or an atom.
In short ISO Latin-1 letters, digits, _
and @
.
The Sigil Type may be empty.
The Sigil Type defines how the Sigil syntactic sugar shall be interpreted. The suggested Sigil Types are:
«»: the vanilla (default (empty name)) Sigil.
Creates a literal Erlang unicode:unicode_binary()
.
It is a string represented as a UTF-8 encoded binary,
equivalent to applying unicode:characters_to_binary/1
on the String Content. The String Delimiters
and escape characters work as they already do for regular strings
or triple-quoted strings.
So «~"abc\d"
» is equivalent to «<<"abc\d"/utf8>>
», and
«~'abc"d'
» is equivalent to «<<"abc\"d"/utf8>>
».
Regular strings honour escape sequences but triple-quoted strings
are verbatim, so «~"
» is equivalent to «~b"
» but
«~"""
» is equivalent to «~B"""
», as described below.
A simple way to create strings as UTF-8 binaries is supposedly the first and most desired missing string feature in Erlang. This sigil does just that.
b
: unicode:unicode_binary()
Creates a literal UTF-8 encoded binary, handling escape characters in the string content. Other features such as string interpolation will require another Sigil Type or using the Sigil Suffix.
In Elixir this corresponds to the ~s
sigil, a string.
B
: unicode:unicode_binary()
, verbatim.
Creates a literal UTF-8 encoded binary, with verbatim string content. The content ends when the end delimiter is found. There is no way to escape the end delimiter.
In Elixir this corresponds to the ~S
sigil, a string.
s
: string()
.
Creates a literal Unicode codepoint list, handling escape characters in the string content. Other features such as string interpolation will require another Sigil Type or using the Sigil Suffix.
In Elixir this corresponds to the ~c
sigil, a charlist.
S
: string()
, verbatim.
Creates a literal Unicode codepoint list, with verbatim string content. The content ends when the end delimiter is found. There is no way to escape the end delimiter.
In Elixir this corresponds to the ~C
sigil, a charlist.
r
: regular expression.
This EEP proposes to not implement regular expressions yet.
It is still unclear how integration with the re
module
should be done, and if it is worth the effort compared
to just using the S
or the B
Sigil Type.
The best idea so far was that this sigil creates a literal term
{re,RE::unicode:charlist(),Flags::[unicode:latin1_char()]}
that is an uncompiled regular expression with compile flags,
suitable for (yet to be implemented) functions in the re
module.
The RE
element is the String Content, and the Flags
element
is the Sigil Suffix.
See the Regular Expressions section about the reasoning behind this proposed term type.
First the end delimiter is found and within the String Content, character escape sequences are handled according to the regular expression rules.
The main advantage of a regular expression Sigil is to avoid
the additional escaping of \
that regular erlang strings require.
Looking for name\number in quotes such as: "foo\17"
Today: re:run(Subject, "^\\s*\"[a-z]+\\\\\\d+\"", [caseless,unicode])
Sigil: re:run(Subject, ~r/^\s*"[a-z]+\\\d+"/iu)
Other advantages are possible tools and library integration features
such as making the re
module recognize this tuple format,
and having the code loader pre-compile them.
Sigil Prefixes with other, unknown, Sigil Types should cause an error “illegal sigil prefix” in the tokenizer or the parser. Another possibility would be to pass them further in the compilation chain enabling parse transforms to act on them, but that feature can be added later, and in general one should avoid using parse transforms since they are often a source for hard to find problems.
These proposed Sigil Types are named according to the corresponding
Erlang types. The Sigil Types in Elixir are named according to
Elixir types. So, for example, a ~s
Sigil Prefix in Erlang
creates an Erlang string()
, which is a list of Unicode codepoints,
but in Elixir the ~s
Sigil Prefix creates an Elixir String
which is a UTF-8 encoded binary.
Consistency within the language is supposedly more important that between the languages, and that the string types are different between the languages is already a known quirk.
Immediately following the Sigil Prefix is the string start delimiter. A specific start delimiter character has a corresponding end delimiter character.
The allowed start-end delimiter character pairs are:
() [] {} <>
.
The following characters are start delimiters that have themselves
as end delimiters: / | ' " ` #
.
Triple-quote delimiters are also allowed, that is; a sequence of
3 or more double quote "
characters as described in EEP 64.
For a given Sigil Type except the Vanilla Sigil, which String Delimiters that are used does not affect how the string content is interpreted, besides finding the end delimiter.
For a triple-quoted string, though, conceptually the end delimiter doesn’t occur in the string’s content, so interpreting the string content does not interfere with finding the end delimiter.
The proposed set of delimiters is the same as in Elixir,
plus `
and #
. They are the characters in ASCII
that are normally used for bracketing or text quoting,
and those that feel like full height vertikal lines,
except: \
is too often used for character escaping,
plus #
which is too useful to not include since
in many contexts (shell scripts, Perl regular expressions)
it is a comment character that is easy to avoid
in the String Content.
Even though Latin-1 is the character set that Erlang is defined in, it is still ASCII that is the common denominator for programming languages. Only western Europeean keyboards and code pages that have the possibility to produce Latin-1 characters above 127.
Latin-1 characters above 127 are allowed in variable names
and unquoted atoms, but the programmer that uses such should
be aware that the code will not read correctly for
non-Latin-1 users. On the other hand it would be bad to lure
a programmer into using e.g a quote character that happens to exist
on a Latin-1 keyboard but will be something completely different
for other programmers. Therefore characters like « »
should not be used for a general syntactical element.
Between the start and end String Delimiters, all characters are string content.
In a triple-quoted string all characters are verbatim, but stripping of indentation and leading and trailing newline is done as usual as described in EEP 64.
In a string with single character String Delimiters,
normal Erlang escape sequences prefixed with \
are honoured,
as usual for regular Erlang strings and quoted atoms
A specific Sigil Type can have it’s own character escaping rules, which may affect finding the end delimiter.
Immediately following the String Content comes the Sigil Suffix, which may be empty.
The Sigil Suffix consists, as the Sigil Type in the Sigil Prefix, of name characters.
The Sigil Suffix may indicate how to interpret the String Content,
for a specific Sigil Type.
For example; for the ~R
Sigil Prefix (regular expression),
the Sigil Suffix is interpreted as short form compile options
such as «i
» that makes the regular expression character
case insensitive. For example «~R/^from: /i
».
Things that may have to be performed by the tokenizer, such as how to handle escape character rules, should not be affected by the Sigil Suffix, since the tokenizer has already scanned the String Content when it sees the Sigil Suffix.
If a Sigil Type doesn’t allow a Sigil Suffix, an error “illegal sigil suffix” should be generated in the tokenizer or the parser.
A regular expression sigil «~R"expression"flags
» should
be translated to something useful for tools/libraries.
There are at least two ways; uncompiled regular expressions,
or compiled regular expressions.
The value of a regular expression Sigil is chosen
to be a tuple {re,RE,Flags}
.
With this representation, the re
module can be augmented
with functions that accept this tuple format that bundles
a regular expression with compile flags. These functions
are re:compile/1,2
, re:replace/3,4
re:run/2,3
,
and re:split/2,3
. Translation of the Flags
’ characters
into re:compile_option()
s should be done by these functions.
Example of calling a yet to be implemented re:run/3
:
1> re:run("ABC123", ~r"abc\d+"i, [{capture,first,list}]).
{match,["ABC123"]}
Since the Sigil value represents an uncompiled regular expression,
the user can choose when to compile it with re:compile/1,2
,
or use it directly in for example re:run/2,3
.
It is possible to implement an optimization to make the compiler
aware that when passing a regular expression Sigil,
which is a literal, to functions like re:run/2,3
, code can be emitted
for the code loader (a now missing feature) to compile
the regular expression at load time and instead pass
the pre-compiled regular expression to re:run/2,3
.
For this optimization to be safe, other compile options than the ones
in the Sigil value cannot be allowed to affect for example re:run/3
that has options as the third argument. If re:run/3
would fail
for any compile options (only allow run-time options), or if
the options argument is a literal to be included in
pre-compilation, then such an optimization is safe.
Another possibility would be that the value of a regular expression
Sigil is a compiled regular expression; the re:mp()
type.
Then it can be used as above, except as an argument to
re:compile/1,2
. Pre-compilation would be a hard requirement
since the running Erlang code must see a compiled regular expression.
And we would still have to decide on another sigil type to be used
in re:compile/1,2
that is syntactic sugar for an uncompiled
regular expression. Without that a ~S
sigil could be used
but that won’t have the compile flags as suffix so those flags
cannot be given in the same way for compiled vs. uncompiled
regular expressions.
Since we in any case need a Sigil that is syntactic sugar for an uncompiled regular expression, and pre-compilation optimization is possible with that, this EEP suggests that a regular expression Sigil should represent an uncompiled regular expression with compile flags.
There is no Vanilla Sigil (empty Sigil Type) in Elixir.
This EEP proposes to add the following String Delimiters
to the set that Elixir has: # `
.
The string and binary Sigil Types are named differently
between the languages, to keep the names consistent within
the language (Erlang): ~s
in Elixir is ~b
in Erlang,
and ~c
in Elixir is ~s
in Erlang, so ~s
means
different things, because strings are different things.
When Elixir allows escape sequences in the String Content it also allows string interpolation. This EEP proposes to not implement string interpolation in the suggested Sigil Types.
When Elixir doesn’t allow escape sequences in the String Content, it still allows escaping the end delimiter. This EEP proposes that such strings should be truly verbatim whith no possibility to escape the end delimiter.
There are small differences in which escape sequences that are implemented
in the languages; Elixir allows escaping of newlines, and has
an escape sequence \a
, that Erlang does not have.
There are also small differences in how newlines are handled
between ~S
heredocs in Elixir and triple-quoted strings in Erlang.
See EEP 64.
Details about regular expression sigils, ~R
, in particular
their Sigil Suffixes remains to be decided in Erlang.
Also, there still is a question about escaping the end delimiter or not.
It has not been decided how or even if string interpolation will be implemented in Erlang, but a Sigil Suffix or new Sigil Types would most probably be used.
PR-7684 Implements the ~s
, ~S
, ~b
, ~B
and the ~
(vanilla) Sigil, according to this EEP.
The tokenizer produces a sigil_prefix
token before the string literal,
and a sigil_suffix
token after. The parser merges and transforms them
into the correct output term.
Another approach would be to produce (for example) a sigil_string
token
for the whole string and then handle that in the parser.
It would require more state to be kept in the tokenizer between
the parts of the sigil prefixed string, and therefore need
more tokenizer rewriting.
This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.