A user-perceived character, consisting of one or more codepoints.
string
Description
This module provides functions for string processing.
A string in this module is represented by unicode:chardata(), that is, a list of codepoints, binaries with UTF-8-encoded codepoints (UTF-8 binaries), or a mix of the two.
"abcd" is a valid string <<"abcd">> is a valid string ["abcd"] is a valid string <<"abc..åäö"/utf8>> is a valid string <<"abc..åäö">> is NOT a valid string, but a binary with Latin-1-encoded codepoints [<<"abc">>, "..åäö"] is a valid string [atom] is NOT a valid string
This module operates on grapheme clusters. A grapheme cluster is a user-perceived character, which can be represented by several codepoints.
"å" [229] or [97, 778] "e̊" [101, 778]
The string length of "ß↑e̊" is 3, even though it is represented by the codepoints [223,8593,101,778] or the UTF-8 binary <<195,159,226,134,145,101,204,138>>.
Grapheme clusters for codepoints of class prepend and non-modern (or decomposed) Hangul is not handled for performance reasons in find/3, replace/3, split/2, split/3 and trim/3.
Splitting and appending strings is to be done on grapheme clusters borders. There is no verification that the results of appending strings are valid or normalized.
Most of the functions expect all input to be normalized to one form, see for example unicode:characters_to_nfc_list/1.
Language or locale specific handling of input is not considered in any function.
The functions can crash for non-valid input strings. For example, the functions expect UTF-8 binaries but not all functions verify that all binaries are encoded correctly.
Unless otherwise specified the return value type is the same as the input type. That is, binary input returns binary output, list input returns a list output, and mixed input can return a mixed output.
1> string:trim(" sarah "). "sarah" 2> string:trim(<<" sarah ">>). <<"sarah">> 3> string:lexemes("foo bar", " "). ["foo","bar"] 4> string:lexemes(<<"foo bar">>, " "). [<<"foo">>,<<"bar">>]
This module has been reworked in Erlang/OTP 20 to handle unicode:chardata() and operate on grapheme clusters. The old functions that only work on Latin-1 lists as input are still available but should not be used, they will be deprecated in a future release.
casefold(String :: unicode:chardata()) -> unicode:chardata()OTP 20.0
Converts String to a case-agnostic comparable string. Function casefold/1 is preferred over lowercase/1 when two strings are to be compared for equality. See also equal/4.
Example:
1> string:casefold("Ω and ẞ SHARP S").
"ω and ss sharp s"
chomp(String :: unicode:chardata()) -> unicode:chardata()OTP 20.0
Returns a string where any trailing \n or \r\n have been removed from String.
Example:
182> string:chomp(<<"\nHello\n\n">>). <<"\nHello">> 183> string:chomp("\nHello\r\r\n"). "\nHello\r"
equal(A, B) -> boolean()
equal(A, B, IgnoreCase) -> boolean()OTP 20.0
equal(A, B, IgnoreCase, Norm) -> boolean()OTP 20.0
Returns true if A and B are equal, otherwise false.
If IgnoreCase is true the function does casefolding on the fly before the equality test.
If Norm is not none the function applies normalization on the fly before the equality test. There are four available normalization forms: nfc, nfd, nfkc, and nfkd.
By default, IgnoreCase is false and Norm is none.
Example:
1> string:equal("åäö", <<"åäö"/utf8>>). true 2> string:equal("åäö", unicode:characters_to_nfd_binary("åäö")). false 3> string:equal("åäö", unicode:characters_to_nfd_binary("ÅÄÖ"), true, nfc). true
find(String, SearchPattern) -> unicode:chardata() | nomatchOTP 20.0
find(String, SearchPattern, Dir) -> unicode:chardata() | nomatchOTP 20.0
Removes anything before SearchPattern in String and returns the remainder of the string or nomatch if SearchPattern is not found. Dir, which can be leading or trailing, indicates from which direction characters are to be searched.
By default, Dir is leading.
Example:
1> string:find("ab..cd..ef", "."). "..cd..ef" 2> string:find(<<"ab..cd..ef">>, "..", trailing). <<"..ef">> 3> string:find(<<"ab..cd..ef">>, "x", leading). nomatch 4> string:find("ab..cd..ef", "x", trailing). nomatch
is_empty(String :: unicode:chardata()) -> boolean()OTP 20.0
Returns true if String is the empty string, otherwise false.
Example:
1> string:is_empty("foo"). false 2> string:is_empty(["",<<>>]). true
length(String :: unicode:chardata()) -> integer() >= 0OTP 20.0
Returns the number of grapheme clusters in String.
Example:
1> string:length("ß↑e̊"). 3 2> string:length(<<195,159,226,134,145,101,204,138>>). 3
lexemes(String :: unicode:chardata(),
SeparatorList :: [grapheme_cluster()]) ->
[unicode:chardata()]OTP 20.0
Returns a list of lexemes in String, separated by the grapheme clusters in SeparatorList.
Notice that, as shown in this example, two or more adjacent separator graphemes clusters in String are treated as one. That is, there are no empty strings in the resulting list of lexemes. See also split/3 which returns empty strings.
Notice that [$\r,$\n] is one grapheme cluster.
Example:
1> string:lexemes("abc de̊fxxghix jkl\r\nfoo", "x e" ++ [[$\r,$\n]]). ["abc","de̊f","ghi","jkl","foo"] 2> string:lexemes(<<"abc de̊fxxghix jkl\r\nfoo"/utf8>>, "x e" ++ [$\r,$\n]). [<<"abc">>,<<"de̊f"/utf8>>,<<"ghi">>,<<"jkl\r\nfoo">>]
lowercase(String :: unicode:chardata()) -> unicode:chardata()OTP 20.0
Converts String to lowercase.
Notice that function casefold/1 should be used when converting a string to be tested for equality.
Example:
2> string:lowercase(string:uppercase("Michał")).
"michał"
next_codepoint(String :: unicode:chardata()) ->
maybe_improper_list(char(), unicode:chardata()) |
{error, unicode:chardata()}OTP 20.0
Returns the first codepoint in String and the rest of String in the tail. Returns an empty list if String is empty or an {error, String} tuple if the next byte is invalid.
Example:
1> string:next_codepoint(unicode:characters_to_binary("e̊fg")).
[101|<<"̊fg"/utf8>>]
next_grapheme(String :: unicode:chardata()) ->
maybe_improper_list(grapheme_cluster(),
unicode:chardata()) |
{error, unicode:chardata()}OTP 20.0
Returns the first grapheme cluster in String and the rest of String in the tail. Returns an empty list if String is empty or an {error, String} tuple if the next byte is invalid.
Example:
1> string:next_grapheme(unicode:characters_to_binary("e̊fg")).
["e̊"|<<"fg">>]
nth_lexeme(String, N, SeparatorList) -> unicode:chardata()OTP 20.0
Returns lexeme number N in String, where lexemes are separated by the grapheme clusters in SeparatorList.
Example:
1> string:nth_lexeme("abc.de̊f.ghiejkl", 3, ".e").
"ghi"
pad(String, Length) -> unicode:charlist()OTP 20.0
pad(String, Length, Dir) -> unicode:charlist()OTP 20.0
pad(String, Length, Dir, Char) -> unicode:charlist()OTP 20.0
Types
Pads String to Length with grapheme cluster Char. Dir, which can be leading, trailing, or both, indicates where the padding should be added.
By default, Char is $\s and Dir is trailing.
Example:
1> string:pad(<<"He̊llö"/utf8>>, 8). [<<72,101,204,138,108,108,195,182>>,32,32,32] 2> io:format("'~ts'~n",[string:pad("He̊llö", 8, leading)]). ' He̊llö' 3> io:format("'~ts'~n",[string:pad("He̊llö", 8, both)]). ' He̊llö '
prefix(String :: unicode:chardata(), Prefix :: unicode:chardata()) ->
nomatch | unicode:chardata()OTP 20.0
If Prefix is the prefix of String, removes it and returns the remainder of String, otherwise returns nomatch.
Example:
1> string:prefix(<<"prefix of string">>, "pre"). <<"fix of string">> 2> string:prefix("pre", "prefix"). nomatch
replace(String, SearchPattern, Replacement) ->
[unicode:chardata()]OTP 20.0
replace(String, SearchPattern, Replacement, Where) ->
[unicode:chardata()]OTP 20.0
Replaces SearchPattern in String with Replacement. Where, default leading, indicates whether the leading, the trailing or all encounters of SearchPattern are to be replaced.
Can be implemented as:
lists:join(Replacement, split(String, SearchPattern, Where)).
Example:
1> string:replace(<<"ab..cd..ef">>, "..", "*"). [<<"ab">>,"*",<<"cd..ef">>] 2> string:replace(<<"ab..cd..ef">>, "..", "*", all). [<<"ab">>,"*",<<"cd">>,"*",<<"ef">>]
reverse(String :: unicode:chardata()) -> [grapheme_cluster()]OTP 20.0
Returns the reverse list of the grapheme clusters in String.
Example:
1> Reverse = string:reverse(unicode:characters_to_nfd_binary("ÅÄÖ")). [[79,776],[65,776],[65,778]] 2> io:format("~ts~n",[Reverse]). ÖÄÅ
slice(String, Start) -> SliceOTP 20.0
slice(String, Start, Length) -> SliceOTP 20.0
Types
Returns a substring of String of at most Length grapheme clusters, starting at position Start.
By default, Length is infinity.
Example:
1> string:slice(<<"He̊llö Wörld"/utf8>>, 4). <<"ö Wörld"/utf8>> 2> string:slice(["He̊llö ", <<"Wörld"/utf8>>], 4,4). "ö Wö" 3> string:slice(["He̊llö ", <<"Wörld"/utf8>>], 4,50). "ö Wörld"
split(String, SearchPattern) -> [unicode:chardata()]OTP 20.0
split(String, SearchPattern, Where) -> [unicode:chardata()]OTP 20.0
Splits String where SearchPattern is encountered and return the remaining parts. Where, default leading, indicates whether the leading, the trailing or all encounters of SearchPattern will split String.
Example:
0> string:split("ab..bc..cd", ".."). ["ab","bc..cd"] 1> string:split(<<"ab..bc..cd">>, "..", trailing). [<<"ab..bc">>,<<"cd">>] 2> string:split(<<"ab..bc....cd">>, "..", all). [<<"ab">>,<<"bc">>,<<>>,<<"cd">>]
take(String, Characters) -> {Leading, Trailing}OTP 20.0
take(String, Characters, Complement) -> {Leading, Trailing}OTP 20.0
take(String, Characters, Complement, Dir) -> {Leading, Trailing}OTP 20.0
Types
Takes characters from String as long as the characters are members of set Characters or the complement of set Characters. Dir, which can be leading or trailing, indicates from which direction characters are to be taken.
Example:
5> string:take("abc0z123", lists:seq($a,$z)). {"abc","0z123"} 6> string:take(<<"abc0z123">>, lists:seq($0,$9), true, leading). {<<"abc">>,<<"0z123">>} 7> string:take("abc0z123", lists:seq($0,$9), false, trailing). {"abc0z","123"} 8> string:take(<<"abc0z123">>, lists:seq($a,$z), true, trailing). {<<"abc0z">>,<<"123">>}
titlecase(String :: unicode:chardata()) -> unicode:chardata()OTP 20.0
Converts String to titlecase.
Example:
1> string:titlecase("ß is a SHARP s").
"Ss is a SHARP s"
to_float(String) -> {Float, Rest} | {error, Reason}
Types
Argument String is expected to start with a valid text represented float (the digits are ASCII values). Remaining characters in the string after the float are returned in Rest.
Example:
> {F1,Fs} = string:to_float("1.0-1.0e-1"), > {F2,[]} = string:to_float(Fs), > F1+F2. 0.9 > string:to_float("3/2=1.5"). {error,no_float} > string:to_float("-1.5eX"). {-1.5,"eX"}
to_integer(String) -> {Int, Rest} | {error, Reason}
Types
Argument String is expected to start with a valid text represented integer (the digits are ASCII values). Remaining characters in the string after the integer are returned in Rest.
Example:
> {I1,Is} = string:to_integer("33+22"), > {I2,[]} = string:to_integer(Is), > I1-I2. 11 > string:to_integer("0.5"). {0,".5"} > string:to_integer("x=2"). {error,no_integer}
to_graphemes(String :: unicode:chardata()) -> [grapheme_cluster()]OTP 20.0
Converts String to a list of grapheme clusters.
Example:
1> string:to_graphemes("ß↑e̊"). [223,8593,[101,778]] 2> string:to_graphemes(<<"ß↑e̊"/utf8>>). [223,8593,[101,778]]
trim(String) -> unicode:chardata()OTP 20.0
trim(String, Dir) -> unicode:chardata()OTP 20.0
trim(String, Dir, Characters) -> unicode:chardata()OTP 20.0
Returns a string, where leading or trailing, or both, Characters have been removed. Dir which can be leading, trailing, or both, indicates from which direction characters are to be removed.
Default Characters is the set of nonbreakable whitespace codepoints, defined as Pattern_White_Space in Unicode Standard Annex #31. By default, Dir is both.
Notice that [$\r,$\n] is one grapheme cluster according to the Unicode Standard.
Example:
1> string:trim("\t Hello \n"). "Hello" 2> string:trim(<<"\t Hello \n">>, leading). <<"Hello \n">> 3> string:trim(<<".Hello.\n">>, trailing, "\n."). <<".Hello">>
uppercase(String :: unicode:chardata()) -> unicode:chardata()OTP 20.0
Converts String to uppercase.
See also titlecase/1.
Example:
1> string:uppercase("Michał").
"MICHAŁ"
Obsolete API functions
centre(String, Number) -> Centered
centre(String, Number, Character) -> Centered
chars(Character, Number) -> String
chars(Character, Number, Tail) -> String
Returns a string consisting of Number characters Character. Optionally, the string can end with string Tail.
This function is obsolete. Use lists:duplicate/2.
chr(String, Character) -> Index
concat(String1, String2) -> String3
Concatenates String1 and String2 to form a new string String3, which is returned.
This function is obsolete. Use [String1, String2] as Data argument, and call unicode:characters_to_list/2 or unicode:characters_to_binary/2 to flatten the output.
copies(String, Number) -> Copies
Returns a string containing String repeated Number times.
This function is obsolete. Use lists:duplicate/2.
cspan(String, Chars) -> Length
join(StringList, Separator) -> String
Returns a string with the elements of StringList separated by the string in Separator.
This function is obsolete. Use lists:join/2.
Example:
> join(["one", "two", "three"], ", "). "one, two, three"
left(String, Number) -> Left
left(String, Number, Character) -> Left
len(String) -> Length
rchr(String, Character) -> Index
right(String, Number) -> Right
right(String, Number, Character) -> Right
rstr(String, SubString) -> Index
span(String, Chars) -> Length
str(String, SubString) -> Index
strip(String :: string()) -> string()
strip(String, Direction) -> Stripped
strip(String, Direction, Character) -> Stripped
Returns a string, where leading or trailing, or both, blanks or a number of Character have been removed. Direction, which can be left, right, or both, indicates from which direction blanks are to be removed. strip/1 is equivalent to strip(String, both).
This function is obsolete. Use trim/3.
Example:
> string:strip("...Hello.....", both, $.). "Hello"
sub_string(String, Start) -> SubString
sub_string(String, Start, Stop) -> SubString
substr(String, Start) -> SubString
substr(String, Start, Length) -> SubString
sub_word(String, Number) -> Word
sub_word(String, Number, Character) -> Word
Returns the word in position Number of String. Words are separated by blanks or Characters.
This function is obsolete. Use nth_lexeme/3.
Example:
> string:sub_word(" Hello old boy !",3,$o). "ld b"
to_lower(String) -> Result
to_lower(Char) -> CharResult
to_upper(String) -> Result
to_upper(Char) -> CharResult
The specified string or character is case-converted. Notice that the supported character set is ISO/IEC 8859-1 (also called Latin 1); all values outside this set are unchanged
This function is obsolete use lowercase/1, uppercase/1, titlecase/1 or casefold/1.
tokens(String, SeparatorList) -> Tokens
Returns a list of tokens in String, separated by the characters in SeparatorList.
Example:
> tokens("abc defxxghix jkl", "x "). ["abc", "def", "ghi", "jkl"]
Notice that, as shown in this example, two or more adjacent separator characters in String are treated as one. That is, there are no empty strings in the resulting list of tokens.
words(String) -> Count
words(String, Character) -> Count
Notes
Some of the general string functions can seem to overlap each other. The reason is that this string package is the combination of two earlier packages and all functions of both packages have been retained.