<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">Hello Dan,</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">Thanks for your answer; indeed
rewriting most uses of string:rstr/2 in terms of string:split/3
and others should be possible, and each string traversal must be
expensive now.</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">Yet maybe enriching the newer API so
that it offers similar services to the previous Latin1 one could
be done at first (albeit admittedly a bit inefficiently) by
applying string:to_graphemes/1 on all<span class="bold_code bc-15"><a><span
class="code"> unicode:chardata() parameters of these
functions: then </span></a></span><a>one could then iterate
on grapheme clusters like we iterated on a plain list of Latin1
characters; I think that index look-up (which is often useful)
could even be done by (ab)using the code of the current
implementation of string:str/2 and string:rstr/2 (which do not
seem to care so much about whether they deal with characters or,
here, grapheme clusters).</a></div>
<div class="moz-cite-prefix"><a><br>
</a></div>
<div class="moz-cite-prefix"><a>In this prospect, if put</a><a>
in-line with the newer kind of indexes suggested by slice/{2,3}
(integers that start at zero and that actually count grapheme
clusters [1]), Unicode-aware replacements to string:str/2 and
string:rstr/2 could be akin to [2].</a></div>
<div class="moz-cite-prefix"><a><br>
</a></div>
<div class="moz-cite-prefix"><a>They seem to work correctly:</a></div>
<div class="moz-cite-prefix">
<pre><a>> String = <<"He̊llö Wörld"/utf8>>.
> find_substring_index( String, "Nöpe̊" ).
nomatch
> find_substring_index( String, "lö", leading ).
3
> find_substring_index( String, "lö", trailing ).
3
> find_substring_index( String, "ö", leading ).
4
> find_substring_index( String, "ö", trailing ).
7
</a></pre>
</div>
<p>Best regards,<br>
</p>
<p>Olivier.</p>
<p><br>
</p>
<div class="moz-cite-prefix"><a>[1] Other languages offer similar
sub</a><a>string index look-up (ex:
https://docs.python.org/3/library/stdtypes.html#str.find) but at
least some operate on codepoints rather than on grapheme
clusters. Maybe there is room for both, yet grapheme clusters, </a><a>even
if they are more complex,</a><a> seem to me more appropriate for
many use cases?<br>
</a></div>
<div class="moz-cite-prefix"><span class="bold_code bc-15"><a><span
class="code"></span></a></span></div>
<p>[2] Corresponding code:</p>
<pre>% Index in a Unicode string, in terms of grapheme clusters (ex: not codepoints,
% not bytes).
%
-type gc_index() :: non_neg_integer().
% Returns the index, in terms of grapheme clusters, of the first occurrence of
% the specified pattern substring (if any) in the specified string.
%
-spec find_substring_index( unicode:chardata(), unicode:chardata() ) ->
gc_index() | 'nomatch'.
find_substring_index( String, SearchPattern ) ->
find_substring_index( String, SearchPattern, _Direction=leading ).
% Returns the index, in terms of grapheme clusters, of the first or last
% occurrence (depending on the specified direction) of the specified pattern
% substring (if any) in the specified string.
%
-spec find_substring_index( unicode:chardata(), unicode:chardata(),
string:direction() ) -> gc_index() | 'nomatch'.
find_substring_index( String, SearchPattern, Direction ) ->
GCString = string:to_graphemes( String ),
GCSearchPattern = string:to_graphemes( SearchPattern ),
PseudoIndex = case Direction of
leading ->
string:str( GCString, GCSearchPattern );
trailing ->
string:rstr( GCString, GCSearchPattern )
end,
case PseudoIndex of
0 ->
nomatch;
% Indexes of grapheme clusters are to start at 0, not 1:
I ->
I-1
end.</pre>
<br>
<p>Notes:</p>
<p>- probably not very efficient, but may be replaced later by an
optimised version, with no API change for the user code<br>
</p>
<p>- the point is to reuse the *code* of string:str/2 and
string:rstr/2 (even if this API is to disappear)</p>
<p>- maybe such functions could also operate directly on [<span
class="bold_code bc-17"><a>grapheme_cluster()</a></span>], to
avoid too many conversions<br>
</p>
<p><br>
</p>
<p><br>
</p>
<div class="moz-cite-prefix"><span class="bold_code bc-15"><a><span
class="code"><br>
</span></a></span></div>
<div class="moz-cite-prefix">Le 4/7/21 à 9:03 AM, Dan Gudmundsson a
écrit :<br>
</div>
<blockquote type="cite"
cite="mid:CANX4uuO67mn09z3VRiDZB3Pmkbc+AdJUkVKVn74DEtMxk3AKvA@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">There are currently no replacements for those
functions,
<div>the thought was that it was a lot more expensive to
traverse the string now so you should only traverse it once,</div>
<div>or you should at least think about it instead of just
replacing the new api.</div>
<div><br>
</div>
<div>I believe 'string:split(File, Ext, trailing)' or why not
'string:replace(File, OldExt, NewExt, trailing)'<br>
does what you want in this case, or you could use the
'filename' module for handling filenames.</div>
<div><br>
</div>
<div>But yes the string api could be extended with a function or
two.</div>
<div><br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Tue, Apr 6, 2021 at 11:29
PM Olivier Boudeville <<a
href="mailto:olivier.boudeville@online.fr"
moz-do-not-send="true">olivier.boudeville@online.fr</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>
<br>
It must be a silly question, but, since the Latin1 ->
Unicode switch in <br>
OTP 20.0, is there a (non-obsolete) way in the string module
to look-up <br>
the index of a string into another one, i.e. to find the
location of a <br>
given substring?<br>
<br>
rstr/2 is supposed to be replaced with find/3, yet the former
returns an <br>
index whereas the latter returns a part of the original
string. I could <br>
not find a way to obtain a relevant index with any of the
newer string <br>
functions - whereas I would guess it is a fairly common need?<br>
<br>
To give a bit more context, the goal was to prevent the
implementation <br>
of [1] from becoming obsolete; string:substr/3 and
string:sub_string/3 <br>
are flagged as obsolete and may be replaced by slice/3 (see
[2]); yet <br>
what can be done for rstr/2?<br>
<br>
(even if a smart use of some function was found to address the
<br>
particular need of this replace_extension/3 function,
obtaining indexes <br>
of substrings would still be useful in many cases, isn't it?)<br>
<br>
Thanks in advance for any hint!<br>
<br>
Best regards,<br>
<br>
Olivier.<br>
<br>
<br>
[1] Soon obsolete apparently:<br>
<br>
% Returns a new filename whose extension has been updated.<br>
%<br>
% Ex: replace_extension("/home/jack/rosie.ttf", ".ttf",
".wav") should <br>
return<br>
% "/home/jack/rosie.wav".<br>
%<br>
-spec replace_extension( file_path(), extension(), extension()
) -> <br>
file_path().<br>
replace_extension( FilePath, SourceExtension, TargetExtension
) -><br>
<br>
case string:rstr( FilePath, SourceExtension ) of<br>
<br>
0 -><br>
throw( { extension_not_found, SourceExtension,
FilePath } );<br>
<br>
Index -><br>
string:substr( FilePath, 1, Index-1 ) ++
TargetExtension<br>
<br>
end.<br>
<br>
<br>
[2] BTW there is a change in the indexing convention that
could be <br>
better advertised in the doc:<br>
<br>
> string:substr("abc",1).<br>
"abc"<br>
<br>
> string:sub_string("abc",1).<br>
"abc"<br>
<br>
> string:slice("abc",1).<br>
"bc"<br>
<br>
> string:slice("abc",0).<br>
"abc"<br>
<br>
-- <br>
Olivier Boudeville<br>
<br>
</blockquote>
</div>
</blockquote>
<p><br>
</p>
<pre class="moz-signature" cols="72">--
Olivier Boudeville
</pre>
</body>
</html>