<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div class="moz-cite-prefix">Hello Dan,</div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">Thanks for your answer; indeed

      rewriting most uses of string:rstr/2 in terms of string:split/3

      and others should be possible, and each string traversal must be

      expensive now.</div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">Yet maybe enriching the newer API so

      that it offers similar services to the previous Latin1 one could

      be done at first (albeit admittedly a bit inefficiently) by

      applying string:to_graphemes/1 on all<span class="bold_code bc-15"><a><span

            class="code"> unicode:chardata() parameters of these

            functions: then </span></a></span><a>one could then iterate

        on grapheme clusters like we iterated on a plain list of Latin1

        characters; I think that index look-up (which is often useful)

        could even be done by (ab)using the code of the current

        implementation of string:str/2 and string:rstr/2 (which do not

        seem to care so much about whether they deal with characters or,

        here, grapheme clusters).</a></div>

    <div class="moz-cite-prefix"><a><br>

      </a></div>

    <div class="moz-cite-prefix"><a>In this prospect, if put</a><a>

        in-line with the newer kind of indexes suggested by slice/{2,3}

        (integers that start at zero and that actually count grapheme

        clusters [1]), Unicode-aware replacements to string:str/2 and

        string:rstr/2 could be akin to [2].</a></div>

    <div class="moz-cite-prefix"><a><br>

      </a></div>

    <div class="moz-cite-prefix"><a>They seem to work correctly:</a></div>

    <div class="moz-cite-prefix">

      <pre><a>> String = <<"He̊llö Wörld"/utf8>>.

> find_substring_index( String, "Nöpe̊" ).

nomatch

> find_substring_index( String, "lö", leading ).

3

> find_substring_index( String, "lö", trailing ).

3

> find_substring_index( String, "ö", leading ).

4

> find_substring_index( String, "ö", trailing ).

7

</a></pre>

    </div>

    <p>Best regards,<br>

    </p>

    <p>Olivier.</p>

    <p><br>

    </p>

    <div class="moz-cite-prefix"><a>[1] Other languages offer similar

        sub</a><a>string index look-up (ex:

        https://docs.python.org/3/library/stdtypes.html#str.find) but at

        least some operate on codepoints rather than on grapheme

        clusters. Maybe there is room for both, yet grapheme clusters, </a><a>even

        if they are more complex,</a><a> seem to me more appropriate for

        many use cases?<br>

      </a></div>

    <div class="moz-cite-prefix"><span class="bold_code bc-15"><a><span

            class="code"></span></a></span></div>

    <p>[2] Corresponding code:</p>

    <pre>% Index in a Unicode string, in terms of grapheme clusters (ex: not codepoints,

% not bytes).

%

-type gc_index() :: non_neg_integer().

% Returns the index, in terms of grapheme clusters, of the first occurrence of

% the specified pattern substring (if any) in the specified string.

%

-spec find_substring_index( unicode:chardata(), unicode:chardata() ) ->

                                    gc_index() | 'nomatch'.

find_substring_index( String, SearchPattern ) ->

    find_substring_index( String, SearchPattern, _Direction=leading ).

% Returns the index, in terms of grapheme clusters, of the first or last

% occurrence (depending on the specified direction) of the specified pattern

% substring (if any) in the specified string.

%

-spec find_substring_index( unicode:chardata(), unicode:chardata(),

                            string:direction() ) -> gc_index() | 'nomatch'.

find_substring_index( String, SearchPattern, Direction ) ->

    GCString = string:to_graphemes( String ),

    GCSearchPattern = string:to_graphemes( SearchPattern ),

    PseudoIndex = case Direction of

        leading ->

            string:str( GCString, GCSearchPattern );

        trailing ->

            string:rstr( GCString, GCSearchPattern )

    end,

    case PseudoIndex of

        0 ->

            nomatch;

        % Indexes of grapheme clusters are to start at 0, not 1:

        I ->

            I-1

    end.</pre>

    <br>

    <p>Notes:</p>

    <p>- probably not very efficient, but may be replaced later by an

      optimised version, with no API change for the user code<br>

    </p>

    <p>- the point is to reuse the *code* of string:str/2 and

      string:rstr/2 (even if this API is to disappear)</p>

    <p>- maybe such functions could also operate directly on [<span

        class="bold_code bc-17"><a>grapheme_cluster()</a></span>], to

      avoid too many conversions<br>

    </p>

    <p><br>

    </p>

    <p><br>

    </p>

    <div class="moz-cite-prefix"><span class="bold_code bc-15"><a><span

            class="code"><br>

          </span></a></span></div>

    <div class="moz-cite-prefix">Le 4/7/21 à 9:03 AM, Dan Gudmundsson a

      écrit :<br>

    </div>

    <blockquote type="cite"

cite="mid:CANX4uuO67mn09z3VRiDZB3Pmkbc+AdJUkVKVn74DEtMxk3AKvA@mail.gmail.com">

      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

      <div dir="ltr">There are currently no replacements for those

        functions,

        <div>the thought was that it was a lot more expensive to

          traverse the string now so you should only traverse it once,</div>

        <div>or you should at least think about it instead of just

          replacing the new api.</div>

        <div><br>

        </div>

        <div>I believe 'string:split(File, Ext, trailing)' or why not

          'string:replace(File, OldExt, NewExt, trailing)'<br>

          does what you want in this case, or you could use the

          'filename' module for handling filenames.</div>

        <div><br>

        </div>

        <div>But yes the string api could be extended with a function or

          two.</div>

        <div><br>

        </div>

      </div>

      <br>

      <div class="gmail_quote">

        <div dir="ltr" class="gmail_attr">On Tue, Apr 6, 2021 at 11:29

          PM Olivier Boudeville <<a

            href="mailto:olivier.boudeville@online.fr"

            moz-do-not-send="true">olivier.boudeville@online.fr</a>>

          wrote:<br>

        </div>

        <blockquote class="gmail_quote" style="margin:0px 0px 0px

          0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>

          <br>

          It must be a silly question, but, since the Latin1 ->

          Unicode switch in <br>

          OTP 20.0, is there a (non-obsolete) way in the string module

          to look-up <br>

          the index of a string into another one, i.e. to find the

          location of a <br>

          given substring?<br>

          <br>

          rstr/2 is supposed to be replaced with find/3, yet the former

          returns an <br>

          index whereas the latter returns a part of the original

          string. I could <br>

          not find a way to obtain a relevant index with any of the

          newer string <br>

          functions - whereas I would guess it is a fairly common need?<br>

          <br>

          To give a bit more context, the goal was to prevent the

          implementation <br>

          of [1] from becoming obsolete; string:substr/3 and

          string:sub_string/3 <br>

          are flagged as obsolete and may be replaced by slice/3 (see

          [2]); yet <br>

          what can be done for rstr/2?<br>

          <br>

          (even if a smart use of some function was found to address the

          <br>

          particular need of this replace_extension/3 function,

          obtaining indexes <br>

          of substrings would still be useful in many cases, isn't it?)<br>

          <br>

          Thanks in advance for any hint!<br>

          <br>

          Best regards,<br>

          <br>

          Olivier.<br>

          <br>

          <br>

          [1] Soon obsolete apparently:<br>

          <br>

          % Returns a new filename whose extension has been updated.<br>

          %<br>

          % Ex: replace_extension("/home/jack/rosie.ttf", ".ttf",

          ".wav") should <br>

          return<br>

          % "/home/jack/rosie.wav".<br>

          %<br>

          -spec replace_extension( file_path(), extension(), extension()

          ) -> <br>

          file_path().<br>

          replace_extension( FilePath, SourceExtension, TargetExtension

          ) -><br>

          <br>

               case string:rstr( FilePath, SourceExtension ) of<br>

          <br>

                   0 -><br>

                       throw( { extension_not_found, SourceExtension,

          FilePath } );<br>

          <br>

                   Index -><br>

                       string:substr( FilePath, 1, Index-1 ) ++

          TargetExtension<br>

          <br>

               end.<br>

          <br>

          <br>

          [2] BTW there is a change in the indexing convention that

          could be <br>

          better advertised in the doc:<br>

          <br>

           > string:substr("abc",1).<br>

          "abc"<br>

          <br>

           > string:sub_string("abc",1).<br>

          "abc"<br>

          <br>

           > string:slice("abc",1).<br>

          "bc"<br>

          <br>

           > string:slice("abc",0).<br>

          "abc"<br>

          <br>

          -- <br>

          Olivier Boudeville<br>

          <br>

        </blockquote>

      </div>

    </blockquote>

    <p><br>

    </p>

    <pre class="moz-signature" cols="72">-- 

Olivier Boudeville

</pre>

  </body>

</html>