Substring look-up

Olivier Boudeville olivier.boudeville@REDACTED
Wed Apr 7 12:37:10 CEST 2021


Hello Dan,

Thanks for your answer; indeed rewriting most uses of string:rstr/2 in 
terms of string:split/3 and others should be possible, and each string 
traversal must be expensive now.

Yet maybe enriching the newer API so that it offers similar services to 
the previous Latin1 one could be done at first (albeit admittedly a bit 
inefficiently) by applying string:to_graphemes/1 on 
allunicode:chardata() parameters of these functions: then one could then 
iterate on grapheme clusters like we iterated on a plain list of Latin1 
characters; I think that index look-up (which is often useful) could 
even be done by (ab)using the code of the current implementation of 
string:str/2 and string:rstr/2 (which do not seem to care so much about 
whether they deal with characters or, here, grapheme clusters).

In this prospect, if putin-line with the newer kind of indexes suggested 
by slice/{2,3} (integers that start at zero and that actually count 
grapheme clusters [1]), Unicode-aware replacements to string:str/2 and 
string:rstr/2 could be akin to [2].

They seem to work correctly:

 > String = <<"He̊llö Wörld"/utf8>>. > find_substring_index( String, 
"Nöpe̊" ). nomatch > find_substring_index( String, "lö", leading ). 3 > 
find_substring_index( String, "lö", trailing ). 3 > 
find_substring_index( String, "ö", leading ). 4 > find_substring_index( 
String, "ö", trailing ). 7

Best regards,

Olivier.


[1] Other languages offer similar substring index look-up (ex: 
https://docs.python.org/3/library/stdtypes.html#str.find) but at least 
some operate on codepoints rather than on grapheme clusters. Maybe there 
is room for both, yet grapheme clusters, even if they are more 
complex,seem to me more appropriate for many use cases?

[2] Corresponding code:

% Index in a Unicode string, in terms of grapheme clusters (ex: not codepoints,
% not bytes).
%
-type gc_index() :: non_neg_integer().


% Returns the index, in terms of grapheme clusters, of the first occurrence of
% the specified pattern substring (if any) in the specified string.
%
-spec find_substring_index( unicode:chardata(), unicode:chardata() ) ->
                                     gc_index() | 'nomatch'.
find_substring_index( String, SearchPattern ) ->
     find_substring_index( String, SearchPattern, _Direction=leading ).


% Returns the index, in terms of grapheme clusters, of the first or last
% occurrence (depending on the specified direction) of the specified pattern
% substring (if any) in the specified string.
%
-spec find_substring_index( unicode:chardata(), unicode:chardata(),
                             string:direction() ) -> gc_index() | 'nomatch'.
find_substring_index( String, SearchPattern, Direction ) ->
     GCString = string:to_graphemes( String ),
     GCSearchPattern = string:to_graphemes( SearchPattern ),
     PseudoIndex = case Direction of

         leading ->
             string:str( GCString, GCSearchPattern );

         trailing ->
             string:rstr( GCString, GCSearchPattern )

     end,

     case PseudoIndex of

         0 ->
             nomatch;

         % Indexes of grapheme clusters are to start at 0, not 1:
         I ->
             I-1

     end.


Notes:

- probably not very efficient, but may be replaced later by an optimised 
version, with no API change for the user code

- the point is to reuse the *code* of string:str/2 and string:rstr/2 
(even if this API is to disappear)

- maybe such functions could also operate directly on 
[grapheme_cluster()], to avoid too many conversions




Le 4/7/21 à 9:03 AM, Dan Gudmundsson a écrit :
> There are currently no replacements for those functions,
> the thought was that it was a lot more expensive to traverse the 
> string now so you should only traverse it once,
> or you should at least think about it instead of just replacing the 
> new api.
>
> I believe 'string:split(File, Ext, trailing)' or why not 
> 'string:replace(File, OldExt, NewExt, trailing)'
> does what you want in this case, or you could use the 
> 'filename' module for handling filenames.
>
> But yes the string api could be extended with a function or two.
>
>
> On Tue, Apr 6, 2021 at 11:29 PM Olivier Boudeville 
> <olivier.boudeville@REDACTED <mailto:olivier.boudeville@REDACTED>> 
> wrote:
>
>     Hi,
>
>     It must be a silly question, but, since the Latin1 -> Unicode
>     switch in
>     OTP 20.0, is there a (non-obsolete) way in the string module to
>     look-up
>     the index of a string into another one, i.e. to find the location
>     of a
>     given substring?
>
>     rstr/2 is supposed to be replaced with find/3, yet the former
>     returns an
>     index whereas the latter returns a part of the original string. I
>     could
>     not find a way to obtain a relevant index with any of the newer
>     string
>     functions - whereas I would guess it is a fairly common need?
>
>     To give a bit more context, the goal was to prevent the
>     implementation
>     of [1] from becoming obsolete; string:substr/3 and
>     string:sub_string/3
>     are flagged as obsolete and may be replaced by slice/3 (see [2]); yet
>     what can be done for rstr/2?
>
>     (even if a smart use of some function was found to address the
>     particular need of this replace_extension/3 function, obtaining
>     indexes
>     of substrings would still be useful in many cases, isn't it?)
>
>     Thanks in advance for any hint!
>
>     Best regards,
>
>     Olivier.
>
>
>     [1] Soon obsolete apparently:
>
>     % Returns a new filename whose extension has been updated.
>     %
>     % Ex: replace_extension("/home/jack/rosie.ttf", ".ttf", ".wav")
>     should
>     return
>     % "/home/jack/rosie.wav".
>     %
>     -spec replace_extension( file_path(), extension(), extension() ) ->
>     file_path().
>     replace_extension( FilePath, SourceExtension, TargetExtension ) ->
>
>          case string:rstr( FilePath, SourceExtension ) of
>
>              0 ->
>                  throw( { extension_not_found, SourceExtension,
>     FilePath } );
>
>              Index ->
>                  string:substr( FilePath, 1, Index-1 ) ++ TargetExtension
>
>          end.
>
>
>     [2] BTW there is a change in the indexing convention that could be
>     better advertised in the doc:
>
>      > string:substr("abc",1).
>     "abc"
>
>      > string:sub_string("abc",1).
>     "abc"
>
>      > string:slice("abc",1).
>     "bc"
>
>      > string:slice("abc",0).
>     "abc"
>
>     -- 
>     Olivier Boudeville
>

-- 
Olivier Boudeville

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20210407/2b8f1c2b/attachment.htm>


More information about the erlang-questions mailing list