Substring look-up
Olivier Boudeville
olivier.boudeville@REDACTED
Wed Apr 7 13:35:37 CEST 2021
Hello Craig,
Thanks for your answer and for the regex-based solution.
Regarding unicode:characters_to_list/1, my understanding is that it
would mean operating on codepoints rather than on grapheme clusters, and
I suppose that this *may* result in unintended matches if searching for
clusters, i.e. "user-perceived characters" (often the ones that
matter/make sense?).
For example, if grapheme clusters such as GC1=[A,B], GC2=[C,D] and
GC3=[B,C] existed (where the A, B, C and D variables are codepoints),
then searching for a substring that would contain only the GC3 cluster
in a flattened string containing GC1 then GC2 (i.e. [A,B,C,D]) would
match, whereas it should not (unless such a case is known to be
impossible by design?).
(disclaimer: I am not pretending I understand Unicode correctly, just
needing to cope with various input filenames - not even mentioning the
so-called "raw" ones ;-))
Best regards,
Olivier.
Le 4/7/21 à 12:42 PM, zxq9 a écrit :
> On 2021/04/07 19:37, Olivier Boudeville wrote:
>> Hello Dan,
>>
>> Thanks for your answer; indeed rewriting most uses of string:rstr/2
>> in terms of string:split/3 and others should be possible, and each
>> string traversal must be expensive now.
>
> Sometimes going the neanderthal route is a simplification:
>
> 1. unicode:characters_to_list/1
> 2. write a custom function to iterate as a list the original way
>
> The more complex the original representation and more interesting the
> sort of work you want done the more this approach tends to save me in
> both cognitive and processing overhead. My case may be highly
> optimized for this, though, as I usually deal with English, German and
> Japanese and rarely any other text input languages -- some input forms
> for other languages can get pretty interesting and probably don't map
> as well to the concept of "characters" after conversion.
>
> -Craig
--
Olivier Boudeville
More information about the erlang-questions
mailing list