Substring look-up

Olivier Boudeville olivier.boudeville@REDACTED
Wed Apr 7 13:35:37 CEST 2021


Hello Craig,

Thanks for your answer and for the regex-based solution.

Regarding unicode:characters_to_list/1, my understanding is that it 
would mean operating on codepoints rather than on grapheme clusters, and 
I suppose that this *may* result in unintended matches if searching for 
clusters, i.e. "user-perceived characters" (often the ones that 
matter/make sense?).

For example, if grapheme clusters such as GC1=[A,B], GC2=[C,D] and 
GC3=[B,C] existed (where the A, B, C and D variables are codepoints), 
then searching for a substring that would contain only the GC3 cluster 
in a flattened string containing GC1 then GC2 (i.e. [A,B,C,D]) would 
match, whereas it should not (unless such a case is known to be 
impossible by design?).

(disclaimer: I am not pretending I understand Unicode correctly, just 
needing to cope with various input filenames - not even mentioning the 
so-called "raw" ones ;-))

Best regards,

Olivier.



Le 4/7/21 à 12:42 PM, zxq9 a écrit :
> On 2021/04/07 19:37, Olivier Boudeville wrote:
>> Hello Dan,
>>
>> Thanks for your answer; indeed rewriting most uses of string:rstr/2 
>> in terms of string:split/3 and others should be possible, and each 
>> string traversal must be expensive now.
>
> Sometimes going the neanderthal route is a simplification:
>
> 1. unicode:characters_to_list/1
> 2. write a custom function to iterate as a list the original way
>
> The more complex the original representation and more interesting the 
> sort of work you want done the more this approach tends to save me in 
> both cognitive and processing overhead. My case may be highly 
> optimized for this, though, as I usually deal with English, German and 
> Japanese and rarely any other text input languages -- some input forms 
> for other languages can get pretty interesting and probably don't map 
> as well to the concept of "characters" after conversion.
>
> -Craig


-- 
Olivier Boudeville



More information about the erlang-questions mailing list