Erlang binary longest common substring

Tue Mar 31 06:13:20 CEST 2020

"substring" is a special case of "subsequence".
"OAR" is a subsequence of "cOronA viRus" but not a substring of it.
"OAR" is a substring of "bOARs" but not a prefix of it.

I am still in the dark about whether you want to view these things
as sequences of BYTEs, as encoded sequences of CODEPOINTS,
as encoded sequences of base-character+combining characters,
as sequence of grapheme clusters, or quite possibly as sequences
of tokens.

A lesson I learned back in 1980 is that "STRINGS ARE WRONG".
As Alan Perlis said, "The string is a stark data structure and
everywhere it is passed there is much duplication of process.
It is a perfect vehicle for hiding information."  Since then experience
has only reinforced this: if you are looking inside a string, then you
probably should not be using a string.

Take a step backwards.
What do these strings (binaries) *represent*?
What is the logical structure of the things you are representing?
Why do you want an operation that is completely insensitive
to that logical structure?
Can two different strings represent the same abstract concept?
What, if anything, does *part* of a string represent?

On Tue, 31 Mar 2020 at 02:45, I Gusti Ngurah Oka Prinarjaya
<okaprinarjaya@REDACTED> wrote:
>
> Hi All,
>
> Thank your very much for the answers,
>
> @Valentin Micic , Your lcs function skipping the "World" word. I need "o" and "World" also count.
>
> @Marc Worrel, Your function also similar with Valentine Micic's function.
>
> @Richard O'Keefe , I suppose substring or subsequence is same thing, but i think i am wrong, the point is, i need the function to also count
> "o" from <<"Hellaao World">>, and "World" from <<"Hello World">>
> from example: lcs(<<"Hello World">>, <<"Hellaao World">>)
>
> I think i need to make a NIF .
> I will try to make NIF based on this https://github.com/wollmers/c-lcs-bv wonderful example ANSI C code.
>
> I've try Pyrlang https://pyrlang.github.io/Pyrlang/ in the hope i that i can use / communicate with https://pypi.org/project/pylcs/ through:
> rpc:call('py@REDACTED', 'pylcs', 'lcs', [<<"Hello World">>,<<"Hellaao World">>])  in a python-process node but Pyrlang not help much in my case. still very slow
> because it run over the network.
>
>
>
>
> Pada tanggal Sen, 30 Mar 2020 pukul 20.08 Ben Murphy <benmmurphy@REDACTED> menulis:
>>
>> Hi,
>>
>> I think the implementation on rosetta code is slightly broken from a
>> performance point of view. Usually, if you are memoising in this
>> problem you would use the offsets for where you are in the string. So
>> if you were writing in an imperative language you would just have a N
>> x M array. Obviously, you can't do this in Erlang so they are storing
>> stuff in a dictionary. However, for the keys it looks like they are
>> using the string/list for the key instead of the offset. This means
>> the dict code will have to iterate over the remaining bit of the
>> string for every lookup. So instead of being O(NxM) you end up with an
>> algorithm that is O(NxMx(N + M)). So instead of being quadratic it
>> gives you cubic performance. This might explain why it is so slow.
>> Note: I haven't included the log() term that is introduced by the
>> dictionary lookup that can't be avoided in Erlang.
>>
>> On Mon, Mar 30, 2020 at 11:18 AM I Gusti Ngurah Oka Prinarjaya
>> <okaprinarjaya@REDACTED> wrote:
>> >
>> > Hi,
>> >
>> > I need longest common substring erlang implementation. Please help me
>> >
>> > I've tried https://rosettacode.org/wiki/Longest_common_subsequence#Erlang but very very slow T_T
>> >
>> > and i've tried binary:longest_common_prefix([<<"i love you 2020-03-28">>,<<"i love you 2020-03-01">>]) not effective / not valid
>> >
>> > Thank you
>> >