Erlang binary longest common substring

Tue Mar 31 08:02:50 CEST 2020

Hi,

@Richard O'Keefe, Thanks a lot for your attention, shared knowledge and
enlightenment.

>> I am still in the dark about whether you want to view these things
>> as sequences of BYTEs, as encoded sequences of CODEPOINTS,
>> as encoded sequences of base-character+combining characters,
>> as sequence of grapheme clusters, or quite possibly as sequences
>> of tokens

Honestly, my knowledge for LCS is not deep. I just observe the output of
each libraries that i test,  if the output suite my need, then i choose
that library.
https://pypi.org/project/pylcs/ already provide functions to solve
LCSubsequence and LCSubstring. But unfortunately all of the integration
options not suited my case except NIF. I've try os:cmd() , Pyrlang and i'm
not interested with erlang port because  similarr with Pyrlang there are a
terms serialization handling.

So I decided to learn to write a NIF based on this
https://github.com/wollmers/c-lcs-bv/blob/master/lcstest.c . I need your
help if someday i facing lots os error when run my very first NIF. And i do
really still welcome for any Erlang's BIF based solution (if any). So i
don't need to write a NIF that risk crashing the VM.

>> Take a step backwards.
>> What do these strings (binaries) *represent*?
>> What is the logical structure of the things you are representing?
>> Why do you want an operation that is completely insensitive
>> to that logical structure?
>> Can two different strings represent the same abstract concept?
>> What, if anything, does *part* of a string represent?

Honestly, for now i don't really understand some of your guidance above,
for now i will just act as a house builder using Ikea's solutions (analogy).
Someday i will spent my free time for learning LCS more deeply act as a
house building architect (analogy).

Pada tanggal Sel, 31 Mar 2020 pukul 11.13 Richard O'Keefe <raoknz@REDACTED>
menulis:

> "substring" is a special case of "subsequence".
> "OAR" is a subsequence of "cOronA viRus" but not a substring of it.
> "OAR" is a substring of "bOARs" but not a prefix of it.
>
> I am still in the dark about whether you want to view these things
> as sequences of BYTEs, as encoded sequences of CODEPOINTS,
> as encoded sequences of base-character+combining characters,
> as sequence of grapheme clusters, or quite possibly as sequences
> of tokens.
>
> A lesson I learned back in 1980 is that "STRINGS ARE WRONG".
> As Alan Perlis said, "The string is a stark data structure and
> everywhere it is passed there is much duplication of process.
> It is a perfect vehicle for hiding information."  Since then experience
> has only reinforced this: if you are looking inside a string, then you
> probably should not be using a string.
>
> Take a step backwards.
> What do these strings (binaries) *represent*?
> What is the logical structure of the things you are representing?
> Why do you want an operation that is completely insensitive
> to that logical structure?
> Can two different strings represent the same abstract concept?
> What, if anything, does *part* of a string represent?
>
>
> On Tue, 31 Mar 2020 at 02:45, I Gusti Ngurah Oka Prinarjaya
> <okaprinarjaya@REDACTED> wrote:
> >
> > Hi All,
> >
> > Thank your very much for the answers,
> >
> > @Valentin Micic , Your lcs function skipping the "World" word. I need
> "o" and "World" also count.
> >
> > @Marc Worrel, Your function also similar with Valentine Micic's function.
> >
> > @Richard O'Keefe , I suppose substring or subsequence is same thing, but
> i think i am wrong, the point is, i need the function to also count
> > "o" from <<"Hellaao World">>, and "World" from <<"Hello World">>
> > from example: lcs(<<"Hello World">>, <<"Hellaao World">>)
> >
> > I think i need to make a NIF .
> > I will try to make NIF based on this
> https://github.com/wollmers/c-lcs-bv wonderful example ANSI C code.
> >
> > I've try Pyrlang https://pyrlang.github.io/Pyrlang/ in the hope i that
> i can use / communicate with https://pypi.org/project/pylcs/ through:
> > rpc:call('py@REDACTED', 'pylcs', 'lcs', [<<"Hello World">>,<<"Hellaao
> World">>])  in a python-process node but Pyrlang not help much in my case.
> still very slow
> > because it run over the network.
> >
> >
> >
> >
> > Pada tanggal Sen, 30 Mar 2020 pukul 20.08 Ben Murphy <
> benmmurphy@REDACTED> menulis:
> >>
> >> Hi,
> >>
> >> I think the implementation on rosetta code is slightly broken from a
> >> performance point of view. Usually, if you are memoising in this
> >> problem you would use the offsets for where you are in the string. So
> >> if you were writing in an imperative language you would just have a N
> >> x M array. Obviously, you can't do this in Erlang so they are storing
> >> stuff in a dictionary. However, for the keys it looks like they are
> >> using the string/list for the key instead of the offset. This means
> >> the dict code will have to iterate over the remaining bit of the
> >> string for every lookup. So instead of being O(NxM) you end up with an
> >> algorithm that is O(NxMx(N + M)). So instead of being quadratic it
> >> gives you cubic performance. This might explain why it is so slow.
> >> Note: I haven't included the log() term that is introduced by the
> >> dictionary lookup that can't be avoided in Erlang.
> >>
> >> On Mon, Mar 30, 2020 at 11:18 AM I Gusti Ngurah Oka Prinarjaya
> >> <okaprinarjaya@REDACTED> wrote:
> >> >
> >> > Hi,
> >> >
> >> > I need longest common substring erlang implementation. Please help me
> >> >
> >> > I've tried
> https://rosettacode.org/wiki/Longest_common_subsequence#Erlang but very
> very slow T_T
> >> >
> >> > and i've tried binary:longest_common_prefix([<<"i love you
> 2020-03-28">>,<<"i love you 2020-03-01">>]) not effective / not valid
> >> >
> >> > Thank you
> >> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20200331/efabbc12/attachment.htm>